Long Short-Term Memory
论文原文
地址01:https://arxiv.org/pdf/1506.04214.pdf
地址02:https://www.bioinf.jku.at/publications/older/2604.pdf
Abstract
Learning to store information over extended time intervals via recurrent backpropagation takes a very long time, mostly due to insucient, decaying error back ow. We brie y review Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient, gradient-based method called \Long Short-Term Memory" (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error ow through \constant error carrousels" within special units. Multiplicative gate units learn to open and close access to the constant error ow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with arti cial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, arti cial long time lag tasks that have never been solved by previous recurrent network algorithms. 通过周期性的反向传播学习,在扩展的时间间隔内存储信息需要很长的时间,这主要是由于不确定的、衰减的错误导致的。我们简要回顾了Hochreiter在1991年对这个问题的分析,然后介绍了一种新颖的、独特的、基于梯度的方法,称为LSTM (LSTM)。在不造成伤害的情况下截断梯度,LSTM可以学习在超过1000个离散时间步长的最小时间滞后上桥接,方法是通过在特殊单元内的“恒定误差轮盘”强制执行恒定误差。乘性门单元学习打开和关闭访问的恒定误差低。LSTM在空间和时间上都是局部的;其每时间步长的计算复杂度和权值为O(1)。我们对人工数据的实验包括局部的、分布式的、实值的和有噪声的模式表示。在与RTRL、BPTT、周期性级联相关、Elman网和神经序列分块的比较中,LSTM带来了更多的成功运行,并且学习速度更快。LSTM还解决了以前的递归网络算法所不能解决的复杂、人工的长时间滞后问题。
1 INTRODUCTION
Recurrent networks can in principle use their feedback connections to store representations of recent input events in form of activations (\short-term memory", as opposed to \long-term memory" embodied by slowly changing weights). This is potentially signicant for many applications, including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory, however, take too much time or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, existing methods do not provide clear practical advantages over, say, backprop in feedforward nets with limited time windows. This paper will review an analysis of the problem and suggest a remedy.
递归网络原则上可以使用它们的反馈连接以激活的形式存储最近输入事件的表示(“短期记忆”,而不是“长期记忆”,后者由缓慢变化的权重表示)。这对许多应用程序都有潜在的重要性,包括语音处理、非马尔可夫控制和音乐作曲(例如,Mozer 1992)。然而,最广泛使用的学习短期记忆的算法要么花费了太多时间,要么根本就不能很好地工作,尤其是在输入和相应教师信号之间的最小时滞很长时。虽然理论上很吸引人,但现有的方法并没有提供明显的实际优势,例如,在有限时间窗口的前馈网络中,backprop。本文将对这一问题进行分析,并提出解决办法。
The problem. With conventional \Back-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, Werbos 1988) or \Real-Time Recurrent Learning" (RTRL, e.g., Robinson and Fallside 1987), error signals \ owing backwards in time" tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights (Hochreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all (see section 3). 这个问题。与传统\反向传播通过时间”(BPTT,例如,1992年威廉姆斯和拉链,Werbos 1988)或\实时复发性学习”(RTRL,例如,罗宾逊和Fallside 1987),误差信号在时间上向后\由于”倾向于(1)炸毁或(2):消失的时间演化backpropagated误差指数的大小取决于重量(Hochreiter 1991)。情形(1)可能会导致权值的振荡,而情形(2)学习如何桥接长时间滞后的情况会花费大量的时间,或者根本不起作用(参见第3节)。
The remedy. This paper presents \Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM is designed to overcome these error back- ow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture enforcing constant (thus neither exploding nor vanishing) error ow through internal states of special units (provided the gradient computation is truncated at certain architecture-specic points | this does not aect long-term error ow though). 补救措施。本文提出了一种新的递归网络结构——长短时记忆(LSTM),并结合适当的梯度学习算法。LSTM的设计就是为了克服这些错误的反向问题。它可以学习桥接超过1000步的时间间隔,即使在有噪声、不可压缩的输入序列的情况下,也不会损失短时间延迟能力。这是通过一种特殊的、基于梯度的算法来实现的,它针对的是一种通过特殊单元的内部状态来执行常量(因此既不会爆炸也不会消失)的错误(假设梯度计算在某些特定的体系结构点|被截断,但这并不影响长期的错误)。
Outline of paper. Section 2 will brie y review previous work. Section 3 begins with an outline of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive approach to constant error backprop for didactic purposes, and highlight its problems concerning information storage and retrieval. These problems will lead to the LSTM architecture as described in Section 4. Section 5 will present numerous experiments and comparisons with competing methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other recurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae (A.2). 第二部分将简要回顾以前的工作。第3节以详细分析Hochreiter(1991)所造成的消失误差的大纲开始。然后,它将介绍一种用于教学目的的幼稚的不断错误支持方法,并突出其在信息存储和检索方面的问题。这些问题将导致第4节中描述的LSTM体系结构。第5节将提供大量的实验和与竞争方法的比较。LSTM比它们做得更好,而且还学会了解决复杂的人工任务,这是其他递归网络算法所不能解决的。第6节将讨论LSTM的局限性和优点。附录中有算法的详细描述(a .1),以及公式的显式误差(a .2)。
2 PREVIOUS WORK
This section will focus on recurrent nets with time-varying inputs (as opposed to nets with stationary inputs and xpoint-based gradient calculations, e.g., Almeida 1987, Pineda 1987).
本节将集中讨论具有时变输入的递归网络(而不是具有固定输入和基于x点的梯度计算的网络,例如Almeida 1987和Pineda 1987)。
Gradient-descent variants. The approaches of Elman (1988), Fahlman (1991), Williams (1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter's comprehensive overview (1995) suer from the same problems as BPTT and RTRL (see Sections 1 and 3).
梯度下降法变体。Elman(1988)、Fahlman(1991)、Williams(1989)、Schmidhuber (1992a)、Pearlmutter(1989)的方法,以及Pearlmutter的综合综述(1995)中的许多相关算法,都是从与BPTT和RTRL相同的问题中提出的(见第1节和第3节)
Time-delays. Other methods that seem practical for short time lags only are Time-Delay Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activations based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al. (1995) propose variants of time-delay networks called NARX networks.
时间延迟。其他似乎只适用于短时间滞后的方法有时滞神经网络(Lang et al. 1990)和Plate法(Plate 1993),后者基于旧激活的加权和更新单位激活(参见de Vries和Principe 1991)。Lin等人(1995)提出了时延网络的变体NARX网络。
Time constants. To deal with long time lags, Mozer (1992) uses time constants in uencing changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact be viewed as a mixture of TDNN and time constants). For long time lags, however, the time constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates the activation of a recurrent unit by adding the old activation and the (scaled) current net input. The net input, however, tends to perturb the stored information, which makes long-term storage impractical.
时间常量。为了处理长时间滞后,Mozer(1992)使用时间常数来表示单位激活的变化(deVries and Principe’s上述方法(1991)实际上可以看作是TDNN和时间常数的混合物)。然而,对于长时间滞后,时间常数需要外部ne调谐(Mozer 1992)。Sun等人的替代方法(1993)通过添加旧的激活和(缩放的)当前净输入来更新一个经常性单元的激活。然而,净输入往往会干扰所存储的信息,这使得长期存储变得不切实际。
Ring's approach. Ring (1993) also proposed a method for bridging long time lags. Whenever a unit in his network receives con icting error signals, he adds a higher order unit in uencing appropriate connections. Although his approach can sometimes be extremely fast, to bridge a time lag involving 100 steps may require the addition of 100 units. Also, Ring's net does not generalize to unseen lag durations.
环的方法。Ring(1993)也提出了一种桥接长时间滞后的方法。当他的网络中的一个单元接收到通信错误信号时,他就增加一个更高阶的单元来建立适当的连接。虽然他的方法有时非常快,但要跨越100步的时间延迟可能需要增加100个单元。同样,环网也不能推广到看不见的滞后时间。
Bengio et al.'s approaches. Bengio et al. (1994) investigate methods such as simulated annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete error propagation. Their \latch" and \2-sequence" problems are very similar to problem 3a with minimal time lag 100 (see Experiment 3). Bengio and Frasconi (1994) also propose an EM approach for propagating targets. With n so-called \state networks", at a given time, their system can be in one of only n dierent states. See also beginning of Section 5. But to solve continuous problems such as the \adding problem" (Section 5.4), their system would require an unacceptable number of states (i.e., state networks).
Bengio等人的方法。Bengio等人(1994)研究了模拟退火、多网格随机搜索、时间加权伪牛顿优化和离散误差传播等方法。他们的“闩锁”和“2-序列”问题与3a问题非常相似,只有最小的滞后时间100(见实验3)。Bengio和Frasconi(1994)也提出了一种EM方法来传播目标。对于n个所谓的“状态网络”,在给定的时间内,它们的系统只能处于n种不同状态中的一种。参见第5节的开头。但是,为了解决诸如“\添加问题”(第5.4节)之类的连续问题,它们的系统将需要不可接受的状态数(即状态的网络)。
Kalman lters. Puskorius and Feldkamp (1994) use Kalman lter techniques to improve recurrent net performance. Since they use \a derivative discount factor imposed to decay exponentially the eects of past dynamic derivatives," there is no reason to believe that their Kalman Filter Trained Recurrent Networks will be useful for very long minimal time lags. Second order nets. We will see that LSTM uses multiplicative units (MUs) to protect error ow from unwanted perturbations. It is not the rst recurrent net method using MUs though. For instance, Watrous and Kuhn (1992) use MUs in second order nets. Some dierences to LSTM are:
(1) Watrous and Kuhn's architecture does not enforce constant error ow and is not designed to solve long time lag problems.
(2) It has fully connected second-order sigma-pi units, while the LSTM architecture's MUs are used only to gate access to constant error ow.
(3) Watrous and Kuhn's algorithm costs O(W2 ) operations per time step, ours only O(W), where W is the number of weights. See also Miller and Giles (1993) for additional work on MUs.
Kalman lters. Puskorius and Feldkamp (1994)使用Kalman lter技术来提高经常性净绩效。由于他们使用一个衍生品折扣因子来指数衰减过去动态衍生品的影响,“我们没有理由相信他们的卡尔曼滤波训练的递归网络在很长一段时间内都是有用的。”二阶网络。我们将看到LSTM使用乘法单位(MUs)来保护错误不受不必要的干扰。但它不是第一个使用MUs的递归网络方法。例如,Watrous和Kuhn(1992)在二阶网中使用MUs。LSTM的一些不同之处是:(1)Watrous和Kuhn的架构不强制恒定的错误,也不是为了解决长时间滞后的问题而设计的。
(2)它具有完全连通的二阶sigma-pi单元,而LSTM体系结构的MUs仅用于对恒定误差低的门访问。
(3) Watrous和Kuhn的算法每时间步需要O(W2)个操作,我们的算法只需要O(W)个操作,其中W是权值的个数。有关MUs的其他工作也见Miller和Giles(1993)。
Simple weight guessing. To avoid long time lag problems of gradient-based approaches we may simply randomly initialize all network weights until the resulting net happens to classify all training sequences correctly. In fact, recently we discovered (Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that simple weight guessing solves many of the problems in (Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995) faster than the algorithms proposed therein. This does not mean that weight guessing is a good algorithm. It just means that the problems are very simple. More realistic tasks require either many free parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters), such that guessing becomes completely infeasible.
简单的猜测。为了避免基于梯度的方法的长时间滞后问题,我们可以简单地随机初始化所有网络权值,直到最终得到的网络正确地对所有训练序列进行分类。事实上,最近我们发现(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997)简单的重量猜测解决了(Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995)中的许多问题,比其中提出的算法更快。这并不意味着猜测权重是一个好的算法。这意味着问题很简单。更实际的任务需要许多自由参数(例如,输入权值)或较高的权值精度(例如,连续值参数),这样猜测就变得完全不可行的。
Adaptive sequence chunkers. Schmidhuber's hierarchical chunker systems (1992b, 1993) do have a capability to bridge arbitrary time lags, but only if there is local predictability across the subsequences causing the time lags (see also Mozer 1992). For instance, in his postdoctoral thesis (1993), Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning tasks involving minimal time lags in excess of 1000 steps. The performance of chunker systems, however, deteriorates as the noise level increases and the input sequences become less compressible. LSTM does not suer from this problem. 自适应序列chunkers。Schmidhuber的分层chunker系统(1992b, 1993)确实具有桥接任意时间滞后的能力,但前提是子序列具有局部可预测性,从而导致时间滞后(参见Mozer 1992)。例如,在他的博士后论文(1993)中,Schmidhuber使用层次递归网络来快速解决某些语法学习任务,这些任务涉及的时间延迟最小,超过了1000步。然而,随着噪声水平的提高和输入序列的可压缩性的降低,chunker系统的性能会下降。LSTM不能解决这个问题。