3 CONSTANT ERROR BACKPROP 固定误差支持
3.1 EXPONENTIALLY DECAYING ERROR 指数衰减误差
Conventional BPTT (e.g. Williams and Zipser 1992). Output unit k's target at time t is denoted by dk (t). Using mean squared error, k's error signal is
传统的BPTT(如Williams和Zipser 1992)。输出单元k在t时刻的目标用dk (t)表示,利用均方误差,k的误差信号为
The corresponding contribution to wjl 's total weight update is #j (t)yl (t 1), where is the learning rate, and l stands for an arbitrary unit connected to unit j. Outline of Hochreiter's analysis (1991, page 19-21). Suppose we have a fully connected net whose non-input unit indices range from 1 to n. Let us focus on local error ow from unit u to unit v (later we will see that the analysis immediately extends to global error ow). The error occurring at an arbitrary unit u at time step t is propagated \back into time" for q time steps, to an arbitrary unit v. This will scale the error by the following fact
wjl的总权重更新的相应贡献是#j (t)yl (t 1),其中为学习率,l表示连接到j单元的任意单元。Hochreiter分析概要(1991年,第19-21页)。假设我们有一个完全连通的网络,它的非输入单位指数范围从1到n。让我们关注从单位u到单位v的局部误差ow(稍后我们将看到分析立即扩展到全局误差ow)。发生在任意单位u上的时间步长t的误差被传播回时间中,对于q时间步长,传播回任意单位v
3.2 CONSTANT ERROR FLOW: NAIVE APPROACH 常量错误流:简单的方法
A single unit. To avoid vanishing error signals, how can we achieve constant error ow through a single unit j with a single connection to itself? According to the rules above, at time t, j's local error back ow is #j (t) = f 0 j (netj (t))#j (t + 1)wjj . To enforce constant error ow through j, we h j, we
一个单元。为了避免消失的错误信号,我们如何通过一个单一的单位j与一个单一的连接到自己实现恒定的错误低?根据上面的规则,在t时刻,j的本地错误返回ow是#j (t) = f0 j (netj (t))#j (t + 1)wjj。为了通过j来执行常误差ow,我们h j,我们
In the experiments, this will be ensured by using the identity function fj : fj (x) = x; 8x, and by setting wjj = 1:0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's central feature (see Section 4). Of course unit j will not only be connected to itself but also to other units. This invokes two obvious, related problems (also inherent in all other gradient-based approaches): 在实验中,利用恒等函数fj: fj (x) = x来保证;设置wjj = 1:0。我们称之为常误差卡鲁塞尔(CEC)。CEC将是LSTM的中心特性(参见第4节)。当然,单元j不仅与自身相连,还与其他单元相连。这引发了两个明显的、相关的问题(也是所有其他基于梯度的方法所固有的):
1. Input weight con ict: for simplicity, let us focus on a single additional input weight wji . Assume that the total error can be reduced by switching on unit j in response to a certain input, and keeping it active for a long time (until it helps to compute a desired output). Provided i is nonzero, since the same incoming weight has to be used for both storing certain inputs and ignoring others, wji will often receive con icting weight update signals during this time (recall that j is linear): these signals will attempt to make wji participate in (1) storing the input (by switching on j) and (2) protecting the input (by preventing j from being switched o by irrelevant later inputs). This con ict makes learning dicult, and calls for a more context-sensitive mechanism for controlling \write operations" through input weights.
2. Output weight con ict: assume j is switched on and currently stores some previous input. For simplicity, let us focus on a single additional outgoing weight wkj . The same wkj has to be used for both retrieving j's content at certain times and preventing j from disturbing k at other times. As long as unit j is non-zero, wkj will attract con icting weight update signals generated during sequence processing: these signals will attempt to make wkj participate in (1) accessing the information stored in j and | at dierent times | (2) protecting unit k from being perturbed by j. For instance, with many tasks there are certain \short time lag errors" that can be reduced in early training stages. However, at later training stages j may suddenly start to cause avoidable errors in situations that already seemed under control by attempting to participate in reducing more dicult \long time lag errors". Again, this con ict makes learning dicult, and calls for a more context-sensitive mechanism for controlling \read operations" through output weights.
1. 输入权值约束:为了简单起见,我们将重点放在单个额外的输入权值wji上。假设可以通过打开单元j来响应某个输入,并长时间保持它处于活动状态(直到它有助于计算所需的输出),从而减少总错误。提供我是零,因为相同的传入的重量必须是用于存储特定的输入和无视他人,wji通常会接收con ict重量更新信号在此期间(回想一下,j是线性):这些信号将试图使wji参与(1)存储输入(通过打开j)和(2)保护输入(通过阻止j被无关紧要了o后输入)。这使得学习变得困难,需要一种更上下文敏感的机制来“通过输入权重”控制写操作。
2. 输出权值:假设j已经打开,并且当前存储了一些以前的输入。为了简单起见,让我们关注单个额外的输出权wkj。相同的wkj必须在特定时间用于检索j的内容,在其他时间用于防止j干扰k。只要单位j是零,wkj将吸引con ict重量更新信号生成的序列处理期间:这些信号将试图使wkj参与(1)访问的信息存储在j和| | dierent倍(2)保护单元凯西从被摄动j。例如,许多任务有些\短时间延迟错误”,可以减少在早期训练阶段。然而,在后来的训练阶段,j可能会突然开始在那些似乎已经在控制之中的情况下,通过尝试减少更多的长时间延迟错误来造成可避免的错误。同样,这一缺点使学习变得困难,需要一种更上下文敏感的机制来“通过输出权重”控制读操作。
Of course, input and output weight con icts are not specic for long time lags, but occur for short time lags as well. Their eects, however, become particularly pronounced in the long time lag case: as the time lag increases, (1) stored information must be protected against perturbation for longer and longer periods, and | especially in advanced stages of learning | (2) more and more already correct outputs also require protection against perturbation.
Due to the problems above the naive approach does not work well except in case of certain simple problems involving local input/output representations and non-repeating input patterns (see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right. 当然,输入和输出的权系数在长时间滞后时是不特定的,但在短时间滞后时也会出现。除,然而,在长时间滞后的情况下尤为明显:随着时间间隔的增加,(1)存储信息必须防止扰动时间却越来越长,学习|和|尤其是晚期(2)越来越多的正确输出也需要防止扰动。
由于上述问题,天真的方法不能很好地工作,除非某些简单的问题涉及本地输入/输出表示和非重复输入模式(见Hochreiter 1991和Silva et al. 1996)。下一节将展示如何正确地执行此操作。
4 LONG SHORT-TERM MEMORY
Memory cells and gate units. To construct an architecture that allows for constant error ow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit j from Section 3.2 by introducing additional features. A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit is introduced which protects other units from perturbation by currently irrelevant memory contents stored in j. 记忆单元和门单元。为了构建一个允许通过特殊的、自连接的单元实现恒定误差的体系结构,同时又不存在朴素方法的缺点,我们通过引入额外的特性来扩展3.2节中自连接的线性单元j所包含的恒定误差carrousel CEC。为了保护存储在j中的存储内容不受无关输入的干扰,引入了乘法输入门单元。同样地,一个乘法输出门单元被引入,它保护其他单元不受当前不相关的存储在j中的内存内容的干扰。
net Figure 1: Architecture of memory cel l cj (the box) and its gate units inj ; outj . The self-recurrent connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.
图1:memory cel l cj(盒子)的结构和它的门单元inj;outj。自循环连接(权值为1.0)表示反馈延迟1个时间步长。它建立了恒定误差carrousel“CEC”的基础。星门单元打开和关闭CEC的入口。详情见正文和附录A.1。
ls. Why gate units? To avoid input weight con icts, inj controls the error ow to memory cell cj 's input connections wcj i . To circumvent cj 's output weight con icts, outj controls the error ow from unit j's output connections. In other words, the net can use inj to decide when to keep or override information in memory cell cj , and outj to decide when to access memory cell cj and when to prevent other units from being perturbed by cj (see Figure 1).
为什么门单位?为了避免输入权值冲突,inj控制了内存单元cj的输入连接的误差。为了绕过cj的输出权值,outj控制来自单位j的输出连接的错误。换句话说,网络可以使用inj来决定何时在内存单元cj中保留或覆盖信息,而使用outj来决定何时访问内存单元cj以及何时防止其他单元受到cj的干扰(参见图1)。
Error signals trapped within a memory cell's CEC cannot change { but dierent error signals owing into the cell (at dierent times) via its output gate may get superimposed. The output gate will have to learn which errors to trap in its CEC, by appropriately scaling them. The input gate will have to learn when to release errors, again by appropriately scaling them. Essentially, the multiplicative gate units open and close access to constant error ow through CEC.
存储单元的CEC中的错误信号不能改变{但是通过输出门进入单元的不同错误信号(在不同的时间)可以被叠加。通过适当地扩展,输出门必须了解在其CEC中应该捕获哪些错误。输入门必须学会何时释放错误,再次通过适当地扩展它们。从本质上说,乘性门单元通过CEC打开和关闭对恒定误差的访问。
Distributed output representations typically do require output gates. Not always are both gate types necessary, though | one may be sucient. For instance, in Experiments 2a and 2b in Section 5, it will be possible to use input gates only. In fact, output gates are not required in case of local output encoding | preventing memory cells from perturbing already learned outputs can be done by simply setting the corresponding weights to zero. Even in this case, however, output gates can be benecial: they prevent the net's attempts at storing long time lag memories (which are usually hard to learn) from perturbing activations representing easily learnable short time lag memories. (This will prove quite useful in Experiment 1, for instance.)
分布式输出表示通常需要输出门。虽然|一个可能是必需的,但两个门不一定都是必需的。例如,在第5节的2a和2b实验中,将可能只使用输入门。事实上,在本地输出编码为|的情况下,不需要输出门,只要将相应的权值设置为0,就可以防止内存单元干扰已经学习过的输出。然而,即使在这种情况下,输出门也可能是有益的:它们阻止了网络存储长时间滞后记忆(通常很难学习)的尝试,从而干扰了代表容易学习的短时间滞后记忆的激活。(例如,这在实验1中将被证明非常有用。)
Network topology. We use networks with one input layer, one hidden layer, and one output layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units (for convenience, we refer to both memory cells and gate units as being located in the hidden layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate units and memory cells. All units (except for gate units) in all layers have directed connections (serve as inputs) to all units in the layer above (or to all higher layers { Experiments 2a and 2b).
Memory cell blocks. S memory cells sharing the same input gate and the same output gate form a structure called a \memory cell block of size S". Memory cell blocks facilitate information storage | as with conventional neural nets, it is not so easy to code a distributed input within a single cell. Since each memory cell block has as many gate units as a single memory cell (namely two), the block architecture can be even slightly more ecient (see paragraph \computational complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments (Section 5), we will use memory cell blocks of various sizes.
网络拓扑结构。我们使用一个输入层、一个隐含层和一个输出层的网络。(完全)自连接的隐层包含内存单元和相应的栅极单元(为了方便起见,我们将位于隐层中的内存单元和栅极单元都称为隐层)。所述隐层还可以包含提供栅极单元和存储器单元输入的常规“隐单元”。所有层中的所有单元(门单元除外)都有指向连接(作为输入)到上面层中的所有单元(或所有更高的层{实验2a和2b)。
存储单元块。共享相同的输入门和输出门的内存单元形成一个称为大小为S的内存单元块的结构。记忆单元块促进信息存储|与传统的神经网络一样,在单个单元内编码分布式输入并不容易。由于每个内存单元块与单个内存单元(即两个)具有同样多的门单元,因此块架构甚至可以更特殊一些(请参阅段落“计算复杂性”)。大小为1的内存单元块只是一个简单的内存单元。在实验(第5部分)中,我们将使用不同大小的存储单元块。
Learning. We use a variant of RTRL (e.g., Robinson and Fallside 1987) which properly takes into account the altered, multiplicative dynamics caused by input and output gates. However, to ensure non-decaying error backprop through internal states of memory cells, as with truncated BPTT (e.g., Williams and Peng 1990), errors arriving at \memory cell net inputs" (for cell cj , this includes netcj , netinj , netoutj ) do not get propagated back further in time (although they do serve to change the incoming weights). Only within2 memory cells, errors are propagated back through previous internal states scj . To visualize this: once an error signal arrives at a memory cell output, it gets scaled by output gate activation and h0 . Then it is within the memory cell's CEC, where it can ow back indenitely without ever being scaled. Only when it leaves the memory cell through the input gate and g, it is scaled once more by input gate activation and g 0 . It then serves to change the incoming weights before it is truncated (see appendix for explicit formulae).
Computational complexity. As with Mozer's focused recurrent backprop algorithm (Mozer 1989), only the derivatives @scj @wil need to be stored and updated. Hence the LSTM algorithm is very ecient, with an excellent update complexity of O(W), where W the number of weights (see details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is local in space and time3 : there is no need to store activation values observed during sequence processing in a stack with potentially unlimited size.
学习。我们使用RTRL的一个变体(例如,Robinson和Fallside 1987),它适当地考虑了输入和输出门所引起的变化的乘法动力学。然而,以确保non-decaying错误backprop通过内部状态的记忆细胞,与截断BPTT(例如,威廉姆斯和彭1990),错误到达\存储单元网络输入”(细胞cj,这包括netcj、netinj netoutj)得不到传播更久远的时代(尽管他们服务变化的权重)。只有在2个内存单元中,错误才会通过之前的内部状态scj传播回来。为了可视化这一点:一旦一个错误信号到达一个内存单元输出,它将被输出门激活和h0缩放。然后它在记忆细胞的CEC中,在那里它可以无限地慢下来而不需要被缩放。只有当它通过输入门和g离开存储单元时,它才通过输入门激活和g 0再次被缩放。然后,它用于在截断之前更改传入的权重(有关显式公式,请参阅附录)。
计算的复杂性。与Mozer的重点循环支持算法(Mozer 1989)一样,只需要存储和更新导数@scj @wil。因此LSTM算法非常特殊,更新复杂度为O(W),其中W表示权值的数量(详见附录A.1)。因此,对于完全经常网,LSTM和BPTT的每一步更新复杂度是相同的(而RTRL要差得多)。但是,与完整的BPTT不同的是,LSTM在空间和时间上是局部的:不需要将序列处理期间观察到的激活值存储在具有无限大小的堆栈中。
Abuse problem and solutions. In the beginning of the learning phase, error reduction may be possible without storing information over time. The network will thus tend to abuse memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing connections as adaptive thresholds for other units). The potential diculty is: it may take a long time to release abused memory cells and make them available for further learning. A similar \abuse problem" appears if two memory cells store the same (redundant) information. There are at least two solutions to the abuse problem: (1) Sequential network construction (e.g., Fahlman 1991): a memory cell and the corresponding gate units are added to the network whenever the error stops decreasing (see Experiment 2 in Section 5). (2) Output gate bias: each output gate gets a negative initial bias, to push initial memory cell activations towards zero. Memory cells with more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5).
滥用问题及解决方法。在学习阶段的开始,可以在不存储信息的情况下减少错误。因此,该网络将倾向于滥用记忆细胞,例如,作为偏见细胞。,它可能使它们的激活保持不变,并使用传出连接作为其他单元的自适应阈值)。潜在的问题是:释放被滥用的记忆细胞并使其用于进一步的学习可能需要很长时间。如果两个记忆单元存储相同的(冗余的)信息,就会出现类似的“滥用”问题。至少有两个解决滥用问题:(1)顺序网络建设(例如,Fahlman 1991):一个存储单元和相应的单元门时被添加到网络错误停止减少(见实验2节5)。(2)输出门偏见:每个输出门负初始偏差,将最初的记忆细胞激活为零。带有更多负偏差的记忆细胞将被自动分配”稍后(参见第5节中的实验1、3、4、5、6)。
Internal state drift and remedies. If memory cell cj 's inputs are mostly positive or mostly negative, then its internal state sj will tend to drift away over time. This is potentially dangerous, for the h0 (sj ) will then adopt very small values, and the gradient will vanish. One way to circumvent this problem is to choose an appropriate function h. But h(x) = x, for instance, has the disadvantage of unrestricted memory cell output range. Our simple but eective way of solving drift problems at the beginning of learning is to initially bias the input gate inj towards zero. Although there is a tradeo between the magnitudes of h0 (sj ) on the one hand and of yinj and f 0 inj on the other, the potential negative eect of input gate bias is negligible compared to the one of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4. 内部状态漂移和补救措施。如果记忆细胞cj的输入大部分是正的或大部分是负的,那么它的内部状态sj会随着时间的推移而漂移。这是潜在的危险,因为h0 (sj)将采用非常小的值,而梯度将消失。解决这个问题的一种方法是选择一个合适的函数h,但是h(x) = x的缺点是不限制内存单元的输出范围。我们在学习之初解决漂移问题的简单而有效的方法是使输入门inj最初偏向于零。虽然在h0 (sj)与yinj和f0 inj的量级之间存在贸易,但与漂移效应相比,输入门偏差的潜在负效应可以忽略不计。对于logistic sigmoid激活函数,似乎不需要对初始偏差进行ne调节,正如5.4节中的实验4和实验5所证实的那样。