LSTM:《Long Short-Term Memory》的翻译并解读(二)-阿里云开发者社区

开发者社区> 开发与运维> 正文

LSTM:《Long Short-Term Memory》的翻译并解读(二)

简介: LSTM:《Long Short-Term Memory》的翻译并解读



Conventional BPTT (e.g. Williams and Zipser 1992). Output unit k's target at time t is denoted by dk (t). Using mean squared error, k's error signal is

传统的BPTT(如Williams和Zipser 1992)。输出单元k在t时刻的目标用dk (t)表示,利用均方误差,k的误差信号为

The corresponding contribution to wjl 's total weight update is #j (t)yl  (t  1), where  is the  learning rate, and l stands for an arbitrary unit connected to unit j.  Outline of Hochreiter's analysis (1991, page 19-21). Suppose we have a fully connected  net whose non-input unit indices range from 1 to n. Let us focus on local error  ow from unit u  to unit v (later we will see that the analysis immediately extends to global error  ow). The error  occurring at an arbitrary unit u at time step t is propagated \back into time" for q time steps, to  an arbitrary unit v. This will scale the error by the following fact

wjl的总权重更新的相应贡献是#j (t)yl (t 1),其中为学习率,l表示连接到j单元的任意单元。Hochreiter分析概要(1991年,第19-21页)。假设我们有一个完全连通的网络,它的非输入单位指数范围从1到n。让我们关注从单位u到单位v的局部误差ow(稍后我们将看到分析立即扩展到全局误差ow)。发生在任意单位u上的时间步长t的误差被传播回时间中,对于q时间步长,传播回任意单位v




A single unit. To avoid vanishing error signals, how can we achieve constant error ow through a single unit j with a single connection to itself? According to the rules above, at time t, j's local error back ow is #j (t) = f 0 j (netj (t))#j (t + 1)wjj . To enforce constant error ow through j, we h j, we

一个单元。为了避免消失的错误信号,我们如何通过一个单一的单位j与一个单一的连接到自己实现恒定的错误低?根据上面的规则,在t时刻,j的本地错误返回ow是#j (t) = f0 j (netj (t))#j (t + 1)wjj。为了通过j来执行常误差ow,我们h j,我们

In the experiments, this will be ensured by using the identity function fj : fj (x) = x; 8x, and by setting wjj = 1:0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's central feature (see Section 4). Of course unit j will not only be connected to itself but also to other units. This invokes two obvious, related problems (also inherent in all other gradient-based approaches): 在实验中,利用恒等函数fj: fj (x) = x来保证;设置wjj = 1:0。我们称之为常误差卡鲁塞尔(CEC)。CEC将是LSTM的中心特性(参见第4节)。当然,单元j不仅与自身相连,还与其他单元相连。这引发了两个明显的、相关的问题(也是所有其他基于梯度的方法所固有的):

1. Input weight con ict: for simplicity, let us focus on a single additional input weight wji .  Assume that the total error can be reduced by switching on unit j in response to a certain input,  and keeping it active for a long time (until it helps to compute a desired output). Provided i is nonzero,  since the same incoming weight has to be used for both storing certain inputs and ignoring  others, wji will often receive con icting weight update signals during this time (recall that j is  linear): these signals will attempt to make wji participate in (1) storing the input (by switching  on j) and (2) protecting the input (by preventing j from being switched o by irrelevant later  inputs). This con ict makes learning dicult, and calls for a more context-sensitive mechanism  for controlling \write operations" through input weights.  

2. Output weight con ict: assume j is switched on and currently stores some previous  input. For simplicity, let us focus on a single additional outgoing weight wkj . The same wkj has  to be used for both retrieving j's content at certain times and preventing j from disturbing k  at other times. As long as unit j is non-zero, wkj will attract con icting weight update signals  generated during sequence processing: these signals will attempt to make wkj participate in (1)  accessing the information stored in j and | at dierent times | (2) protecting unit k from being  perturbed by j. For instance, with many tasks there are certain \short time lag errors" that can be  reduced in early training stages. However, at later training stages j may suddenly start to cause  avoidable errors in situations that already seemed under control by attempting to participate in  reducing more dicult \long time lag errors". Again, this con ict makes learning dicult, and  calls for a more context-sensitive mechanism for controlling \read operations" through output  weights.  

1. 输入权值约束:为了简单起见,我们将重点放在单个额外的输入权值wji上。假设可以通过打开单元j来响应某个输入,并长时间保持它处于活动状态(直到它有助于计算所需的输出),从而减少总错误。提供我是零,因为相同的传入的重量必须是用于存储特定的输入和无视他人,wji通常会接收con ict重量更新信号在此期间(回想一下,j是线性):这些信号将试图使wji参与(1)存储输入(通过打开j)和(2)保护输入(通过阻止j被无关紧要了o后输入)。这使得学习变得困难,需要一种更上下文敏感的机制来“通过输入权重”控制写操作。

2. 输出权值:假设j已经打开,并且当前存储了一些以前的输入。为了简单起见,让我们关注单个额外的输出权wkj。相同的wkj必须在特定时间用于检索j的内容,在其他时间用于防止j干扰k。只要单位j是零,wkj将吸引con ict重量更新信号生成的序列处理期间:这些信号将试图使wkj参与(1)访问的信息存储在j和| | dierent倍(2)保护单元凯西从被摄动j。例如,许多任务有些\短时间延迟错误”,可以减少在早期训练阶段。然而,在后来的训练阶段,j可能会突然开始在那些似乎已经在控制之中的情况下,通过尝试减少更多的长时间延迟错误来造成可避免的错误。同样,这一缺点使学习变得困难,需要一种更上下文敏感的机制来“通过输出权重”控制读操作。

Of course, input and output weight con icts are not specic for long time lags, but occur for  short time lags as well. Their eects, however, become particularly pronounced in the long time  lag case: as the time lag increases, (1) stored information must be protected against perturbation  for longer and longer periods, and | especially in advanced stages of learning | (2) more and  more already correct outputs also require protection against perturbation.  

Due to the problems above the naive approach does not work well except in case of certain  simple problems involving local input/output representations and non-repeating input patterns  (see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right. 当然,输入和输出的权系数在长时间滞后时是不特定的,但在短时间滞后时也会出现。除,然而,在长时间滞后的情况下尤为明显:随着时间间隔的增加,(1)存储信息必须防止扰动时间却越来越长,学习|和|尤其是晚期(2)越来越多的正确输出也需要防止扰动。 

由于上述问题,天真的方法不能很好地工作,除非某些简单的问题涉及本地输入/输出表示和非重复输入模式(见Hochreiter 1991和Silva et al. 1996)。下一节将展示如何正确地执行此操作。


Memory cells and gate units. To construct an architecture that allows for constant error ow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit j from Section 3.2 by introducing additional features. A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit is introduced which protects other units from perturbation by currently irrelevant memory contents stored in j. 记忆单元和门单元。为了构建一个允许通过特殊的、自连接的单元实现恒定误差的体系结构,同时又不存在朴素方法的缺点,我们通过引入额外的特性来扩展3.2节中自连接的线性单元j所包含的恒定误差carrousel CEC。为了保护存储在j中的存储内容不受无关输入的干扰,引入了乘法输入门单元。同样地,一个乘法输出门单元被引入,它保护其他单元不受当前不相关的存储在j中的内存内容的干扰。



net Figure 1: Architecture of memory cel l cj (the box) and its gate units inj ; outj . The self-recurrent connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.

图1:memory cel l cj(盒子)的结构和它的门单元inj;outj。自循环连接(权值为1.0)表示反馈延迟1个时间步长。它建立了恒定误差carrousel“CEC”的基础。星门单元打开和关闭CEC的入口。详情见正文和附录A.1。

ls.  Why gate units? To avoid input weight con icts, inj controls the error  ow to memory cell  cj 's input connections wcj i . To circumvent cj 's output weight con icts, outj controls the error   ow from unit j's output connections. In other words, the net can use inj to decide when to keep  or override information in memory cell cj , and outj to decide when to access memory cell cj and  when to prevent other units from being perturbed by cj (see Figure 1).  


Error signals trapped within a memory cell's CEC cannot change { but dierent error signals   owing into the cell (at dierent times) via its output gate may get superimposed. The output  gate will have to learn which errors to trap in its CEC, by appropriately scaling them. The input gate will have to learn when to release errors, again by appropriately scaling them. Essentially, the multiplicative gate units open and close access to constant error ow through CEC.


Distributed output representations typically do require output gates. Not always are both  gate types necessary, though | one may be sucient. For instance, in Experiments 2a and 2b in  Section 5, it will be possible to use input gates only. In fact, output gates are not required in case  of local output encoding | preventing memory cells from perturbing already learned outputs can  be done by simply setting the corresponding weights to zero. Even in this case, however, output  gates can be benecial: they prevent the net's attempts at storing long time lag memories (which  are usually hard to learn) from perturbing activations representing easily learnable short time lag  memories. (This will prove quite useful in Experiment 1, for instance.)  


Network topology. We use networks with one input layer, one hidden layer, and one output  layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units  (for convenience, we refer to both memory cells and gate units as being located in the hidden  layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate  units and memory cells. All units (except for gate units) in all layers have directed connections  (serve as inputs) to all units in the layer above (or to all higher layers { Experiments 2a and 2b).  

Memory cell blocks. S memory cells sharing the same input gate and the same output gate  form a structure called a \memory cell block of size S". Memory cell blocks facilitate information  storage | as with conventional neural nets, it is not so easy to code a distributed input within a  single cell. Since each memory cell block has as many gate units as a single memory cell (namely  two), the block architecture can be even slightly more ecient (see paragraph \computational  complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments  (Section 5), we will use memory cell blocks of various sizes.  





Learning. We use a variant of RTRL (e.g., Robinson and Fallside 1987) which properly takes  into account the altered, multiplicative dynamics caused by input and output gates. However, to  ensure non-decaying error backprop through internal states of memory cells, as with truncated  BPTT (e.g., Williams and Peng 1990), errors arriving at \memory cell net inputs" (for cell cj , this  includes netcj  , netinj  , netoutj ) do not get propagated back further in time (although they do serve  to change the incoming weights). Only within2 memory cells, errors are propagated back through  previous internal states scj  . To visualize this: once an error signal arrives at a memory cell output,  it gets scaled by output gate activation and h0  . Then it is within the memory cell's CEC, where it  can  ow back indenitely without ever being scaled. Only when it leaves the memory cell through  the input gate and g, it is scaled once more by input gate activation and g  0  . It then serves to  change the incoming weights before it is truncated (see appendix for explicit formulae).  

Computational complexity. As with Mozer's focused recurrent backprop algorithm (Mozer  1989), only the derivatives @scj  @wil  need to be stored and updated. Hence the LSTM algorithm is  very ecient, with an excellent update complexity of O(W), where W the number of weights (see  details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update  complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is  local in space and time3  : there is no need to store activation values observed during sequence  processing in a stack with potentially unlimited size.

学习。我们使用RTRL的一个变体(例如,Robinson和Fallside 1987),它适当地考虑了输入和输出门所引起的变化的乘法动力学。然而,以确保non-decaying错误backprop通过内部状态的记忆细胞,与截断BPTT(例如,威廉姆斯和彭1990),错误到达\存储单元网络输入”(细胞cj,这包括netcj、netinj netoutj)得不到传播更久远的时代(尽管他们服务变化的权重)。只有在2个内存单元中,错误才会通过之前的内部状态scj传播回来。为了可视化这一点:一旦一个错误信号到达一个内存单元输出,它将被输出门激活和h0缩放。然后它在记忆细胞的CEC中,在那里它可以无限地慢下来而不需要被缩放。只有当它通过输入门和g离开存储单元时,它才通过输入门激活和g 0再次被缩放。然后,它用于在截断之前更改传入的权重(有关显式公式,请参阅附录)。

计算的复杂性。与Mozer的重点循环支持算法(Mozer 1989)一样,只需要存储和更新导数@scj @wil。因此LSTM算法非常特殊,更新复杂度为O(W),其中W表示权值的数量(详见附录A.1)。因此,对于完全经常网,LSTM和BPTT的每一步更新复杂度是相同的(而RTRL要差得多)。但是,与完整的BPTT不同的是,LSTM在空间和时间上是局部的:不需要将序列处理期间观察到的激活值存储在具有无限大小的堆栈中。

Abuse problem and solutions. In the beginning of the learning phase, error reduction  may be possible without storing information over time. The network will thus tend to abuse  memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing  connections as adaptive thresholds for other units). The potential diculty is: it may take a  long time to release abused memory cells and make them available for further learning. A similar  \abuse problem" appears if two memory cells store the same (redundant) information. There are  at least two solutions to the abuse problem: (1) Sequential network construction (e.g., Fahlman  1991): a memory cell and the corresponding gate units are added to the network whenever the error stops decreasing (see Experiment 2 in Section 5). (2) Output gate bias: each output gate gets a negative initial bias, to push initial memory cell activations towards zero. Memory cells with more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5).

滥用问题及解决方法。在学习阶段的开始,可以在不存储信息的情况下减少错误。因此,该网络将倾向于滥用记忆细胞,例如,作为偏见细胞。,它可能使它们的激活保持不变,并使用传出连接作为其他单元的自适应阈值)。潜在的问题是:释放被滥用的记忆细胞并使其用于进一步的学习可能需要很长时间。如果两个记忆单元存储相同的(冗余的)信息,就会出现类似的“滥用”问题。至少有两个解决滥用问题:(1)顺序网络建设(例如,Fahlman 1991):一个存储单元和相应的单元门时被添加到网络错误停止减少(见实验2节5)。(2)输出门偏见:每个输出门负初始偏差,将最初的记忆细胞激活为零。带有更多负偏差的记忆细胞将被自动分配”稍后(参见第5节中的实验1、3、4、5、6)。

Internal state drift and remedies. If memory cell cj 's inputs are mostly positive or mostly  negative, then its internal state sj will tend to drift away over time. This is potentially dangerous,  for the h0  (sj ) will then adopt very small values, and the gradient will vanish. One way to circumvent  this problem is to choose an appropriate function h. But h(x) = x, for instance, has the  disadvantage of unrestricted memory cell output range. Our simple but eective way of solving  drift problems at the beginning of learning is to initially bias the input gate inj towards zero.  Although there is a tradeo between the magnitudes of h0  (sj ) on the one hand and of yinj  and  f 0  inj on the other, the potential negative eect of input gate bias is negligible compared to the one  of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for  ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4. 内部状态漂移和补救措施。如果记忆细胞cj的输入大部分是正的或大部分是负的,那么它的内部状态sj会随着时间的推移而漂移。这是潜在的危险,因为h0 (sj)将采用非常小的值,而梯度将消失。解决这个问题的一种方法是选择一个合适的函数h,但是h(x) = x的缺点是不限制内存单元的输出范围。我们在学习之初解决漂移问题的简单而有效的方法是使输入门inj最初偏向于零。虽然在h0 (sj)与yinj和f0 inj的量级之间存在贸易,但与漂移效应相比,输入门偏差的潜在负效应可以忽略不计。对于logistic sigmoid激活函数,似乎不需要对初始偏差进行ne调节,正如5.4节中的实验4和实验5所证实的那样。


+ 订阅