# LSTM：《Long Short-Term Memory》的翻译并解读（二）

## 3 CONSTANT ERROR BACKPROP  固定误差支持

3.1 EXPONENTIALLY DECAYING ERROR   指数衰减误差

Conventional BPTT (e.g. Williams and Zipser 1992). Output unit k's target at time t is denoted by dk (t). Using mean squared error, k's error signal is

The corresponding contribution to wjl 's total weight update is #j (t)yl  (t  1), where  is the  learning rate, and l stands for an arbitrary unit connected to unit j.  Outline of Hochreiter's analysis (1991, page 19-21). Suppose we have a fully connected  net whose non-input unit indices range from 1 to n. Let us focus on local error  ow from unit u  to unit v (later we will see that the analysis immediately extends to global error  ow). The error  occurring at an arbitrary unit u at time step t is propagated \back into time" for q time steps, to  an arbitrary unit v. This will scale the error by the following fact

wjl的总权重更新的相应贡献是#j (t)yl (t 1)，其中为学习率，l表示连接到j单元的任意单元。Hochreiter分析概要(1991年，第19-21页)。假设我们有一个完全连通的网络，它的非输入单位指数范围从1到n。让我们关注从单位u到单位v的局部误差ow(稍后我们将看到分析立即扩展到全局误差ow)。发生在任意单位u上的时间步长t的误差被传播回时间中，对于q时间步长，传播回任意单位v

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH 常量错误流:简单的方法

A single unit. To avoid vanishing error signals, how can we achieve constant error ow through a single unit j with a single connection to itself? According to the rules above, at time t, j's local error back ow is #j (t) = f 0 j (netj (t))#j (t + 1)wjj . To enforce constant error ow through j, we h j, we

In the experiments, this will be ensured by using the identity function fj : fj (x) = x; 8x, and by setting wjj = 1:0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's central feature (see Section 4). Of course unit j will not only be connected to itself but also to other units. This invokes two obvious, related problems (also inherent in all other gradient-based approaches): 在实验中，利用恒等函数fj: fj (x) = x来保证;设置wjj = 1:0。我们称之为常误差卡鲁塞尔(CEC)。CEC将是LSTM的中心特性(参见第4节)。当然，单元j不仅与自身相连，还与其他单元相连。这引发了两个明显的、相关的问题(也是所有其他基于梯度的方法所固有的):

1. Input weight con ict: for simplicity, let us focus on a single additional input weight wji .  Assume that the total error can be reduced by switching on unit j in response to a certain input,  and keeping it active for a long time (until it helps to compute a desired output). Provided i is nonzero,  since the same incoming weight has to be used for both storing certain inputs and ignoring  others, wji will often receive con icting weight update signals during this time (recall that j is  linear): these signals will attempt to make wji participate in (1) storing the input (by switching  on j) and (2) protecting the input (by preventing j from being switched o by irrelevant later  inputs). This con ict makes learning dicult, and calls for a more context-sensitive mechanism  for controlling \write operations" through input weights.

2. Output weight con ict: assume j is switched on and currently stores some previous  input. For simplicity, let us focus on a single additional outgoing weight wkj . The same wkj has  to be used for both retrieving j's content at certain times and preventing j from disturbing k  at other times. As long as unit j is non-zero, wkj will attract con icting weight update signals  generated during sequence processing: these signals will attempt to make wkj participate in (1)  accessing the information stored in j and | at dierent times | (2) protecting unit k from being  perturbed by j. For instance, with many tasks there are certain \short time lag errors" that can be  reduced in early training stages. However, at later training stages j may suddenly start to cause  avoidable errors in situations that already seemed under control by attempting to participate in  reducing more dicult \long time lag errors". Again, this con ict makes learning dicult, and  calls for a more context-sensitive mechanism for controlling \read operations" through output  weights.

1. 输入权值约束:为了简单起见，我们将重点放在单个额外的输入权值wji上。假设可以通过打开单元j来响应某个输入，并长时间保持它处于活动状态(直到它有助于计算所需的输出)，从而减少总错误。提供我是零,因为相同的传入的重量必须是用于存储特定的输入和无视他人,wji通常会接收con ict重量更新信号在此期间(回想一下,j是线性):这些信号将试图使wji参与(1)存储输入(通过打开j)和(2)保护输入(通过阻止j被无关紧要了o后输入)。这使得学习变得困难，需要一种更上下文敏感的机制来“通过输入权重”控制写操作。

2. 输出权值:假设j已经打开，并且当前存储了一些以前的输入。为了简单起见，让我们关注单个额外的输出权wkj。相同的wkj必须在特定时间用于检索j的内容，在其他时间用于防止j干扰k。只要单位j是零,wkj将吸引con ict重量更新信号生成的序列处理期间:这些信号将试图使wkj参与(1)访问的信息存储在j和| | dierent倍(2)保护单元凯西从被摄动j。例如,许多任务有些\短时间延迟错误”,可以减少在早期训练阶段。然而，在后来的训练阶段，j可能会突然开始在那些似乎已经在控制之中的情况下，通过尝试减少更多的长时间延迟错误来造成可避免的错误。同样，这一缺点使学习变得困难，需要一种更上下文敏感的机制来“通过输出权重”控制读操作。

Of course, input and output weight con icts are not specic for long time lags, but occur for  short time lags as well. Their eects, however, become particularly pronounced in the long time  lag case: as the time lag increases, (1) stored information must be protected against perturbation  for longer and longer periods, and | especially in advanced stages of learning | (2) more and  more already correct outputs also require protection against perturbation.

Due to the problems above the naive approach does not work well except in case of certain  simple problems involving local input/output representations and non-repeating input patterns  (see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right. 当然，输入和输出的权系数在长时间滞后时是不特定的，但在短时间滞后时也会出现。除,然而,在长时间滞后的情况下尤为明显:随着时间间隔的增加,(1)存储信息必须防止扰动时间却越来越长,学习|和|尤其是晚期(2)越来越多的正确输出也需要防止扰动。&nbsp;

## 4 LONG SHORT-TERM MEMORY

Memory cells and gate units. To construct an architecture that allows for constant error ow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit j from Section 3.2 by introducing additional features. A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit is introduced which protects other units from perturbation by currently irrelevant memory contents stored in j. 记忆单元和门单元。为了构建一个允许通过特殊的、自连接的单元实现恒定误差的体系结构，同时又不存在朴素方法的缺点，我们通过引入额外的特性来扩展3.2节中自连接的线性单元j所包含的恒定误差carrousel CEC。为了保护存储在j中的存储内容不受无关输入的干扰，引入了乘法输入门单元。同样地，一个乘法输出门单元被引入，它保护其他单元不受当前不相关的存储在j中的内存内容的干扰。

net Figure 1: Architecture of memory cel l cj (the box) and its gate units inj ; outj . The self-recurrent connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.

ls.  Why gate units? To avoid input weight con icts, inj controls the error  ow to memory cell  cj 's input connections wcj i . To circumvent cj 's output weight con icts, outj controls the error   ow from unit j's output connections. In other words, the net can use inj to decide when to keep  or override information in memory cell cj , and outj to decide when to access memory cell cj and  when to prevent other units from being perturbed by cj (see Figure 1).

Error signals trapped within a memory cell's CEC cannot change { but dierent error signals   owing into the cell (at dierent times) via its output gate may get superimposed. The output  gate will have to learn which errors to trap in its CEC, by appropriately scaling them. The input gate will have to learn when to release errors, again by appropriately scaling them. Essentially, the multiplicative gate units open and close access to constant error ow through CEC.

Distributed output representations typically do require output gates. Not always are both  gate types necessary, though | one may be sucient. For instance, in Experiments 2a and 2b in  Section 5, it will be possible to use input gates only. In fact, output gates are not required in case  of local output encoding | preventing memory cells from perturbing already learned outputs can  be done by simply setting the corresponding weights to zero. Even in this case, however, output  gates can be benecial: they prevent the net's attempts at storing long time lag memories (which  are usually hard to learn) from perturbing activations representing easily learnable short time lag  memories. (This will prove quite useful in Experiment 1, for instance.)

Network topology. We use networks with one input layer, one hidden layer, and one output  layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units  (for convenience, we refer to both memory cells and gate units as being located in the hidden  layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate  units and memory cells. All units (except for gate units) in all layers have directed connections  (serve as inputs) to all units in the layer above (or to all higher layers { Experiments 2a and 2b).

Memory cell blocks. S memory cells sharing the same input gate and the same output gate  form a structure called a \memory cell block of size S". Memory cell blocks facilitate information  storage | as with conventional neural nets, it is not so easy to code a distributed input within a  single cell. Since each memory cell block has as many gate units as a single memory cell (namely  two), the block architecture can be even slightly more ecient (see paragraph \computational  complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments  (Section 5), we will use memory cell blocks of various sizes.

Learning. We use a variant of RTRL (e.g., Robinson and Fallside 1987) which properly takes  into account the altered, multiplicative dynamics caused by input and output gates. However, to  ensure non-decaying error backprop through internal states of memory cells, as with truncated  BPTT (e.g., Williams and Peng 1990), errors arriving at \memory cell net inputs" (for cell cj , this  includes netcj  , netinj  , netoutj ) do not get propagated back further in time (although they do serve  to change the incoming weights). Only within2 memory cells, errors are propagated back through  previous internal states scj  . To visualize this: once an error signal arrives at a memory cell output,  it gets scaled by output gate activation and h0  . Then it is within the memory cell's CEC, where it  can  ow back indenitely without ever being scaled. Only when it leaves the memory cell through  the input gate and g, it is scaled once more by input gate activation and g  0  . It then serves to  change the incoming weights before it is truncated (see appendix for explicit formulae).

Computational complexity. As with Mozer's focused recurrent backprop algorithm (Mozer  1989), only the derivatives @scj  @wil  need to be stored and updated. Hence the LSTM algorithm is  very ecient, with an excellent update complexity of O(W), where W the number of weights (see  details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update  complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is  local in space and time3  : there is no need to store activation values observed during sequence  processing in a stack with potentially unlimited size.

Abuse problem and solutions. In the beginning of the learning phase, error reduction  may be possible without storing information over time. The network will thus tend to abuse  memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing  connections as adaptive thresholds for other units). The potential diculty is: it may take a  long time to release abused memory cells and make them available for further learning. A similar  \abuse problem" appears if two memory cells store the same (redundant) information. There are  at least two solutions to the abuse problem: (1) Sequential network construction (e.g., Fahlman  1991): a memory cell and the corresponding gate units are added to the network whenever the error stops decreasing (see Experiment 2 in Section 5). (2) Output gate bias: each output gate gets a negative initial bias, to push initial memory cell activations towards zero. Memory cells with more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5).

Internal state drift and remedies. If memory cell cj 's inputs are mostly positive or mostly  negative, then its internal state sj will tend to drift away over time. This is potentially dangerous,  for the h0  (sj ) will then adopt very small values, and the gradient will vanish. One way to circumvent  this problem is to choose an appropriate function h. But h(x) = x, for instance, has the  disadvantage of unrestricted memory cell output range. Our simple but eective way of solving  drift problems at the beginning of learning is to initially bias the input gate inj towards zero.  Although there is a tradeo between the magnitudes of h0  (sj ) on the one hand and of yinj  and  f 0  inj on the other, the potential negative eect of input gate bias is negligible compared to the one  of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for  ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4. 内部状态漂移和补救措施。如果记忆细胞cj的输入大部分是正的或大部分是负的，那么它的内部状态sj会随着时间的推移而漂移。这是潜在的危险，因为h0 (sj)将采用非常小的值，而梯度将消失。解决这个问题的一种方法是选择一个合适的函数h，但是h(x) = x的缺点是不限制内存单元的输出范围。我们在学习之初解决漂移问题的简单而有效的方法是使输入门inj最初偏向于零。虽然在h0 (sj)与yinj和f0 inj的量级之间存在贸易，但与漂移效应相比，输入门偏差的潜在负效应可以忽略不计。对于logistic sigmoid激活函数，似乎不需要对初始偏差进行ne调节，正如5.4节中的实验4和实验5所证实的那样。

+ 订阅