LSTM:《Long Short-Term Memory》的翻译并解读(三)-阿里云开发者社区

开发者社区> 人工智能> 正文

LSTM:《Long Short-Term Memory》的翻译并解读(三)

简介: LSTM:《Long Short-Term Memory》的翻译并解读


Introduction. Which tasks are appropriate to demonstrate the quality of a novel long time lag 介绍。哪些任务是合适的,以证明一个新的长时间滞后的质量

algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher  signals must be long for al l training sequences. In fact, many previous recurrent net algorithms  sometimes manage to generalize from very short training sequences to very long test sequences.  See, e.g., Pollack (1991). But a real long time lag problem does not have any short time lag  exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL,  online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and  Mozer (1992). A second important requirement is that the tasks should be complex enough such  that they cannot be solved quickly by simple-minded strategies such as random weight guessing.

算法?首先,对于all训练序列,相关输入信号与相应教师信号之间的最小时滞必须很长。事实上,许多以前的递归网络算法有时能够将非常短的训练序列推广到非常长的测试序列。参见,例如Pollack(1991)。但是一个真实的长时间滞后问题在训练集中没有任何短时间滞后的例子。例如,Elman的训练过程,BPTT, oine RTRL, online RTRL等,在真实的长时间滞后问题上严重失败。例如Hochreiter(1991)和Mozer(1992)。第二个重要的要求是,任务应该足够复杂,不能用简单的策略(如随机猜测权值)快速解决。

Guessing can outperform many long time lag algorithms. Recently we discovered  (Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that many long  time lag tasks used in previous work can be solved more quickly by simple random weight guessing  than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's  \parity problem" (1994) problem much faster4  than the seven methods tested by Bengio et al.  (1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of  course, this does not mean that guessing is a good algorithm. It just means that some previously  used problems are not extremely appropriate to demonstrate the quality of previously proposed  algorithms.  

猜测可以胜过许多长时间延迟的算法。最近我们发现(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997),以前工作中使用的许多长时间延迟任务可以通过简单的随机猜测权值来快速解决,而不是通过所提出的算法。例如,猜测解决了Bengio和Frasconi's奇偶校验问题(1994)的一个变体,比Bengio等人(1994)和Bengio和Frasconi(1994)测试的七种方法要快得多。类似地,米勒和贾尔斯的一些问题(1993年)。当然,这并不意味着猜测是一个好的算法。这只是意味着一些以前用过的问题不是非常适合用来演示以前提出的算法的质量。

What's common to Experiments 1{6. All our experiments (except for Experiment 1)  involve long minimal time lags | there are no short time lag training exemplars facilitating  learning. Solutions to most of our tasks are sparse in weight space. They require either many  parameters/inputs or high weight precision, such that random weight guessing becomes infeasible.  


We always use on-line learning (as opposed to batch learning), and logistic sigmoids as activation  functions. For Experiments 1 and 2, initial weights are chosen in the range [0:2;  0:2], for  the other experiments in [0:1;  0:1]. Training sequences are generated randomly according to the  various task descriptions. In slight deviation from the notation in Appendix A1, each discrete  time step of each input sequence involves three processing steps:

(1) use current input to set the  input units.

(2) Compute activations of hidden units (including input gates, output gates, memory  cells).

(3) Compute output unit activations. Except for Experiments 1, 2a, and 2b, sequence  elements are randomly generated on-line, and error signals are generated only at sequence ends.  Net activations are reset after each processed input sequence.





For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL,  except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see,  e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time  lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online  weight changes) lead to almost identical, negative results (as conrmed by additional simulations  in Hochreiter 1991; see also Mozer 1992). This is because oine RTRL, online RTRL, and full  BPTT all suer badly from exponential error decay.  

对于用梯度下降法讲授的循环网的比较,我们只给出了RTRL的结果,除了比较2a,其中也包括了BPTT。但是,请注意未截断的BPTT(参见, Williams和Peng(1990)计算的梯度与oine RTRL完全相同。由于存在长时间滞后问题,oine RTRL(或BPTT)和RTRL的在线版本(没有激活重置,在线权重变化)导致几乎相同的负结果(如Hochreiter 1991中的额外模拟所证实的;参见Mozer 1992)。这是因为oine RTRL、online RTRL和full BPTT都严重依赖于指数误差衰减。

Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complexity  of a given problem, a more systematic approach would be: start with a very small net consisting  of one memory cell. If this does not work, try two cells, etc. Alternatively, use sequential network  construction (e.g., Fahlman 1991).

我们的LSTM架构是任意选择的。如果对给定问题的复杂性一无所知,那么一种更系统的方法是:从一个由一个记忆单元组成的非常小的网络开始。如果这不起作用,尝试两个单元格,等等。或者,使用顺序网络结构(例如,Fahlman 1991)。

Outline of experiments  试验大纲

Experiment 1 focuses on a standard benchmark test for recurrent nets: the embedded Reber  grammar. Since it allows for training sequences with short time lags, it is not a long time  lag problem. We include it because (1) it provides a nice example where LSTM's output  gates are truly benecial, and (2) it is a popular benchmark for recurrent nets that has been  used by many authors | we want to include at least one experiment where conventional  BPTT and RTRL do not fail completely (LSTM, however, clearly outperforms them). The  embedded Reber grammar's minimal time lags represent a border case in the sense that it  is still possible to learn to bridge them with conventional algorithms. Only slightly long  minimal time lags would make this almost impossible. The more interesting tasks in our  paper, however, are those that RTRL, BPTT, etc. cannot solve at all.  

Experiment 2 focuses on noise-free and noisy sequences involving numerous input symbols  distracting from the few important ones. The most dicult task (Task 2c) involves hundreds  of distractor symbols at random positions, and minimal time lags of 1000 steps. LSTM solves  it, while BPTT and RTRL already fail in case of 10-step minimal time lags (see also, e.g.,  Hochreiter 1991 and Mozer 1992). For this reason RTRL and BPTT are omitted in the  remaining, more complex experiments, all of which involve much longer time lags.  

Experiment 3 addresses long time lag problems with noise and signal on the same input  line. Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because  this problem actually can be solved quickly by random weight guessing, we also include a  far more dicult 2-sequence problem (3c) which requires to learn real-valued, conditional  expectations of noisy targets, given the inputs.  

Experiments 4 and 5 involve distributed, continuous-valued input representations and require  learning to store precise, real values for very long time periods. Relevant input signals  can occur at quite dierent positions in input sequences. Again minimal time lags involve  hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms.

Experiment 6 involves tasks of a dierent complex type that also has not been solved by  other recurrent net algorithms. Again, relevant input signals can occur at quite dierent  positions in input sequences. The experiment shows that LSTM can extract information  conveyed by the temporal order of widely separated inputs.

Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for reference.


实验2着重于无噪声和有噪声的序列,这些序列涉及大量的输入符号,分散了对少数重要符号的注意力。最复杂的任务(task 2c)包含数百个随机位置的干扰符号,最小延迟为1000步。LSTM解决了这个问题,而BPTT和RTRL已经在10步最小时间延迟的情况下失败了(参见Hochreiter 1991和Mozer 1992)。因此,RTRL和BPTT在剩余的、更复杂的实验中被忽略,所有这些实验都涉及更长的时间滞后。 

实验3解决了在同一输入线上存在噪声和信号的长时间滞后问题。实验3a/3b集中于Bengio等人的1994 \2-sequence问题”。因为这个问题实际上可以通过随机猜测权值来快速解决,所以我们还包括了一个更复杂的2-序列问题(3c),该问题要求在给定输入的情况下学习噪声目标的实值、条件期望。 





Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989),  Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short  time lags (of as few as 9 steps), it is not a long time lag problem. We include it for two reasons: (1)  it is a popular recurrent net benchmark used by many authors | we wanted to have at least one  experiment where RTRL and BPTT do not fail completely, and

(2) it shows nicely how output  gates can be bene cial.







+ 订阅