# LSTM：《Long Short-Term Memory》的翻译并解读（三）

## 5 EXPERIMENTS 实验

Introduction. Which tasks are appropriate to demonstrate the quality of a novel long time lag 介绍。哪些任务是合适的，以证明一个新的长时间滞后的质量

algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher  signals must be long for al l training sequences. In fact, many previous recurrent net algorithms  sometimes manage to generalize from very short training sequences to very long test sequences.  See, e.g., Pollack (1991). But a real long time lag problem does not have any short time lag  exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL,  online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and  Mozer (1992). A second important requirement is that the tasks should be complex enough such  that they cannot be solved quickly by simple-minded strategies such as random weight guessing.

Guessing can outperform many long time lag algorithms. Recently we discovered  (Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that many long  time lag tasks used in previous work can be solved more quickly by simple random weight guessing  than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's  \parity problem" (1994) problem much faster4  than the seven methods tested by Bengio et al.  (1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of  course, this does not mean that guessing is a good algorithm. It just means that some previously  used problems are not extremely appropriate to demonstrate the quality of previously proposed  algorithms.

What's common to Experiments 1{6. All our experiments (except for Experiment 1)  involve long minimal time lags | there are no short time lag training exemplars facilitating  learning. Solutions to most of our tasks are sparse in weight space. They require either many  parameters/inputs or high weight precision, such that random weight guessing becomes infeasible.

We always use on-line learning (as opposed to batch learning), and logistic sigmoids as activation  functions. For Experiments 1 and 2, initial weights are chosen in the range [0:2;  0:2], for  the other experiments in [0:1;  0:1]. Training sequences are generated randomly according to the  various task descriptions. In slight deviation from the notation in Appendix A1, each discrete  time step of each input sequence involves three processing steps:

(1) use current input to set the  input units.

(2) Compute activations of hidden units (including input gates, output gates, memory  cells).

(3) Compute output unit activations. Except for Experiments 1, 2a, and 2b, sequence  elements are randomly generated on-line, and error signals are generated only at sequence ends.  Net activations are reset after each processed input sequence.

(1)使用电流输入设置输入单元。

(2)计算隐藏单元的激活(包括输入门、输出门、存储单元)。

(3)计算输出单元激活。除实验1、2a、2b外，序列元素在线随机生成，仅在序列末端产生误差信号。Net激活在每个处理的输入序列之后被重置。

For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL,  except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see,  e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time  lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online  weight changes) lead to almost identical, negative results (as conrmed by additional simulations  in Hochreiter 1991; see also Mozer 1992). This is because oine RTRL, online RTRL, and full  BPTT all suer badly from exponential error decay.

Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complexity  of a given problem, a more systematic approach would be: start with a very small net consisting  of one memory cell. If this does not work, try two cells, etc. Alternatively, use sequential network  construction (e.g., Fahlman 1991).

Outline of experiments  试验大纲

Experiment 1 focuses on a standard benchmark test for recurrent nets: the embedded Reber  grammar. Since it allows for training sequences with short time lags, it is not a long time  lag problem. We include it because (1) it provides a nice example where LSTM's output  gates are truly benecial, and (2) it is a popular benchmark for recurrent nets that has been  used by many authors | we want to include at least one experiment where conventional  BPTT and RTRL do not fail completely (LSTM, however, clearly outperforms them). The  embedded Reber grammar's minimal time lags represent a border case in the sense that it  is still possible to learn to bridge them with conventional algorithms. Only slightly long  minimal time lags would make this almost impossible. The more interesting tasks in our  paper, however, are those that RTRL, BPTT, etc. cannot solve at all.

Experiment 2 focuses on noise-free and noisy sequences involving numerous input symbols  distracting from the few important ones. The most dicult task (Task 2c) involves hundreds  of distractor symbols at random positions, and minimal time lags of 1000 steps. LSTM solves  it, while BPTT and RTRL already fail in case of 10-step minimal time lags (see also, e.g.,  Hochreiter 1991 and Mozer 1992). For this reason RTRL and BPTT are omitted in the  remaining, more complex experiments, all of which involve much longer time lags.

Experiment 3 addresses long time lag problems with noise and signal on the same input  line. Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because  this problem actually can be solved quickly by random weight guessing, we also include a  far more dicult 2-sequence problem (3c) which requires to learn real-valued, conditional  expectations of noisy targets, given the inputs.

Experiments 4 and 5 involve distributed, continuous-valued input representations and require  learning to store precise, real values for very long time periods. Relevant input signals  can occur at quite dierent positions in input sequences. Again minimal time lags involve  hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms.

Experiment 6 involves tasks of a dierent complex type that also has not been solved by  other recurrent net algorithms. Again, relevant input signals can occur at quite dierent  positions in input sequences. The experiment shows that LSTM can extract information  conveyed by the temporal order of widely separated inputs.

Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for reference.

5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR  实验1:嵌入式REBER语法

Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989),  Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short  time lags (of as few as 9 steps), it is not a long time lag problem. We include it for two reasons: (1)  it is a popular recurrent net benchmark used by many authors | we wanted to have at least one  experiment where RTRL and BPTT do not fail completely, and

(2) it shows nicely how output  gates can be bene cial.

(1)它是一个被许多作者|使用的流行的周期性网络基准，我们希望至少有一个RTRL和BPTT不会完全失败的实验，

(2)它很好地展示了输出门是如何可以带来好处的。

+ 订阅