LSTM:《Long Short-Term Memory》的翻译并解读(三)-阿里云开发者社区

开发者社区> 人工智能> 正文
登录阅读全文

LSTM:《Long Short-Term Memory》的翻译并解读(三)

简介: LSTM:《Long Short-Term Memory》的翻译并解读

5 EXPERIMENTS 实验


Introduction. Which tasks are appropriate to demonstrate the quality of a novel long time lag 介绍。哪些任务是合适的,以证明一个新的长时间滞后的质量

algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher  signals must be long for al l training sequences. In fact, many previous recurrent net algorithms  sometimes manage to generalize from very short training sequences to very long test sequences.  See, e.g., Pollack (1991). But a real long time lag problem does not have any short time lag  exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL,  online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and  Mozer (1992). A second important requirement is that the tasks should be complex enough such  that they cannot be solved quickly by simple-minded strategies such as random weight guessing.

算法?首先,对于all训练序列,相关输入信号与相应教师信号之间的最小时滞必须很长。事实上,许多以前的递归网络算法有时能够将非常短的训练序列推广到非常长的测试序列。参见,例如Pollack(1991)。但是一个真实的长时间滞后问题在训练集中没有任何短时间滞后的例子。例如,Elman的训练过程,BPTT, oine RTRL, online RTRL等,在真实的长时间滞后问题上严重失败。例如Hochreiter(1991)和Mozer(1992)。第二个重要的要求是,任务应该足够复杂,不能用简单的策略(如随机猜测权值)快速解决。

Guessing can outperform many long time lag algorithms. Recently we discovered  (Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that many long  time lag tasks used in previous work can be solved more quickly by simple random weight guessing  than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's  \parity problem" (1994) problem much faster4  than the seven methods tested by Bengio et al.  (1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of  course, this does not mean that guessing is a good algorithm. It just means that some previously  used problems are not extremely appropriate to demonstrate the quality of previously proposed  algorithms.  

猜测可以胜过许多长时间延迟的算法。最近我们发现(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997),以前工作中使用的许多长时间延迟任务可以通过简单的随机猜测权值来快速解决,而不是通过所提出的算法。例如,猜测解决了Bengio和Frasconi's奇偶校验问题(1994)的一个变体,比Bengio等人(1994)和Bengio和Frasconi(1994)测试的七种方法要快得多。类似地,米勒和贾尔斯的一些问题(1993年)。当然,这并不意味着猜测是一个好的算法。这只是意味着一些以前用过的问题不是非常适合用来演示以前提出的算法的质量。

What's common to Experiments 1{6. All our experiments (except for Experiment 1)  involve long minimal time lags | there are no short time lag training exemplars facilitating  learning. Solutions to most of our tasks are sparse in weight space. They require either many  parameters/inputs or high weight precision, such that random weight guessing becomes infeasible.  

实验1{6。我们所有的实验(除了实验1)都涉及到长时间的最小滞后时间|没有短时间的滞后训练范例来促进学习。我们大多数任务的解在权值空间中是稀疏的。它们要么需要许多参数/输入,要么需要较高的权值精度,这样随机猜测权值就变得不可行了。

We always use on-line learning (as opposed to batch learning), and logistic sigmoids as activation  functions. For Experiments 1 and 2, initial weights are chosen in the range [0:2;  0:2], for  the other experiments in [0:1;  0:1]. Training sequences are generated randomly according to the  various task descriptions. In slight deviation from the notation in Appendix A1, each discrete  time step of each input sequence involves three processing steps:

(1) use current input to set the  input units.

(2) Compute activations of hidden units (including input gates, output gates, memory  cells).

(3) Compute output unit activations. Except for Experiments 1, 2a, and 2b, sequence  elements are randomly generated on-line, and error signals are generated only at sequence ends.  Net activations are reset after each processed input sequence.

我们总是使用在线学习(而不是批量学习),并使用逻辑sigmoids作为激活函数。实验1和实验2的初始权值选择在[0:2;0:2],用于其他实验[0:1;0:1)。根据不同的任务描述,随机生成训练序列。与附录A1中的符号略有偏差,每个输入序列的每个离散时间步都涉及三个处理步骤:

(1)使用电流输入设置输入单元。

(2)计算隐藏单元的激活(包括输入门、输出门、存储单元)。

(3)计算输出单元激活。除实验1、2a、2b外,序列元素在线随机生成,仅在序列末端产生误差信号。Net激活在每个处理的输入序列之后被重置。

For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL,  except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see,  e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time  lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online  weight changes) lead to almost identical, negative results (as conrmed by additional simulations  in Hochreiter 1991; see also Mozer 1992). This is because oine RTRL, online RTRL, and full  BPTT all suer badly from exponential error decay.  

对于用梯度下降法讲授的循环网的比较,我们只给出了RTRL的结果,除了比较2a,其中也包括了BPTT。但是,请注意未截断的BPTT(参见, Williams和Peng(1990)计算的梯度与oine RTRL完全相同。由于存在长时间滞后问题,oine RTRL(或BPTT)和RTRL的在线版本(没有激活重置,在线权重变化)导致几乎相同的负结果(如Hochreiter 1991中的额外模拟所证实的;参见Mozer 1992)。这是因为oine RTRL、online RTRL和full BPTT都严重依赖于指数误差衰减。

Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complexity  of a given problem, a more systematic approach would be: start with a very small net consisting  of one memory cell. If this does not work, try two cells, etc. Alternatively, use sequential network  construction (e.g., Fahlman 1991).

我们的LSTM架构是任意选择的。如果对给定问题的复杂性一无所知,那么一种更系统的方法是:从一个由一个记忆单元组成的非常小的网络开始。如果这不起作用,尝试两个单元格,等等。或者,使用顺序网络结构(例如,Fahlman 1991)。

Outline of experiments  试验大纲


Experiment 1 focuses on a standard benchmark test for recurrent nets: the embedded Reber  grammar. Since it allows for training sequences with short time lags, it is not a long time  lag problem. We include it because (1) it provides a nice example where LSTM's output  gates are truly benecial, and (2) it is a popular benchmark for recurrent nets that has been  used by many authors | we want to include at least one experiment where conventional  BPTT and RTRL do not fail completely (LSTM, however, clearly outperforms them). The  embedded Reber grammar's minimal time lags represent a border case in the sense that it  is still possible to learn to bridge them with conventional algorithms. Only slightly long  minimal time lags would make this almost impossible. The more interesting tasks in our  paper, however, are those that RTRL, BPTT, etc. cannot solve at all.  

Experiment 2 focuses on noise-free and noisy sequences involving numerous input symbols  distracting from the few important ones. The most dicult task (Task 2c) involves hundreds  of distractor symbols at random positions, and minimal time lags of 1000 steps. LSTM solves  it, while BPTT and RTRL already fail in case of 10-step minimal time lags (see also, e.g.,  Hochreiter 1991 and Mozer 1992). For this reason RTRL and BPTT are omitted in the  remaining, more complex experiments, all of which involve much longer time lags.  

Experiment 3 addresses long time lag problems with noise and signal on the same input  line. Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because  this problem actually can be solved quickly by random weight guessing, we also include a  far more dicult 2-sequence problem (3c) which requires to learn real-valued, conditional  expectations of noisy targets, given the inputs.  

Experiments 4 and 5 involve distributed, continuous-valued input representations and require  learning to store precise, real values for very long time periods. Relevant input signals  can occur at quite dierent positions in input sequences. Again minimal time lags involve  hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms.

Experiment 6 involves tasks of a dierent complex type that also has not been solved by  other recurrent net algorithms. Again, relevant input signals can occur at quite dierent  positions in input sequences. The experiment shows that LSTM can extract information  conveyed by the temporal order of widely separated inputs.

Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for reference.

实验1着重于递归网络的标准基准测试:嵌入式Reber语法。因为它允许训练序列有短的时间滞后,所以它不是一个长时间滞后的问题。我们包括是因为(1),它提供了一个很好的例子,LSTM输出门真正benecial,和(2)这是一个流行的复发性基准网,已经被许多作者|我们希望包括至少一个实验,常规BPTT和RTRL不完全失败(然而,LSTM明显优于他们)。嵌入式Reber语法的最小时间延迟代表了一种边界情况,在这种情况下,学习用传统算法桥接它们仍然是可能的。只要稍微长一点的时间延迟,这几乎是不可能的。然而,我们的论文中更有趣的任务是那些RTRL、BPTT等根本无法解决的任务。

实验2着重于无噪声和有噪声的序列,这些序列涉及大量的输入符号,分散了对少数重要符号的注意力。最复杂的任务(task 2c)包含数百个随机位置的干扰符号,最小延迟为1000步。LSTM解决了这个问题,而BPTT和RTRL已经在10步最小时间延迟的情况下失败了(参见Hochreiter 1991和Mozer 1992)。因此,RTRL和BPTT在剩余的、更复杂的实验中被忽略,所有这些实验都涉及更长的时间滞后。 

实验3解决了在同一输入线上存在噪声和信号的长时间滞后问题。实验3a/3b集中于Bengio等人的1994 \2-sequence问题”。因为这个问题实际上可以通过随机猜测权值来快速解决,所以我们还包括了一个更复杂的2-序列问题(3c),该问题要求在给定输入的情况下学习噪声目标的实值、条件期望。 

实验4和5涉及到分布式的连续值输入表示,需要学习长时间存储精确的、真实的值。相关的输入信号可以出现在输入序列的不同位置。同样,最小的时间延迟涉及数百个步骤。其他递归网络算法从未解决过类似的问题。

实验6涉及到不同复杂类型的任务,其他递归网络算法也没有解决这些任务。同样,相关的输入信号可以出现在输入序列的不同位置。实验结果表明,LSTM可以提取出由时间顺序的离散输入所传递的信息。

第5.7款将提供两个表内实验条件的详细摘要,以供参考。


5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR  实验1:嵌入式REBER语法


Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989),  Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short  time lags (of as few as 9 steps), it is not a long time lag problem. We include it for two reasons: (1)  it is a popular recurrent net benchmark used by many authors | we wanted to have at least one  experiment where RTRL and BPTT do not fail completely, and

(2) it shows nicely how output  gates can be bene cial.

任务。我们的首要任务是学习嵌入的Reber语法”,例如Smith和Zipser(1989)、Cleeremans等人(1989)和Fahlman(1991)。因为它允许训练序列有短的时间滞后(只有9个步骤),所以它不是一个长时间滞后的问题。我们引入它有两个原因:

(1)它是一个被许多作者|使用的流行的周期性网络基准,我们希望至少有一个RTRL和BPTT不会完全失败的实验,

(2)它很好地展示了输出门是如何可以带来好处的。

 

 

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

分享:
人工智能
使用钉钉扫一扫加入圈子
+ 订阅

了解行业+人工智能最先进的技术和实践,参与行业+人工智能实践项目

其他文章
最新文章
相关文章