LSTM:《Understanding LSTM Networks》的翻译并解读(二)

简介: LSTM:《Understanding LSTM Networks》的翻译并解读

The Core Idea Behind LSTMs LSTMs背后的核心思想


The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

LSTMs的关键是单元状态,即贯穿图顶部的水平线。

细胞状态有点像一个传送带。它沿着整个链向下,只有一些微小的线性相互作用。信息很容易以不变的方式流动。


The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

LSTM确实能够删除或向细胞状态添加信息,这是由称为门的结构仔细控制的。

门是一种可选地让信息通过的方法。它们由sigmoid神经网络层和逐点乘法运算组成。


The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

sigmoid层输出0到1之间的数字,描述每个组件应该允许通过的数量。0的值表示“不让任何东西通过”,而1的值表示“让所有东西通过!”

LSTM有三个这样的门来保护和控制单元状态。


Step-by-Step LSTM Walk Through  分步执行LSTM


The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1

and xt

, and outputs a number between 0

and 1

for each number in the cell state Ct−1

. A 1

represents “completely keep this” while a 0

represents “completely get rid of this.”

LSTM的第一步是决定要从单元状态中丢弃什么信息。这个决定是由一个叫做“忘记门”的sigmoid层做出的。“它查看ht−1ht−1和xtxt,并为细胞状态Ct−1Ct−1中的每个数输出一个00到11之间的数字。11代表“完全保留这个”,而00代表“完全摆脱这个”。”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.


让我们回到我们的例子,一个语言模型试图预测下一个单词基于所有前面的词。在这样的问题中,单元格状态可能包括当前主体的性别,这样就可以使用正确的代词。当我们看到一个新的主题时,我们想要忘记旧主题的性别。

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t

, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.


下一步是决定要在单元状态中存储什么新信息。它有两部分。首先,一个名为“输入门层”的sigmoid层决定要更新哪些值。接下来,tanh层创建一个新的候选值向量C~tC~t,可以将其添加到状态中。在下一个步骤中,我们将把这两者结合起来以创建对状态的更新。

在我们的语言模型示例中,我们希望将新主体的性别添加到单元格状态,以替换我们忘记的旧主体。

It’s now time to update the old cell state, Ct−1

, into the new cell state Ct

. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by ft

, forgetting the things we decided to forget earlier. Then we add it∗C~t

. This is the new candidate values, scaled by how much we decided to update each state value.

现在是时候将旧的细胞状态Ct−1Ct−1更新为新的细胞状态CtCt了。前面的步骤已经决定了要做什么,我们只需要实际去做。

我们将旧状态乘以ft,忘记了我们之前决定忘记的事情。然后我们把它加入到显示状态显示状态C~tit C~t。这是新的候选值,根据我们决定更新每个状态值的程度进行缩放。

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.


在语言模型中,这是我们实际删除关于旧主题性别的信息并添加新信息的地方,正如我们在前面的步骤中所决定的那样。

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh

(to push the values to be between −1

and 1

) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.


最后,我们需要决定我们要输出什么。此输出将基于我们的单元格状态,但将是经过筛选的版本。首先,我们运行一个sigmoid层,它决定我们要输出的单元状态的哪些部分。然后,我们将细胞状态放入tanhtanh(将值设置为−1−1和11之间),并将其乘以s形门的输出,这样我们只输出我们决定输出的部分。

对于语言模型示例,因为它只是看到了一个主题,所以它可能希望输出与动词相关的信息,以防接下来会发生什么。例如,它可以输出主语是单数还是复数,这样我们就可以知道一个动词接下来应该变成什么形式。


Variants on Long Short Term Memory  LSTM的变体


What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.


到目前为止,我所描述的是一个非常普通的LSTM。但并不是所有的lstm都与上述相同。事实上,似乎几乎每一篇涉及LSTMs的论文都使用了稍微不同的版本。差异很小,但值得一提。

一种流行的LSTM变体,由Gers和Schmidhuber(2000)引入,增加了“窥视孔连接”。这意味着我们让栅极层观察细胞的状态。

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.


上面的图表在所有的门上都加了窥视孔,但是很多论文只会给出一些窥视孔,而不会给出其他的。

另一种变化是使用耦合忘记和输入门。我们不是单独决定忘记什么和应该添加什么新信息,而是一起做这些决定。我们只会忘记什么时候在它的位置上输入东西。我们只在忘记旧的值时才向状态输入新值。

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.



LSTM的一个稍微戏剧性的变化是门控递归单元,或GRU,由Cho等人(2014)引入。它将忘记和输入门组合成一个“更新门”。“它还融合了细胞状态和隐藏状态,并做了一些其他的改变。得到的模型比标准LSTM模型更简单,并且越来越流行。

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

这些只是最值得注意的LSTM变体中的几个。还有很多其他的,如姚等人(2015)的《深度门控RNNs》。还有一些完全不同的处理长期依赖关系的方法,如Koutnik等人(2014)的Clockwork RNNs。

这些变体中哪个是最好的?差异重要吗?Greff等人(2015)对流行的变体做了一个很好的比较,发现它们都差不多。Jozefowicz等人(2015)测试了一万多个RNN架构,发现有些架构在某些任务上比LSTMs工作得更好。


Conclusion


Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

之前,我提到了人们使用RNNs所取得的显著成果。基本上所有这些都是使用LSTMs实现的。对于大多数任务来说,它们确实工作得更好!

作为一组方程来写,lstm看起来很吓人。希望在这篇文章中一步一步地介绍它们能使它们更容易理解。

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

LSTMs是我们使用RNNs实现目标的一大步。人们很自然地会想:还会有更大的进步吗?研究人员普遍认为:“是的!下一步就是集中注意力!这个想法是让RNN的每一步都从更大的信息集合中挑选信息。例如,如果您使用RNN来创建描述图像的标题,它可能会选择图像的一部分来查看它输出的每个单词。事实上,Xu等人(2015)正是这样做的——如果你想探索注意力,这可能是一个有趣的起点!已经有很多使用注意力的令人兴奋的结果,而且似乎更多的结果即将出现……

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!


注意力并不是RNN研究中唯一令人兴奋的线索。例如,Kalchbrenner等人(2015)的Grid LSTMs似乎非常有前途。在生成模型中使用RNNs的工作——如Gregor等人(2015)、Chung等人(2015)或Bayer & Osendorfer等人(2015)——似乎也非常有趣。过去的几年对于递归神经网络来说是激动人心的一年,而未来的几年将会更加激动人心!


Acknowledgments  致谢


I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

我非常感谢许多人帮助我更好地理解LSTMs,对其可视化进行了评论,并对本文提供了反馈。

I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

我非常感谢谷歌的同事们提供的有用的反馈,特别是Oriol Vinyals、Greg Corrado、Jon Shlens、Luke Vilnis和Ilya Sutskever。我也感谢许多其他的朋友和同事花时间来帮助我,包括达里奥·阿莫德和雅各布·斯坦哈特。我特别感谢Kyunghyun Cho对我的图表所做的极其周到的回复。

在这篇文章之前,我在两个关于神经网络的系列研讨会上练习解释LSTMs。感谢每一个参与其中的人,感谢他们对我的耐心,感谢他们的反馈。


 

相关文章
|
机器学习/深度学习 监控
DL之Attention-ED:基于TF NMT利用带有Attention的 ED模型训练、测试(中英文平行语料库)实现将英文翻译为中文的LSTM翻译模型过程全记录
DL之Attention-ED:基于TF NMT利用带有Attention的 ED模型训练、测试(中英文平行语料库)实现将英文翻译为中文的LSTM翻译模型过程全记录
DL之Attention-ED:基于TF NMT利用带有Attention的 ED模型训练、测试(中英文平行语料库)实现将英文翻译为中文的LSTM翻译模型过程全记录
|
机器学习/深度学习 监控 TensorFlow
DL之Attention-ED:基于TF NMT利用带有Attention的 ED模型训练、测试(中英文平行语料库)实现将英文翻译为中文的LSTM翻译模型过程全记录
DL之Attention-ED:基于TF NMT利用带有Attention的 ED模型训练、测试(中英文平行语料库)实现将英文翻译为中文的LSTM翻译模型过程全记录
DL之Attention-ED:基于TF NMT利用带有Attention的 ED模型训练、测试(中英文平行语料库)实现将英文翻译为中文的LSTM翻译模型过程全记录
|
机器学习/深度学习 语音技术 知识图谱
LSTM:《Understanding LSTM Networks》的翻译并解读(一)
LSTM:《Understanding LSTM Networks》的翻译并解读
|
机器学习/深度学习 存储 算法
LSTM:《Long Short-Term Memory》的翻译并解读(三)
LSTM:《Long Short-Term Memory》的翻译并解读
|
存储 机器学习/深度学习 JavaScript
LSTM:《Long Short-Term Memory》的翻译并解读(二)
LSTM:《Long Short-Term Memory》的翻译并解读
|
机器学习/深度学习 存储 算法
LSTM:《Long Short-Term Memory》的翻译并解读(一)
LSTM:《Long Short-Term Memory》的翻译并解读
|
存储 机器学习/深度学习 移动开发
新手教程之:循环网络和LSTM指南 (A Beginner’s Guide to Recurrent Networks and LSTMs)
新手教程之:循环网络和LSTM指南 (A Beginner’s Guide to Recurrent Networks and LSTMs)     本文翻译自:http://deeplearning4j.
|
机器学习/深度学习 TensorFlow 算法框架/工具

热门文章

最新文章