Understanding LSTM Networks
Posted on August 27, 2015
原文地址:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks
Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.
Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.
Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.
人类并不是每一秒都能从头开始思考。当你阅读这篇文章的时候,你是根据你对之前的单词的理解来理解每一个单词的。你不会把所有东西都扔掉,然后从头开始思考。你的思想有持续力。
传统的神经网络做不到这一点,这似乎是一个主要的缺点。例如,假设您想要对电影中每一点发生的事件进行分类。目前还不清楚传统的神经网络如何利用其对电影中先前事件的推理来为后来的事件提供信息。
递归神经网络解决了这个问题。它们是包含循环的网络,允许信息持续存在。
In the above diagram, a chunk of neural network, A
, looks at some input xt
and outputs a value ht
. A loop allows information to be passed from one step of the network to the next.
在上面的图中,一个神经网络块,查看一些输入xtxt并输出一个值htht。循环允许信息从网络的一个步骤传递到下一个步骤。
These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:
这些循环使得递归神经网络看起来有点神秘。然而,如果你多想一下,就会发现它们与普通的神经网络并没有太大的不同。一个递归神经网络可以被认为是同一个网络的多个副本,每个副本都将一个消息传递给一个后续副本。考虑一下如果我们展开循环会发生什么:
This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data.
And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.
Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.
这种链状的性质揭示了递归神经网络与序列和列表密切相关。它们是神经网络用来处理这些数据的自然结构。
它们确实被使用了!在过去的几年里,将RNNs应用到各种各样的问题上取得了令人难以置信的成功:语音识别、语言建模、翻译、图像字幕等等。我将把关于使用RNNs可以实现的惊人壮举的讨论留给Andrej Karpathy的优秀博客文章:循环神经网络的不合理有效性。但它们真的很神奇。
这些成功的关键是“LSTMs”的使用,这是一种非常特殊的递归神经网络,它在很多任务上都比标准版本好得多。几乎所有基于递归神经网络的激动人心的结果都是用它们实现的。本文将探索这些LSTMs。
The Problem of Long-Term Dependencies 长期依赖的问题
One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.
Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.
RNNs的一个吸引人的地方是,他们可能能够将以前的信息与当前的任务联系起来,例如使用以前的视频帧可能有助于理解当前的帧。如果RNNs可以做到这一点,它们将是非常有用的。但他们能吗?视情况而定。
But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.
有时,我们只需要查看最近的信息来执行当前的任务。例如,考虑一个语言模型,它试图根据前面的单词预测下一个单词。如果我们试图预测“天空中的云”中的最后一个词,我们不需要任何进一步的上下文——很明显下一个词将是天空。在这种情况下,相关信息和需要信息的地方之间的差距很小,RNNs可以学习使用过去的信息。
但在某些情况下,我们需要更多的上下文。试着预测文章的最后一个词“我在法国长大……我说一口流利的法语。”“最近的信息表明,下一个词可能是一种语言的名字,但如果我们想缩小范围,我们需要更早的法国的背景。相关信息与需要它变得非常大的点之间的差距是完全可能的。
In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs don’t have this problem!
从理论上讲,RNNs 绝对有能力处理这种“长期依赖”。“人类可以为他们仔细挑选参数来解决这种形式的玩具问题。遗憾的是,在实践中,RNNs似乎不能学习它们。Hochreiter(1991)[德语]和Bengio等人(1994)对这个问题进行了深入的探讨,他们发现了一些非常基本的原因,解释了为什么这个问题可能很难解决。
谢天谢地,lstm没有这个问题!
LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
长期短期记忆网络——通常简称为“LSTMs”——是一种特殊的RNN,能够学习长期依赖关系。它们由Hochreiter和Schmidhuber(1997)引入,并在随后的工作中被许多人提炼和推广。他们在许多问题上都做得非常好,现在被广泛使用。
LSTMs的设计是为了避免长期依赖问题。长时间记忆信息实际上是他们的默认行为,而不是他们努力学习的东西!
所有的递归神经网络都是由一系列重复的神经网络模块组成的。在标准的RNNs中,这个重复的模块有一个非常简单的结构,比如一个单tanh层。
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
LSTMs也有这种类似链的结构,但是重复模块有不同的结构。不是只有一个神经网络层,而是有四个,它们以一种非常特殊的方式相互作用。
Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.
不要担心正在发生的事情的细节。稍后,我们将逐步遍历LSTM图。现在,让我们试着熟悉一下我们将要使用的符号。
In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.
在上面的图中,每一行都携带一个完整的向量,从一个节点的输出到其他节点的输入。粉红色的圆圈表示点化操作,比如向量加法,而黄色的方框表示学习神经网络层。行合并表示连接,而行分叉表示被复制的内容和被复制到不同位置的内容。