2、Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. 减少顺序计算的目的也形成了扩展神经GPU〔16〕、ByteNet〔18〕和VusS2S〔9〕的基础,所有这些都使用卷积神经网络作为基本构建块,并行计算所有输入和输出位置的隐藏表示。在这些模型中,将来自两个任意输入或输出位置的信号关联起来所需的操作数在位置之间的距离中增长,convs2为线性增长,ByteNet为对数增长。这使得学习远距离位置之间的依赖性变得更加困难[12]。在Transformer 中,这被减少到一个恒定的操作数,尽管由于平均注意加权位置而降低了有效分辨率,但是我们用多头注意抵消了这一影响,如第3.2节所述。
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. Self-attention,有时称为注意内注意,是一种注意机制,它将单个序列的不同位置联系起来,以计算序列的表示。Self-attention在阅读理解、抽象概括、语篇蕴涵和学习任务无关的句子表征等任务中得到了成功的运用[4,27,28,22]。
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence- aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34]. 端到端的记忆网络是基于一种循环注意机制而不是顺序排列的循环,并且已经被证明在简单的语言问答和语言建模任务中表现良好[34]。
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence- aligned RNNs or convolution. In the following sections, we will describe the Transformer , motivate self-attention and discuss its advantages over models such as [17, 18] and [9]. 然而,据我们所知,Transformer 是第一个完全依赖于self-attention来计算其输入和输出表示的传导模型,而不使用序列对齐的RNN或卷积。在下面的章节中,我们将描述Transformer ,激发self-attention,并讨论其相对于[17,18]和[9]等模型的优势。
3、Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next. 大多数竞争性神经序列转导模型都具有编码器-解码器结构[5,2,35]。这里,编码器将符号表示的输入序列(x1,…,xn)映射到连续表示的序列z=(z1,…,zn)。给定z,解码器然后一次生成一个符号的输出序列(y1,…,ym)。在每一步,模型都是自回归的[10],在生成下一步时,将先前生成的符号作为附加输入。
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively. Transformer 遵循这个总体架构,使用堆叠的自关注层和点方式的完全连接层,分别用于编码器和解码器,如图1的左半部分和右半部分所示。
Figure 1: The Transformer - model architecture.
3.1、Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)),
where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
编码器:编码器由N=6个相同层组成。每层有两个子层。
第一层是多头multi-head self-attention机制,
第二层是一个简单的、位置全连接的前馈网络。我们在两子层的每一个子层周围使用一个残差连接[11],然后是层规范化[1]。也就是说,每个子层的输出是
LayerNorm(x + Sublayer(x)),
其中Sublayer(x) 是子层本身实现的功能。为了方便这些残差连接,模型中的所有子层以及嵌入层都会生成维d model=512的输出。
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. 解码器:解码器也由N=6个相同层组成。除了每个编码器层中的两个子层之外,解码器还插入第三个子层,该子层对编码器堆栈的输出执行multi-head attention。与编码器类似,我们在每个子层周围使用残差连接,然后进行层规范化。我们还修改解码器堆栈中的自关注子层,以防止位置关注后续位置。这种掩蔽加上输出嵌入偏移一个位置的事实,确保位置i的预测只能依赖于小于i的位置处的已知输出。
3.2、Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 一个注意函数可以描述为将查询和一组键值对映射到输出,其中query, keys, values和output 都是向量。输出被计算为值的加权和,其中分配给每个值的权重由查询的兼容函数和相应的键计算。
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.
图2:(左)缩放点积注意力机制 (右图)多头注意力机制由几个平行运行的注意力层组成。
3.2.1、Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by , and apply a softmax function to obtain the weights on the values.
我们称特别注意为“缩放点积注意”(图2)。输入由维度dk的查询和键以及维度dv的值组成。我们使用所有键计算查询的点积,用√dk除以每个键,然后应用softmax函数获得值的权重。
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as:
在实际应用中,我们同时计算一组查询的注意函数,将它们组合成一个矩阵Q,并将键和值组合成矩阵K和V。我们将输出矩阵计算为:
The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor. Additive attention computes the compatibility function using a feed-forward network witha single hidden layer. While the two are similar in theoretical complexity, dot-product attention ismuch faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
最常用的两个注意函数是additive attention[2]和dot-product attention(乘法)。dot-product attention与我们的算法相同,除了比例因子。additive attention使用具有单个隐藏层的前馈网络计算兼容性函数。虽然这两种方法在理论复杂度上是相似的,但是由于它可以用高度优化的矩阵乘法码来实现,所以在实际应用中,dot-product attention要快得多,空间也更有效。
While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4. To counteract this effect, we scale the dot products by 而对于dk的小值,两种机制的表现相似,additive attention优于dot product attention,而对于dk的大值则没有标度[3]。我们怀疑,对于dk的大值,点积在数量级上增长很大,将softmax函数推到梯度非常小的区域4。为了抵消这一影响,我们将 dot products积按
3.2.2、Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk , dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv -dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
我们发现,不同于使用dmodel维度的键、值和查询来执行单一的注意功能,将查询、键和值分别以不同的线性投影h次线性投影到dk、dk和dv维度是有益的。
然后,在这些查询、键和值的投影版本中,我们并行地执行注意功能,生成dv维输出值。如图2所示,这些被连接起来并再次投影,从而产生最终值。Multi-head attention允许模型共同关注来自不同位置的不同表示子空间的信息。只有一个注意力集中的头脑,平均化可以抑制这种情况。
In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality. 在这项工作中,我们使用了h=8个平行的注意层,或者说头部。对于每一个,我们使用dk=dv=dmodel/h=64。由于每个头部的维数减少,总的计算成本与全维度的单头部注意的计算成本相似。
3.2.3、Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.
Transformer 以三种不同的方式使用multi-head:
在“编码器-解码器-注意”层中,查询来自上一个解码器层,而内存键和值来自编码器的输出。这使得解码器中的每个位置都可以参与输入序列中的所有位置。这模仿了典型的编码器-解码器-注意机制的顺序对序列模型,如[38,2,9]。
编码器包含self-attention层。在自关注层中,所有键、值和查询都来自同一个位置,在本例中,是编码器中前一层的输出。编码器中的每个位置都可以处理编码器前一层中的所有位置。
类似地,解码器中的 self-attention层允许解码器中的每个位置关注解码器中直到并包括该位置的所有位置。为了保持解码器的自回归特性,需要防止解码器中的信息向左流动。我们通过屏蔽softmax输入中所有与非法连接相对应的值(设置为–∞)来实现这个内标度点积注意。见图2。
3.3、Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
FFN(x) = max(0, xW1 + b1)W2 + b2 (2)
除了注意力子层之外,我们的编码器和解码器中的每一层都包含一个完全连接的前馈网络,该网络分别且相同地应用于每个位置。。这包括两个线性变换,中间有一个ReLU激活。
FFN(x)=max(0,xW1+b1)W2+b2
While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality df f = 2048. 虽然不同位置的线性变换是相同的,但它们在不同的层之间使用不同的参数。另一种描述方法是两个核大小为1的卷积。输入输出的维数为dmodel=512,内层的维数为df f=2048。