3.4、Embeddings and Softmax
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transfor- mation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by √dmodel.
与其他序列转换模型类似,我们使用学习的嵌入来将输入标记和输出标记转换为维度dmodel的向量。我们还使用通常的学习线性变换和softmax函数将解码器输出转换为预测的下一个令牌概率。在我们的模型中,我们在两个嵌入层和pre-softmax线性变换之间共享相同的权重矩阵,类似于[30]。在嵌入层中,我们将这些权重乘以√dmodel。
3.5、Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].
由于我们的模型不包含递归和卷积,为了使模型能够利用序列的顺序,我们必须注入一些关于序列中标记的相对或绝对位置的信息。为此,我们将“位置编码”添加到编码器和解码器堆栈底部的输入嵌入中。位置编码与嵌入编码具有相同的维度dmodel,因此可以将它们相加。有许多位置编码的选择,学习和固定的[9]。
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension, k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention.
表1:不同层类型的最大路径长度、每层复杂度和最小顺序操作数。
n为序列长度,
d为表示维数,
k为卷积的核大小,
r为受限自我注意的邻域大小。
In this work, we use sine and cosine functions of different frequencies:
where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of P Epos.
在这项工作中,我们使用不同频率的正弦和余弦函数:
其中pos是位置,i是维度。
也就是说,位置编码的每个维度对应于一个正弦。波长形成了从2π到10000·2π的几何级数。我们选择这个函数是因为我们假设它可以让模型很容易地学会通过相对位置来参与,因为对于任何固定偏移量k,P-Epos+k可以表示为P-Epos的线性函数。
We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training. 我们还尝试使用学习的位置嵌入[9],发现这两个版本产生了几乎相同的结果(见表3第(E)行)。我们之所以选择正弦曲线,是因为它可能允许模型外推到比训练中遇到的序列长度更长的序列。
4、Why Self-Attention
In this section we compare various aspects of self-attention layers to the recurrent and convolu- tional layers commonly used for mapping one variable-length sequence of symbol representations (x1, ..., xn) to another sequence of equal length (z1, ..., zn), with xi, zi ∈ Rd, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.
在本节中,我们将self-attention层的各个方面与通常用于映射一个可变长度的符号表示序列(x1,…,xn)到另一个等长序列(Z1,…,Zn)的递归和卷积层进行比较。例如典型的序列转换编码器或解码器中的隐藏层。激发我们使用self-attention,我们考虑三个目的。
One is the total computational complexity per layer.
Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
一个是每层的总计算复杂度。
另一个是可以并行化的计算量,用所需的最少顺序操作数来衡量。
第三个是网络中长距离依赖关系之间的路径长度。在许多序列转导任务中,学习长程依赖性是一个关键的挑战。影响学习这种依赖关系的能力的一个关键因素是网络中向前和向后信号必须经过的路径的长度。输入和输出序列中任何位置组合之间的这些路径越短,就越容易学习长期依赖性[12]。因此,我们还比较了在由不同层类型组成的网络中的任意两个输入和输出位置之间的最大路径长度。
As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations. To improve computational performance for tasks involving very long sequences,self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r). We plan to investigate this approach further in future work.
如表1所示,self-attention层用固定数量的顺序执行的操作连接所有位置,而递归层需要O(n)顺序操作。在计算复杂性方面,当序列长度N小于表示维度D时,自关注层比递归层更快,这是最常用的机器翻译中最先进的模型所使用的句子表示的情况,例如Word(38)和字节对(31)表示。为了提高包含很长序列的任务的计算性能,可以将self-attention限制为仅考虑以各自输出位置为中心的输入序列中r大小的邻域。这将增加到O(n/r)的最大路径长度。我们计划在今后的工作中进一步研究这种方法。
A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(logk (n)) in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity considerably, to O(k · n · d + n · d2). Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
核宽度k<n的单个卷积层不能连接所有输入和输出位置对。这样做需要一堆O(n/k)卷积层(对于相邻的核),或者O(logk(n))卷积层(对于扩展卷积)[18],增加网络中任意两个位置之间最长路径的长度。卷积层通常比递归层更昂贵,由K的可分离卷积〔6〕,但是,大大降低了复杂度,达到O(k·n·d+n·d2)。然而,即使具有k= n,可分离卷积的复杂性等于自关注层和点前馈层的组合,我们在模型中采用的方法。
As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences. 作为附带好处,self-attention可以产生更多可解释的模型。我们从我们的模型中检查注意分布,并在附录中给出和讨论示例。不仅个体的注意头清楚地学会了执行不同的任务,许多似乎表现出与句子的句法和语义结构有关的行为。
5、Training
This section describes the training regime for our models. 本节介绍我们的模型的训练制度。
5.1、Training Data and Batching
We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source- target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
我们使用标准的WMT 2014英德数据集进行训练,该数据集由大约450万个句子对组成。句子使用字节对编码[3]进行编码,它有一个大约37000个标记的共享源-目标词汇表。
对于英语-法语,我们使用了更大的WMT 2014英语-法语数据集,该数据集包含3600万个句子,并将标记拆分为32000个词条词汇[38]。句子对由近似的序列长度组合在一起。每个训练批次包含一组句子对,其中包含大约25000个源令牌和25000个目标令牌。
5.2、Hardware and Schedule—8台+12小时/3.5天
We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days). 我们用8台NVIDIA P100 gpu在一台机器上训练我们的模型。对于使用本文描述的超参数的基础模型,每个训练步骤大约需要0.4秒。我们对基础模型进行了总共100000步或12小时的训练。对于我们的大型模型(见表3的底线),步进时间为1.0秒。大模型接受了30万步(3.5天)的训练。
5.3、Optimizer—Adam优化器
We used the Adam optimizer [20] with β1 = 0.9, β2 = 0.98 and E = 10−9. We varied the learning rate over the course of training, according to the formula:
我们使用Adam优化器[20],beta1=0.9,beta2=0.98,E=10-9。我们在整个训练过程中根据以下公式调整学习率:
This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000. 这相当于线性地增加第一个warmup_steps 的学习率,然后与步骤数的平方根成比例地降低学习率。我们采用了warmup_steps =4000。
5.4、Regularization—3种正则化
We employ three types of regularization during training:
Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized.
In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.
For the base model, we use a rate of Pdrop = 0.1.
在训练期间,我们采用3种正则化:
在将Dropout[33]添加到子层输入并规范化之前,我们对每个子层的输出应用Dropout[33]。
此外,我们还将dropout应用于编码器和解码器堆栈中的嵌入和位置编码的总和。
对于基本模型,我们使用Pdrop=0.1的速率。
Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
Label Smoothing During training, we employed label smoothing of value Els = 0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
表2:Transformer在英语到德语和英语到法语的newstest2014测试中取得了比以前最先进的模式更好的BLEU分数,只需花费很少的训练成本。
在训练中,我们使用Els = 0.1[36]的值进行标签平滑。这伤害了perplexity,因为模型学会了更不确定,但提高了准确性和BLEU分数。
6、Results
6.1、Machine Translation
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate Pdrop = 0.1, instead of 0.3.
在WMT 2014英德翻译任务中,big transformer模型(表2中的transformer(big))比之前报告的最佳模型(包括集成)的表现超过2.0 BLEU,新的最新BLEU分数为28.4。此型号的配置列于表3的最后一行。训练时间为3.5天,平均成绩为8 P100。即使是我们的基础模型也超过了所有之前发布的模型和集合,只是任何竞争模型的训练成本的一小部分。
在WMT 2014英法翻译任务中,我们的大模型达到了41.0的BLEU分数,超过了所有之前发布的单一模型,在不到1/4的训练成本的前一个最先进的模式。为英语到法语培训的Transformer(big)模型使用的是辍学率Pdrop=0.1,而不是0.3。
For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty α = 0.6 [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [38].
对于基本模型,我们使用一个通过平均最后5个检查点(以10分钟为间隔写入)获得的模型。对于大型模型,我们平均使用了最后20个检查点。我们使用波束搜索,波束大小为4,长度惩罚α=0.6[38]。这些超参数是在开发集上进行实验后选择的。在输入长度为50时,我们设定最大输出长度,但在可能时终止(38)。
Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU 5. 表2总结了我们的结果,并将我们的翻译质量和训练成本与文献中的其他模型架构进行了比较。我们通过乘以训练时间、使用的GPU数量和每个GPU 5的持续单精度浮点容量,来估计用于训练模型的浮点操作的数量。