论文评价
2017年,Google机器翻译团队发表的《Attention is all you need》中大量使用了自注意力(self-attention)机制来学习文本表示。
参考文章:《attention is all you need》解读
1、Motivation:
靠attention机制,不使用rnn和cnn,并行度高
通过attention,抓长距离依赖关系比rnn强
2、创新点:
通过self-attention,自己和自己做attention,使得每个词都有全局的语义信息(长依赖
由于 Self-Attention 是每个词和所有词都要计算 Attention,所以不管他们中间有多长距离,最大的路径长度也都只是 1。可以捕获长距离依赖关系
提出multi-head attention,可以看成attention的ensemble版本,不同head学习不同的子空间语义。
论文地址
链接:《Attention Is All You Need》
PDF:《Attention Is All You Need》
Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEUon the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
主要的序列转导模型是基于复杂的递归或卷积神经网络,包括编码器和解码器。性能最好的模型还需通过注意机制连接编码器和解码器。我们提出了一种新的简单的网络结构,Transformer,它完全基于注意机制,完全不需要递归和卷积。在两个机器翻译任务上的实验表明,这些模型具有更好的并行性和更少的训练时间。我们的模型在WMT 2014英语到德语翻译任务中达到28.4的BLEU,改进现有的最佳结果,包括集成,超过2 BLEU。在WMT 2014英法翻译任务中,我们的模型在8个GPU上训练3.5天后,建立了一个新的单模型最新的BLEU分数41.8,这是文献中最好模型的训练成本的一小部分。我们将该Transformer成功地应用于具有大量和有限训练数据的英语选区句法分析中,证明了该变换器对其他任务具有良好的推广效果。
注:BLEU是一种文本评估算法,即Bilingual Evaluation Understudy,它是用来评估机器翻译跟专业人工翻译之间的对应关系,核心思想就是机器翻译越接近专业人工翻译,质量就越好,经过BLEU算法得出的分数可以作为机器翻译质量的其中一个指标。
1、Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research. †Work performed while at Google Brain. ‡Work performed while at Google Research. 递归神经网络,特别是LSTM[13]和GRU[7]神经网络,已经作为序列建模和转导问题(如语言建模和机器翻译[35,2,5])的最新方法被牢固地建立起来。此后,无数的努力继续推进递归语言模型和编码器-解码器体系结构的界限[38,24,15]。
*同等贡献。列表顺序是随机的。Jakob建议用self-attention取代RNNs,并开始努力评估这个想法。Ashish和Illia一起设计并实现了第一个Transformer 模型,并在这项工作的各个方面都有重要的参与。Noam提出了标度点积注意、多头注意和无参数位置表示,成为几乎涉及每个细节的另一个人。Niki在我们最初的代码库和tensor2tensor中设计、实现、调整和评估了无数的模型变体。Llion还试验了新的模型变体,负责我们的初始代码库,以及有效的推理和可视化。Lukasz和Aidan花了无数天的时间设计和实现Tensor2Sensor的各个部分,取代了我们早期的代码库,极大地改进了结果,极大地加速了我们的研究。在谷歌大脑工作期间完成的工作。‡在谷歌研究所工作期间完成的工作。
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
递归模型通常沿输入和输出序列的符号位置进行因子计算。将位置与计算时间中的步骤对齐,它们生成隐藏状态的序列ht,作为先前隐藏状态ht-1和位置t的输入的函数。这种固有的序列性质排除了训练示例中的并行化,而在较长的序列长度下,并行化变得至关重要,因为内存限制限制了跨示例的批处理。最近的工作通过因子分解技巧[21]和条件计算[32]在计算效率方面取得了显著的提高,同时也提高了后者的模型性能。然而,顺序计算的基本约束仍然存在。
Attention mechanisms have become an integral part of compelling sequence modeling and transduc- tion models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. 注意机制已经成为各种任务中强制序列建模和转换模型的一个组成部分,允许不考虑输入或输出序列中的距离的依赖关系建模[2,19]。然而,在除少数情况外的所有情况下[27],这种注意机制都与一个递归网络结合使用。
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. 在这项工作中,我们提出了Transformer,这是一种避免重复出现的模型架构,而完全依赖于注意机制来绘制输入和输出之间的全局依赖关系。该Transformer 允许更显著的并行化,并可以达到一个新的水平,在翻译质量后,在8个P100 gpu训练了12小时。