Paper:2017年的Google机器翻译团队《Transformer:Attention Is All You Need》翻译并解读(四)

简介: Paper:2017年的Google机器翻译团队《Transformer:Attention Is All You Need》翻译并解读

6.2、Model Variations


To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3. 为了评估Transformer不同部件的重要性,我们以不同的方式改变了我们的基本模型,测量了开发集newstest2013中英德语翻译的性能变化。我们使用前一节中描述的波束搜索,但没有检查点平均。这些结果见表3。



Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)

表4:Transformer很好地概括了英语选民解析(结果见《华尔街日报》第23章)




In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.

在表3行(A)中,我们改变了注意头的数量、注意键和值维度,保持计算量不变,如3.2.2节所述。单头注意力比最佳设置差0.9个BLEU,但如果头太多,质量也会下降。


6.3、English Constituency Parsing


To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37]. 为了评估转换程序是否可以推广到其他任务,我们进行了英语选区解析的实验。这项任务提出了具体的挑战:产出受到强烈的结构限制,而且明显长于投入。此外,RNN序列到序列模型无法在小数据区获得最新的结果[37]。

We trained a 4-layer transformer with dmodel = 1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.

我们在《华尔街日报》(Wall Street Journal)的Penn Treebank[25]部分训练了一个dmodel=1024的4层transformer ,大约4万个训练句子。我们还训练了它在半监督设置,使用较大的高置信度和BerkleyParser语料库从大约17M句[37 ]。我们只对WSJ设置使用了16K令牌的词汇表,对半监督设置使用了32K令牌的词汇表。

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length + 300. We used a beam size of 21 and α = 0.3 for both WSJ only and the semi-supervised setting. 我们只进行了少量的实验来选择22段开发集上的dropout、attention 和residual (第5.4节)、学习率和beam 大小,所有其他参数从英语到德语的基础翻译模型保持不变。在推理过程中,我们将最大输出长度增加到输入长度+ 300。我们使用21和α=0.3的光束尺寸,仅用于WSJ和半监督设置。

Our results in Table 4 show that despite the lack of task-specific tuning our model performs sur- prisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].

In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley- Parser [29] even when training only on the WSJ training set of 40K sentences. 我们在表4中的结果表明,尽管缺乏特定于任务的调整,我们的模型仍然运行得很好,产生的结果比以前报告的所有模型都要好,除了递归神经网络语法[8]。

与RNN的序列到序列模型[37]相比,Transformer甚至在仅接受华尔街日报的40K句子训练集训练时,也比Berkeley- Parser[29]表现更好。

7、Conclusion


In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. 在这项工作中,我们提出了第一个完全基于注意的序列转换模型Transformer,它用多头self-attention取代了编码器-解码器架构中最常用的递归层。

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. 对于转换任务,可以比基于递归层或卷积层的体系结构更快地训练Transformer 。在WMT 2014英语-德语和WMT 2014英语-法语翻译任务中,我们达到了一个新的水平。在前一个任务中,我们的最佳模型甚至优于所有先前报告的集成模型。

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours. 我们对基于注意力的模型的未来感到兴奋,并计划将其应用到其他任务中。我们计划将转换器扩展到涉及文本以外的输入和输出模式的问题,并研究局部的、受限的注意机制,以有效地处理图像、音频和视频等大型输入和输出。减少世代的连续性是我们的另一个研究目标。

The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor.

Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration. 我们用来训练和评估模型的代码可以在https://github.com/tensorflow/tensor2tensor上找到。感谢Nal Kalchbrenner和Stephan Gouws的富有成效的评论、更正和启发。

References


[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.


[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.


[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.


[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.


[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.


[6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.


[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.


[8] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proc. of NAACL, 2016.


[9] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolu- tional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.


[10] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.


[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.


[12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.


[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.


[14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.


[15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.


[16] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.


[17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.


[18] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Ko- ray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.


[19] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks.

In International Conference on Learning Representations, 2017.


[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [21] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint‌ arXiv:1703.10722, 2017.


[22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.


[23] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.


[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention- based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.


[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.


[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006.


[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.


[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.


[29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.


[30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.


[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.


[32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.


[33] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- nov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.


[34] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.


[35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.


[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.


[37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.


[38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.


[39] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.


[40] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.


相关文章
|
7月前
|
自然语言处理 JavaScript
vue3-ts-vite:Google 多语言调试 / 网页中插入谷歌翻译元素 / 翻译
vue3-ts-vite:Google 多语言调试 / 网页中插入谷歌翻译元素 / 翻译
86 0
|
4月前
|
JavaScript 测试技术
【sgGoogleTranslate】自定义组件:基于Vue.js用谷歌Google Translate翻译插件实现网站多国语言开发
【sgGoogleTranslate】自定义组件:基于Vue.js用谷歌Google Translate翻译插件实现网站多国语言开发
|
3月前
|
机器学习/深度学习 自然语言处理 机器人
机器翻译的异步翻译接口的确已经有了显著的进展
机器翻译的异步翻译接口的确已经有了显著的进展【1月更文挑战第24天】【1月更文挑战第119篇】
35 1
|
9月前
|
Web App开发 网络协议 搜索推荐
完美修复google翻译失效的问题
使用chrome的小伙伴应该都知道有个页面一键翻译,对于英语相当蹩脚的我来说灰常好用,然而…
108 0
|
11月前
|
机器学习/深度学习 数据采集 自然语言处理
谷歌为1000+「长尾」语言创建机器翻译系统,Google翻译已支持部分小众语言
谷歌为1000+「长尾」语言创建机器翻译系统,Google翻译已支持部分小众语言
|
Web App开发 API 网络安全
Google 翻译插件不能用了怎么办
Google 翻译退出中国
4650 4
Google 翻译插件不能用了怎么办
|
网络协议 JavaScript Java
Google Java编程风格规范(2020年4月原版翻译)
Google Java Style Guide 这份文档是Google Java编程风格规范的完整定义。当且仅当一个Java源文件符合此文档中的规则, 我们才认为它符合Google的Java编程风格。 与其它的编程风格指南一样,这里所讨论的不仅仅是编码格式美不美观的问题, 同时也讨论一些约定及编码标准。然而,这份文档主要侧重于我们所普遍遵循的规则,
805 0
|
Web App开发 JavaScript 前端开发
忍受不了糟糕的工作氛围,我退出了 Google WebAssembly 团队
当地时间 5 月 11 日,游戏设计师、工具开发者 Katelyn Gadd 在一篇博文中分享了她在 Google WebAssembly 团队中的工作经历。这位女士讲述了自己在 Google 任职期间因为管理混乱患上脑损伤并长期饱受折磨,这些负面经历凸显了 Google 糟糕的工作氛围:团队缺乏支持并且领导者们又没有改革能力。“我希望这个故事能够帮助人们认识到自己工作场所中的有害文化,或者能帮助新员工在 Google 获得更好的职业生涯。”Gadd 说。
145 0
忍受不了糟糕的工作氛围,我退出了 Google WebAssembly 团队
|
自然语言处理 Java 开发工具
阿里云机器翻译产品机器批量翻译服务 Python SDK 调用指南
阿里云机器翻译产品下的机器批量翻译服务,支持同时对多段文本进行翻译。本文介绍如何使用机器批量翻译提供的Python SDK,包括SDK的安装方法及SDK代码示例。
459 0
|
自然语言处理
阿里云机器翻译在线翻译平台报错问题总结
在使用阿里云机器翻译时,在线翻译平台能够满足不通过代码直接上传图片处理的需求,但在实际操作中,有时候购买机器翻译资源包后,却依然报错了,提示未开通已欠费,或者资源包已经用尽等问题,此篇文章简单介绍下相关问题处理方式。
1103 0
阿里云机器翻译在线翻译平台报错问题总结