Paper:《Pre-Trained Models: Past, Present and Future大规模预训练模型的发展历史、最新现状和未来发展三个方向》翻译与解读

本文涉及的产品
NLP 自学习平台,3个模型定制额度 1个月
NLP自然语言处理_高级版,每接口累计50万次
NLP自然语言处理_基础版,每接口每天50万次
简介: Paper:《Pre-Trained Models: Past, Present and Future大规模预训练模型的发展历史、最新现状和未来发展三个方向》翻译与解读


目录

Paper:《Pre-Trained Models: Past, Present and Future大规模预训练模型的发展历史、最新现状和未来发展三个方向》翻译与解读

Abstract

1 Introduction简介

2 Background背景

2.1 Transfer Learning and Supervised Pre-Training 迁移学习和有监督预训练

2.2 Self-Supervised Learning and Self-Supervised Pre-Training 自监督学习和自监督预训练

3 Transformer and Representative PTMs 代表性的预训练模型

3.1 Transformer

3.2 GPT

3.3 BERT

3.4 After GPT and BERT

4 Designing Effective Architecture设计有效的架构

4.1 Unified Sequence Modeling统一序列建模

4.2 Cognitive-Inspired Architectures架构

4.3 More Variants of Existing PTMs现有PTMs的更多变体

5 Utilizing Multi-Source Data利用多源数据

5.1 Multilingual Pre-Training多语言预训练

5.2 Multimodal Pre-Training多模态预训练

5.3 Knowledge-Enhanced Pre-Training增强知识的预训练

6 Improving Computational Efficiency提高计算效率

6.1 System-Level Optimization系统级优化

6.2 Efficient Pre-Training 高效的预训练

6.3 Model Compression模型压缩

7 Interpretation and Theoretical Analysis解释与理论分析

7.1 Knowledge of PTMs知识

7.2 Robustness of PTMs鲁棒性

7.3 Structural Sparsity of PTMs结构稀疏性

7.4 Theoretical Analysis of PTMs理论分析

8 Future Direction未来方向

8.1 Architectures and Pre-Training Methods架构和预训练方法

8.2 Multilingual and Multimodal Pre-Training多语言、多模态的预训练

8.3 Computational Efficiency计算效率

8.4 Theoretical Foundation理论基础

8.5 Modeledge Learning

8.6 Cognitive and Knowledgeable Learning认知和知识学习

8.7 Applications应用

9 Conclusion

Note and Contribution


Paper:《Pre-Trained Models: Past, Present and Future大规模预训练模型的发展历史、最新现状和未来发展三个方向》翻译与解读

作者:清华唐杰团队等

发布时间:2021年9月

文章地址

http://keg.cs.tsinghua.edu.cn/jietang/publications/AIOPEN21-Han-et-al-Pre-Trained%20Models-%20Past,%20Present%20and%20Future.pdf

Xu Han1⇤ , Zhengyan Zhang1⇤, Ning Ding1⇤, Yuxian Gu1⇤, Xiao Liu1⇤, Yuqi Huo2⇤, Jiezhong Qiu1, Liang Zhang2, Wentao Han1,† Minlie Huang1†, Qin Jin2†, Yanyan Lan4†, Yang Liu1,4†, Zhiyuan Liu1†, Zhiwu Lu3†, Xipeng Qiu5†, Ruihua Song3†, Jie Tang1†, Ji-Rong Wen3†, Jinhui Yuan6†, Wayne Xin Zhao3†, Jun Zhu1†
1 Department of Computer Science and Technology, Tsinghua University, Beijing, China 2 School of Information, Renmin University of China, Beijing, China
3 Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
4 Institute for AI Industry Research, Tsinghua University, Beijing, China
5 School of Computer Science, Fudan University, Shanghai, China
6 OneFlow Inc., Beijing, China
{hanxu17,zy-z19,dingn18,gu-yx17,liuxiao17,qiujz16}@mails.tsinghua.edu.cn,
{hanwentao,aihuang,lanyanyan,liuyang2011,liuzy,jietang,dcszj}@tsinghua.edu.cn,
{bnhony,zhangliang00,qjin,luzhiwu,jrwen,batmanfly}@ruc.edu.cn,
xpqiu@fudan.edu.cn, songruihua_bloon@outlook.com, yuanjinhui@oneflow.org

Abstract

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives and huge model parameters, large-scale PTMs can ef-fectively capture knowledge from massive la-beled and unlabeled data. By storing knowl-edge into huge parameters and fine-tuning on specific tasks, the rich knowledge implicitly encoded in huge parameters can benefit a vari-ety of downstream tasks, which has been exten-sively demonstrated via experimental verifica-tion and empirical analysis. It is now the con-sensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch. In this paper, we take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI development spectrum. Further, we compre-hensively review the latest breakthroughs of PTMs. These breakthroughs are driven by the surge of computational power and the increas-ing availability of data, towards four impor-tant directions: designing effective architec-tures, utilizing rich contexts, improving com-putational efficiency, and conducting interpre-tation and theoretical analysis. Finally, we dis-cuss a series of open problems and research directions of PTMs, and hope our view can in-spire and advance the future study of PTMs.

近年来,BERT和GPT等大型预训练模型(PTM)取得了巨大的成功,成为人工智能(AI)领域的一个里程碑。由于复杂的训练前目标和庞大的模型参数,大规模PTMs能够有效地从大量有标签和无标签的数据中捕获知识。通过将知识存储到巨大的参数中,并对特定的任务进行微调,隐含在巨大参数中的丰富知识可以使各种下游任务受益,这已通过实验验证和经验分析得到广泛证明。现在AI社区的共识是采用PTMs作为下游任务的骨干,而不是从零开始学习模型。在本文中,我们深入研究了预训练的历史,特别是它与迁移学习自我监督学习的特殊关系,以揭示PTMs在人工智能发展谱系中的关键地位。此外,我们全面回顾了PTMs的最新突破。这些突破是由计算能力的激增数据可用性的增加驱动的,朝着四个重要方向发展:设计有效的架构,利用丰富的上下文提高计算效率,以及进行解释和理论分析。。最后,我们讨论了PTMs的一系列有待解决的问题和研究方向,希望我们的观点能对PTMs的未来研究有所启发和推动。

1 Introduction简介

Deep neural networks, such as convolutional neural networks (CNNs) (Krizhevsky et al., 2012; Kim, 2014; Kalchbrenner et al., 2014; He et al., 2016), recurrent neural networks (RNNs) (Sutskever et al., 2014; Donahue et al., 2015; Liu et al., 2016; Wu et al., 2016), graph neural networks (GNNs) (Kipf and Welling, 2016; Veliˇckovic´ et al., 2018; Schlichtkrull et al., 2018), and attention neu-ral networks (Jaderberg et al., 2015; Wang et al., 2017), have been widely applied for various artifi-cial intelligence (AI) tasks in recent years. Differ-ent from previous non-neural models that largely relied on hand-crafted features and statistical meth-ods, neural models can automatically learn low-dimensional continuous vectors (a.k.a., distributed representations) from data as task-specific features, thereby getting rid of complex feature engineer-ing. Despite the success of deep neural networks, a number of studies have found that one of their crit-ical challenges is data hungry. Since deep neural networks usually have a large number of param-eters, they are thus easy to overfit and have poor generalization ability (Belkin et al., 2019; Xu et al., 2021) without sufficient training data.

Considering this issue, over the same period of developing deep neural networks, massive efforts have been devoted to manually constructing high-quality datasets for AI tasks (Deng et al., 2009; Lin et al., 2014; Bojar et al., 2014), making it possible to learn effective neural models for specific tasks that are superior to conventional non-neural models. However, it is expensive and time-consuming to manually annotate large-scale data. For example,utilizing crowdsourcing to segment images costs about $6.4 per image (Liu et al., 2020b). Some complex tasks that require expert annotations may charge much more to build their datasets. Several tasks such as visual recognition (Deng et al., 2009) and machine translation (Bojar et al., 2014) have datasets containing millions of samples, yet it is impossible to build such large-scale datasets for all AI tasks. More generally, the dataset of a specific AI task usually has a limited size. Hence, for a long time until now, it has been a key research issue: how to train effective deep neural models for specific tasks with limited human-annotated data.

深度神经网络,如卷积神经网络(CNNs) (Krizhevsky et al., 2012; Kim, 2014; Kalchbrenner et al., 2014; He et al., 2016),循环神经网络 (RNNs) (Sutskever et al., , 2014; Donahue et al., 2015; Liu et al., 2016; Wu et al., 2016),图神经网络(GNNs) (Kipf and Welling, 2016; Veliˇckovic´ et al., 2018; Schlichtkrull et al., 2018) , 2018) 和注意力神经网络 (Jaderberg et al., 2015; Wang et al., 2017) 近年来已广泛应用于各种人工智能 (AI) 任务。与以前主要依赖手工特征和统计方法非神经模型不同,神经模型可以从数据中自动学习低维连续向量(又称分布式表示)作为任务特定的特征,从而摆脱复杂的特征工程。尽管深度神经网络取得了成功,但许多研究发现,它们面临的关键挑战之一是数据匮乏。由于深度神经网络通常具有大量参数,因此在没有足够训练数据的情况下,它们很容易过度拟合并且泛化能力较差(Belkin 等人,2019;Xu 等人,2021)。

考虑到这个问题,在开发深度神经网络的同一时期,大量努力致力于为人工智能任务手动构建高质量的数据集(Deng et al., 2009; Lin et al., 2014; Bojar et al., 2014 ),从而有可能为特定任务学习有效的神经模型,这些模型优于传统的非神经模型。但是,手动注释大规模数据既昂贵又耗时。例如,使用众包分割图像的成本约为每张图像 6.4 美元(Liuet al.,2020b)。一些需要专家注释的复杂任务可能需要花费更多的钱来构建它们的数据集。视觉识别 (Deng et al., 2009) 和机器翻译 (Bojar et al., 2014) 等一些任务的数据集包含数百万个样本,但不可能为所有 AI 任务构建如此大规模的数据集。更普遍地说,特定 AI 任务的数据集通常具有有限的大小。因此,到目前为止,很长一段时间以来,它一直是一个关键的研究问题:如何在有限的人工注释数据的情况下为特定任务训练有效的深度神经模型

 Figure 1: The two figures show the significant improvement on performance of both language understanding and language generation after using large-scale PTMs.

图1:这两幅图显示了在使用大规模的PTMs后,语言理解和语言生成性能显著提高

One milestone for this issue is the introduction of transfer learning (Thrun and Pratt, 1998; Pan and Yang, 2009). Instead of training a model from scratch with large amounts of data, human beings can learn to solve new problems with very few sam-ples. This amazing learning process is motivated by the fact that human beings can use previously learned knowledge to handle new problems. In-spired by this, transfer learning formalizes a two-phase learning framework: a pre-training phase to capture knowledge from one or more source tasks, and a fine-tuning stage to transfer the captured knowledge to target tasks. Owing to the wealth of knowledge obtained in the pre-training phase, the fine-tuning phase can enable models to well handle target tasks with limited samples.

Transfer learning provides a feasible method for alleviating the challenge of data hungry, and it has soon been widely applied to the field of computer vision (CV). A series of CNNs (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; Szegedy et al., 2015; He et al., 2016) are pre-trained on the human-annotated visual recognition dataset Ima-geNet (Deng et al., 2009). Benefiting from the strong visual knowledge distributed in ImageNet, fine-tuning these pre-trained CNNs with a small amount of task-specific data can perform well on downstream tasks. This triggers the first wave of exploring pre-trained models (PTMs) in the era of deep learning. In this wave, PTMs are used for al-most all CV tasks such as image classification (He et al., 2016), object detection (Sermanet et al., 2014; Ren et al., 2016), image segmentation (Long et al., 2015), and image captioning (Vinyals et al.,2015).

The natural language processing (NLP) com-munity was also aware of the potential of PTMs and started to develop PTMs for NLP tasks (Qiu et al., 2020). To take full advantage of large-scale unlabeled corpora to provide versatile lin-guistic knowledge for NLP tasks, the NLP com-munity adopts self-supervised learning (Liu et al., 2020b) to develop PTMs. The motivation of self-supervised learning is to leverage intrinsic correla-tions in the text as supervision signals instead of human supervision. For example, given the sen-tence “Beijing is the capital of China”, we mask the last word in the sentence, and then require mod-els to predict the masked position with the word “China”. Through self-supervised learning, tremen-dous amounts of unlabeled textual data can be uti-lized to capture versatile linguistic knowledge with-out labor-intensive workload. This self-supervised setting in essence follows the well-known language model learning (Bengio et al., 2003).

这个问题的一个里程碑是迁移学习的引入(Thrun 和 Pratt,1998;Pan 和 Yang,2009)。人类可以学习用很少的样本来解决新问题,而不是用大量的数据从头开始训练模型。这一惊人的学习过程的动机是人类可以使用以前学到的知识来处理新问题。受此启发,迁移学习形式化了一个两阶段的学习框架:从一个或多个源任务中获取知识的预训练阶段,以及将获取的知识转移到目标任务的微调阶段。由于在训练前阶段获得了丰富的知识,因此微调阶段可以使模型在样本有限的情况下很好地处理目标任务

迁移学习为缓解数据饥饿的挑战提供了一种可行的方法,并很快被广泛应用于计算机视觉(CV)领域。一系列CNN(Krizhevsky 等人,2012;Simonyan 和 Zisserman,2015;Szegedy 等人,2015;He 等人,2016)在人类标注的视觉识别数据集Ima-geNet上进行预训练(Deng等人,2009)。受益于 ImageNet 中分布的强大视觉知识,使用少量特定于任务的数据对这些预训练的 CNN 进行微调可以在下游任务上表现良好。这引发了深度学习时代探索预训练模型(PTM)的第一波浪潮。在这一浪潮中,PTMs被用于几乎所有的CV任务,如图像分类(He et al., 2016),目标检测(Sermanet et al., 2014;Ren et al., 2016)、图像分割(Long et al.,2015)和图像字幕(Vinyals et al.,2015)。

自然语言处理(NLP)社区也意识到了PTMs的潜力,并开始为NLP任务开发PTMs (Qiu et al., 2020)。为了充分利用大规模未标记语料库为 NLP 任务提供通用的语言知识,NLP社区采用了自我监督学习(Liu et al., 2020b)来开发PTMs。自我监督学习的动机是利用文本中的内在相关性作为监督信号而不是人工监督。例如,给定“Beijing is the capital of China”这句话,我们将最后一个单词屏蔽掉,然后要求模型用“China”这个词来预测被屏蔽的位置。通过自我监督学习,可以利用大量的未标记文本数据来获取通用的语言知识,而无需耗费大量的劳动密集型工作量。这种自我监督的设置在本质上遵循了著名的语言模型学习(Bengio et al., 2003)。

 Figure 2: Figure 2(a) shows the number of publications with the keyword “language model” as well as their citations in different years. Figure 2(b) shows the parameter size of large-scale PTMs for NLP tasks and the pre-training data size are increasing by 10 times per year. From these figures, we can find that, after 2018, when large-scale NLP PTMs begin to be explored, more and more efforts are devoted to this field, and the model size and data size used by the PTMs are also getting larger.

图2:图2(a)显示了关键词为“language model”的出版物数量及其在不同年份的被引次数。图2(b)显示了针对NLP任务的大规模PTMs的参数大小,训练前的数据大小以每年10倍的速度增长。从这些数据可以看出,在2018年以后,当大规模的NLP PTMs开始被探索的时候,越来越多的精力投入到这个领域,PTMs所使用的模型大小和数据大小也越来越大。

For a long time, the problem of vanishing or ex-ploding gradients (Bengio et al., 1994) is the pain point of using deep neural networks for NLP tasks. Therefore, when the CV community advances the research of deep PTMs, the early exploration of the NLP community focuses on pre-training shallow networks to capture semantic meanings of words, like Word2Vec (Mikolov et al., 2013b,a,c) and GloVe (Pennington et al., 2014). Although these pre-trained word embeddings play an important role in various NLP tasks, they still face a major limitation to represent polysemous words in differ-ent contexts, as each word is represented by only one dense vector. A famous example in NLP is that the word “bank” has entirely different meanings in the sentences “open a bank account” and “on a bank of the river”. This motivates pre-training RNNs to provide contextualized word embeddings (Mela-mud et al., 2016; Peters et al., 2018; Howard and Ruder, 2018), yet the performance of these models is still limited by their model size and depth.

With the development of deep neural networks in the NLP community, the introduction of Trans-formers (Vaswani et al., 2017) makes it feasible to train very deep neural models for NLP tasks. With Transformers as architectures and language model learning as objectives, deep PTMs GPT (Radford and Narasimhan, 2018) and BERT (Devlin et al., 2019) are proposed for NLP tasks in 2018. From GPT and BERT, we can find that when the size of PTMs becomes larger, large-scale PTMs with hundreds of millions of parameters can capture polysemous disambiguation, lexical and syntactic structures, as well as factual knowledge from the text. By fine-tuning large-scale PTMs with quite a few samples, rich linguistic knowledge of PTMs brings awesome performance on downstream NLP tasks. As shown in Figure 1(a) and Figure 1(b), large-scale PTMs well perform on both language understanding and language generation tasks in the past several years and even achieve better results than human performance. As shown in Figure 2(a), all these efforts and achievements in the NLP com-munity let large-scale PTMs become the focus of AI research, after the last wave that PTMs allow for huge advances in the CV community.

长期以来,梯度消失或爆炸的问题(Bengio et al., 1994)是使用深度神经网络进行 NLP 任务的痛点。因此,在CV社区推进深度PTMs的研究时,NLP社区的早期探索主要集中在对浅层网络进行预处理捕获单词的语义意义,如Word2Vec (Mikolov et al., 2013b,a,c)和GloVe (Pennington et al., 2014)。尽管这些预先训练好的词嵌入在各种NLP任务中发挥着重要作用,但由于每个词仅由一个密集向量表示,因此在不同语境下对多义词的表示仍存在很大的局限性。NLP中一个著名的例子是“bank”这个词在“open A bank account”和“on A bank of the river”这两个句子中有着完全不同的含义。这促使预训练 RNN 提供上下文化的词嵌入(Mela-mud 等人,2016;Peters 等人,2018;Howard 和 Ruder,2018),但这些模型的性能仍然受到模型大小和深度的限制

随着深度神经网络在NLP领域的发展,Trans-formers (Vaswani et al., 2017)的引入,使得为NLP任务训练非常深度的神经模型成为可能。以Transformer为架构,以语言模型学习为目标,2018 年针对 NLP 任务提出了深度 PTM GPT (Radford and Narasimhan, 2018) 和 BERT (Devlin et al., 2019)。从GPT和BERT中我们可以发现,我们可以发现当PTM 的规模变得更大,具有数亿个参数的大规模 PTM 可以从文本中捕获多义消歧词汇和句法结构以及事实知识。通过使用相当多的样本微调大规模 PTM,PTM 丰富的语言知识为下游 NLP 任务带来了出色的性能。如图1(a)和图1(b)所示,在过去的几年中,大规模的PTMs在语言理解语言生成任务上都表现良好,甚至达到了比人类性能更好的结果。如图 2(a) 所示,在上一波 PTM 为 CV 社区带来巨大进步之后,NLP 社区的所有这些努力和成就让大规模 PTM 成为 AI 研究的重点

Up to now, various efforts have been devoted to exploring large-scale PTMs, either for NLP (Rad-ford et al., 2019; Liu et al., 2020d; Raffel et al., 2020; Lewis et al., 2020a), or for CV (Lu et al., 2019; Li et al., 2019; Tan and Bansal, 2019). Fine-tuning large-scale PTMs for specific AI tasks in-stead of learning models from scratch has also be-come a consensus (Qiu et al., 2020). As shown in Figure 2(b), with the increasing computational power boosted by the wide use of distributed com-puting devices and strategies, we can further ad-vance the parameter scale of PTMs from million-level to billion-level (Brown et al., 2020; Lepikhin et al., 2021; Zeng et al., 2021; Zhang et al., 2020c, 2021a) and even trillion-level (Fedus et al., 2021). And the emergence of GPT-3 (Brown et al., 2020), which has hundreds of billions of parameters, en-ables us to take a glimpse of the latent power dis-tributed in massive model parameters, especially the great abilities of few-shot learning like human beings (shown in Figure 3).

The existing large-scale PTMs have improved the model performance on various AI tasks and even subverted our current perception of the perfor-mance of deep learning models. However, several fundamental issues about PTMs still remain: it is still not clear for us the nature hidden in huge amounts of model parameters, and huge compu-tational cost of training these behemoths also pre-vents us from further exploration. At this moment, these PTMs have pushed our AI researchers to a crossroad, with a number of open directions to go. “Rome wasn’t built in a day”— PTMs also ex-perience a long development before achieving the latest success. To this end, we try to trace the development history of PTMs and draw their po-sitions in the AI spectrum, which can give us a clear understanding of the core research issues of PTMs. Then, we introduce the details of various latest PTMs, following four important lines that are currently being advanced, including designing effective architectures, utilizing rich contexts, im-proving computational efficiency, and conducting interpretation and theoretical analysis. By inte-grating the current development of PTMs into the context of the historical spectrum, we discuss sev-eral open problems and conclude promising future directions for PTMs. We hope our efforts in this pa-per can advance further development of PTMs. In what follows, we will introduce the background of pre-training in Section 2 and Section 3, the model architectures of PTMs in Section 4, using multi-source heterogeneous data for PTMs in Section 5, the computational efficiency optimization of PTMs in Section 6, and the theoretical analysis of PTMs in Section 7. Finally, we will briefly discuss a series of open problems and promising directions towards better PTMs in the future.

到目前为止,人们已经做出了各种努力来探索大规模的PTMs,无论是针对NLP (radford et al., 2019;Liu et al., 2020d;rafael等人,2020年;Lewis等人,2020a),或CV (Lu等人,2019;Li et al., 2019;Tan and Bansal, 2019)。针对特定的人工智能任务对大规模PTMs进行微调,而不是从零开始的学习模型,也已成为共识(Qiu等人,2020年)。如图2(b)所示,随着分布式计算设备和策略的广泛应用,计算能力不断增强,我们可以进一步将PTMs的参数规模从百万级提升到十亿级(Brown et al., 2020; Lepikhin et al., 2021; Zeng et al., 2021; Zhang et al., 2020c, 2021a),甚至是万亿级(Fedus et al., 2021)。而具有数千亿参数的GPT-3 (Brown et al., 2020)的出现,让我们得以一窥海量模型参数分布的潜在能力,特别是像人类一样的少量学习能力(如图3所示)。

现有的大规模PTMs提高了模型在各种人工智能任务上的性能,甚至颠覆了我们目前对深度学习模型性能的认知。然而,关于PTMs的几个基本问题仍然存在:我们仍然不清楚隐藏在大量模型参数中的本质,训练这些庞然大物的巨大计算成本也阻碍了我们进一步的探索。目前,这些 PTM 已将我们的 AI 研究人员推到了一个十字路口,有许多开放的方向要走。“罗马不是一天建成的”——PTMs在取得最新的成功之前也经历了一个漫长的发展过程。为此,我们试图追溯PTMs的发展历史,并绘制出它们在AI谱中的位置,这可以让我们清楚地了解PTMs的核心研究问题。然后,我们介绍了各种最新 PTM 的细节,遵循目前正在推进的四个重要方面,包括设计有效的架构,利用丰富的上下文提高计算效率,以及进行解释和理论分析。通过将 PTM 的当前发展整合到历史范围的背景下,我们讨论了几个未解决的问题并得出 PTM 的有希望的未来方向。我们希望本文的工作能够推动PTMs的进一步发展。接下来,我们将在第 2 节和第 3 节介绍预训练的背景,第 4 节介绍 PTM 的模型架构,第 5 节介绍 PTM 的多源异构数据,第 6 节介绍 PTM 的计算效率优化,以及第 7 节中对 PTM 的理论分析。最后,我们将简要讨论一系列未解决的问题和前景。

 Figure 3: GPT-3, with 175 billion parameters, uses 560 GB data and 10,000 GPUs for its training. It has shown the abilities of learning world knowledge, common sense, and logical reasoning.

图3:具有 1750 亿个参数的 GPT-3 使用 560 GB 数据和 10,000 个 GPU 进行训练。 它展示了学习世界知识、常识和逻辑推理的能力。

2 Background背景

Although effective PTMs have recently gained the attention of researchers, pre-training is not a novel machine learning tool. In fact, pre-training has been developed for decades, as a typical machine learning paradigm. In this section, we introduce the development of pre-training in the AI spectrum, from early supervised pre-training to current self-supervised pre-training, which can lead to a brief understanding of the background of PTMs.

尽管有效的PTMs最近引起了研究人员的关注,但预训练并不是一种新的机器学习工具。事实上,作为一种典型的机器学习范式,预训练已经发展了几十年。在本节中,我们将介绍人工智能领域中预训练的发展,从早期的有监督的预训练到目前的自监督的预训练,从而可以简要了解 PTM 的背景。

2.1 Transfer Learning and Supervised Pre-Training 迁移学习和监督预训练

The early efforts of pre-training are mainly in-volved in transfer learning (Thrun and Pratt, 1998). The study of transfer learning is heavily moti-vated by the fact that people can rely on previ-ously learned knowledge to solve new problems and even achieve better results. More formally,transfer learning aims to capture important knowl-edge from multiple source tasks and then apply the knowledge to a target task.

In transfer learning, source tasks and target tasks may have completely different data domains and task settings, yet the knowledge required to handle these tasks is consistent (Pan and Yang, 2009). It is thus important to select a feasible method to trans-fer knowledge from source tasks to target tasks. To this end, various pre-training methods have been proposed to work as the bridge between source and target tasks. Specifically, these methods first pre-train models on the data of multiple source tasks to pre-encode knowledge and then transfer the pre-encoded knowledge to train models for target tasks.

Generally, two pre-training approaches are widely explored in transfer learning: feature transfer and parameter transfer. Feature transfer methods pre-train effective feature represen-tations to pre-encode knowledge across domains and tasks (Johnson and Zhang, 2005; Evgeniou and Pontil, 2007; Dai et al., 2007; Raina et al., 2007). By injecting these pre-trained representations into target tasks, model performance of target tasks can be significantly improved. Parameter transfer meth-ods follow an intuitive assumption that source tasks and target tasks can share model parameters or prior distributions of hyper-parameters. Therefore, these methods pre-encode knowledge into shared model parameters (Lawrence and Platt, 2004; Ev-geniou and Pontil, 2004; Williams et al., 2007; Gao et al., 2008), and then transfer the knowledge by fine-tuning pre-trained parameters with the data of target tasks.

早期的预训练工作主要涉及迁移学习(Thrun和Pratt, 1998)。迁移学习的研究很大程度上是基于人们可以依靠以前学到的知识来解决新问题,甚至取得更好的结果。更正式地说,迁移学习的目的是从多个源任务中获取重要的知识,然后将这些知识应用到目标任务中

在迁移学习中,源任务和目标任务可能具有完全不同的数据域和任务设置,但处理这些任务所需的知识是一致的(Pan和Yang, 2009)。因此,选择一种可行的方法将知识从源任务转移到目标任务是非常重要的。为此,提出了各种预训练方法,作为源任务和目标任务之间的桥梁。具体来说,这些方法首先对多个源任务的数据进行预训练,对知识进行预编码,然后将预编码的知识转移到目标任务的训练模型中。

一般来说,迁移学习的两种预训练方法是特征迁移参数迁移特征迁移方法对有效的特征表示进行预训练,以对跨领域和任务的知识进行预编码(Johnson和Zhang, 2005;Evgeniou和Pontil, 2007;Dai等人,2007;Raina等人,2007)。通过将这些预先训练的表征注入到目标任务中,可以显著提高目标任务的模型性能。参数迁移方法遵循一个直观的假设,即源任务和目标任务可以共享模型参数或超参数的先验分布。因此,这些方法将知识预编码为共享的模型参数(Lawrence和Platt, 2004;Ev-geniou和Pontil, 2004年;Williams等人,2007;Gao et al., 2008),然后利用目标任务的数据,通过微调预训练参数传递知识

 Figure 4: The spectrum of pre-training methods from transfer learning, self-supervised learning to the latest pre-training neural models.

图4:从迁移学习、自监督学习到最新的预训练神经模型的预训练方法的范围。

To some extent, both representation transfer and parameter transfer lay the foundation of PTMs. Word embeddings, widely used as the input of NLP tasks, are built on the framework of feature transfer. Inspired by parameter transfer, pre-trained CNNs are applied as the backbone of most state-of-the-art CV models. Some recent well-known PTMs are also based on representation transfer and parameter transfer, e.g., ELMo (Peters et al., 2018) and BERT apply representation transfer and parameter transfer respectively.

Since AlexNet (Krizhevsky et al., 2012), a series of deep neural networks have been developed for AI tasks. As compared with those conventional machine learning models, deep neural models have more parameters and show better capabilities of fitting complex data. Therefore, from AlexNet to later VGG (Simonyan and Zisserman, 2015) and GoogleNet (Szegedy et al., 2015), the architec-ture of these neural networks becomes deeper and deeper, and their performance accordingly becomes better and better. Although the network depth is important, training a deep network is not easy, as stacking more network layers inevitably brings the problem of vanishing or exploding gradients (Ben-gio et al., 1994). Besides the gradient issues, model performance may soon meet a ceiling and then de-grade rapidly with continually increasing network depths.

在一定程度上,表征传递参数迁移都是PTMs的基础。词嵌入是一种建立在特征迁移框架上的语言处理方法,被广泛用于自然语言处理任务的输入。受参数迁移的启发,预训练的CNN被用作最先进的CV模型的主干。最近一些著名的PTMs也是基于表征迁移参数迁移的,如ELMo (Peters et al., 2018)和BERT分别采用了表征迁移参数迁移

AlexNet (Krizhevsky et al., 2012)以来,一系列用于人工智能任务的深度神经网络被开发出来。与传统的机器学习模型相比,深度神经模型具有更多的参数,显示出更好的拟合复杂数据的能力。因此,从AlexNet到后来的VGG (Simonyan and Zisserman, 2015)和GoogleNet (Szegedy et al., 2015),这些神经网络的架构越来越深性能也越来越好。虽然网络深度很重要,但训练一个深度网络并,因为堆叠更多的网络不可避免地会带来梯度消失或爆炸的问题(Ben-gio et al., 1994)。除了梯度问题外,模型性能可能很快就会遇到一个上限,然后随着网络深度的不断增加而迅速下降。

By adding normalization to parameter initialization (LeCun et al., 2012; Saxe et al., 2013) and hidden states (Ioffe and Szegedy, 2015), and intro-ducing shortcut connections with residual layers, ResNet (He et al., 2016) effectively tackles these problems. As we mentioned before, deep neural networks require large amounts of data for train-ing. To provide sufficient data to train deep models, some large-scale supervised datasets have also been built (Russakovsky et al., 2015; Lin et al., 2014; Krishna et al., 2017; Chen et al., 2015; Cordts et al., 2016), and the most representative one is ImageNet. ImageNet contains millions of images divided into thousands of categories, representing a wide vari-ety of everyday objects. Based on the combina-tion of effective model ResNet, informative dataset ImageNet, as well as mature knowledge transfer methods, a wave of pre-training models on labeled data emerges.

The CV community benefits a lot from this wave. By applying ResNet pre-trained on ImageNet as the backbone, various CV tasks have been quickly advanced, like image classification (He et al., 2016; Lee et al., 2015), object detection (Ren et al., 2016; Sermanet et al., 2014; Gidaris and Komodakis, 2015), image segmentation (Long et al., 2015; Zheng et al., 2015), image caption (Vinyals et al., 2015; Johnson et al., 2016), visual question answer-ing (Antol et al., 2015; Gao et al., 2015; Xiong et al., 2016), etc. Utilizing PTMs like ResNet50 1 has proven to be a crucial step to obtain highly accurate results on most CV tasks. Inspired by the success of PTMs for CV tasks, some NLP re-searchers also explore supervised Pre-training, and the most representative work is CoVE (McCann et al., 2017). CoVE adopts machine translation as its pre-training objective. After pre-training, the en-coder of source languages can work as a powerful backbone for downstream NLP tasks.

通过对参数初始化 (LeCun et al., 2012; Saxe et al., 2013) 和隐藏状态 (Ioffe and Szegedy, 2015) 添加归一化,并引入带有残差层的快捷连接 ResNet (He et al. , 2016) 有效地解决了这些问题。正如我们之前提到的,深度神经网络需要大量的数据进行训练。为了提供足够的数据来训练深度模型,还建立了一些大规模的监督数据集(Russakovsky et al., 2015;Lin等人,2014;Krishna等人,2017年;Chen et al., 2015;Cordts et al., 2016),其中最具代表性的是ImageNet。ImageNet 包含数百万张图像,分为数千个类别,代表各种各样的日常对象。基于有效模型 ResNet信息数据集 ImageNet 以及成熟的知识转移方法的组合,出现了一波标记数据的预训练模型

CV社区从这波浪潮中获益良多。通过在ImageNet上进行预训练的ResNet作为骨干,各种CV任务得到了快速的推进,比如图像分类(He et al., 2016;Lee et al., 2015),目标检测(Ren et al., 2016;Sermanet et al., 2014;Gidaris和Komodakis, 2015),图像分割(Long et al., 2015;郑等人,2015),图像说明(Vinyals等人,2015;Johnson等人,2016),视觉问答(Antol等人,2015;高等人,2015;Xiong et al., 2016)等。事实证明,利用 ResNet501等 PTM 是在大多数 CV 任务上获得高度准确结果关键步骤受PTMs对CV任务成功的启发,一些NLP研究者也探索了有监督的预训练,其中最具代表性的工作是CoVE (McCann et al., 2017)。CoVE的训练前目标是机器翻译。经过预训练后,源语言的编码人员可以作为下游NLP任务的强大主干。

2.2 Self-Supervised Learning and Self-Supervised Pre-Training 自监督学习和自监督预训练

As shown in Figure 4, transfer learning can be cat-egorized under four sub-settings, inductive trans-fer learning (Lawrence and Platt, 2004; Mihalkova et al., 2007; Evgeniou and Pontil, 2007), transduc-tive transfer learning (Shimodaira, 2000; Zadrozny,2004; Daume III and Marcu, 2006), self-taught learning (Raina et al., 2007; Dai et al., 2008) 2, and unsupervised transfer learning (Wang et al., 2008).

Among these four settings, the inductive and transductive settings are the core of research, as these two settings aim to transfer knowledge from supervised source tasks to target tasks. Although supervised learning is always one of the core issues of machine learning research, the scale of unlabeled data is much larger than that of manually labeled data. Recently, more and more researchers have noticed the importance of large-scale unlabeled data and are committed to extracting information from unlabeled data. Self-supervised learning has been proposed to extract knowledge from large-scale unlabeled data by leveraging input data itself as supervision.

Self-supervised learning and unsupervised learn-ing have many similarities in their settings. To a certain extent, self-supervised learning can be regarded as a branch of unsupervised learning be-cause they both apply unlabeled data. However, unsupervised learning mainly focuses on detecting data patterns (e.g., clustering, community discov-ery, and anomaly detection), while self-supervised learning is still in the paradigm of supervised set-tings (e.g., classification and generation) (Liu et al., 2020b).

如图4所示,迁移学习可以分为四个子设置,即归纳式迁移学习(Lawrence and Platt, 2004;Mihalkova等人,2007年;Evgeniou和Pontil, 2007),转导式迁移学习(Shimodaira, 2000;Zadrozny, 2004;Daume III和Marcu, 2006),自学学习(Raina等人,2007;Dai等人,2008)2和无监督式迁移学习(Wang等人,2008)。

在这四种设置中,归纳转导设置是研究的核心,因为这两种设置旨在将知识从有监督的源任务转移到目标任务。虽然监督学习一直是机器学习研究的核心问题之一,但未标注数据的规模远大于人工标注的数据。最近,越来越多的研究人员注意到了大规模未标记数据的重要性,并致力于从未标记数据中提取信息自监督学习一种利用输入数据本身作为监督,来从大规模的未标记数据中提取知识的方法

自监督学习无监督学习在它们的设置上有很多相似之处。在一定程度上,自监督学习可以看作无监督学习的一个分支,因为它们都应用了未标记的数据。然而,无监督学习主要侧重于检测数据模式(如聚类、社区发现和异常检测),而自监督学习仍处于监督设置的范式(如分类和生成)(Liu et al., 2020b)。

The development of self-supervised learning makes it possible to perform pre-training on large-scale unsupervised data. Compared to supervised pre-training working as the cornerstone of CV in the deep learning era, self-supervised pre-training allows for huge advances in the field of NLP. Al-though some supervised pre-training methods like CoVE have achieved promising results on NLP tasks, it is nearly impossible to annotate a textual dataset as large as ImageNet, considering annotat-ing textual data is far more complex than annotating images. Hence, applying self-supervised learning to utilize unlabeled data becomes the best choice to pre-train models for NLP tasks. The recent stun-ning breakthroughs in PTMs are mainly towards NLP tasks, more specifically pre-trained language models.

The early PTMs for NLP tasks exist in the form of well-known word embeddings (Collobert and Weston, 2008; Mikolov et al., 2013b; Pennington et al., 2014), which apply self-supervised methods to transform words into distributed representations. As these pre-trained word representations capture syntactic and semantic information in the text, they are often used as input embeddings and initializa-tion parameters for NLP models and offer signifi-cant improvements over random initialization pa-rameters (Turian et al., 2010). Since these word-level models often suffer from the word polysemy, Peters et al. (2018) further adopt a sequence-level neural model to capture complex word features across different linguistic contexts and generates context-aware word embeddings. Using word em-beddings as the input of neural models has almost become the common mode for NLP tasks.

自监督学习的发展使得大规模非监督数据进行预训练成为可能。与监督的预训练作为深度学习时代CV的基石相比,自监督预训练NLP 领域取得了巨大进步。尽管一些监督预训练方法(如CoVE)在NLP任务上取得了很好的效果,但要注释像 ImageNet 这样大的文本数据集几乎是不可能的,因为注释文本数据要比注释图像复杂得多。因此,应用自监督学习来利用未标记数据成为NLP任务预训练模型的最佳选择。PTMs 最近的惊人突破主要是针对 NLP 任务,更具体地说是预训练的语言模型

NLP任务早期的PTMs以众所周知的词嵌入形式存在(Collobert 和 Weston,2008;Mikolov 等人,2013b;Pennington 等人,2014),该方法应用自监督方法单词转换为分布式表示。由于这些预训练的词表示捕获文本中的句法和语义信息,它们经常被用作NLP模型的输入嵌入初始化参数,并比随机初始化参数提供了显著的改进(Turian et al., 2010)。由于这些词级模型经常遭受一词多义现象的困扰,Peters等人(2018)进一步采用序列级神经模型来捕获不同语言上下文中的复杂单词特征,并生成上下文感知词嵌入。使用词嵌入作为神经模型的输入几乎已成为自然语言处理任务的常用模式

After Vaswani et al. (2017) propose Transformers to deal with sequential data, PTMs for NLP tasks have entered a new stage, because it is pos-sible to train deeper language models compared to conventional CNNs and RNNs. Different from those word-level PTMs used as input features, the Transformer-based PTMs such as GPT and BERT can be used as the model backbone of various spe-cific tasks. After pre-training these Transformer-based PTMs on large-scale textual corpora, both the architecture and parameters of PTMs can serve as a starting point for specific NLP tasks, i.e., just fine-tuning the parameters of PTMs for specific NLP tasks can achieve competitive performance. So far, these Transformer-based PTMs have achieved state-of-the-art results on almost all NLP tasks. In-spired by GPT and BERT, many more effective PTMs for NLP tasks have also been proposed, like XLNET (Yang et al., 2019), RoBERTa (Liu et al., 2020d), BART (Lewis et al., 2020a), and T5 (Raffel et al., 2020).

With the recent advance of PTMs for NLP tasks, applying Transformer-based PTMs as the backbone of NLP tasks has become a standard procedure. Motivated by the success of self-supervised learning and Transformers in NLP, some researchers explore self-supervised learning (Wu et al., 2018; Chen et al., 2020c; Chen and He, 2020; He et al., 2020) and Transformers (Carion et al., 2020; Liu et al., 2021c) for CV tasks. These preliminary efforts have shown that self-supervised learning and Transformers can outperform conventional su-pervised CNNs. Furthermore, Transformer-based multimodal PTMs (Lu et al., 2019; Li et al., 2019; Tan and Bansal, 2019) have also been proposed and shown promising results. After the last wave of su-pervised pre-training, self-supervised pre-training has become the focus of current AI research.

Looking back at the pre-training in the AI spec-trum, it is not difficult to find that pre-training has been developed for decades, focusing on how to ac-quire versatile knowledge for various downstream tasks. Next, we will comprehensively introduce the latest breakthroughs of PTMs in this wave of self-supervised pre-training. Considering that almost all the latest PTMs are related to pre-trained language models, “PTMs” in the following sections refers to pre-trained language models or multimodal models. For those conventional PTMs based on supervised pre-training, we refer to the papers of He et al.(2019) and Zoph et al. (2020).

在Vaswani et al.(2017)提出Transformers处理序列数据后,用于 NLP 任务的 PTM 进入了一个新阶段,因为与传统的 CNN 和 RNN 相比,它可以训练更深的语言模型。与那些用作输入特征的单词级PTMs不同,基于Transformer的PTMs(如GPT和BERT)可以用作各种特定任务的模型主干。在大规模文本语料库上对这些基于Transformer的PTM进行预训练后,PTM的结构和参数都可以作为特定NLP任务的起点,即仅针对特定的NLP任务对PTM的参数进行微调就可以获得竞争性能。到目前为止,这些基于Transformer的PTMs在几乎所有的NLP任务上都取得了最先进的结果。在GPT和BERT的启发下,许多针对NLP任务的更有效的PTMs也被提出,如XLNET (Yang等人,2019)、RoBERTa (Liu等人,2020d)、BART (Lewis等人,2020a)和T5 (Raffel等人,2020)。

随着近年来NLP任务的PTMs的进步,应用基于Transformer的PTMs作为NLP任务的主干已经成为一种标准流程。在NLP中自监督学习Transformers成功推动下,一些研究者探索了自监督学习(Wu等人,2018;Chen et al., 2020c;陈和何,2020;他等人,2020)和Transformers(Carion等人,2020;Liu等人,2021c)。这些初步的努力已经表明,自监督学习Transformers可以超越传统的监督CNN。此外,基于Transformer的多模态PTMs (Lu等人,2019;Li et al., 2019;Tan和Bansal, 2019年)也被提出并显示出可喜的结果。在上一波有监督的预训练之后,自监督的预训练成为当前人工智能研究的焦点

回顾人工智能领域的预训练,不难发现预训练已经发展了几十年专注于如何获取下游各种任务的通用知识。接下来,我们将全面介绍PTMs在这一波自监督式预训练中的最新突破。考虑到几乎所有最新的PTMs都与预训练的语言模型有关,以下章节中的“PTMs”是指预训练的语言模型或多模态模型。对于那些基于监督预训练的传统PTMs,我们参考He et al.(2019)和Zoph et al.(2020)的论文。

 Figure 5: An illustration of the self-attention mech-anism of Transformer. The figure shows the self-attention results when encoding the word “he”, where the darker the color of the square is, the larger the cor-responding attention score is.

图5:《Transformer》的自注意机制示意图。如图所示为编码单词“he”时的自我注意结果,正方形的颜色越深,对应的注意分数越大。

3 Transformer and Representative PTMs 代表性的预训练模型

As we mentioned before, the key to the success of recent PTMs is an integration of self-supervised learning and Transformers. Hence, this section be-gins with the dominant basic neural architecture, Transformer. Then, we will introduce two land-mark Transformer-based PTMs, GPT and BERT, which respectively use autoregressive language modeling and autoencoding language modeling as the pre-training objective. All subsequent PTMs are variants of these two models. The final part of this section gives a brief review of typical variants after GPT and BERT to reveal the recent develop-ment of PTMs.

正如我们前面提到的,最近PTMs成功的关键是自监督学习Transformers集成。因此,本节从占主导地位的基本神经结构Transformer开始。然后,我们将引入两种具有里程碑意义的基于Transformer的PTMs, GPTBERT,它们分别使用自回归语言建模和自编码语言建模作为预训练目标。所有后续的PTMs都是这两个模型的变体。本节的最后一部分简要回顾了GPTBERT之后的典型变体,揭示了PTMs的最新发展。

 Figure 6: The difference between GPT and BERT in their self-attention mechanisms and pre-training objectives.

图6:GPTBERT在自注意力机制和预训练目标方面的差异。

3.1 Transformer

Before Transformer, RNNs have been typical neu-ral networks for processing sequential data (espe-cially for natural languages) for a long time. As RNNs are equipped with sequential nature, they read a word at each time step in order and refer to the hidden states of the previous words to process it. Such a mechanism is considered to be difficult to take advantage of the parallel capabilities of high-performance computing devices such as GPUs and TPUs.

As compared to RNNs, Transformer is an encoder-decoder structure that applies a self-attention mechanism, which can model correlations between all words of the input sequence in parallel. Hence, owing to the parallel computation of the self-attention mechanism, Transformer could fully take advantage of advanced computing devices to train large-scale models. In both the encoding and decoding phases of Transformer, the self-attention mechanism of Transformer computes representa-tions for all input words. Next, we dive into the self-attention mechanism more specifically.

在Transformer之前,RNN长期以来一直是处理顺序数据(特别是自然语言)的典型神经网络。由于RNN具有顺序性,它们在每个时间步按顺序读取一个单词,并参考之前单词的隐藏状态来处理它。这种机制被认为很难利用高性能计算设备(如GPU和TPU)的并行能力。

与RNNs相比,Transformer是一种编码器-解码器结构,它应用了一种自注意力机制,可以并行地对输入序列中所有单词之间的相关性进行建模。因此,由于自注意力机制的并行计算Transformer可以充分利用先进的计算设备来训练大规模模型。在Transformer的编码和解码阶段,Transformer的自注意力机制计算所有输入单词的表示。接下来,我们将更具体地探讨自注意力机制

In the encoding phase, for a given word, Transformer computes an attention score by comparing it with each other word in the input sequence. And such attention scores indicate how much each of the other words should contribute to the next representation of the given word. Then, the attention scores are utilized as weights to compute a weighted aver-age of the representations of all the words. We give an example in Figure 5, where the self-attention mechanism accurately captures the referential rela-tionships between “Jack” and “he”, generating the highest attention score. By feeding the weighted average of all word representations into a fully connected network, we obtain the representation of the given word. Such a procedure is essentially an aggregation of the information of the whole input sequence, and it will be applied to all the words to generate representations in parallel. In the de-coding phase, the attention mechanism is similar to the encoding, except that it only decodes one repre-sentation from left to right at one time. And each step of the decoding phase consults the previously decoded results. For more details of Transformer, please refer to its original paper (Vaswani et al., 2017) and the survey paper (Lin et al., 2021).

Due to the prominent nature, Transformer grad-ually becomes a standard neural structure for natu-ral language understanding and generation. More-over, it also serves as the backbone neural structure for the subsequently derived PTMs. Next, we in-troduce two landmarks that completely open the door towards the era of large-scale self-supervised PTMs, GPT and BERT. In general, GPT is good at natural language generation, while BERT focuses more on natural language understanding.

编码阶段,对于给定的单词Transformer通过与输入序列中的其他单词进行比较计算出一个注意力分数。这样的注意力分数表明了其他每个单词在对给定单词的下一次表征中应该起到多大作用。然后,注意力分数被用作权重来计算所有单词表示的加权平均值。我们在图5中给出了一个示例,其中自注意力机制准确地捕捉了“Jack”和“he”之间的引用关系,从而产生了最高的注意力分数。过将所有单词表示的加权平均值输入到一个完全连接的网络中,我们获得了给定单词的表示。这一过程本质上是整个输入序列信息的集合,它将应用于所有的单词以并行生成表示。在解码阶段,注意力机制与编码类似,不同的是它一次只从左到右解码一种表示。并且,解码阶段的每一步都参考先前解码的结果。关于Transformer的更多细节,请参考其原始论文(Vaswani et al., 2017)和调查论文(Lin et al., 2021)。

由于其突出的特性Transformer逐渐成为一种用于自然语言理解和生成的标准神经结构。此外,它还作为随后衍生的PTMs的中枢神经结构。接下来,我们将介绍两个完全开启大规模自监督PTMs时代大门的里程碑:GPTBERT。一般来说,GPT擅长于自然语言生成,而BERT更侧重于自然语言理解

3.2 GPT

As introduced in Section 2, PTMs typically con-sist of two phases, the pre-training phase and the fine-tuning phase. Equipped by the Transformer decoder as the backbone 3, GPT applies a genera-tive pre-training and a discriminative fine-tuning. Theoretically, compared to precedents of PTMs, GPT is the first model that combines the modern Transformer architecture and the self-supervised pre-training objective. Empirically, GPT achieves significant success on almost all NLP tasks, includ-ing natural language inference, question answering, commonsense reasoning, semantic similarity and classification.

如第2节所介绍的,PTMs 通常由两个阶段组成预训练阶段微调阶段。由Transformer解码器作为骨干装备的GPT应用了生成式预训练判别式微调。从理论上讲,与以往的PTMs先例相比,GPT第一个将现代Transformer架构和自监督的预训练目标结合起来的模型。从经验上看,GPT在几乎所有的NLP任务中都取得了显著的成功,包括自然语言推理、问题回答、常识推理、语义相似度和分类

Given large-scale corpora without labels, GPT optimizes a standard autoregressive language mod-eling, that is, maximizing the conditional probabili-ties of all the words given their corresponding pre-vious words as contexts. In the pre-training phase of GPT, the conditional probability of each word is modeled by Transformer. As shown in Figure 6, for each word, GPT computes its probability distri-butions by applying multi-head self-attention oper-ations over its previous words followed by position-wise feed-forward layers.

The adaptation procedure of GPT to specific tasks is fine-tuning, by using the pre-trained pa-rameters of GPT as a start point of downstream tasks. In the fine-tuning phase, passing the input sequence through GPT, we can obtain the representations of the final layer of the GPT Transformer. By using the representations of the final layer and task-specific labels, GPT optimizes standard objec-tives of downstream tasks with simple extra ouTPUt layers. As GPT has hundreds of millions of param-eters, it is trained for 1 month on 8 GPUs, which is fairly the first “large-scale” PTM in the history of NLP. And undoubtedly, the success of GPT pave the way for the subsequent rise of a series of large-scale PTMs. In the next part, we introduce another most representative model BERT.

预训练阶段:对于给定的无标签的大型语料库,GPT优化了一种标准的自回归语言建模,也就是说,在给定其对应的前一个词作为上下文的情况下,最大化所有单词的条件概率。在GPT的预训练阶段,Transformer对每个单词的条件概率进行建模。如图6所示,对于每个单词,GPT通过对其前面的单词应用多头自注意力操作,然后再加上位置前馈层,来计算其概率分布。

微调阶段:GPT特定任务的适应过程是通过使用GPT预训练的参数作为下游任务的起点进行微调。在微调阶段,通过GPT传递输入序列,我们可以得到GPTTransformer最后一层的表示。通过使用最后一层表示特定于任务的标签GPT使用简单的额外输出层优化下游任务的标准目标。由于GPT有数亿个参数,因此在8 个 GPU 上训练了 1 个月,这是NLP历史上第一个“大规模”的PTM。毫无疑问,GPT的成功为随后一系列大型PTMs的兴起铺平了道路。在接下来的部分中,我们将介绍另一种最具代表性的BERT模型。

 Figure 7: The pre-training and fine-tuning phases for BERT.

图7:BERT的预训练和微调阶段。

3.3 BERT

The emergence of BERT has also greatly promoted the development of the PTM field. Theoretically, compared with GPT, BERT uses a bidirectional deep Transformer as the main structure. There are also two separate stages to adapt BERT for specific tasks, pre-training and fine-tuning (see Figure 7).

In the pre-training phase, BERT applies autoen-coding language modeling rather than autoregres-sive language modeling used in GPT. More specifi-cally, inspired by cloze (Taylor, 1953), the objec-tive masked language modeling (MLM) is designed. As shown in Figure 6, in the procedure of MLM, tokens are randomly masked with a special token [MASK], the objective is to predict words at the masked positions with contexts. Compared with standard unidirectional autoregressive language modeling, MLM can lead to a deep bidirectional representation of all tokens.

Besides MLM, the objective of next sentence prediction (NSP) is adopted to capture discourse relationships between sentences for some down-stream tasks with multiple sentences, such as nat-ural language inference and question answering. For this task, a binary classifier is used to predict whether two sentences are coherent. In the pre-training phase, MLM and NSP work together to optimize the parameters of BERT.

BERT的出现也极大地促进了PTM领域的发展。理论上,与GPT相比,BERT采用了双向深度Transformer结构作为主体结构。还有两个单独的阶段可以使BERT适应特定的任务,即预训练微调(参见图7)。

在预训练阶段,BERT使用的是自动编码语言建模,而不是GPT中使用的自回归语言建模。更具体地说,受完形填空(Taylor, 1953)的启发,设计了目标掩码语言建模(MLM)。如图6所示,在MLM过程中,token 被一个特殊的 token [MASK] 随机掩码,目标是根据上下文预测掩码位置上的单词。与标准的单向自回归语言建模相比,MLM可以实现所有符号的深度双向表示

除了 MLM,还采用下一句预测(NSP)的目标来捕获句子之间的话语关系,用于一些具有多个句子的下游任务,例如自然语言推理和问答。对于此任务,使用二元分类器预测两个句子是否连贯。在预训练阶段,MLM和NSP协同工作以优化BERT的参数。

After pre-training, BERT can obtain robust pa-rameters for downstream tasks. By modifying inputs and ouTPUts with the data of downstream tasks, BERT could be fine-tuned for any NLP tasks. BERT could effectively handle those ap-plications with the input of a single sentence or sentence pairs. For the input, its schema is two sen-tences concatenated with the special token [SEP], which could represent:

(1) sentence pairs in para-phrase,

(2) hypothesis-premise pairs in entailment,

(3) question-passage pairs in question answering, and

(4) a single sentence for text classification or sequence tagging.

For the ouTPUt, BERT will pro-duce a token-level representation for each token, which can be used to handle sequence tagging or question answering, and the special token [CLS] can be fed into an extra layer for classification. After GPT, BERT has further achieved significant improvements on 17 different NLP tasks, including SQuAD (better than human performance), GLUE (7.7% point absolute improvements), MNLI (4.6%point absolute improvements), etc.

通过预训练,BERT可以获得稳健的下游任务参数。通过使用下游任务的数据修改输入和输出,BERT可以对任何NLP任务进行微调BERT能够有效地处理那些只输入一个句子或句子对的应用程序。对于输入,其模式是两个与特殊标记 [SEP] 连接的句子,可以表示:

(1)短语中的句子对,

(2)蕴含中的假设-前提对,

(3)问答中的问题-段落对,

(4)用于文本分类或序列标注的单句。

对于输出,BERT将为每个token生成一个token级别的表示,它可以用于处理序列标记或问题回答,并且特殊的token[CLS]可以被输入到额外的层进行分类。在GPT之后,BERT17个不同的NLP任务上进一步取得了显著的改进,包括SQuAD(优于人类表现)、GLUE(7.7%的绝对改进)、MNLI(4.6%的绝对改进)等。

3.4 After GPT and BERT

After GPT and BERT, some of their improvements have been proposed, such as RoBERTa and ALBERT. RoBERTa(Liu et al., 2020d) is one of the success variants of BERT, which mainly has four simple and effective changes:

(1) Removing the NSP task;

(2) More training steps, with bigger batch size and more data;

(3) Longer training sen-tences;

(4) Dynamically changing the [MASK] pat-tern.

 RoBERTa achieves impressive empirical re-sults on the basis of BERT. Moreover, RoBERTa has pointed out that the NSP task is relatively use-less for the training of BERT. ALBERT(Lan et al., 2019) is another important variant of BERT, which provides several interesting observations on reduc-ing parameters. First, it factorizes the input word embedding matrix into two smaller ones. Second, it enforces parameter-sharing between all Transformer layers to significantly reduce parameters. Third, it proposes the sentence order prediction (SOP) task to substitute BERT’s NSP task. As a sacrifice to its space efficiency, ALBERT has a slower fine-tuning and inference speed.

GPTBERT之后,也有人提出了一些改进方案,如RoBERTaALBERTRoBERTa(Liu et al., 2020d)是BERT成功变体之一,它主要有四个简单而有效的改变:

(1)去除NSP任务;

(2)更多的训练步骤,更大的batch size和更多的数据;

(3)较长的训练句;

(4)动态改变[MASK]模式。

RoBERTaBERT的基础上取得了令人印象深刻的实证结果。此外,RoBERTa还指出,NSP任务对于BERT的训练相对来说用处不大ALBERT(Lan et al., 2019)是BERT的另一个重要变体,它提供了一些关于减少参数的有趣观察。首先,它将输入的词嵌入矩阵分解为两个较小的矩阵。其次,它强制所有Transformer层之间的参数共享,以显著减少参数。第三,提出了句子顺序预测(SOP)任务来替代BERT的NSP任务。作为对其空间效率的牺牲ALBERT具有较慢的微调推理速度

As shown in Figure 8, besides RoBERTa and ALBERT, there are various PTMs being proposed in recent years towards better capturing knowl-edge from unlabeled data. Some work improves the model architectures and explores novel pre-training tasks, such as XLNet (Yang et al., 2019), UniLM (Dong et al., 2019), MASS (Song et al., 2019), SpanBERT (Joshi et al., 2020) and ELEC-TRA (Clark et al., 2020). Besides, incorporating rich data sources is also an important direction, such as utilizing multilingual corpora, knowledge graphs, and images. Since the model scale is a crucial success factor of PTMs, researchers also explore to build larger models to reach over hun-dreds of billions of parameters, such as the series of GPT (Radford et al., 2019; Brown et al., 2020), Switch Transformer (Fedus et al., 2021), and mean-while conduct computational efficiency optimiza-tion for training PTMs (Shoeybi et al., 2019; Ra-jbhandari et al., 2020; Ren et al., 2021). In the following sections, we will further introduce all these efforts for PTMs in detail.

如图8所示,除了RoBERTaALBERT之外,最近几年还提出了各种各样的PTMs,以更好地从未标记的数据捕获知识。一些工作改进了模型架构并探索了新的预训练任务,如XLNet (Yang et al., 2019)、UniLM (Dong et al., 2019)、MASS (Song et al., 2019)、SpanBERT (Joshi et al., 2020)和ELEC-TRA(Clark et al., 2020)。此外,整合丰富的数据源也是一个重要的方向,如利用多语言语料库、知识图谱和图像。由于模型规模是PTMs成功的一个关键因素,研究人员还探索构建更大的模型,以达到数千亿参数,如GPT系列(Radford等人,2019;Brown et al., 2020),Switch Transformer(Fedus et al., 2021),同时对训练PTMs进行计算效率优化(Shoeybi et al., 2019;Ra-jbhandari等人,2020年;Ren等,2021)。在下面的小节中,我们将进一步详细介绍所有这些针对PTMs努力的工作。

4 Designing Effective Architecture设计有效的架构

In this section, we dive into the after-BERT PTMs deeper. The success of Transformer-based PTMs has stimulated a stream of novel architectures for modeling sequences for natural language and beyond. Generally, all the after-BERT Transformer architectures for language pre-training could be cat-egorized according to two motivations: toward uni-fied sequence modeling and cognitive-inspired architectures. Besides, we also take a glimpse over other important BERT variants in the third subsection, which mostly focus on improving natu-ral language understanding.

在本节中,我们将更深入地探讨BERT后PTMs。基于Transformer的PTMs的成功激发了一系列为自然语言及其他领域建模序列的新颖架构。一般来说,所有用于语言预训练的BERTTransformer架构都可以根据两个动机进行分类:统一序列建模和认知启发式架构。此外,我们还在第三小节中对其他重要的BERT变体进行了简要介绍,这些BERT变体主要侧重提高自然语言的理解能力

4.1 Unified Sequence Modeling统一序列建模

Why is NLP so challenging? One of the funda-mental reasons is that it has versatile downstream tasks and applications, which could be generally categorized into three genres:

(1)、Natural language understanding: includes grammatical analysis, syntactic analysis, word/sentence/paragraph classification, ques-tion answering, factual/commonsense knowl-edge inference and etc.

(2)、Open-ended language generation: includes dialog generation, story generation, data-to-text generation and etc.

(3)、Non-open-ended language generation: in-cludes machine translation, abstract summa-rizing, blank filling and etc.

为什么NLP如此具有挑战性?一个根本的原因是它有通用的下游任务和应用程序,可以大致分为三种类型:(1)、自然语言理解:包括语法分析、句法分析、单词/句子/段落分类、问答、事实/常识推理等。

(2)、开放式语言生成:包括对话生成、故事生成、数据转文本生成等。

(3)、非开放式语言生成:包括机器翻译、摘要总结、填空等。

Nevertheless, the differences between them are not so significant. As Feynman’s saying goes, “What I cannot create, I do not understand”. On one hand, a model that can not understand must not fluently generate; on the other hand, we can easily turn understanding tasks into generation tasks (Schick and Schütze, 2020). Recent studies also show that GPTs can achieve similar and even better performance on understanding benchmarks than BERTs (Liu et al., 2021b). The boundary between understanding and generation is vague.

Based on the observation, a bunch of novel ar-chitectures has been seeking for unifying different types of language tasks with one PTM. We will take a look over its development and discuss the in-spirations they bring towards a unified foundation of natural language processing.

Combining Autoregressive and Autoencoding Modeling. The pioneer work to unify GPT-style unidirectional generation and BERT-style bidirec-tional understanding is XLNet (Yang et al., 2019), which proposes the permutated language modeling. The masked-recover strategy in BERT naturally contradicts with its downstream application, where there is no [MASK] in input sentences. XLNet solves the problem by permutating tokens’ order in the pre-training and then applying the autore-gressive prediction paradigm, which endows XLNet with the ability for both understanding and generation. An important follower of permutation language modeling is MPNet (Song et al., 2020), which amends the XLNet’s discrepancy that in pre-training XLNet does not know the sentence’s length while in downstream it knows.

Besides permutated language modeling, another stream would be multi-task training. UniLM (Dong et al., 2019) proposes to jointly train different language modeling objectives together, includ-ing unidirectional, bidirectional, and sequence-to-sequence (seq2seq) objectives. This can be achieved by changing the attention masks in Transformers. UniLM performs quite well in generative question answering and abstract summarization.

然而,它们之间的差异并不那么显著。正如费曼所说:“我不能创造的,我就不能理解。”一方面,不能理解的模型不能流畅地生成;另一方面,我们可以很容易地将理解任务转化为生成任务(Schick和Schütze, 2020)。最近的研究还表明,与BERTs相比,GPTs可以在理解基准上取得类似甚至更好的性能(Liu等人,2021b)。理解和生成之间的界限是模糊的

基于这种观察,一堆新颖的架构一直在寻求用一个 PTM 统一不同类型的语言任务。我们将回顾它的发展,并讨论它们给自然语言处理的统一基础带来的启示。

结合自回归和自编码建模。将GPT风格的单向生成BERT风格的双向理解统一起来的先锋工作是XLNet (Yang等人,2019),它提出了置换语言建模。BERT中的masked-recover策略自然与其下游应用相矛盾,后者在输入句中没有[MASK]。XLNet通过在预训练打乱token的顺序,然后应用自回归预测范式解决了这一问题,使XLNet具备了既理解又生成的能力。置换语言建模的一个重要追随者是MPNet (Song et al., 2020),它修正XLNet的差异,即在预训练中XLNet不知道句子的长度,而在下游它知道句子的长度。

除了置换语言建模,另一个流程是多任务训练UniLM (Dong et al., 2019)提出联合训练不同的语言建模目标,包括单向、双向和序列到序列(seq2seq)目标。这可以通过改变《Transformers》中的注意力掩码来实现。UniLM生成式问答抽象总结方面表现得很好。

Recently, GLM (Du et al., 2021) proposes a more elegant approach for combining autoregres-sive and autoencoding. Given a variable-length masked span, instead of providing the number of [MASK] to model as BERT and SpanBERT (Joshi et al., 2020) do, GLM asks Transformer blocks to autoregressively generate the masked tokens. And to preserve the information of [MASK]s’ number, GLM proposes a 2D positional encoding strategy. GLM is the first model to achieve the best perfor-mance on all types of tasks including natural lan-guage understanding, conditional generation, and unconditional generation at the same time.

Applying Generalized Encoder-Decoder. Before GLM, both encoder structure (e.g., BERT) or de-coder structure (e.g., GPT) can not solve an im-portant problem: to fill in blanks with variable lengths (Du et al., 2021; Shen et al., 2020b). The decoder-based models can not make it because they can only generate at the end of the sequence and neither the encoder-based models because the num-ber of [MASK]s will leak information. A natural idea is to turn to encoder-decoder architecture originally designed for machine translation, which would produce variable lengths of target sequences conditioned on the sources.

最近,GLM (Du et al., 2021)提出了一种更优雅的方法来结合自回归自编码。给定一个可变长度的掩码跨度GLM 要求Transformer块自回归地生成掩码token,而不是像BERTSpanBERT (Joshi et al., 2020)那样提供[MASK]的数量来建模。为了保留[MASK]s的编号信息,GLM 提出了一种2D位置编码策略。GLM 第一个在所有类型的任务(包括自然语言理解、条件生成和无条件生成)上都能达到最佳性能的模型。

应用广义编码器-解码器(Encoder-Decoder)。在GLM之前,无论是编码器结构(如BERT),还是解码器结构(如GPT)都不能解决一个重要问题:用可变长度填充空白(Du等人,2021;沈等,2020b)。基于解码器的模型不能做到这一点,因为它们只能在序列的末尾生成,而基于编码器的模型也不能做到这一点,因为[MASK]的数量会泄露信息。一个自然的想法是转向最初为机器翻译设计的编码器-解码器架构,它将根据源产生可变长度的目标序列

The pioneer of this genre is MASS (Song et al., 2019), which introduces the masked-prediction strategy into the encoder-decoder structure. How-ever, MASS does not touch the problem of filling variable-length blanks. T5 (Raffel et al., 2020) solves the problem by masking a variable-length of span in text with only one mask token and asks the decoder to recover the whole masked sequence. BART (Lewis et al., 2020a) introduces the inter-esting idea of corrupting the source sequence with multiple operations such as truncation, deletion, re-placement, shuffling, and masking, instead of mere masking. There are following works that specify in typical seq2seq tasks, such as PEGASUS (Zhang et al., 2020a) and PALM (Bi et al., 2020).

However, several challenges lie in front of encoder-decoder architectures. First, the encoder-decoder introduces much more parameters com-pared to a single encoder/decoder. Although this problem could be alleviated by parameter-sharing of the encoder and decoder, its parameter-efficiency is still doubtful. Second, encoder-decoder struc-tures generally do not perform very well on natu-ral language understanding. Despite reported im-provements over similar-sized vanilla BERT, well-trained RoBERTa or GLM encoder performs much better than them.

这一流派的先驱是MASS (Song et al., 2019),它将掩码预测策略引入到编码器-解码器结构中。然而,MASS 并没有涉及填充可变长度空白的问题。T5 (Raffel et al., 2020)通过使用一个掩码token在文本中掩码一个可变长度的跨度来解决这个问题,并要求解码器恢复整个掩码序列。BART (Lewis等人,2020a)引入了一种有趣的思想,即通过多种操作破坏源序列,例如截断、删除、替换、变换和掩码,而不仅仅是掩码。有以下的工作指定了典型的seq2seq任务,如PEGASUS (Zhang et al., 2020a)和PALM (Bi et al., 2020)。

然而,在编码器-解码器架构面前存在着几个挑战。首先,与单个编码器/解码器相比,编码器-解码器引入了更多的参数。虽然这个问题可以通过编码器和解码器的参数共享来缓解,但它的参数效率仍然值得怀疑。第二,编码器-解码器结构通常在自然语言理解方面表现得不太好。尽管有报道称在类似大小的BERT上有改进,但训练有素的RoBERTaGLM编码器的性能要比它们好得多。

Table 1: Three fundamental types of framework and their suitable downstream tasks. “NLU” refers to nat-ural language understanding. “Cond. Gen.” and “Un-cond. Gen.” refer to conditional and unconditional text generation, respectively. “X” means “is good at”, “—”means “could be adapted to”, and “⇥” means “cannot be directly applied to”. We define unconditional generation as the task of generating text without further training as in a standard language model, while conditional generation refers to seq2seq tasks such as text summarization. Taken from (Du et al., 2021).

表1:三种基本的框架类型和它们合适的下游任务。“NLU”指的是自然语言理解。“Cond. Gen.”和“Un-cond. Gen.”分别指有条件的和无条件文本生成。“X”表示“擅长”,“—”表示“可以适应”,“⇥”表示“不能直接适用”。我们将无条件生成定义为在标准语言模型中无需进一步训练可生成文本的任务,而条件生成则是如文本摘要之类的 seq2seq 任务。摄于(Du等人,2021年)。

4.2 Cognitive-Inspired Architectures架构

 Figure 8: The family of recent typical PTMs, including both pre-trained language models and multimodal models.

图8:最近典型的PTMs家族,包括预训练的语言模型多模态模型

Is the current Transformer a good enough imple-mentation of human beings’ cognitive system? Of course not. Attention mechanism, the core module in Transformer architecture, is inspired by the micro and atom operation of the human’s cogni-tive system and only responsible for the perceptive function. However, human-level intelligence is far more complex than the mere understanding of the association between different things.

In pursuit for human-level intelligence, under-standing the macro architecture of our cogni-tive functions including decision making, logical reasoning, counterfactual reasoning and working memory (Baddeley, 1992) is crucial. In this subsec-tion, we will take a look over the novel attempts in-spired by advances of cognitive science, especially on maintainable working memory and sustainable long-term memory.

现在的Transformer很好地实现人类的认知系统吗?当然不是。Transformer架构的核心模块——注意力机制,灵感来源于人类认知系统的微观和原子操作,只负责感知功能。然而,人类水平的智力远比仅仅理解不同事物之间的关联要复杂得多

在追求人类水平的智能时,了解我们的认知功能的宏观架构,包括决策、逻辑推理、反事实推理和工作记忆(Baddeley, 1992)是至关重要的。在这个小节中,我们将回顾认知科学的进步启发的新尝试,特别是在可维持工作记忆可持续长期记忆方面。

Maintainable Working Memory. A natural problem of Transformer is its fixed window size and quadratic space complexity, which significantly hinders its applications in long document under-standing.

Despite the bunch of modifications on approx-imate computing of the quadratic growing point-wise attention (Tay et al., 2020), a question is that we humans do not present such a long-range at-tention mechanism. As an alternative, cognitive scientists have revealed that humans could main-tain a working memory (Baddeley, 1992; Brown, 1958; Barrouillet et al., 2004; Wharton et al., 1994),which not only memorizes and organizes but also forgets. The conventional long-short term memory network is an exemplar practice for such a philoso-phy.

For Transformer-based architecture, the Transformer-XL (Dai et al., 2019) is the first to introduce segment-level recurrence and relative positional encoding to fulfill this goal. How-ever, the recurrence only implicitly models the working memory. As a more explicit solution, CogQA (Ding et al., 2019) proposes to maintain a cognitive graph in the multi-hop reading. It is composed of two systems: the System 1 based on PTMs and the System 2 based on GNNs to model the cognitive graph for multi-hop understanding.

A limitation of CogQA is that its use of the Sys-tem 1 is still based on fixed window size. To endow working memory with the ability to understand long documents, CogLTX (Ding et al., 2020) lever-ages a MemRecall language model to select sen-tences that should be maintained in the working memory and another model for answering or clas-sification.

可维持工作记忆Transformer的一个自然问题是其固定的窗口大小二次空间复杂度,这严重阻碍了其在长文档理解中的应用。

尽管对二次增长的逐点注意的近似计算进行了大量的修改(Tay等人,2020年),但问题是,我们人类并没有呈现如此长程的注意力机制。作为一种替代方法,认知科学家已经揭示,人类可以保持工作记忆(Baddeley, 1992;布朗,1958;Barrouillet等人,2004年;沃顿等人,1994),它不仅可以记忆和组织,但也忘记。传统的长短期记忆网络就是这种理念的一个范例。

对于基于Transformer的架构,Transformer-XL (Dai等人,2019)是第一个引入分段级递归相对位置编码来实现这一目标的。然而,这种递归只是隐含地模拟了工作记忆。作为一种更明确的解决方案,CogQA (Ding et al., 2019)提出在多跳阅读中维护认知图。它由两个系统组成:基于PTMs的系统1和基于GNN的系统2,用于对认知图进行建模,实现多跳理解。

CogQA的一个限制是它对系统1的使用仍然基于固定的窗口大小。为了赋予工作记忆理解长文档的能力,CogLTX (Ding et al., 2020)利用MemRecall语言模型来选择应该保存在工作记忆中的句子,并使用另一个模型来回答或分类。

Sustainable Long-Term Memory. The success of GPT-3 (Brown et al., 2020) and recent studies on language models’ ability in recalling factual knowledge (Petroni et al., 2019; Wang et al., 2020a; Liu et al., 2021b) has revealed the fact that Transformers can memorize. But how does Transformers make it?

In Lample et al. (2019), the authors provide some inspiring evidences on how Transformers memo-rize. They replace the feed-forward networks in a Transformer layer with large key-value memory networks, and find it to work pretty well. This somehow proves that the feed-forward networks in Transformers is equivalent to memory networks.

Nevertheless, the memory capacity in Transformers is quite limited. For human intelligence, besides working memory for deciding and reasoning, the long-term memory also plays a key role in recall-ing facts and experiences. REALM (Guu et al., 2020) is a pioneer to explore how to construct a sustainable external memory for Transformers. The authors tensorize the whole Wikipedia sentence by sentence, and retrieve relevant sentences as context for masked pre-training. The tensorized Wikipedia is asynchronously updated for a given number of training steps. RAG (Lewis et al., 2020b) extends the masked pre-training to autoregressive generation, which could be better than extractive question answering.

Besides tensorizing the text corpora, (Verga et al., 2020; Févry et al., 2020) propose to tensorize entities and triples in existing knowledge bases. When entities appear in contexts, they replace en-tity tokens’ embedding in an internal Transformer layer with the embedding from outer memory net-works. (Dhingra et al., 2020; Sun et al., 2021) maintain a virtual knowledge from scratch, and propose a differentiable reasoning training objective over it. All of these methods achieve promising improvement on many open-domain question answering benchmarks.

可持续的长期记忆GPT-3的成功(Brown et al., 2020)和最近关于语言模型记忆事实知识能力的研究(Petroni et al., 2019;Wang et al., 2020a;Liu等人,2021b)揭示了《Transformers可以记忆的事实。但是《Transformers》是怎么做到的呢?

在Lample等人(2019)中,作者就《Transformers》如何记忆提供了一些鼓舞人心的证据。他们用大型键值记忆网络替换了Transformer层中的前馈网络,并且发现它工作得非常好。这在某种程度上证明了《Transformers》中的前馈网络与记忆网络是等价的。

然而,《Transformers》的记忆容量是相当有限的。对于人类智能而言,除了用于决定推理的工作记忆外,长期记忆在回忆事实和经验方面也起着关键作用。REALM (Guu等人,2020)是探索如何为Transformers构建可持续外部记忆的先驱。作者将整个维基百科语料库逐句tensorize化,并检索相关的句子作为掩码预训练的上下文。对于给定数量的训练步骤,tensorized化的维基百科会异步更新。RAG (Lewis et al., 2020b)将掩码预训练扩展到自回归生成,这可能比抽取式问题回答更好

除了对文本语料库进行张量tensorizing外,(Verga等人,2020;Févry等人,2020)提出对现有知识库中的实体和三元组进行tensorize化。当实体出现在上下文中时,它们用来自外部内存网络的嵌入替换嵌入内部Transformer层的实体token。(Dhingra et al., 2020; Sun et al., 2021)从零开始维护一个虚拟知识,并在此基础上提出一个可微分的推理训练目标。所有这些方法在许多开放域问题回答基准上都取得了有希望的改进。

4.3 More Variants of Existing PTMs现有PTMs的更多变体

Besides the practice to unify sequence model-ing and construct cognitive-inspired architectures, most current studies focus on optimizing BERT’s architecture to boost language models’ perfor-mance on natural language understanding.

A stream of work aims at improving the mask-ing strategy, which could be regarded as a certain kind of data augmentation (Gu et al., 2020). Span-BERT (Joshi et al., 2020) shows that masking a con-tinuous random-length span of tokens with a span boundary objective (SBO) could improve BERT’s performance. Similar ideas have also been explored in ERNIE (Sun et al., 2019b,c) (where a whole entity is masked), NEZHA (Wei et al., 2019), and Whole Word Masking (Cui et al., 2019).

除了统一序列建模和构建认知架构的实践外,目前的研究主要集中在优化BERT的架构,以提高语言模型在自然语言理解方面的性能

一系列的工作旨在改进掩码策略,这可以被视为一种数据增强(Gu et al., 2020)。Span -BERT (Joshi et al., 2020)表明,用跨度边界目标(SBO)掩码连续的随机长度的token跨度可以提高BERT的性能。ERNIE (Sun et al., 2019b,c)(其中整个实体被掩码)、NEZHA (Wei et al., 2019)和Whole Word Masking (Cui et al., 2019)也探索了类似的想法。

Another interesting practice is to change the masked-prediction objective to a harder one. ELECTRA (Clark et al., 2020) transform MLM to a replace token detection (RTD) objective, in which a generator will replace tokens in original sequences and a discriminator will predict whether a token is replaced.

另一个有趣的实践是将带掩码的预测目标更改为一个更难的目标ELECTRA (Clark et al., 2020)将传送带转换为一个替换token检测(RTD)目标,在这个目标中,生成器将替换原始序列中的token,鉴别器将预测一个token是否被替换。

5 Utilizing Multi-Source Data利用多源数据

In this section, we introduce some typical PTMs that take advantage of multi-source heterogeneous data, including multilingual PTMs, multimodal PTMs, and knowledge-enhanced PTMs.

在本节中,我们将介绍一些利用多源异构数据的典型PTM,包括多语言PTM多模态PTM知识增强的PTM

5.1 Multilingual Pre-Training多语言预训练

Language models trained on large-scale English corpora have achieved great success in many bench-marks. However, we live in a multilingual world, and training a large language model for each lan-guage is not an elegant solution because of the cost and the amount of data required. In fact, although people from all over the world use different languages, they can express the same meaning. This may indicate that semantics is independent of symbol systems. Additionally, some researchers found that they could get even better performance on benchmarks when training one model with several languages comparing with training several mono-lingual models (Lample and Conneau, 2019; Huang et al., 2020b). Hence, training one model to learn multilingual representations rather than monolin-gual representations may be a better way.

Before BERT, some researchers have explored multilingual representations. There are mainly two ways to learn multilingual representations. One way is to learn through parameter sharing. For ex-ample, training multilingual LSTMs with several language pairs together achieves multilingual trans-lation. Another way is to learn language-agnostic constraints, such as decoupling language representations into language-specific and language-agnostic representations utilizing the WGAN (Ar-jovsky et al., 2017) framework. Both of these two ways enable models to be applied to multilingual scenarios, but only for specific tasks. The model in each of them is trained with one specific task from beginning to end, and cross-lingual knowledge can-not be generalized to other tasks. Hence, for any other multilingual tasks, training new models from scratch is still required. Learning new models from scratch needs a large volume of task-specific data.

基于大型英语语料库的语言模型在许多基准测试中取得巨大的成功。然而,我们生活在一个多语言的世界,由于成本和所需数据量的原因,为每种语言训练一个大型语言模型并不是一个优雅的解决方案。事实上,尽管来自世界各地的人们使用不同的语言,但他们可以表达相同的意思。这可能表明语义是独立于符号符号系统的。此外,一些研究人员发现,与训练几个单语言模型相比,用几种语言训练一个模型可以在基准测试中获得更好的性能(Lample和Conneau, 2019;黄等,2020b)。因此,训练一个模型来学习多语言表征,而不是单语语言表征,可能是一个更好的方法

BERT之前,一些研究人员已经探索了多语言表征。学习多语言表征的方法主要有两种:一种方法是通过参数共享来学习。例如,用几个语言对一起训练多语言LSTM,就可以实现多语言翻译。另一种方法是学习语言不可知的约束,例如使用WGAN (Ar-jovsky et al., 2017)框架将语言表示解耦为特定的语言和与语言无关的表示。这两种方法都可以将模型应用于多语言场景,但仅适用于特定的任务。它们中的模型从头到尾都是用一个特定的任务来训练的,跨语言知识不能推广到其他任务中。因此,对于任何其他多语言任务,仍然需要从头开始训练新的模型。从头开始学习新模型需要大量特定于任务的数据。

The appearance of BERT shows that the frame-work of pre-training with general self-supervised tasks and then fine-tuning on specific downstream tasks is feasible. This motivates researchers to design tasks to pre-train versatile multilingual mod-els. Multilingual tasks could be divided into un-derstanding tasks and generation tasks according to task objectives. Understanding tasks focus on sentence-level or word-level classification, and are of help for downstream classification tasks such as natural language inference (Conneau et al., 2018b). Generation tasks focus on sentence generation, and are crucial in downstream generation tasks such as machine translation.

Some understanding tasks are first used to pre-train multilingual PTMs on non-parallel multilin-gual corpora. For example, multilingual BERT (mBERT) released by Devlin et al. (2019) is pre-trained with the multilingual masked language modeling (MMLM) task using non-parallel multi-lingual Wikipedia corpora in 104 languages. The research conducted by Pires et al. (2019) shows that mBERT has the ability to generalize cross-lingual knowledge in zero-shot scenarios. This in-dicates that even with the same structure of BERT, using multilingual data can enable the model to learn cross-lingual representations. XLM-R (Con-neau et al., 2020) builds a non-parallel multilingual dataset called CC-100, which supports 100 lan-guages. The scale of CC-100 is much larger than the Wikipedia corpora used by mBERT, especially for those low-resource languages. XLM-R is pre-trained with MMLM as the only task on CC-100 and gets better performance on several benchmarks than mBERT, which indicates that a larger scale of multilingual corpora can bring better performance.

BERT出现表明,基于一般自监督任务的预训练框架,再对特定的下游任务进行微调可行的。这促使研究人员设计任务来训练通用的多语言模型。多语言任务根据任务目标,可以分为理解任务生成任务理解任务侧重于句子级别或单词级别的分类,并有助于下游的分类任务,如自然语言推理(Conneau等,2018b)。生成任务主要专注于句子的生成,对于机器翻译等下游生成任务至关重要。

一些理解任务首先被用于非并行多语言语料库上的多语PTMs的预训练。例如,Devlin等人(2019)发布的多语言BERT(mBERT)使用104种语言的非并行多语言维基百科语料库,通过多语言掩码语言建模(MMLM)任务进行预训练。Pires等人(2019)的研究表明,mBERT具有在零样本场景下概括跨语言知识的能力。这表明,即使使用相同的BERT结构,使用多语言数据也可以使模型学习跨语言表示。XLM-R (Con-neau等人,2020)构建了一个称为CC-100的非并行多语言数据集,它支持100种语言。CC-100的规模比mBERT使用的Wikipedia语料库要大得多,特别是对于那些资源较低的语言。XLM-R在CC-100上以MMLM作为唯一任务进行预训练,在多个基准测试上的性能都优于mBERT,说明多语言语料规模越大性能越好

However, the MMLM task cannot well utilize parallel corpora. In fact, parallel corpora are quite important for some NLP tasks such as machine translation. Intuitively, parallel corpora are very helpful to directly learn cross-lingual representa-tions for those sentences in different languages with the same meanings. From this point, XLM (Lample and Conneau, 2019) leverages bilingual sentence pairs to perform the translation language modeling (TLM) task. Similar to MLM in BERT, TLM combines two semantically matched sentences into one and randomly masks tokens in both parts. Compared with MLM, TLM requires models to predict the masked tokens depending on the bilingual con-texts. This encourages models to align the repre-sentations of two languages together.

Besides TLM, there are some other effective methods to learn multilingual representations from parallel corpora. Unicoder (Huang et al., 2019a) provides two novel pre-training tasks based on par-allel corpora: cross-lingual word recovery (CLWR) and cross-lingual paraphrase classification (CLPC). CLWR uses target language embeddings to repre-sent source language embeddings by leveraging at-tention mechanisms, and its objective is to recover the source language embeddings. This task enables models to learn word-level alignments between dif-ferent languages. CLPC treats aligned sentences as positive pairs and samples misaligned sentences as negative pairs to perform sentence-level classi-fication, letting models predict whether the input pair is aligned or not. With CLPC, models can learn sentence-level alignments between different languages. ALM (Yang et al., 2020) automatically generates code-switched sequences from parallel sentences and performs MLM on it, which forces models to make predictions based only on contexts of other languages. InfoXLM (Chi et al., 2020b) analyzes MMLM and TLM from the perspective of information theory, and encourages models to distinguish aligned sentence pairs with misaligned negative examples under the framework of con-trastive learning. HICTL (Wei et al., 2021) extends the idea of using contrastive learning to learn both sentence-level and word-level cross-lingual repre-sentations. ERNIE-M (Ouyang et al., 2020) pro-poses back-translation masked language modeling (BTMLM), and expands the scale of parallel cor-pora through back-translation mechanisms. These works show that leveraging parallel corpora can bring much help towards learning cross-lingual rep-resentations.

然而,MMLM任务不能很好地利用并行语料库。事实上,平行语料库对于机器翻译等NLP任务非常重要。从直观上看,平行语料库对于直接学习具有相同意义的不同语言句子的跨语言表征非常有帮助。从这一点出发,XLM (Lample and Conneau, 2019)利用双语句子对来执行翻译语言建模(TLM)任务。与BERT中的MLM相似,TLM将两个语义匹配的句子组合成一个句子,并随机掩码两部分中的标记。与MLM相比,TLM需要模型根据双语语境预测掩码token。这鼓励模型将两种语言的表示组合在一起。

除了TLM外,还有一些从平行语料库学习多语言表征的有效方法。Unicoder (Huang et al., 2019a)提供了两个基于平行语料库的新型预训练任务:跨语言单词恢复(CLWR)和跨语言意译分类(CLPC)。CLWR利用目标语言嵌入机制来表示源语言嵌入,其目标是恢复源语言嵌入。这个任务使模型能够学习不同语言之间的单词级别对齐CLPC将对齐的句子作为正对,样本未对齐的句子作为负对,以执行句子级别的分类,让模型预测输入对是否对齐。使用CLPC,模型可以学习不同语言之间的句子级别对齐ALM (Yang et al., 2020)自动从平行句中生成代码转换序列,并对其进行MLM,这迫使模型仅基于其他语言的上下文进行预测。InfoXLM (Chi et al., 2020b)从信息论的角度分析MMLM和TLM,鼓励模型在对比学习框架下区分对齐的句子对和未对齐的反例。HICTL (Wei et al., 2021)扩展了使用对比学习来学习句子级和单词级跨语言表示的想法。ERNIE-M (Ouyang et al., 2020)提出了反向翻译掩码语言模型(BTMLM),并通过反向翻译机制扩展了平行语料库的规模。这些研究表明,利用平行语料库对学习跨语言表征有很大帮助。

Researches have also widely explored generative models for multilingual PTMs. Normally, a gener-ative model consists of a Transformer encoder and a Transformer decoder. For example, MASS (Song et al., 2019) extends MLM to language generation. It randomly masks a span of tokens in the input sentence and predicts the masked tokens in an autoregressive manner. Denoising autoencoding (DAE) is a typical generation task, which applies noise functions to the input sentence and then re-stores the original sentence with the decoder. The noise functions of DAE usually contain two operations: replacing a span of tokens with a mask token as well as permuting the order of tokens. mBART (Liu et al., 2020c) extends DAE to support multiple languages by adding special symbols. It adds a lan-guage symbol both to the end of the encoder input and the beginning of the decoder input. This en-ables models to know the languages to be encoded and generated.

Although DAE in mBART (Liu et al., 2020c) is trained with multiple languages, the encoding in-put and the decoding ouTPUt are always in the same language. This leads models to capture spurious correlations between language symbols and gener-ated sentences. In other words, models may ignore the given language symbols and directly generate sentences in the same language of the input. To ad-dress this issue, XNLG (Chi et al., 2020a) proposes the cross-lingual autoencoding (XAE) task. Differ-ent from DAE, the encoding input and the decoding ouTPUt of XAE are in different languages, which is similar to machine translation. In addition, XNLG optimizes parameters in a two-stage manner. It trains the encoder with the MLM and TLM tasks in the first stage. Then, it fixes the encoder and trains the decoder with the DAE and XAE tasks in the second stage. All parameters are well pre-trained by this way, and the gap between pre-training with MLM and fine-tuning with autoregressive decoding is also filled.

研究人员也广泛探索了多语言PTMs的生成模型。通常,生成模型由Transformer编码器Transformer解码器组成。例如MASS (Song et al., 2019)将MLM扩展到语言生成。该算法在输入句中随机掩码一段标记,并以自回归的方式对掩码标记进行预测。降噪自编码(DAE)是一种典型的生成任务,它将噪声函数应用于输入句子,然后用解码器重新存储原始句子。DAE的噪声函数通常包含两种操作:用掩码token替换一组token,以及置换token的顺序。mBART(Liu et al., 2020c)通过添加特殊符号扩展了DAE以支持多种语言。它在编码器输入的末尾和解码器输入的开头都添加了语言符号。这使得模型能够知道要编码和生成的语言

虽然mBART中的DAE (Liu et al., 2020c)使用多种语言进行训练,但编码输入和解码输出总是使用同一种语言。这导致模型捕捉到语言符号和生成的句子之间的虚假关联性。换句话说,模型可能会忽略给定的语言符号,并直接以与输入相同的语言生成句子。为了解决这个问题,XNLG (Chi et al., 2020a)提出了跨语言自动编码(XAE)任务。与DAE不同的是,XAE的编码输入和解码输出是不同的语言,类似于机器翻译。此外,XNLG还以两阶段的方式优化参数。在第一阶段,利用MLMTLM任务对编码器进行训练。然后,在第二阶段对编码器进行修复,并使用DAEXAE任务训练解码器。通过这种方式,所有参数都得到了很好的预训练,并且也填补了使用 MLM 进行预训练和使用自回归解码进行微调之间的空白。

5.2 Multimodal Pre-Training多模态预训练

Large-scale pre-training and its downstream ap-plications have cascaded impactful research and development with diverse real-world modalities. We see objects, hear sounds and speak languages. Modalities, such as audio, video, image and text, refer to how something happens or is experienced. Tasks include multiple modalities that are devel-oping in a fast-paced. More recently, large-scale PTMs have enhanced research interests in the in-tersection of multiple modalities, such as the in-tersection of image and text, or the intersection of video and text. Specifically, this kind of modalities can all be classified as vision and language (V&L), considering that images and videos belong to vi-sion as well as text and speech (audio) belong to language. V&L tasks can be further divided into image-text-based tasks, video-text-based tasks, and video-audio-based tasks according to their specific modalities being used.

We now present a detailed overview of the previous trends in pre-training on V&L modalities. First, for image-text-based PTMs, the most current solu-tions are to adopt visual-linguistic BERT. The main difficulty relies upon integrating non-text informa-tion into the framework of BERT. ViLBERT (Lu et al., 2019) is a model to learn task-agnostic joint representations of images and languages. It extends the BERT architecture to a multimodal model that supports two streams of input, by preprocessing textual and visual information separately. After two encoders, it uses Transformer layers to ob-tain united attention results for both textual and visual information. ViLBERT first provides a new mind for learning the relationship between vision and language, which is no longer limited to learn a specific task but takes the relationship between vision and language as a pre-trainable and transfer-able ability of models. It uses three pre-training tasks: MLM, sentence-image alignment (SIA) and masked region classification (MRC). It is evalu-ated on five downstream tasks: visual question an-swering (VQA), visual commonsense reasoning (VCR), grounding referring expressions (GRE), image-text retrieval (ITIR) and zero-shot image-text retrieval (ZSIR). LXMERT (Tan and Bansal, 2019) has similar architecture compared to Vil-BERT but uses more pre-training tasks: MLM, SIA, MRC, masked region feature regression (MRFR) and VQA. LXMERT is tested on three downstream tasks: VQA, graph question answering (GQA) and natural language for visual reasoning (NLVR2).

大规模的预训练及其下游应用已经将具有影响力的研究和开发与各种现实世界的模态串联起来。我们看到物体,听到声音,说语言。模态,如音频、视频、图像和文本,指的是某事如何发生被体检的。任务包括在快节奏中发展的多种模态。最近,大规模的PTM增强了对多种模态交叉的研究兴趣,如图像和文本的交叉,或视频和文本的交叉。具体来说,这种模态可以分为视觉语言(V&L),图像和视频属于视觉,文本和语音(音频)属于语言。V&L任务根据其使用的具体方式又可分为基于图像文本的任务、基于视频文本的任务和基于视频音频的任务

现在,我们将详细概述V&L模态的预训练趋势。首先,对于基于图像文本的PTMs,当前的解决方案是采用视觉语言BERT。主要的困难在于将非文本信息整合到BERT框架中。ViLBERT (Lu et al., 2019)是一个学习与任务无关的图像和语言联合表示的模型。通过分别对文本和视觉信息进行预处理,将BERT架构扩展为支持两种输入流的多模态模型。经过两个编码器后,使用Transformer层对文本和视觉信息获得统一的注意力结果ViLBERT 首先为学习视觉与语言之间的关系提供了一种新的思路不再局限于学习特定的任务,而是将视觉与语言之间的关系看作是模型的一种可预训练和可迁移的能力。该算法采用三个预训练任务:MLM句子-图像对齐(SIA)和掩码区域分类(MRC)。它在5个下游任务上进行评估:视觉问题回答(VQA)、视觉常识推理(VCR)、基础参考表达式(GRE)、图像文本检索(ITIR)和零样本图像文本检索(ZSIR)。与ViLBERT 相比,LXMERT (Tan和Bansal, 2019)具有类似的架构,但使用了更多的预训练任务MLMSIAMRC掩码区域特征回归(MRFR)和VQALXMERT在三个下游任务上进行测试:VQA图形问题回答(GQA)和用于视觉推理的自然语言(NLVR2)。

VisualBERT (Li et al., 2019), on the other side, extends the BERT architecture at the minimum. It can be regarded as a simple and effective base-line for V&L pre-training. The Transformer lay-ers of VisualBERT implicitly align elements in the input text and image regions. It uses two pre-training tasks: MLM and IA, and is tested on four downstream tasks: VQA, VCR, NLVR2, and ITIR. Unicoder-VL (Li et al., 2020a) moves the offsite visual detector in VisualBERT into an end-to-end version. It designs the image token for Transformers as the sum of the bounding box and object label features. It uses MLM, SIA and masked object classification (MOC) as its pre-training tasks, as well as uses IR, ZSIR and VCR as its downstream tasks. VL-BERT(Su et al., 2020) also uses a similar architecture to VisualBERT. For VL-BERT, each input element is either a token from the input sen-tence or a region-of-interest (RoI) from the input image. It uses MLM and MOC as the pre-training tasks and finds that adding SIA will decrease model performance. It is evaluated on three downstream tasks: VQA, VCR and GRE.

Some multimodal PTMs are designed to solve specific tasks such as VQA. B2T2(Alberti et al., 2019) is the model that mainly focuses on VQA. It designs a model for early fusion of the co-reference between textual tokens and visual object features, and then uses MLM and SIA as the pre-training tasks. VLP (Zhou et al., 2020a) focuses on VQA and image captioning. It uses a shared multi-layer Transformer for both encoding and decod-ing, different from many existing methods whose encoder and decoder are implemented using sep-arate models. It is pre-trained on bidirectional masked language prediction (BMLP) and sequence to sequence masked language prediction (s2sMLP). Furthermore, UNITER (Chen et al., 2020e) learns unified representations between the two modali-ties. UNITER tries many pre-trained tasks, such as MLM, SIA, MRC and MRFR. UNITER is also tested on various downstream tasks: VQA, IR, VCR, NLVR2, referring expression comprehension (REC), and visual entailment (VE).

另一方面,VisualBERT (Li et al., 2019)最小限度地扩展BERT架构。可以看作是一个简单而有效的V&L预训练的基线VisualBERTTransformer隐式地对齐输入文本和图像区域中的元素。它使用两个预训练任务:MLMIA,并在四个下游任务上测试:VQAVCRNLVR2ITIRUnicoder-VL(Li et al., 2020a)将VisualBERT中的场外视觉检测器移动到端到端版本。它将Transformers的图像标记设计为包围盒和对象标签特征的和。它以MLMSIAMOC作为预训练任务,并使用 IRZSIR VCR 作为其下游任务。VL-BERT(Su等人,2020)也使用了与VisualBERT类似的架构。对于VL-BERT,每个输入元素要么是来自输入句子的标记,要么是来自输入图像的感兴趣区域(RoI)。以MLMMOC作为预训练任务,发现添SIA会降低模型性能。它通过三个下游任务:VQAVCRGRE进行评估。

一些多模态PTMs被设计用来解决特定的任务,比如VQAB2T2(Alberti et al., 2019)是主要关注于VQA的模型。该算法设计了一个早期融合文本标记视觉对象特征之间的协同引用的融合模型,并以MLMSIA作为预训练任务。VLP (Zhou et al., 2020a)专注于VQA和图像字幕。它使用一个共享的多层Transformers进行编码和解码,这与许多现有方法不同,后者的编码器和解码器是使用单独的模型实现的。它在双向掩码语言预测 (BMLP) 和序列到序列掩码语言预测 (s2sMLP) 上进行了预训练。此外,UNITER (Chen等人,2020e)学习了两种模态之间的统一表示UNITER尝试许多预训练的任务,如MLM,SIA, MRCMRFRUNITER 还在各种下游任务上进行了测试:VQAIRVCRNLVR2参照表达理解(REC)和视觉蕴涵(VE)。

ImageBERT (Qi et al., 2020) is the same as Unicoder-VL. It designs a novel weakly super-vised approach to collect large-scale image-text data from the website, whose volume and quality are essential to V&L pre-train tasks. The collect-ing steps include web-page collection, image filter-ing, sentence detection, sentence cleaning, image-text semantic scoring, and image-text aggregation. The resulting dataset contains ten million images and their descriptions with an average length of 13 words, which shows benefits to pre-training multi-modal PTMs. The pre-training tasks include MLM, SIA, MOC and MRFR, while only being tested on one downstream task: ITIR. Lu et al. (2020) investi-gate relationships between nearly all V&L tasks by developing a large-scale, multi-task training regime. It classifies the common tasks into four groups: VQA, caption-based image retrieval, grounding referring expressions, and multimodal verification. It adopts two pre-training tasks by masking mul-timodal modeling only for aligned image-caption pairs and masking overlapped image regions, while performing well on five downstream tasks: VQA, GQA, IR, RE and NLVR2.

X-GPT (Xia et al., 2020) finds that while pre-vious BERT-based multimodal PTMs produce ex-cellent results on downstream understanding tasks,they cannot be applied to generation tasks directly. It is then proposed to pre-train text-to-image cap-tion generators through three novel generation tasks, including image-conditioned masked lan-guage modeling (IMLM), image-conditioned de-noising autoencoding (IDA), and text-conditioned image feature generation (TIFG). For downstream tasks, it focuses only on image captioning (IC). Oscar (Li et al., 2020e) uses object tags detected in images as anchor points to ease the learning of alignments significantly. It is motivated by the ob-servation that the salient objects in an image can be accurately detected and often mentioned in the paired text. It performs well on six downstream tasks: ITIR, IC, novel object captioning (NOC), VQA, GCQ and NLVR2.

ImageBERT (Qi et al., 2020)与Unicoder-VL是一样的。它设计了一种新颖的弱监督方法来从网站上收集大规模的图像文本数据,其数量和质量对于V&L预训练任务至关重要。收集步骤包括网页收集、图像过滤、句子检测、句子清洗、图像文本语义评分、图像文本聚合。得到的数据集包含1000万幅图像和它们的描述,平均长度为13个单词,这显示出对预训练的多模态PTMs的好处。预训练的任务包括MLM, SIA, MOCMRFR,而只测试一个下游任务ITIR。Lu等人(2020)通过开发大规模的多任务训练机制,调查了几乎所有V&L任务之间的关系。它将常见的任务分为四组:VQA基于标题的图像检索grounding引用表达式多模态验证。该算法采用两种预训练任务,即仅针对对齐的图像标题对进行多模态建模,对重叠的图像区域进行掩码,同时在5个下游任务上表现良好:VQA、GQA、IR、RE  NLVR2

X-GPT (Xia et al., 2020)发现,尽管之前基于BERT的多模态PTMs在下游理解任务上产生了很好的结果,但它们不能直接应用于生成任务。然后提出通过图像条件掩码语言建模(IMLM)、图像条件去噪自编码(IDA)和文本条件图像特征生成(TIFG)3种新的生成任务对文本图像标题生成器进行预处理训练。对于下游任务,它只关注图像字幕(IC)。Oscar (Li et al., 2020e)使用图像中检测到的对象标签作为锚点,以显著简化对齐的学习。它的动机是观察图像中的显著物体可以被准确地检测到,并经常在配对的文本中提到。它在ITIRIC新对象字幕(NOC)VQAGCQNLVR2这6个下游任务上表现良好。

A bigger step towards conditional zero-shot im-age generation is taken by DALLE (Ramesh et al., 2021) from OpenAI and CogView (Ding et al., 2021) from Tsinghua and BAAI. DALLE is the very first transformer-based text-to-image zero-shot pre-trained model with around 10 billiion pa-rameters. It shows the potential of multi-modal pre-trained models to bridge the gap between text descriptions and image generation, especially the excellent ability in combining different objects, such as “an armchair in the shape of an avocado". CogView improves the numerical precision and training stability by introducing sandwich trans-former and sparse attention mechanism, and thus surpasses the DALLE in FID and . It is also the first text-to-image model in Chinese.

Recently, CLIP (Radford et al., 2021) and Wen-Lan (Huo et al., 2021) explore enlarging web-scale data for V&L pre-training with big success. Com-paring to previous works, they face a large-scale distributed pre-training challenge. We will intro-duce how to handle the large-scale distributed pre-training challenge in the next section.

OpenAI公司的DALLE (Ramesh et al., 2021)和清华大学和BAAI公司的CogView (Ding et al., 2021)向着有条件的零样本图像生成迈出了更大的一步。DALLE第一个基于Transformer的文本图像零样本预训练模型,具有大约100亿个参数。它展示了多模态预训练模型在消除文本描述图像生成之间的差距方面的潜力,特别是在组合不同对象方面的出色能力,例如“一个鳄梨形状的扶手椅”。CogView通过引入三明治Transformer稀疏注意力机制,提高了数值精度和训练稳定性,从而在FID和FID中超过了DALLE 。它也是第一个中文文本图像模型

最近,CLIP (Radford et al., 2021)和Wen-Lan (Huo et al., 2021)对扩大网络规模的V&L预训练数据进行了研究,取得了巨大成功。与以往的研究相比,它们面临着大规模分布式的预训练挑战。我们将在下一节介绍如何处理大规模分布式的预训练挑战。

5.3 Knowledge-Enhanced Pre-Training增强知识的预训练

 Figure 9: An illustration of ZeRO-Offload and ZeRO-Offload with delayed parameter update

图9:带有延迟参数更新的ZeRO-Offload和ZeRO-Offload示意图

PTMs can extract plenty of statistical information from large amounts of data. Besides, external knowledge, such as knowledge graphs, domain-specific data and extra annotations of pre-training data, is the outcome of human wisdom which can be a good prior to the modeling of statistics. In this subsection, we classify external knowledge accord-ing to the knowledge format and introduce several methods attempting to combine knowledge with PTMs.

The typical form of structured knowledge is knowledge graphs. Many works try to enhance PTMs by integrating entity and relation embed-dings (Zhang et al., 2019b; Liu et al., 2020a; Pe-ters et al., 2019; Sun et al., 2020; Rosset et al., 2020; Qin et al., 2021) or their alignments with the text (Xiong et al., 2019; Sun et al., 2019b). How-ever, real-world knowledge graphs like Wikidata contain more information than entities and rela-tions. Wang et al. (2021) pre-train models based on the descriptions of Wikidata entities, by incorporating a language model loss and a knowledge em-bedding loss together to get knowledge-enhanced representations. Some works regard the paths and even sub-graphs in knowledge graphs as a whole, and directly model them and the aligned text to re-tain more structural information. Since aligning en-tities and relations to raw text is often troublesome and can introduce noise in data pre-processing, an-other line of works (Bosselut et al., 2019; Guan et al., 2020; Chen et al., 2020d) can directly con-vert structural knowledge into the serialized text and let models learn knowledge-text alignments by themselves. An interesting attempt is OAG-BERT (Liu et al., 2021a), which integrates hetero-geneous structural knowledge in the open academic graph (OAG) (Zhang et al., 2019a), which covers 0.7 billion heterogeneous entities and 2 billion relations.

PTMs可以从大量的数据中提取大量的统计信息。此外,外部知识,如知识图谱特定领域的数据和对预训练数据的额外标注等,都是人类智慧的产物,可以很好地在统计建模之前进行。在这个小节中,我们将根据知识的格式对外部知识进行分类,并介绍几种尝试将知识与PTMs相结合的方法。

结构化知识的典型形式是知识图谱。许多工作试图通过整合实体和关系嵌入来增强PTMs (Zhang et al., 2019b; Liu et al., 2020a; Pe-ters et al., 2019; Sun et al., 2020; Rosset et al., 2020; Qin et al., 2021)或它们与文本的对齐方式 (Xiong et al., 2019; Sun et al., 2019b)。然而,像维基数据这样的现实世界知识图谱比实体和关系包含更多的信息。Wang等人(2021)通过将语言模型损失知识嵌入损失合并在一起,以基于维基多实体描述的预训练模型来获得知识增强表示。有些工作将知识图谱中的路径甚至子图视为一个整体,并直接对它们和对齐的文本进行建模保留更多的结构信息。由于将实体和关系与原始文本对齐通常很麻烦,并且可能在数据预处理中引入噪声,另一项工作(Bosselut等人,2019年;Guan等人,2020年;Chen et al., 2020d)可以直接将结构化知识转换为序列化的文本,并让模型自行学习知识文本对齐OAG- BERT(Liu et al., 2021a)是一个有趣的尝试,它将异质性结构知识整合到开放学术图(OAG) (Zhang et al., 2019a)中,该图覆盖了7亿个异构实体和20亿个关系。

Compared to structured knowledge, unstructured knowledge is more intact but also noisier. How to effectively model this kind of knowledge from the data is also worth being explored. The data of a specific domain or task can be considered as a kind of unstructured knowledge. Many works (Beltagy et al., 2019; Lee et al., 2020) further pre-train the general PTMs on this data to get better domain-specific or task-specific models. Since there are some domain-specific and task-specific human an-notations, Ke et al. (2020) incorporate these ex-tra annotations to get better domain-specific and task-specific language representations. For all the above-mentioned works, knowledge is implic-itly stored in their model parameters. To model external knowledge in a more interpretable way, some works (Lewis et al., 2020b; Guu et al., 2020) design retrieval-based methods to use structured knowledge on downstream tasks. Another kind of works (Wang et al., 2020b) can use adapters trained on different knowledge sources with extra annotations to distinguish where the knowledge is from.

与结构化知识相比,非结构化知识更完整,但也更嘈杂。如何从数据中有效地为这类知识建模也是值得探索的。特定领域或任务的数据可以看作是一种非结构化知识。许多工作(Beltagy等人,2019年;Lee等人,2020)进一步根据这些数据对通用的PTMs进行预训练,以获得更好的领域特定模型任务特定模型。由于存在一些特定于领域和特定于任务的人工注释,Ke等人(2020)合并了这些额外的注释,以获得更好的特定于领域和特定于任务的语言表示。对于所有上述工作,知识隐含地存储在它们的模型参数中。为了以一种更易于解释的方式外部知识建模,一些研究(Lewis等人,2020b;Guu等人,2020)设计了基于检索的方法,在下游任务上使用结构化知识。另一种工作(Wang et al., 2020b)可以使用在不同的知识源上训练过的适配器,并添加额外的注释区分知识的来源


相关文章
|
6月前
|
数据采集 自然语言处理 Devops
ToolLearning Eval:CodeFuse发布首个中文Function Call的大语言模型评测基准!🚀
CodeFuse发布了首个面向ToolLearning领域的中文评测基准ToolLearning-Eval,以帮助开发者跟踪ToolLearning领域大模型的进展,并了解各个ToolLearning领域大模型的优势与不足。ToolLearning-Eval按照Function Call流程进行划分,包含工具选择、工具调用、工具执行结果总结这三个过程,方便通用模型可以对各个过程进行评测分析。
667 0
|
5月前
|
机器学习/深度学习 人工智能 数据处理
人工智能平台PAI产品使用合集之对于有多个raw_feature,如何进行区分
阿里云人工智能平台PAI是一个功能强大、易于使用的AI开发平台,旨在降低AI开发门槛,加速创新,助力企业和开发者高效构建、部署和管理人工智能应用。其中包含了一系列相互协同的产品与服务,共同构成一个完整的人工智能开发与应用生态系统。以下是对PAI产品使用合集的概述,涵盖数据处理、模型开发、训练加速、模型部署及管理等多个环节。
|
5月前
|
机器学习/深度学习 人工智能 并行计算
人工智能平台PAI操作报错合集之version选了0.7.5并在使用learn_loss_weight时遇到报错,如何解决
阿里云人工智能平台PAI (Platform for Artificial Intelligence) 是阿里云推出的一套全面、易用的机器学习和深度学习平台,旨在帮助企业、开发者和数据科学家快速构建、训练、部署和管理人工智能模型。在使用阿里云人工智能平台PAI进行操作时,可能会遇到各种类型的错误。以下列举了一些常见的报错情况及其可能的原因和解决方法。
|
6月前
|
机器学习/深度学习 人工智能 分布式计算
人工智能平台PAI产品使用合集之user_id和item_idd在train/predict的时候发挥什么作用
阿里云人工智能平台PAI是一个功能强大、易于使用的AI开发平台,旨在降低AI开发门槛,加速创新,助力企业和开发者高效构建、部署和管理人工智能应用。其中包含了一系列相互协同的产品与服务,共同构成一个完整的人工智能开发与应用生态系统。以下是对PAI产品使用合集的概述,涵盖数据处理、模型开发、训练加速、模型部署及管理等多个环节。
|
6月前
|
机器学习/深度学习 SQL 存储
人工智能平台PAI 操作报错合集之机器学习PAI训练的时候logging.info('Train and evaluate finish')后, 总会报出来一个错如何解决
阿里云人工智能平台PAI (Platform for Artificial Intelligence) 是阿里云推出的一套全面、易用的机器学习和深度学习平台,旨在帮助企业、开发者和数据科学家快速构建、训练、部署和管理人工智能模型。在使用阿里云人工智能平台PAI进行操作时,可能会遇到各种类型的错误。以下列举了一些常见的报错情况及其可能的原因和解决方法。
|
6月前
|
机器学习/深度学习 人工智能 TensorFlow
人工智能平台PAI产品使用合集之在使用DSSM负采样时,不知道label_fields的配置方法如何解决
阿里云人工智能平台PAI是一个功能强大、易于使用的AI开发平台,旨在降低AI开发门槛,加速创新,助力企业和开发者高效构建、部署和管理人工智能应用。其中包含了一系列相互协同的产品与服务,共同构成一个完整的人工智能开发与应用生态系统。以下是对PAI产品使用合集的概述,涵盖数据处理、模型开发、训练加速、模型部署及管理等多个环节。
|
6月前
|
机器学习/深度学习 人工智能 PyTorch
人工智能平台PAI 操作报错合集之机器学习PAI,用Triton Inference Server 22.05 部署模型,遇到SaveV3这个op的问题,如何解决
阿里云人工智能平台PAI (Platform for Artificial Intelligence) 是阿里云推出的一套全面、易用的机器学习和深度学习平台,旨在帮助企业、开发者和数据科学家快速构建、训练、部署和管理人工智能模型。在使用阿里云人工智能平台PAI进行操作时,可能会遇到各种类型的错误。以下列举了一些常见的报错情况及其可能的原因和解决方法。
|
6月前
|
机器学习/深度学习 人工智能 API
人工智能平台PAI产品使用合集之机器学习PAI中的sample_weight怎么加在样本中
阿里云人工智能平台PAI是一个功能强大、易于使用的AI开发平台,旨在降低AI开发门槛,加速创新,助力企业和开发者高效构建、部署和管理人工智能应用。其中包含了一系列相互协同的产品与服务,共同构成一个完整的人工智能开发与应用生态系统。以下是对PAI产品使用合集的概述,涵盖数据处理、模型开发、训练加速、模型部署及管理等多个环节。
|
计算机视觉
超简单高效方法 | 谷歌提出MOAT Backbone,base+tiny版本实现全方位超越(二)
超简单高效方法 | 谷歌提出MOAT Backbone,base+tiny版本实现全方位超越(二)
142 0
|
机器学习/深度学习 编解码 自然语言处理
超简单高效方法 | 谷歌提出MOAT Backbone,base+tiny版本实现全方位超越(一)
超简单高效方法 | 谷歌提出MOAT Backbone,base+tiny版本实现全方位超越(一)
97 0
下一篇
无影云桌面