Abstract 摘要
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. 最近的研究表明,通过对大量文本语料库进行预训练,然后对特定任务进行微调,在许多NLP任务和基准上取得了实质性的进展。虽然在体系结构中通常与任务无关,但这种方法仍然需要成千上万个示例的特定于任务的微调数据集。相比之下,人类通常可以通过几个例子或简单的指令来执行一项新的语言任务——这是目前的NLP系统在很大程度上仍难以做到的。这里,我们展示了扩展语言模型可以极大地提高任务不可知的、小样本的性能,有时甚至可以通过预先采用的最先进的微调方法达到竞争力。具体来说,我们训练GPT-3,这是一个自回归语言模型,有1750亿个参数,比以往任何非稀疏语言模型多10倍,并测试其在小样本设置下的性能。对于所有任务,GPT-3的应用不需要任何梯度更新或微调,只需要通过与模型的文本交互指定任务和小样本演示。GPT-3在许多NLP数据集上实现了强大的性能,包括翻译、问题回答和完形填空任务,以及一些需要实时推理或领域适应的任务,如整理单词、在句子中使用新单词或执行3位数字算术。与此同时,我们也发现了一些数据集,其中GPT-3的小样本学习仍然存在困难,以及一些数据集,其中GPT-3面临着与大型网络语料库培训相关的方法论问题。最后,我们发现GPT-3可以生成人类评估者难以区分的新闻文章样本和人类撰写的文章样本。我们将讨论这一发现和GPT-3的更广泛的社会影响。
1 Introduction 介绍
Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP+17] have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18].
This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms [RSR+19, LOG+19, YDY+19, LCG+19]. However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons.
First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task.
近年来,NLP系统中出现了一种预先训练语言表示的趋势,应用于越来越灵活和任务不确定的下游迁移方式。首先,学会了使用单层表示词向量(MCCD13, PSM14)和特定于任务的架构,然后用多层RNNs表示和上下文状态被用来形成强表示[DL15、MBXS17 PNZtY18](尽管仍然适用于特定于任务的架构),以及最近pre-trained复发或变压器语言模型(垂直地震剖面+ 17)直接调整,完全消除需要特定于任务的架构(RNSS18,DCLT18, HR18]。最后一种范式在许多具有挑战性的NLP任务(如阅读理解、问题回答、文本蕴涵和许多其他任务)上取得了实质性的进展,并在新的架构和算法的基础上继续前进[RSR+19, LOG+19, YDY+19, LCG+19]。然而,这种方法的主要限制在于,架构是task-agnostic,仍然是一个需要特定于任务的数据集和特定于任务的微调:实现强劲表现所需的任务通常需要微调的数据集上成千成百上千的例子具体任务。出于几个原因,消除这一限制是可取的。首先,从实践的角度来看,每一个新任务都需要大量带标签的示例数据集,这限制了语言模型的适用性。有非常广泛的可能有用的语言任务,包括任何事情,从纠正语法,生成一个抽象概念的例子,批评一个短篇小说。对于许多这样的任务来说,很难收集到一个大型的监督训练数据集,特别是当这个过程必须为每个新任务重复时。
Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions. For instance [HLW+20] observe that larger models do not necessarily generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm can be poor because the model is overly specific to the training distribution and does not generalize well outside it [YdC+19, MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at human-level, may exaggerate actual performance on the underlying task [GSL+18, NK19].
Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.
其次,随着模型的表现力和训练分布的窄性,挖掘训练数据中假相关性的潜力从根本上增加。这可能会给预训练和微调范式带来问题,在这种范式中,模型被设计得很大,以便在预训练期间吸收信息,但随后在非常狭窄的任务分布上进行微调。例如[HLW+20]观察到,较大的模型不一定能更好地推广非分布。有证据表明,在这种范式下实现的泛化可能很差,因为模型过于具体于训练分布,不能很好地泛化在训练分布之外[YdC+19, MPL19]。因此,在特定基准测试中,即使名义上是在人的层面上,经过调优的模型的性能也可能会夸大底层任务的实际性能[GSL+18, NK19]。第三,人类学习语言最不需要大型数据集监管任务——一个简短的指令在自然语言(如:“请告诉我,如果这句话描述了一些快乐或者悲伤”)或者最多一个小数量的示威活动(例如:“这里有两个例子的人勇敢的行动;请给出勇气的第三个例子”)通常足以使一个人完成一项新任务,至少达到合理的能力水平。除了指出我们目前的NLP技术在概念上的局限性外,这种适应性还具有实际的优势——它允许人类无缝地混合在一起或在许多任务和技能之间切换,例如在冗长的对话中执行添加操作。为了广泛应用,我们希望我们的NLP系统有同样的流动性和普遍性。
One potential route towards addressing these issues is meta-learning1 – which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work [RWC+19] attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next.
While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example [RWC+19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of solving language tasks.
Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters [DCLT18], to 1.5 billion parameters [RWC+19], to 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale [KMH+20]. Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.
对解决这些问题的一个潜在的路线是meta-learning1——在语言的上下文模型意味着模型发展广泛技能的训练时间和模式识别能力,然后使用这些能力在推理时迅速适应或识别所需的任务(见图1.1)。最近的工作[RWC + 19]试图做到这一点通过我们称之为“语境学习”,使用文本输入pretrained语言模型作为一种任务规范:模型条件在自然语言指令和/或一些示威活动的任务,然后将完成进一步的实例任务只需预测接下来会发生什么。虽然它显示出了一些最初的希望,但这种方法取得的效果仍远不及微调——例如[RWC+19]在自然问题上仅取得4%的成绩,甚至它的55 F1 CoQa结果现在也落后于最先进的水平35分以上。元学习显然需要大量的改进,才能成为解决语言任务的可行的实用方法。语言建模的另一个最新趋势可能提供了一个前进的方向。近年来,transformer语言模型的容量大幅增加,从1亿个参数[RNSS18],到3亿个参数[DCLT18],再到15亿个参数[RWC+19],再到80亿个参数[SPP+19], 110亿个参数[RSR+19],最后是170亿个参数[Tur20]。每一次增加都带来了文本合成和/或下游NLP任务的改进,有证据表明,与许多下游任务相关的日志丢失随着规模的增大呈现平稳的改善趋势[KMH+20]。由于内环境学习涉及在模型的参数内吸收许多技能和任务,因此内环境学习能力可能会随着规模的增长而显示出类似的强劲增长,这是合理的。
In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional fine-tuning setting, but we leave this to future work.
Figure 1.2 illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to remove extraneous symbols from a word. Model performance improves with the addition of a natural language task description, and with the number of examples in the model’s context, K. Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model size and number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.
Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in the one-shot setting, 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on TriviaQA in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, the last of which is state-of-the-art relative to fine-tuned models operating in the same closed-book setting.
GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning, which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human evaluators have difficulty distinguishing from human-generated articles.
At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed.
A heuristic sense of the overall results can be seen in Figure 1.3, which aggregates the various tasks (though it should not be seen as a rigorous or meaningful benchmark in itself).
在本文中,我们通过训练一个参数为1750亿的自回归语言模型(我们称之为GPT-3),并测量其上下文内学习能力来检验这一假设。具体地说,我们在超过24个NLP数据集上评估GPT-3,以及一些旨在测试对训练集中不太可能直接包含的任务的快速适应的新任务。对于每个任务,我们评估GPT-3 3条件下:(一)“few-shot学习”,或语境学习,我们允许尽可能多的示威活动将适合模型的上下文窗口(通常10 - 100),(b)“一次性学习”,我们只允许一个示范,和(c)“zero-shot”学习,不允许有示威游行,只有一条指令在自然语言模型。原则上,GPT-3也可以在传统的微调设置中进行评估,但我们将其留给未来的工作。图1.2说明了我们所研究的条件,并展示了一个简单任务的少量学习,该任务要求模型从一个单词中去除无关的符号。模型性能随着自然语言任务描述的增加而提高,随着模型上下文中的示例数量的增加,K. Few-shot学习也随着模型大小的增加而显著提高。虽然在这种情况下的结果是特别引人注目的,但模型大小和上下文示例数量的总体趋势对我们研究的大多数任务都是成立的。我们强调,这些“学习”曲线不涉及梯度更新或微调,只是不断增加作为条件的演示数量。总的来说,在NLP任务中,GPT-3在零杆和单杆设置中取得了很好的效果,在少杆设置中,有时可以与最先进的技术竞争,甚至有时超过最先进的技术(尽管最先进的技术是由经过微调的模型持有的)。例如,GPT-3在零杆设置中CoQA达到81.5 F1,在单杆设置中CoQA达到84.0 F1,在少杆设置中达到85.0 F1。同样,在TriviaQA上,GPT-3在零杆设置上的精度为64.3%,在单杆设置上的精度为68.0%,在少杆设置上的精度为71.2%,与在相同闭锁设置下运行的精细模型相比,后者是最先进的。在测试快速适应或即时推理的任务上,GPT-3也显示出一步走和少步出的熟练程度,这些任务包括解读单词、执行算术,以及在一个句子中使用只定义过一次的新单词。我们还表明,在小样本设置中,GPT-3可以生成人工评估人员难以区分的合成新闻文章。与此同时,我们也发现一些任务在性能上有一些困难,即使在GPT-3的规模上也是如此。这包括像ANLI数据集这样的自然语言推理任务,以及像RACE或QuAC这样的阅读理解数据集。通过对GPT-3的优点和缺点(包括这些局限性)的广泛描述,我们希望能促进对语言少注射学习的研究
We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity.
In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.
Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard.
The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes.
我们还对“数据污染”进行了系统的研究——这是一个日益严重的问题,当在数据集上训练高容量模型时,比如Common crawlow,它可能会包含来自测试数据集的内容,因为这些内容经常存在于web上。在本文中,我们开发了系统的工具来测量数据污染和量化其扭曲效应。尽管我们发现数据污染对大多数数据集上的GPT-3性能的影响很小,但我们确定了一些数据集可能会导致结果膨胀,我们要么不报告这些数据集的结果,要么根据严重程度用星号标注它们。除了以上这些,我们还训练了一系列较小的模型(从1.25亿参数到130亿参数不等),以便在零样本、一样本和小样本设置中与GPT-3进行比较。总的来说,对于大多数任务,我们发现在所有三种设置中,模型容量的缩放相对平稳;一个值得注意的模式是,零弹、一弹和少弹之间的差距经常随着模型容量的增加而增加,这可能表明较大的模型更精通元学习。最后,鉴于GPT-3表现出的广泛的能力范围,我们讨论了对偏见、公平和更广泛的社会影响的关注,并试图在这方面对GPT-3的特征进行初步分析。本文的其余部分组织如下。在第2节中,我们将描述培训GPT-3并对其进行评估的方法和方法。第3节在零,一次和很小样本设置的任务的全范围的结果。第4节讨论了数据污染的问题(火车测试重叠)。第5节讨论GPT-3的局限性。第6节讨论更广泛的影响。第7节回顾相关工作,第8节作总结。
2 Approach 方法
Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):
我们的基本预训练方法,包括模型、数据和训练,类似于[RWC+19]中描述的过程,即相对简单地增加模型大小、数据集大小和多样性,以及训练长度。我们对上下文内学习的使用也类似于[RWC+19],但在这项工作中,我们系统地探索了上下文内学习的不同设置。因此,在本节开始时,我们将显式定义并对比我们将在其上评估GPT-3或原则上可以在其上评估GPT-3的不同设置。这些设置可以看作取决于它们倾向于依赖多少特定于任务的数据。具体来说,我们可以在这个频谱中确定至少四个
Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data [GSL+18, NK19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.
Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning [RWC+19], but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving K examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set K in the range of 10 to 100 as this is how many examples can fit in the model’s context window (nctx = 2048). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, VBL+16] – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.
One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.
Zero-Shot (0S) is the same as one-shot except that no demonstrations are allowed, and the model is only given a natural language instruction describing the task. This method provides maximum convenience, potential for robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of pre-training data), but is also the most challenging setting. In some cases it may even be difficult for humans to understand the format of the task without prior examples, so this setting is in some cases “unfairly hard”. For example, if someone is asked to “make a table of world records for the 200m dash”, this request can be ambiguous, as it may not be clear exactly what format the table should have or what should be included (and even with careful clarification, understanding precisely what is desired can be difficult). Nevertheless, for at least some settings zero-shot is closest to how humans perform tasks – for example, in the translation example in Figure 2.1, a human would likely know what to do from just the text instruction.
微调(FT)是近年来最常见的方法,它包括通过对特定于预期任务的监督数据集进行训练来更新预训练模型的权重。通常使用成千上万的带标签的例子。调优的主要优点是在许多基准测试中具有强大的性能。主要缺点是每个任务都需要一个新的大数据集,可能会出现泛化不均匀分布[MPL19],可能会利用训练数据的虚假特征[GSL+18, NK19],可能会导致与人类性能的不公平比较。在这项工作中,我们没有对GPT-3进行微调,因为我们关注的是任务不确定性能,但是GPT-3原则上可以进行微调,这是未来工作的一个有希望的方向。
少量射击(FS)是我们将在这项工作中使用的术语,用来指在推理时给模型一些任务的演示作为条件[RWC+19],但不允许权重更新。如图2.1所示,一个典型的数据集实例有一个上下文和所需的完成(例如一个英语句子翻译和法国),和few-shot作品给K上下文和完成的例子,然后最后一个例子的情况下,模型将提供完成。我们通常将K设置在10到100之间,因为这是模型上下文窗口所能容纳的示例数(nctx = 2048)。few-shot的主要优点是大大减少了对特定任务数据的需求,并减少了从一个大而窄的微调数据集学习过窄分布的可能性。这种方法的主要缺点是,到目前为止,其结果远不如最先进的微调模型。此外,仍然需要少量特定于任务的数据。显示的名字,few-shot学习作为语言模型与这里描述few-shot学习用于其他上下文毫升(HYC01轮式侦察车+ 16)-包括基于分布广泛的学习任务(在本例中隐含在训练的数据),然后迅速适应新任务。
one -shot (1S)与few-shot相同,只是除了任务的自然语言描述之外,只允许进行一次演示,如图1所示。区分“一步走”、“少一步走”和“零一步走”(见下图)的原因是,“一步走”与某些任务传达给人类的方式最接近。例如,当要求人类在人工工作服务(例如Mechanical Turk)上生成数据集时,通常会给出任务的演示。相比之下,如果没有例子,有时很难传达任务的内容或格式。
Zero-Shot(0)与one-shot相同,只是不允许演示,并且只给模型一个描述任务的自然语言指令。这种方法提供了最大的便利性、潜在的鲁棒性和避免虚假相关性(除非它们在大量的训练前数据语料库中广泛出现),但也是最具挑战性的设置。在某些情况下,如果没有之前的例子,人类甚至很难理解任务的格式,所以在某些情况下,这种设置是“不公平的困难”。例如,如果有人要求“让200米短跑的世界纪录表”,这个请求可以模糊,因为它可能不清楚什么格式表应该有或应该包括什么(甚至仔细澄清,需要理解恰恰是很困难的)。不过,至少在某些设置中,zero-shot最接近于人类执行任务的方式——例如,在图2.1中的翻译示例中,人类可能仅通过文本指令就知道该做什么。
Figure 2.1 shows the four methods using the example of translating English to French. In this paper we focus on zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.
Sections 2.1-2.3 below give details on our models, training data, and training process respectively. Section 2.4 discusses the details of how we do few-shot, one-shot, and zero-shot evaluations.
图2.1展示了使用翻译英语到法语的示例的四种方法。在本文中,我们关注于零射击、一次射击和少射击,目的不是将它们作为竞争的备选方案进行比较,而是将它们作为不同的问题设置,在特定基准测试的性能和样本效率之间提供不同的权衡。我们特别强调小样本的结果,因为他们中的许多只是稍微落后于最先进的微调模型。然而,最终,“一箭双雕”(有时甚至是“零射”)似乎是对人类表现最公平的比较,也是未来工作的重要目标。下面的2.1-2.3节分别给出了我们的模型、训练数据和训练过程的细节。第2.4节讨论了我们如何进行少拍、一次拍和零拍评估的细节。
2.1 Model and Architectures 模型和架构
We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20] suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.
我们使用与GPT-2 [RWC+19]相同的模型和架构,包括修改的初始化、预归一化和其中描述的可逆标记,但我们在变压器的层中使用交替密集和局部带状稀疏注意模式,类似于稀疏变压器[CGRS19]。为了研究ML性能对模型大小的依赖关系,我们训练了8种不同大小的模型,从1.25亿个参数到1750亿个参数的3个数量级,最后一个是我们称为GPT-3的模型。先前的研究[KMH+20]表明,在有足够的训练数据的情况下,验证损失的比例应近似于一个平滑的幂律,该幂律是大小的函数;许多不同大小的训练模型允许我们测试验证丢失和下游语言任务的假设。
Table 2.1 shows the sizes and architectures of our 8 models. Here nparams is the total number of trainable parameters, nlayers is the total number of layers, dmodel is the number of units in each bottleneck layer (we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel), and dhead is the dimension of each attention head. All models use a context window of nctx = 2048 tokens. We partition the model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work [KMH+20] suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.
表2.1显示了我们的8个模型的大小和架构。这里nparams总数可训练的参数,nlayers总层数,dmodel是单位的数量在每一个瓶颈层(我们总是有前馈层瓶颈层的四倍,dff = 4∗dmodel),和dhead每个关注头部尺寸。所有模型都使用nctx = 2048令牌的上下文窗口。我们沿着深度和宽度维度在gpu上划分模型,以最小化节点之间的数据传输。每个模型的精确结构参数的选择是基于计算效率和在GPU中模型布局的负载均衡。先前的工作[KMH+20]表明,验证损失对这些参数在一个合理的大范围内不是很敏感。