3.9.1 Arithmetic  算术

To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small  battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:

2 digit addition (2D+) – The model is asked to add two integers sampled uniformly from [0, 100), phrased in  the form of a question, e.g. “Q: What is 48 plus 76? A: 124.”  

2 digit subtraction (2D-) – The model is asked to subtract two integers sampled uniformly from [0, 100); the  answer may be negative. Example: “Q: What is 34 minus 53? A: -19”.  

3 digit addition (3D+) – Same as 2 digit addition, except numbers are uniformly sampled from [0, 1000).

3 digit subtraction (3D-) – Same as 2 digit subtraction, except numbers are uniformly sampled from [0, 1000).

4 digit addition (4D+) – Same as 3 digit addition, except uniformly sampled from [0, 10000).

4 digit subtraction (4D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 10000).

5 digit addition (5D+) – Same as 3 digit addition, except uniformly sampled from [0, 100000).

5 digit subtraction (5D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 100000).

2 digit multiplication (2Dx) – The model is asked to multiply two integers sampled uniformly from [0, 100), e.g. “Q: What is 24 times 42? A: 1008”.

One-digit composite (1DC) – The model is asked to perform a composite operation on three 1 digit numbers, with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers are selected uniformly on [0, 10) and the operations are selected uniformly from {+,-,*}.


2位加法(2D+)——模型被要求将从[0,100均匀采样的两个整数相加,以问题的形式表达,例如:“Q: 48加76等于多少?”答:124。”


3位加法(3D+) -与2位加法相同,只是数字均匀地从[0,1000]取样。

3位减法(3D-) -与2位减法相同,只是数字均匀地从[0,1000]采样。

4位加法(4D+) -与3位加法相同,只是均匀采样于[0,10000]。

4位减法(4D-) -与3位减法相同,只是均匀采样于[0,10000]。

5位加法(5D+) -与3位加法相同,除了均匀采样于[0,100000]。

5位减法(5D-) -与3位减法相同,除了均匀采样[0,100000]。

2位乘法(2Dx)——模型要求将从[0,100均匀采样的两个整数相乘),例如:“Q: 24乘以42等于多少?”答:1008”。

一位数合成(1DC)——要求模型对三个1位数执行合成操作,最后两个用括号括起来。例如,“Q: 6+(4*8)是多少?”答:38”。在[0,10)上一致选择三个1位数字,在{+,-,*}中一致选择操作。

In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random  instances of the task and evaluate all models on those instances.  First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction,  GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition,  98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the  number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on  five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves  29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves  21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness  beyond just single operations.  


As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the  second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all  other operations less than 10% of the time.  

One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation  to the task (or at the very least recognition of the task) is important to performing these computations correctly.  Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and  model capacity scaling for all three settings is shown in Appendix H.

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic  problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and  "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000  subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers  could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes  such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than  memorizing a table.  

Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even  zero-shot settings.


为了抽查模型是否只是简单地记忆特定的算术问题,我们取测试集中的三位数算术问题,并在训练数据中以“<num1> + <num2> =”和“<num1> + <num2>”的形式搜索它们。</num2></num1></num2></num1>在2000道加法题中,我们发现只有17道匹配(0.8%),而在2000道减法题中,我们发现只有2道匹配(0.1%),这表明只有一小部分正确答案能够被记住。此外,对错误答案的检查发现,该模型经常会犯错误,比如没有带“1”,这表明它实际上是在尝试执行相关的计算,而不是记忆一个表。总的来说,GPT-3在少杆、一杆甚至零杆设置中表现出了相当熟练的中等复杂的算术。

3.9.2 Word Scrambling and Manipulation Tasks  拼字和操作任务

To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of  5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of  scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:

Cycle letters in word (CL) – The model is given a word with its letters cycled, then the “=” symbol, and is expected to generate the original word. For example, it might be given “lyinevitab” and should output “inevitably”.

Anagrams of all but first and last characters (A1) – The model is given a word where every letter except the first and last have been scrambled randomly, and must output the original word. Example: criroptuon = corruption.

Anagrams of all but first and last 2 characters (A2) – The model is given a word where every letter except the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt → opponent.

Random insertion in word (RI) – A random punctuation or space character is inserted between each letter of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.

Reversed words (RW) – The model is given a word spelled backwards, and must output the original word. Example: stcejbo → objects.



除了第一个和最后一个字符以外的所有字符的字谜(A1)——模型被给定一个单词,其中除了第一个和最后一个字符以外的每个字母都被随机打乱,并且必须输出原始单词。例如:criroptuon =腐败。


单词中的随机插入(RI)——在单词的每个字母之间插入随机的标点或空格字符,模型必须输出原始单词。例子:s.u ! c / c e。ssi /o/n =连续。


For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by  [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11.  Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram  task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word.  

In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the  model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these  tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear  in the pre-training data (although we cannot confirm this with certainty).   对于每个任务,我们生成10,000个示例,我们选择这些示例作为最常见的10,000个单词,以长度大于4个字符和小于15个字符的[Nor09]来衡量。小样本结果如图3.11所示。任务性能随着模型大小的变化而平稳增长,完整的GPT-3模型在删除随机插入时达到66.9%,循环字母达到38.6%,在较简单的字谜任务中达到40.2%,在较困难的字谜任务(只保留第一个和最后一个字母)中达到15.1%。没有一个模型能将字母倒转成一个单词。&nbsp;


We can further quantify performance by plotting “in-context learning curves”, which show task performance as a  function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task  in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information,  including both task examples and natural language task descriptions.

Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding  operates on significant fractions of a word (on average ∼ 0.7 words per token), so from the LM’s perspective succeeding  at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also,  CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word),  requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require  non-trivial pattern-matching and computation.



3.9.3 SAT Analogies 类比

To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of  374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of  the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to  hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to  temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original  word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the  few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among  college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with  scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model. 为了在另一个任务中测试GPT-3,这个任务相对于文本的典型分布有些不寻常,我们收集了一组374个“SAT类比”问题[TLBS03]。类推题是2005年前SAT大学入学考试的一个部分的多项选择题。一个典型的例子是“大胆之于大胆,正如(A)伪善之于伪善,(b)匿名之于身份,(c)懊悔之于恶行,(d)有害之于结果,(e)易受诱惑之于结果。”要求学生从五组单词中选出与原单词有相同关系的单词;在这个例子中,答案是“假装虔诚就是虚伪”。在这项任务中,GPT-3在少发、一发和零发中得分分别为65.2%、59.1%和53.7%,而大学申请者的平均得分为57% [TL05](随机猜测的得分为20%)。如图3.12所示,结果随着规模的增加而提高,全1750亿模型比130亿参数模型提高了10%以上。

3.9.4 News Article Generation  新闻文章生成

Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by  conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news  story [RWC+19]. Relative to [RWC+19], the dataset used to train GPT-3 is much less weighted towards news articles,  so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets  the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To  solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the  model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably  generate short articles in the “news” genre.  

To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional  sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles  from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative  language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to  distinguish the two is a potentially important measure of quality.3

之前在生成语言模型上的工作定性地测试了他们生成合成“新闻文章”的能力,方法是有条件地从模型中取样,并给出一个由一个新闻故事的可信的第一句话组成的人类书面提示。相对于数据集(RWC + 19),用于火车GPT-3偏重于新闻文章要少得多,因此试图产生新闻文章通过原始无条件的样品更有效——例如GPT-3经常解释提出的第一句话“新闻文章”的一条微博,然后文章合成反应或后续消息。为了解决这个问题,我们使用了GPT-3的少样本学习能力,在模型的上下文中提供了之前的三篇新闻文章来约束它。有了提议的下一篇文章的标题和副标题,该模型能够可靠地生成“新闻”类型的短文章。


In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles  from the website (mean length: 215 words). We then generated completions of these titles and subtitles  from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each  model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed  by either the human written article or the article generated by the model4  . Participants were asked to select whether the  article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by  a machine”, or “very likely written by a machine”.

The articles we selected were not in the models’ training data and the model outputs were formatted and selected  programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were  pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model.  However, we also ran an experiment to control for participant effort and attention that followed the same format but  involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a  160M parameter model with no context and increased output randomness.

为了考察人类检测模型生成的文本的能力,我们从newser.com网站上任意选择了25篇文章的标题和副标题(平均长度:215个单词)。然后,我们根据四种语言模型生成这些标题和字幕的完整版本,大小从1.25米到175B (GPT-3)参数不等(平均长度:200个单词)。对于每个模型,我们向大约80名来自美国的参与者展示了一个测试,其中包含这些真实的标题和副标题,然后是人工撰写的文章或由模型4生成的文章。参与者被要求选择文章是“很可能是人类写的”,“更可能是人类写的”,“我不知道”,“更可能是机器写的”,还是“很可能是机器写的”。


Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that  the intentionally bad articles were model generated was ∼ 86% where 50% is chance level performance. By contrast,  mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance  at ∼ 52% (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases:  there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6  This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E).  

Examples of synthetic articles from GPT-3 are given in Figures 3.14 and 3.15.  7 Much of the text is—as indicated by the  evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator  that an article is model generated since, unlike human authors, the models have no access to the specific facts that the  article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual  phrasings, though these are often subtle enough that they are not noticed.  

Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like  G R O V E R [ZHR+19] and GLTR [GSR19] may have greater success at detecting model generated text than human  evaluators. Automatic detection of these models may be a promising area of future research.  

Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe  more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated  by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated  completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial  experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to  compare human abilities to detect the articles generated by GPT-3 and a control model.  

We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was  ∼ 88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely  above chance at ∼ 52% (see Table 3.12). This indicates that, for news articles that are around 500 words long, GPT-3  continues to produce articles that humans find difficult to distinguish from human written news articles.



Ippolito等人[IDCBE19]在语言模型检测方面的相关工作表明,自动鉴别器如G R O V E R [ZHR+19]和GLTR [GSR19]在检测模型生成的文本方面可能比人类评价器更成功。这些模型的自动检测可能是未来研究的一个有前景的领域。

Ippolito等人[IDCBE19]也注意到,随着人们观察到更多的标记,人类检测模型生成的文本的准确性也会提高。做一个初步调查好人类是如何检测时间的新闻文章由GPT-3 175 b,我们选择了12项世界新闻文章来自路透社平均长度为569个单词和生成完成的这些文章GPT-3平均长度为498个单词(298字的时间比我们最初的实验)。按照上述方法,我们进行了两个实验,每个实验都有大约80名美国参与者,以比较人类检测GPT-3和一个对照模型生成的文章的能力。

我们发现,人类在检测控制组故意制造的较长文章时的平均准确率为~ 88%,而在检测GPT-3 175B制造的较长文章时的平均准确率为~ 52%(见表3.12)。这表明,对于长度在500字左右的新闻文章,GPT-3继续生成人类难以区分的文章。

3.9.5 Learning and Using Novel Words  学习和使用新单词

A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a  word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here  we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word,  such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate) nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the  broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions  were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were  generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try  any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final  sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of  the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy  sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.

发展语言学[CB78]研究的一个任务是学习和利用新单词的能力,例如在一个句子中只看到一个单词的定义一次就使用它,或者从一个用法反过来推断一个单词的意思。在这里,我们定性地测试GPT-3完成前一项任务的能力。具体来说,我们给GPT-3一个不存在的单词的定义,比如“Gigamuru”,然后让它在一个句子中使用它。我们提供了一个(单独的)不存在的单词在句子中被定义和使用的1到5个例子,所以就宽泛任务的前面例子而言,任务是很少的,而就具体单词而言,任务是一次性的。表3.16显示了我们生成的6个示例;所有的定义都是人为生成的,第一个答案是人为生成的,作为条件反射,随后的答案是GPT-3生成的。这些示例是在一次运行中连续生成的,我们没有省略或重复尝试任何提示。在所有的情况下,生成的句子似乎是一个正确的或至少似是而非的词的使用。在最后一句话中,该模型为单词“screeg”(即“screeghed”)生成了一个貌似合理的变位,尽管这个词的使用有点尴尬(“screeghed at each other”),尽管它在描述一场玩具剑战的意义上似乎是可信的。总的来说,GPT-3在使用新单词造句方面至少表现得很熟练。

3.9.6 Correcting English Grammar  修改英语语法

Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot  setting by giving prompts of the form "Poor English Input: <sentence>\n Good English Output:  <sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any  omissions or repeats). Results are shown in Figure 3.17. 另一项非常适合少量学习的任务是纠正英语语法。我们在fewshot设置中使用GPT-3测试这一点,给出如下提示:“糟糕的英语输入:<句子>\n良好的英语输出:<句子>”。我们给GPT-3一个人为的修正,然后让它再修正5个(同样没有遗漏或重复)。结果如图3.17所示。

4 Measuring and Preventing Memorization Of Benchmarks  测量和防止记忆基准

Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our  benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research  without established best practices. While it is common practice to train large models without investigating contamination,  given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to.  

This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18]  detected and removed a training document which overlapped with one of their evaluation datasets. Other work such  as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that although models did perform moderately better on data that overlapped between training and testing, this did not  significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).


这种担忧不仅仅是假设。最早在普通爬行数据上训练语言模型的论文之一[TL18]检测并删除了一个与其中一个评估数据集重叠的训练文档。GPT-2 [RWC+19]等其他工作也进行了事后重叠分析。他们的研究相对令人鼓舞,发现尽管模型在训练和测试重叠的数据上表现得稍微好一些,但这并不会对报告的结果产生显著影响,因为有一小部分数据被污染了(通常只有几个百分点)。

GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of  magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential  for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B  does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was  deduplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as  large as feared.  

We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap  between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a  bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t  feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts  results.  

For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, defined roughly as  examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when  it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination,  so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in  Appendix C.

GPT-3的运作方式有些不同。一方面,数据集和模型的大小大约比GPT-2使用的大两个数量级,并且包括大量的常见爬行,增加了污染和记忆的可能性。另一方面,精确地说,由于数据量大,即使是GPT-3 175B,其训练集也没有过度拟合,这是相对于一个被删除的验证集而言的(图4.1)。因此,我们预计污染可能是频繁的,但其影响可能不会像担心的那样大。



We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean  subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a  significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be  inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a  quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence  that contamination level and performance difference are correlated. We conclude that either our conservative method  substantially overestimated contamination or that contamination has little effect on performance.  

Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on  the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference  difficult.



Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension  (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false  positives. We summarize the results for each group of tasks below:

Reading Comprehension: Our initial analysis flagged >90% of task examples from QuAC, SQuAD2, and  DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult.  Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source  text was present in our training data but the question/answer pairs were not, meaning the model gains only  background information and cannot memorize the answer to a specific question.  

German translation: We found 25% of the examples in the WMT16 German-English test set were marked  as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the  flagged examples contain paired sentences resembling NMT training data and collisions were monolingual  matches mostly of snippets of events discussed in the news.  

Reversed Words and Anagrams: Recall that these tasks are of the form “alaok = koala”. Due to the  short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged  overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set,  but rather palindromes or trivial unscramblings, e.g “kayak = kayak”. The amount of overlap was small,  but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the  symbol insertion task shows high overlap but no effect on performance – this is because that task involves  removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to  many spurious matches.

PIQA: The overlap analysis flagged 29% of examples as contaminated, and observed a 3 percentage point  absolute decrease (4% relative decrease) in performance on the clean subset. Though the test dataset was  released after our training set was created and its labels are hidden, some of the web pages used by the  crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller  model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias  rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot  rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential  contamination.

Winograd: The overlap analysis flagged 45% of examples, and found a 2.6% decrease in performance on the  clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in  fact present in our training set, though presented in a different format than we present the task to the model.  Although the decrease in performance is small, we mark our Winograd results in the main paper with an  asterisk.

Language modeling: We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the  Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably  extract a clean subset here, we do not report results on these datasets, even though we intended to when starting  this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language  modeling benchmark.

我们的分析为进一步的调查标记了六组基准:拼词,阅读理解(QuAC, SQuAD2, DROP), PIQA, Winograd,语言建模任务(Wikitext任务,1BW),以及德语到英语的翻译。由于我们的重叠分析被设计成极其保守的,我们预计它会产生一些误报。我们将每组任务的结果总结如下:



颠倒单词和字谜:回想一下这些任务的形式是“alaok = koala”。由于这些任务的长度较短,我们使用2克来进行过滤(忽略标点符号)。在检查标记的重叠之后,我们发现它们并不是训练集中真正的反向或解码的典型实例,而是回文或普通的解码。g " kayak = kayak "。重叠的数量很小,但是去除琐碎的任务会增加难度,从而产生虚假信号。与此相关的是,符号插入任务显示了高重叠,但对性能没有影响——这是因为该任务涉及从单词中删除非字母字符,而重叠分析本身忽略了这些字符,从而导致许多虚假匹配。

PIQA:重叠分析将29%的示例标记为受污染的,并观察到干净子集的性能下降了3个百分点(相对下降4%)。虽然测试数据集创建发布我们的训练集和它的标签是隐藏的,使用的一些网页的创造者众包数据集都包含在我们的训练集,我们也发现了相似的下降25 x模型和更少的记忆能力小,导致我们怀疑这种转变可能是统计偏差而不是记忆;工人们模仿的例子可能更简单。不幸的是,我们不能严格地证明这个假设。因此,我们用星号标记PIQA结果,表示这种潜在的污染。



We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply  to verify how much actual contamination existed. These appeared to often contain false positives. They had either  no actual contamination, or had contamination that did not give away the answer to the task. One notable exception  was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very  small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format  precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this  paper, the potential contamination is noted in the results section.  

An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the  same distribution as the original dataset. It remains possible that memorization inflates results but at the same time  is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number  of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small  models, which are unlikely to be memorizing.  

Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.




5 Limitations  局限性

GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for  future work.  

First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct  predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although  the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to  lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences  or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of  GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed  informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some  datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type  “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable  gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when  evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same  way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading  comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks.


首先,尽管GPT-3在定量和定性方面有了很大的改进,特别是与它的直接前身GPT-2相比,它在文本合成和一些NLP任务方面仍有明显的缺陷。在文本合成方面,尽管整体质量很高,但GPT-3样本有时仍然在文档层面上语义上重复,在足够长的段落中开始失去连贯性,自相矛盾,偶尔还包含不符合逻辑的句子或段落。我们将发布500个未经管理的无条件样本,以帮助更好地了解GPT-3在文本合成方面的局限性和优势。在离散语言任务领域,我们非正式地注意到GPT-3似乎在“常识物理”方面有特殊的困难,尽管在一些测试该领域的数据集(如PIQA [BZB+19])上做得很好。具体来说,GPT-3很难回答“如果我把奶酪放进冰箱,它会融化吗?”定量,GPT-3的语境学习表现有明显的差距在我们套件的基准,如第三节所述,特别是它没有比机会当评估一次性甚至few-shot一些“比较”的任务,如确定两个词使用同样的方式在一个句子,或者如果一个句子意味着另一个(WIC和ANLI分别),以及阅读理解任务的一个子集。考虑到GPT-3在许多其他任务上的出色的小样本性能,这一点尤其引人注目。

GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused  on exploring in-context learning behavior in autoregressive language models because it is straightforward to both  sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional  architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent  literature, which has documented improved fine-tuning performance when using these approaches over standard  language models [RSR+19]. Thus our design decision comes at the cost of potentially worse performance on tasks  which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back  and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then  generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a  few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves  comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and  RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning  than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with  few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.


A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether  autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to  predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also,  with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas  ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed  actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains  of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world  [BHT+20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a  different approach is likely to be necessary. Promising future directions in this vein might include learning the objective  function from humans [ZSW+19a], fine-tuning with reinforcement learning, or adding additional modalities such as  images to provide grounding and a better model of the world [CLY+19]. 本文所描述的一般方法的一个更基本的限制是——扩展任何类似lm的模型,无论是自回归的还是双向的——它可能最终会(或可能已经)碰到培训前目标的限制。我们目前的目标是平等地对每一个标记进行权重,并且缺乏一个概念,即哪些是最重要的,哪些是不那么重要的。[RRS20]演示定制对相关实体的预测的好处。此外,在自我监督的目标中,任务规范依赖于将所需的任务强制转化为预测问题,然而最终,有用的语言系统(例如虚拟助手)可能被认为是采取目标导向的行动,而不仅仅是进行预测。最后,大型的预训练语言模型并不基于其他经验领域,如视频或现实世界的物理互动,因此缺乏大量关于世界的上下文[BHT+20]。由于所有这些原因,纯自监督预测的缩放可能会达到极限,使用不同的方法进行扩展可能是必要的。在这方面,未来有希望的方向可能包括从人类那里学习目标函数[ZSW+19a],用强化学习进行微调,或添加额外的模式,如图像,以提供接地和更好的世界模型[CLY+19]。

Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3  takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more  text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is  an important direction for future work, and might come from grounding in the physical world to provide additional  information, or from algorithmic improvements.  

A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot  learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it  has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that  are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format,  to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on  this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words  seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although  possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what  humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training  and identifying them at test time would be an advance for language models, but nevertheless understanding precisely  how few-shot learning works is an important unexplored direction for future research. 语言模型普遍存在的另一个局限性是在训练前的样本效率较低。尽管GPT-3在测试时间样本效率方面更接近人类(一次或零次),但它在训练前看到的文本仍然比人类在一生中看到的要多得多[Lin20]。提高训练前的样本效率是未来工作的一个重要方向,可能来自于在物理世界的基础上提供额外的信息,或者来自于算法的改进。在GPT-3中,与少样本学习相关的一个限制,或者至少是不确定性,是关于小样本学习实际上是在推理时间“从零开始”学习新任务,还是仅仅识别和识别在训练中学习到的任务的不确定性。这些可能性存在于光谱,从示威游行的训练集来自相同的分布与测试时间,认识到相同的任务,但在不同的格式,以适应一个特定的风格的QA等任务,学习一门技能完全新创。GPT-3在这个范围内的位置也可能因任务而异。合成任务,如词序打乱或定义无意义的词,似乎特别有可能从头学习,而翻译显然必须在训练前学习,尽管可能从组织和风格上与测试数据非常不同的数据。最终,我们甚至不清楚人类从从零开始和之前的演示中学到了什么。即使是在训练前组织各种演示,并在测试时识别它们,也将是语言模型的一个进步,但准确地理解少枪学习是如何工作的,是未来研究的一个重要的未探索的方向。

A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are  both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of  models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large  models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills,  most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible.  Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters;  new challenges and opportunities may be associated with applying it to models of this size.  

Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable,  it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in  performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This  last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special  concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts  (Section 6).

无论目标函数或算法如何,GPT-3规模上的模型都存在一个限制,即它们都是昂贵的,并且不便于进行推断,这可能对当前形式的这种规模的模型的实际适用性提出挑战。解决这一问题的一个可能的未来方向是将大型模型精馏[HVD15],使其达到可管理的规模,以完成特定的任务。像GPT-3这样的大型模型包含了非常广泛的技能,其中大多数技能对于特定的任务来说是不需要的,这表明在原则上积极的提炼是可能的。蒸馏在一般情况下得到了很好的探索[LHCG19a],但还没有在数千亿个参数的规模上进行尝试;将其应用于这种规模的模型可能会带来新的挑战和机会。最后,GPT-3共同分享一些限制大多数深度学习系统——它的决定并不容易解释,它在预测不一定精确校准的小说所观察到的输入方差性能远高于人类标准基准,它保留了数据的偏见一直在训练。最后这个问题- -数据的偏差可能导致模型产生定型或偏见的内容- -从社会角度来说是特别关注的问题,将在下一节中与其他问题一起讨论更广泛的影响(第6节)。

