3.9.1 Arithmetic 算术
To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:
2 digit addition (2D+) – The model is asked to add two integers sampled uniformly from [0, 100), phrased in the form of a question, e.g. “Q: What is 48 plus 76? A: 124.”
2 digit subtraction (2D-) – The model is asked to subtract two integers sampled uniformly from [0, 100); the answer may be negative. Example: “Q: What is 34 minus 53? A: -19”.
3 digit addition (3D+) – Same as 2 digit addition, except numbers are uniformly sampled from [0, 1000).
3 digit subtraction (3D-) – Same as 2 digit subtraction, except numbers are uniformly sampled from [0, 1000).
4 digit addition (4D+) – Same as 3 digit addition, except uniformly sampled from [0, 10000).
4 digit subtraction (4D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 10000).
5 digit addition (5D+) – Same as 3 digit addition, except uniformly sampled from [0, 100000).
5 digit subtraction (5D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 100000).
2 digit multiplication (2Dx) – The model is asked to multiply two integers sampled uniformly from [0, 100), e.g. “Q: What is 24 times 42? A: 1008”.
One-digit composite (1DC) – The model is asked to perform a composite operation on three 1 digit numbers, with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers are selected uniformly on [0, 10) and the operations are selected uniformly from {+,-,*}.
为了测试GPT-3在没有特定任务训练的情况下执行简单算术运算的能力,我们开发了一个包含10个测试的小电池,其中包括用自然语言问GPT-3一个简单的算术问题:
2位加法(2D+)——模型被要求将从[0,100均匀采样的两个整数相加,以问题的形式表达,例如:“Q: 48加76等于多少?”答:124。”
2位减法(2D-)——要求模型从[0,100]均匀采样的两个整数进行减法;答案可能是否定的。例子:“问:34减53等于多少?”答:-19”。
3位加法(3D+) -与2位加法相同,只是数字均匀地从[0,1000]取样。
3位减法(3D-) -与2位减法相同,只是数字均匀地从[0,1000]采样。
4位加法(4D+) -与3位加法相同,只是均匀采样于[0,10000]。
4位减法(4D-) -与3位减法相同,只是均匀采样于[0,10000]。
5位加法(5D+) -与3位加法相同,除了均匀采样于[0,100000]。
5位减法(5D-) -与3位减法相同,除了均匀采样[0,100000]。
2位乘法(2Dx)——模型要求将从[0,100均匀采样的两个整数相乘),例如:“Q: 24乘以42等于多少?”答:1008”。
一位数合成(1DC)——要求模型对三个1位数执行合成操作,最后两个用括号括起来。例如,“Q: 6+(4*8)是多少?”答:38”。在[0,10)上一致选择三个1位数字,在{+,-,*}中一致选择操作。
In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random instances of the task and evaluate all models on those instances. First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness beyond just single operations.
在所有的10个任务中,模型必须准确地生成正确的答案。对于每个任务,我们生成一个包含2000个任务随机实例的数据集,并对这些实例上的所有模型进行评估。首先,我们在小样本设置中评估GPT-3,其结果如图3.10所示。在加减法方面,GPT-3在数字较少的情况下表现出较强的熟练度,2位加法的准确率为100%,2位减法的准确率为98.9%,3位加法的准确率为80.2%,3位减法的准确率为94.2%。随着数字数目的增加,性能会下降,但是GPT-3在四位数操作上仍能达到25-26%的精度,在五位数操作上仍能达到9-10%的精度,这表明至少有一些能力概括为更大数目的数字。GPT-3在2位乘法上也达到了29.2%的精度,这是一个特别的计算密集型操作。最后,GPT-3在个位数联合操作(例如,9*(7+5))时达到了21.3%的准确率,这表明GPT-3在单个操作之外还有一定的稳健性。
As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time.
One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation to the task (or at the very least recognition of the task) is important to performing these computations correctly. Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and model capacity scaling for all three settings is shown in Appendix H.
To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table.
Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even zero-shot settings.
图3.10表明,小模型在所有这些任务做差,甚至130亿年的参数模型(1750亿年之后的第二大完整的GPT-3)可以解决2位数的加法和减法只有一半的时间,和所有其他操作的时间不到10%。一次射击和零射击的性能相对于少射击的性能有所下降,这表明适应任务(或至少识别任务)对正确执行这些计算很重要。尽管如此,单次射击的性能仍然相当强大,甚至全GPT-3的零射击性能也显著优于所有小型模型的少次射击学习。表3.9显示了完整GPT-3的所有三个设置,附录H显示了所有这三个设置的模型容量伸缩。
为了抽查模型是否只是简单地记忆特定的算术问题,我们取测试集中的三位数算术问题,并在训练数据中以“<num1> + <num2> =”和“<num1> + <num2>”的形式搜索它们。</num2></num1></num2></num1>在2000道加法题中,我们发现只有17道匹配(0.8%),而在2000道减法题中,我们发现只有2道匹配(0.1%),这表明只有一小部分正确答案能够被记住。此外,对错误答案的检查发现,该模型经常会犯错误,比如没有带“1”,这表明它实际上是在尝试执行相关的计算,而不是记忆一个表。总的来说,GPT-3在少杆、一杆甚至零杆设置中表现出了相当熟练的中等复杂的算术。
3.9.2 Word Scrambling and Manipulation Tasks 拼字和操作任务
To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:
Cycle letters in word (CL) – The model is given a word with its letters cycled, then the “=” symbol, and is expected to generate the original word. For example, it might be given “lyinevitab” and should output “inevitably”.
Anagrams of all but first and last characters (A1) – The model is given a word where every letter except the first and last have been scrambled randomly, and must output the original word. Example: criroptuon = corruption.
Anagrams of all but first and last 2 characters (A2) – The model is given a word where every letter except the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt → opponent.
Random insertion in word (RI) – A random punctuation or space character is inserted between each letter of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.
Reversed words (RW) – The model is given a word spelled backwards, and must output the original word. Example: stcejbo → objects.
为了测试GPT-3从几个例子中学习新的符号操作的能力,我们设计了一个包含5个“字符操作”任务的小电池。每个任务都包括给模型一个被打乱、添加或删除字符组合而扭曲的单词,并要求它恢复原来的单词。这5项任务是:
单词(CL)中的循环字母——给模型一个单词,它的字母是循环的,然后是“=”符号,并期望生成原始单词。例如,它可能被赋予“lyinevitab”,而应该输出“不可避免”。
除了第一个和最后一个字符以外的所有字符的字谜(A1)——模型被给定一个单词,其中除了第一个和最后一个字符以外的每个字母都被随机打乱,并且必须输出原始单词。例如:criroptuon =腐败。
除了第一个和最后两个字符以外的所有字符的字谜(A2)——模型给出一个单词,其中除了前两个和后两个字符以外的每个字母都被随机打乱,并且必须恢复原来的单词。例:opoepnnt→对手。
单词中的随机插入(RI)——在单词的每个字母之间插入随机的标点或空格字符,模型必须输出原始单词。例子:s.u ! c / c e。ssi /o/n =连续。
反向单词(RW)——给模型一个反向拼写的单词,并且必须输出原始单词。示例:stcejbo→对象。
For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word.
In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear in the pre-training data (although we cannot confirm this with certainty). 对于每个任务,我们生成10,000个示例,我们选择这些示例作为最常见的10,000个单词,以长度大于4个字符和小于15个字符的[Nor09]来衡量。小样本结果如图3.11所示。任务性能随着模型大小的变化而平稳增长,完整的GPT-3模型在删除随机插入时达到66.9%,循环字母达到38.6%,在较简单的字谜任务中达到40.2%,在较困难的字谜任务(只保留第一个和最后一个字母)中达到15.1%。没有一个模型能将字母倒转成一个单词。
在单样本设置中,性能明显较差(下降一半或更多),而在零样本设置中,模型很少能执行任何任务(表3.10)。这表明,模型确实在测试时学习了这些任务,因为模型不能零失误地执行它们,而且它们的人工特性使它们不太可能出现在训练前的数据中(尽管我们不能确定地证实这一点)。
We can further quantify performance by plotting “in-context learning curves”, which show task performance as a function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, including both task examples and natural language task descriptions.
Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average ∼ 0.7 words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation.
我们可以通过绘制“上下文内学习曲线”来进一步量化绩效,该曲线将任务绩效显示为上下文内例子数量的函数。我们在图1.2中展示了用于符号插入任务的上下文内学习曲线。我们可以看到,更大的模型能够越来越有效地使用上下文信息,包括任务示例和自然语言任务描述。
最后,值得补充的是,解决这些任务需要字符级操作,而我们的BPE编码作用于重要的分数一个词(平均0.7∼字令牌),所以从LM的角度成功在这些任务不仅包括操纵BPE令牌但理解和剖析他们的子结构。另外,CL、A1和A2不是双射的(也就是说,被解置的单词不是被解置单词的确定性函数),需要模型执行一些搜索来找到正确的解置。因此,所涉及的技能似乎需要非平凡的模式匹配和计算。
3.9.3 SAT Analogies 类比
To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model. 为了在另一个任务中测试GPT-3,这个任务相对于文本的典型分布有些不寻常,我们收集了一组374个“SAT类比”问题[TLBS03]。类推题是2005年前SAT大学入学考试的一个部分的多项选择题。一个典型的例子是“大胆之于大胆,正如(A)伪善之于伪善,(b)匿名之于身份,(c)懊悔之于恶行,(d)有害之于结果,(e)易受诱惑之于结果。”要求学生从五组单词中选出与原单词有相同关系的单词;在这个例子中,答案是“假装虔诚就是虚伪”。在这项任务中,GPT-3在少发、一发和零发中得分分别为65.2%、59.1%和53.7%,而大学申请者的平均得分为57% [TL05](随机猜测的得分为20%)。如图3.12所示,结果随着规模的增加而提高,全1750亿模型比130亿参数模型提高了10%以上。
3.9.4 News Article Generation 新闻文章生成
Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news story [RWC+19]. Relative to [RWC+19], the dataset used to train GPT-3 is much less weighted towards news articles, so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably generate short articles in the “news” genre.
To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to distinguish the two is a potentially important measure of quality.3
之前在生成语言模型上的工作定性地测试了他们生成合成“新闻文章”的能力,方法是有条件地从模型中取样,并给出一个由一个新闻故事的可信的第一句话组成的人类书面提示。相对于数据集(RWC + 19),用于火车GPT-3偏重于新闻文章要少得多,因此试图产生新闻文章通过原始无条件的样品更有效——例如GPT-3经常解释提出的第一句话“新闻文章”的一条微博,然后文章合成反应或后续消息。为了解决这个问题,我们使用了GPT-3的少样本学习能力,在模型的上下文中提供了之前的三篇新闻文章来约束它。有了提议的下一篇文章的标题和副标题,该模型能够可靠地生成“新闻”类型的短文章。
为了衡量GPT-3生成新闻文章的质量(我们认为这很可能与有条件的样本生成质量总体上相关),我们决定衡量人类区分GPT-3生成的文章与真实文章的能力。Kreps等人[KMB20]和Zellers等人[ZHR+19]也进行了类似的工作。生成语言模型被训练来匹配人类生成的内容的分布,所以人类区分这两者的能力是质量的一个潜在的重要衡量标准
In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed by either the human written article or the article generated by the model4 . Participants were asked to select whether the article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by a machine”, or “very likely written by a machine”.
The articles we selected were not in the models’ training data and the model outputs were formatted and selected programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. However, we also ran an experiment to control for participant effort and attention that followed the same format but involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a 160M parameter model with no context and increased output randomness.
为了考察人类检测模型生成的文本的能力,我们从newser.com网站上任意选择了25篇文章的标题和副标题(平均长度:215个单词)。然后,我们根据四种语言模型生成这些标题和字幕的完整版本,大小从1.25米到175B (GPT-3)参数不等(平均长度:200个单词)。对于每个模型,我们向大约80名来自美国的参与者展示了一个测试,其中包含这些真实的标题和副标题,然后是人工撰写的文章或由模型4生成的文章。参与者被要求选择文章是“很可能是人类写的”,“更可能是人类写的”,“我不知道”,“更可能是机器写的”,还是“很可能是机器写的”。
我们选择的文章不在模型的训练数据中,并且模型的输出被编程地格式化和选择,以防止人类的“挑选”。所有模型都使用相同的上下文来设置输出条件,并使用相同的上下文大小进行预训练,每个模型都使用相同的文章标题和副标题作为提示。然而,我们也进行了一项实验,以控制参与者的努力和注意力,这些人遵循同样的格式,但包含了有意的不良模型生成的文章。这是通过从一个“控制模型”生成文章来实现的:一个没有上下文且增加了输出随机性的160M参数模型。
Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that the intentionally bad articles were model generated was ∼ 86% where 50% is chance level performance. By contrast, mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance at ∼ 52% (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases: there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6 This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E).
Examples of synthetic articles from GPT-3 are given in Figures 3.14 and 3.15. 7 Much of the text is—as indicated by the evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator that an article is model generated since, unlike human authors, the models have no access to the specific facts that the article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual phrasings, though these are often subtle enough that they are not noticed.
Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like G R O V E R [ZHR+19] and GLTR [GSR19] may have greater success at detecting model generated text than human evaluators. Automatic detection of these models may be a promising area of future research.
Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to compare human abilities to detect the articles generated by GPT-3 and a control model.
We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was ∼ 88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely above chance at ∼ 52% (see Table 3.12). This indicates that, for news articles that are around 500 words long, GPT-3 continues to produce articles that humans find difficult to distinguish from human written news articles.
在检测出被模型生成的故意差的文章时,人类的平均准确率(每个参与者的正确任务与非中立任务的比率)为86%,其中50%是随机水平的表现。相比之下,人类检测175B参数模型产生的物品的平均准确率仅为52%(见表3.11)。5人类检测模型生成的文本的能力似乎随着模型大小的增加而减少:模型大小似乎有机会准确性的趋势,人类对GPT-3的检测接近于机会。尽管随着模型尺寸的增加,参与者会在每个输出上花费更多的时间(见附录E),但这是真的。
图3.14和图3.15给出了GPT-3合成产品的示例。7如评估所示,大部分文本对人类来说很难从真实的人类内容中区分出来。事实不准确可能是一篇文章是模型生成的标志,因为与人类作者不同,模型无法访问文章标题所引用的具体事实或文章的写作时间。其他的指标包括重复,不符合逻辑,和不寻常的措辞,尽管这些通常是足够微妙的,他们没有被注意到。
Ippolito等人[IDCBE19]在语言模型检测方面的相关工作表明,自动鉴别器如G R O V E R [ZHR+19]和GLTR [GSR19]在检测模型生成的文本方面可能比人类评价器更成功。这些模型的自动检测可能是未来研究的一个有前景的领域。
Ippolito等人[IDCBE19]也注意到,随着人们观察到更多的标记,人类检测模型生成的文本的准确性也会提高。做一个初步调查好人类是如何检测时间的新闻文章由GPT-3 175 b,我们选择了12项世界新闻文章来自路透社平均长度为569个单词和生成完成的这些文章GPT-3平均长度为498个单词(298字的时间比我们最初的实验)。按照上述方法,我们进行了两个实验,每个实验都有大约80名美国参与者,以比较人类检测GPT-3和一个对照模型生成的文章的能力。
我们发现,人类在检测控制组故意制造的较长文章时的平均准确率为~ 88%,而在检测GPT-3 175B制造的较长文章时的平均准确率为~ 52%(见表3.12)。这表明,对于长度在500字左右的新闻文章,GPT-3继续生成人类难以区分的文章。
3.9.5 Learning and Using Novel Words 学习和使用新单词
A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word, such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate) nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.
发展语言学[CB78]研究的一个任务是学习和利用新单词的能力,例如在一个句子中只看到一个单词的定义一次就使用它,或者从一个用法反过来推断一个单词的意思。在这里,我们定性地测试GPT-3完成前一项任务的能力。具体来说,我们给GPT-3一个不存在的单词的定义,比如“Gigamuru”,然后让它在一个句子中使用它。我们提供了一个(单独的)不存在的单词在句子中被定义和使用的1到5个例子,所以就宽泛任务的前面例子而言,任务是很少的,而就具体单词而言,任务是一次性的。表3.16显示了我们生成的6个示例;所有的定义都是人为生成的,第一个答案是人为生成的,作为条件反射,随后的答案是GPT-3生成的。这些示例是在一次运行中连续生成的,我们没有省略或重复尝试任何提示。在所有的情况下,生成的句子似乎是一个正确的或至少似是而非的词的使用。在最后一句话中,该模型为单词“screeg”(即“screeghed”)生成了一个貌似合理的变位,尽管这个词的使用有点尴尬(“screeghed at each other”),尽管它在描述一场玩具剑战的意义上似乎是可信的。总的来说,GPT-3在使用新单词造句方面至少表现得很熟练。
3.9.6 Correcting English Grammar 修改英语语法
Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot setting by giving prompts of the form "Poor English Input: <sentence>\n Good English Output: <sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any omissions or repeats). Results are shown in Figure 3.17. 另一项非常适合少量学习的任务是纠正英语语法。我们在fewshot设置中使用GPT-3测试这一点,给出如下提示:“糟糕的英语输入:<句子>\n良好的英语输出:<句子>”。我们给GPT-3一个人为的修正,然后让它再修正5个(同样没有遗漏或重复)。结果如图3.17所示。
4 Measuring and Preventing Memorization Of Benchmarks 测量和防止记忆基准
Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research without established best practices. While it is common practice to train large models without investigating contamination, given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to.
This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18] detected and removed a training document which overlapped with one of their evaluation datasets. Other work such as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that although models did perform moderately better on data that overlapped between training and testing, this did not significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).
由于我们的训练数据集来自互联网,所以我们的模型可能是在一些基准测试集上训练的。从互联网规模的数据集中准确地检测测试污染是一个新的研究领域,没有建立最佳实践。虽然在训练大型模型时不调查污染是常见的做法,但考虑到训练前数据集规模的不断扩大,我们相信这个问题正变得越来越重要。
这种担忧不仅仅是假设。最早在普通爬行数据上训练语言模型的论文之一[TL18]检测并删除了一个与其中一个评估数据集重叠的训练文档。GPT-2 [RWC+19]等其他工作也进行了事后重叠分析。他们的研究相对令人鼓舞,发现尽管模型在训练和测试重叠的数据上表现得稍微好一些,但这并不会对报告的结果产生显著影响,因为有一小部分数据被污染了(通常只有几个百分点)。
GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was deduplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as large as feared.
We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts results.
For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, defined roughly as examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination, so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in Appendix C.
GPT-3的运作方式有些不同。一方面,数据集和模型的大小大约比GPT-2使用的大两个数量级,并且包括大量的常见爬行,增加了污染和记忆的可能性。另一方面,精确地说,由于数据量大,即使是GPT-3 175B,其训练集也没有过度拟合,这是相对于一个被删除的验证集而言的(图4.1)。因此,我们预计污染可能是频繁的,但其影响可能不会像担心的那样大。
我们最初试图通过主动搜索并试图消除我们的训练数据与本文中研究的所有基准的开发和测试集之间的任何重叠,来解决污染问题。不幸的是,一个错误只导致部分删除了训练数据中检测到的所有重叠部分。由于培训成本的原因,对模型进行再培训是不可行的。为了解决这个问题,我们详细研究剩余检测到的重叠是如何影响结果的。
对于每个基准测试,我们生成一个“干净”版本,删除所有可能泄露的示例,大致定义为与预训练集中的任何内容有13克重叠的示例(或者与整个示例有重叠的示例,如果它小于13克)。我们的目标是非常保守地标记出任何可能被污染的东西,以便产生一个高度可靠的无污染子集。确切的程序在附录C中有详细说明。
We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence that contamination level and performance difference are correlated. We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance.
Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference difficult.
然后我们在这些干净的基准上评估GPT-3,并与原始分数进行比较。如果清洁子集上的分数与整个数据集上的分数相似,这表明即使存在污染,也不会对报告的结果产生显著的影响。如果清洁组的分数较低,这表明污染可能使结果膨胀。结果如图4.2所示。尽管潜在的污染通常很高(四分之一的基准测试得分超过50%),但在大多数情况下,性能变化只是微不足道的,而且我们没有看到污染水平和性能差异相关的证据。我们得出的结论是,要么我们的保守方法大大高估了污染,要么污染对性能的影响很小。
下面,我们将更详细地回顾一些特定的情况,其中(1)模型在清理后的版本上表现明显较差,或(2)潜在的污染非常高,这使得测量性能差异非常困难。
Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false positives. We summarize the results for each group of tasks below:
Reading Comprehension: Our initial analysis flagged >90% of task examples from QuAC, SQuAD2, and DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult. Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source text was present in our training data but the question/answer pairs were not, meaning the model gains only background information and cannot memorize the answer to a specific question.
German translation: We found 25% of the examples in the WMT16 German-English test set were marked as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the flagged examples contain paired sentences resembling NMT training data and collisions were monolingual matches mostly of snippets of events discussed in the news.
Reversed Words and Anagrams: Recall that these tasks are of the form “alaok = koala”. Due to the short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set, but rather palindromes or trivial unscramblings, e.g “kayak = kayak”. The amount of overlap was small, but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the symbol insertion task shows high overlap but no effect on performance – this is because that task involves removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to many spurious matches.
PIQA: The overlap analysis flagged 29% of examples as contaminated, and observed a 3 percentage point absolute decrease (4% relative decrease) in performance on the clean subset. Though the test dataset was released after our training set was created and its labels are hidden, some of the web pages used by the crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential contamination.
Winograd: The overlap analysis flagged 45% of examples, and found a 2.6% decrease in performance on the clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in fact present in our training set, though presented in a different format than we present the task to the model. Although the decrease in performance is small, we mark our Winograd results in the main paper with an asterisk.
Language modeling: We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably extract a clean subset here, we do not report results on these datasets, even though we intended to when starting this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language modeling benchmark.
我们的分析为进一步的调查标记了六组基准:拼词,阅读理解(QuAC, SQuAD2, DROP), PIQA, Winograd,语言建模任务(Wikitext任务,1BW),以及德语到英语的翻译。由于我们的重叠分析被设计成极其保守的,我们预计它会产生一些误报。我们将每组任务的结果总结如下:
阅读理解:我们最初的分析将QuAC、SQuAD2和DROP中90%的任务示例>标记为潜在污染,如此之大,甚至很难在干净子集上测量差异。然而,经过人工检查,我们发现,对于我们检查的每一个重叠,在所有3个数据集中,我们的训练数据中都有源文本,但是问题/答案对没有,这意味着模型只获得了背景信息,不能记住特定问题的答案。
德语翻译:我们发现,在WMT16德语-英语测试集中,25%的样本被标记为潜在污染,相关总效应值为1-2蓝色。经过检查,没有一个标记的例子包含类似NMT训练数据的成对句子,碰撞是单语匹配,主要是新闻中讨论的事件片段。
颠倒单词和字谜:回想一下这些任务的形式是“alaok = koala”。由于这些任务的长度较短,我们使用2克来进行过滤(忽略标点符号)。在检查标记的重叠之后,我们发现它们并不是训练集中真正的反向或解码的典型实例,而是回文或普通的解码。g " kayak = kayak "。重叠的数量很小,但是去除琐碎的任务会增加难度,从而产生虚假信号。与此相关的是,符号插入任务显示了高重叠,但对性能没有影响——这是因为该任务涉及从单词中删除非字母字符,而重叠分析本身忽略了这些字符,从而导致许多虚假匹配。
PIQA:重叠分析将29%的示例标记为受污染的,并观察到干净子集的性能下降了3个百分点(相对下降4%)。虽然测试数据集创建发布我们的训练集和它的标签是隐藏的,使用的一些网页的创造者众包数据集都包含在我们的训练集,我们也发现了相似的下降25 x模型和更少的记忆能力小,导致我们怀疑这种转变可能是统计偏差而不是记忆;工人们模仿的例子可能更简单。不幸的是,我们不能严格地证明这个假设。因此,我们用星号标记PIQA结果,表示这种潜在的污染。
Winograd:重叠分析标记了45%的示例,发现干净子集的性能下降了2.6%。对重叠数据点的手动检查表明,实际上有132个Winograd模式出现在我们的训练集中,尽管它们的格式与我们向模型展示任务的格式不同。尽管性能下降很小,但我们在主论文中用星号标记了Winograd结果。
语言建模:我们发现用GPT-2测量的4个维基百科语言建模基准,加上儿童书籍测试数据,几乎全部包含在我们的训练数据中。因为我们不能可靠地提取一个干净的子集,所以我们不报告这些数据集的结果,即使我们在开始这项工作时打算这样做。我们注意到佩恩树银行由于其年龄未受影响,因此成为我们的主要语言建模基准。
We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply to verify how much actual contamination existed. These appeared to often contain false positives. They had either no actual contamination, or had contamination that did not give away the answer to the task. One notable exception was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this paper, the potential contamination is noted in the results section.
An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the same distribution as the original dataset. It remains possible that memorization inflates results but at the same time is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small models, which are unlikely to be memorizing.
Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.
我们还检查了污染程度高的数据集,但对性能的影响接近于零,只是为了验证实际存在多少污染。这些报告似乎经常包含误报。他们要么没有受到实际的污染,要么受到的污染并没有泄露任务的答案。一个值得注意的例外是LAMBADA,它看起来确实存在大量的污染,但对性能的影响非常小,干净子集的得分在整个数据集的0.5%之内。而且,严格地说,我们的填空格式排除了最简单的记忆形式。然而,由于我们在这篇论文中取得了很大的进展,潜在的污染在结果部分被指出。
我们的污染分析的一个重要限制是,我们不能确定干净子集是从与原始数据集相同的分布中提取的。记忆有可能使结果膨胀,但同时也被一些统计偏差精确地抵消了,从而使干净子集变得更容易。然而,绝对的数字。
总的来说,我们已经尽了最大的努力来度量和记录数据污染的影响,并根据严重程度来注意或直接删除有问题的结果。在设计基准和培训模式时,仍有许多工作要做,以解决该领域一般的这一重要而微妙的问题。有关我们的分析的更详细的解释,请读者参阅附录C。
5 Limitations 局限性
GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for future work.
First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks.
GPT-3和我们对它的分析都有一些局限性。下面我们将对其中一些进行描述,并对未来的工作提出建议。
首先,尽管GPT-3在定量和定性方面有了很大的改进,特别是与它的直接前身GPT-2相比,它在文本合成和一些NLP任务方面仍有明显的缺陷。在文本合成方面,尽管整体质量很高,但GPT-3样本有时仍然在文档层面上语义上重复,在足够长的段落中开始失去连贯性,自相矛盾,偶尔还包含不符合逻辑的句子或段落。我们将发布500个未经管理的无条件样本,以帮助更好地了解GPT-3在文本合成方面的局限性和优势。在离散语言任务领域,我们非正式地注意到GPT-3似乎在“常识物理”方面有特殊的困难,尽管在一些测试该领域的数据集(如PIQA [BZB+19])上做得很好。具体来说,GPT-3很难回答“如果我把奶酪放进冰箱,它会融化吗?”定量,GPT-3的语境学习表现有明显的差距在我们套件的基准,如第三节所述,特别是它没有比机会当评估一次性甚至few-shot一些“比较”的任务,如确定两个词使用同样的方式在一个句子,或者如果一个句子意味着另一个(WIC和ANLI分别),以及阅读理解任务的一个子集。考虑到GPT-3在许多其他任务上的出色的小样本性能,这一点尤其引人注目。
GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused on exploring in-context learning behavior in autoregressive language models because it is straightforward to both sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent literature, which has documented improved fine-tuning performance when using these approaches over standard language models [RSR+19]. Thus our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.
GPT-3在结构和算法上有一些限制,这可以解释上面的一些问题。我们专注于探索自回归语言模型中的上下文内学习行为,因为用这个模型类进行抽样和计算可能性都很简单。因此,我们的实验不包括任何双向架构或其他训练目标,如去噪。这与最近的许多文献有明显的不同,后者记录了在标准语言模型上使用这些方法可以提高调优性能[RSR+19]。因此,我们的设计决策的代价是,在经验上受益于双向性的任务上,可能会有更糟糕的性能。这可能包括填空任务,包括回顾和比较两段内容的任务,或者要求重读或仔细考虑一篇很长的文章,然后写出非常简短的答案的任务。这可能是一个可能的解释为GPT-3滞后few-shot性能的一些任务,如WIC(包括比较词的使用在两个句子),ANLI(包括比较两个句子是否意味着另一个),和一些阅读理解任务(例如QuAC和种族)。基于过去的文献,我们还推测,一个大型的双向模型在微调方面会比GPT-3更强。在GPT-3的规模上制作一个双向模型,以及/或尝试使双向模型在很少或零射击学习中工作,是未来研究的一个有前途的方向,并且可以帮助实现“两全其美”。
A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world [BHT+20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a different approach is likely to be necessary. Promising future directions in this vein might include learning the objective function from humans [ZSW+19a], fine-tuning with reinforcement learning, or adding additional modalities such as images to provide grounding and a better model of the world [CLY+19]. 本文所描述的一般方法的一个更基本的限制是——扩展任何类似lm的模型,无论是自回归的还是双向的——它可能最终会(或可能已经)碰到培训前目标的限制。我们目前的目标是平等地对每一个标记进行权重,并且缺乏一个概念,即哪些是最重要的,哪些是不那么重要的。[RRS20]演示定制对相关实体的预测的好处。此外,在自我监督的目标中,任务规范依赖于将所需的任务强制转化为预测问题,然而最终,有用的语言系统(例如虚拟助手)可能被认为是采取目标导向的行动,而不仅仅是进行预测。最后,大型的预训练语言模型并不基于其他经验领域,如视频或现实世界的物理互动,因此缺乏大量关于世界的上下文[BHT+20]。由于所有这些原因,纯自监督预测的缩放可能会达到极限,使用不同的方法进行扩展可能是必要的。在这方面,未来有希望的方向可能包括从人类那里学习目标函数[ZSW+19a],用强化学习进行微调,或添加额外的模式,如图像,以提供接地和更好的世界模型[CLY+19]。
Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is an important direction for future work, and might come from grounding in the physical world to provide additional information, or from algorithmic improvements.
A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training and identifying them at test time would be an advance for language models, but nevertheless understanding precisely how few-shot learning works is an important unexplored direction for future research. 语言模型普遍存在的另一个局限性是在训练前的样本效率较低。尽管GPT-3在测试时间样本效率方面更接近人类(一次或零次),但它在训练前看到的文本仍然比人类在一生中看到的要多得多[Lin20]。提高训练前的样本效率是未来工作的一个重要方向,可能来自于在物理世界的基础上提供额外的信息,或者来自于算法的改进。在GPT-3中,与少样本学习相关的一个限制,或者至少是不确定性,是关于小样本学习实际上是在推理时间“从零开始”学习新任务,还是仅仅识别和识别在训练中学习到的任务的不确定性。这些可能性存在于光谱,从示威游行的训练集来自相同的分布与测试时间,认识到相同的任务,但在不同的格式,以适应一个特定的风格的QA等任务,学习一门技能完全新创。GPT-3在这个范围内的位置也可能因任务而异。合成任务,如词序打乱或定义无意义的词,似乎特别有可能从头学习,而翻译显然必须在训练前学习,尽管可能从组织和风格上与测试数据非常不同的数据。最终,我们甚至不清楚人类从从零开始和之前的演示中学到了什么。即使是在训练前组织各种演示,并在测试时识别它们,也将是语言模型的一个进步,但准确地理解少枪学习是如何工作的,是未来研究的一个重要的未探索的方向。
A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; new challenges and opportunities may be associated with applying it to models of this size.
Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts (Section 6).
无论目标函数或算法如何,GPT-3规模上的模型都存在一个限制,即它们都是昂贵的,并且不便于进行推断,这可能对当前形式的这种规模的模型的实际适用性提出挑战。解决这一问题的一个可能的未来方向是将大型模型精馏[HVD15],使其达到可管理的规模,以完成特定的任务。像GPT-3这样的大型模型包含了非常广泛的技能,其中大多数技能对于特定的任务来说是不需要的,这表明在原则上积极的提炼是可能的。蒸馏在一般情况下得到了很好的探索[LHCG19a],但还没有在数千亿个参数的规模上进行尝试;将其应用于这种规模的模型可能会带来新的挑战和机会。最后,GPT-3共同分享一些限制大多数深度学习系统——它的决定并不容易解释,它在预测不一定精确校准的小说所观察到的输入方差性能远高于人类标准基准,它保留了数据的偏见一直在训练。最后这个问题- -数据的偏差可能导致模型产生定型或偏见的内容- -从社会角度来说是特别关注的问题,将在下一节中与其他问题一起讨论更广泛的影响(第6节)。