# Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读（三）

3.9.1 Arithmetic  算术

To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small  battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:

2 digit addition (2D+) – The model is asked to add two integers sampled uniformly from [0, 100), phrased in  the form of a question, e.g. “Q: What is 48 plus 76? A: 124.”

2 digit subtraction (2D-) – The model is asked to subtract two integers sampled uniformly from [0, 100); the  answer may be negative. Example: “Q: What is 34 minus 53? A: -19”.

3 digit addition (3D+) – Same as 2 digit addition, except numbers are uniformly sampled from [0, 1000).

3 digit subtraction (3D-) – Same as 2 digit subtraction, except numbers are uniformly sampled from [0, 1000).

4 digit addition (4D+) – Same as 3 digit addition, except uniformly sampled from [0, 10000).

4 digit subtraction (4D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 10000).

5 digit addition (5D+) – Same as 3 digit addition, except uniformly sampled from [0, 100000).

5 digit subtraction (5D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 100000).

2 digit multiplication (2Dx) – The model is asked to multiply two integers sampled uniformly from [0, 100), e.g. “Q: What is 24 times 42? A: 1008”.

One-digit composite (1DC) – The model is asked to perform a composite operation on three 1 digit numbers, with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers are selected uniformly on [0, 10) and the operations are selected uniformly from {+,-,*}.

2位加法(2D+)——模型被要求将从[0,100均匀采样的两个整数相加，以问题的形式表达，例如:“Q: 48加76等于多少?”答:124。”

2位减法(2D-)——要求模型从[0,100]均匀采样的两个整数进行减法;答案可能是否定的。例子:“问:34减53等于多少?”答:-19”。

3位加法(3D+) -与2位加法相同，只是数字均匀地从[0,1000]取样。

3位减法(3D-) -与2位减法相同，只是数字均匀地从[0,1000]采样。

4位加法(4D+) -与3位加法相同，只是均匀采样于[0,10000]。

4位减法(4D-) -与3位减法相同，只是均匀采样于[0,10000]。

5位加法(5D+) -与3位加法相同，除了均匀采样于[0,100000]。

5位减法(5D-) -与3位减法相同，除了均匀采样[0,100000]。

2位乘法(2Dx)——模型要求将从[0,100均匀采样的两个整数相乘)，例如:“Q: 24乘以42等于多少?”答:1008”。

In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random  instances of the task and evaluate all models on those instances.  First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction,  GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition,  98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the  number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on  five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves  29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves  21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness  beyond just single operations.

As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the  second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all  other operations less than 10% of the time.

One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation  to the task (or at the very least recognition of the task) is important to performing these computations correctly.  Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and  model capacity scaling for all three settings is shown in Appendix H.

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic  problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and  "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000  subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers  could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes  such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than  memorizing a table.

Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even  zero-shot settings.

3.9.2 Word Scrambling and Manipulation Tasks  拼字和操作任务

To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of  5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of  scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:

Cycle letters in word (CL) – The model is given a word with its letters cycled, then the “=” symbol, and is expected to generate the original word. For example, it might be given “lyinevitab” and should output “inevitably”.

Anagrams of all but first and last characters (A1) – The model is given a word where every letter except the first and last have been scrambled randomly, and must output the original word. Example: criroptuon = corruption.

Anagrams of all but first and last 2 characters (A2) – The model is given a word where every letter except the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt → opponent.

Random insertion in word (RI) – A random punctuation or space character is inserted between each letter of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.

Reversed words (RW) – The model is given a word spelled backwards, and must output the original word. Example: stcejbo → objects.

For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by  [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11.  Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram  task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word.

In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the  model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these  tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear  in the pre-training data (although we cannot confirm this with certainty).   对于每个任务，我们生成10,000个示例，我们选择这些示例作为最常见的10,000个单词，以长度大于4个字符和小于15个字符的[Nor09]来衡量。小样本结果如图3.11所示。任务性能随着模型大小的变化而平稳增长，完整的GPT-3模型在删除随机插入时达到66.9%，循环字母达到38.6%，在较简单的字谜任务中达到40.2%，在较困难的字谜任务(只保留第一个和最后一个字母)中达到15.1%。没有一个模型能将字母倒转成一个单词。&nbsp;

We can further quantify performance by plotting “in-context learning curves”, which show task performance as a  function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task  in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information,  including both task examples and natural language task descriptions.

Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding  operates on significant fractions of a word (on average ∼ 0.7 words per token), so from the LM’s perspective succeeding  at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also,  CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word),  requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require  non-trivial pattern-matching and computation.

3.9.3 SAT Analogies 类比

To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of  374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of  the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to  hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to  temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original  word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the  few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among  college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with  scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model. 为了在另一个任务中测试GPT-3，这个任务相对于文本的典型分布有些不寻常，我们收集了一组374个“SAT类比”问题[TLBS03]。类推题是2005年前SAT大学入学考试的一个部分的多项选择题。一个典型的例子是“大胆之于大胆，正如(A)伪善之于伪善，(b)匿名之于身份，(c)懊悔之于恶行，(d)有害之于结果，(e)易受诱惑之于结果。”要求学生从五组单词中选出与原单词有相同关系的单词;在这个例子中，答案是“假装虔诚就是虚伪”。在这项任务中，GPT-3在少发、一发和零发中得分分别为65.2%、59.1%和53.7%，而大学申请者的平均得分为57% [TL05](随机猜测的得分为20%)。如图3.12所示，结果随着规模的增加而提高，全1750亿模型比130亿参数模型提高了10%以上。

3.9.4 News Article Generation  新闻文章生成

Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by  conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news  story [RWC+19]. Relative to [RWC+19], the dataset used to train GPT-3 is much less weighted towards news articles,  so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets  the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To  solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the  model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably  generate short articles in the “news” genre.

To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional  sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles  from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative  language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to  distinguish the two is a potentially important measure of quality.3

In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles  from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles  from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each  model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed  by either the human written article or the article generated by the model4  . Participants were asked to select whether the  article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by  a machine”, or “very likely written by a machine”.

The articles we selected were not in the models’ training data and the model outputs were formatted and selected  programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were  pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model.  However, we also ran an experiment to control for participant effort and attention that followed the same format but  involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a  160M parameter model with no context and increased output randomness.

Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that  the intentionally bad articles were model generated was ∼ 86% where 50% is chance level performance. By contrast,  mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance  at ∼ 52% (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases:  there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6  This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E).

Examples of synthetic articles from GPT-3 are given in Figures 3.14 and 3.15.  7 Much of the text is—as indicated by the  evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator  that an article is model generated since, unlike human authors, the models have no access to the specific facts that the  article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual  phrasings, though these are often subtle enough that they are not noticed.

Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like  G R O V E R [ZHR+19] and GLTR [GSR19] may have greater success at detecting model generated text than human  evaluators. Automatic detection of these models may be a promising area of future research.

Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe  more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated  by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated  completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial  experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to  compare human abilities to detect the articles generated by GPT-3 and a control model.

We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was  ∼ 88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely  above chance at ∼ 52% (see Table 3.12). This indicates that, for news articles that are around 500 words long, GPT-3  continues to produce articles that humans find difficult to distinguish from human written news articles.

Ippolito等人[IDCBE19]在语言模型检测方面的相关工作表明，自动鉴别器如G R O V E R [ZHR+19]和GLTR [GSR19]在检测模型生成的文本方面可能比人类评价器更成功。这些模型的自动检测可能是未来研究的一个有前景的领域。

Ippolito等人[IDCBE19]也注意到，随着人们观察到更多的标记，人类检测模型生成的文本的准确性也会提高。做一个初步调查好人类是如何检测时间的新闻文章由GPT-3 175 b,我们选择了12项世界新闻文章来自路透社平均长度为569个单词和生成完成的这些文章GPT-3平均长度为498个单词(298字的时间比我们最初的实验)。按照上述方法，我们进行了两个实验，每个实验都有大约80名美国参与者，以比较人类检测GPT-3和一个对照模型生成的文章的能力。

3.9.5 Learning and Using Novel Words  学习和使用新单词

A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a  word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here  we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word,  such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate) nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the  broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions  were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were  generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try  any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final  sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of  the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy  sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.

3.9.6 Correcting English Grammar  修改英语语法

Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot  setting by giving prompts of the form "Poor English Input: <sentence>\n Good English Output:  <sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any  omissions or repeats). Results are shown in Figure 3.17. 另一项非常适合少量学习的任务是纠正英语语法。我们在fewshot设置中使用GPT-3测试这一点，给出如下提示:“糟糕的英语输入:<句子>\n良好的英语输出:<句子>”。我们给GPT-3一个人为的修正，然后让它再修正5个(同样没有遗漏或重复)。结果如图3.17所示。

## 4 Measuring and Preventing Memorization Of Benchmarks  测量和防止记忆基准

Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our  benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research  without established best practices. While it is common practice to train large models without investigating contamination,  given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to.

This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18]  detected and removed a training document which overlapped with one of their evaluation datasets. Other work such  as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that although models did perform moderately better on data that overlapped between training and testing, this did not  significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).

GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of  magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential  for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B  does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was  deduplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as  large as feared.

We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap  between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a  bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t  feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts  results.

For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, defined roughly as  examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when  it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination,  so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in  Appendix C.

GPT-3的运作方式有些不同。一方面，数据集和模型的大小大约比GPT-2使用的大两个数量级，并且包括大量的常见爬行，增加了污染和记忆的可能性。另一方面，精确地说，由于数据量大，即使是GPT-3 175B，其训练集也没有过度拟合，这是相对于一个被删除的验证集而言的(图4.1)。因此，我们预计污染可能是频繁的，但其影响可能不会像担心的那样大。

We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean  subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a  significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be  inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a  quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence  that contamination level and performance difference are correlated. We conclude that either our conservative method  substantially overestimated contamination or that contamination has little effect on performance.

Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on  the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference  difficult.

Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension  (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false  positives. We summarize the results for each group of tasks below:

Reading Comprehension: Our initial analysis flagged >90% of task examples from QuAC, SQuAD2, and  DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult.  Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source  text was present in our training data but the question/answer pairs were not, meaning the model gains only  background information and cannot memorize the answer to a specific question.

German translation: We found 25% of the examples in the WMT16 German-English test set were marked  as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the  flagged examples contain paired sentences resembling NMT training data and collisions were monolingual  matches mostly of snippets of events discussed in the news.

Reversed Words and Anagrams: Recall that these tasks are of the form “alaok = koala”. Due to the  short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged  overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set,  but rather palindromes or trivial unscramblings, e.g “kayak = kayak”. The amount of overlap was small,  but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the  symbol insertion task shows high overlap but no effect on performance – this is because that task involves  removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to  many spurious matches.

PIQA: The overlap analysis flagged 29% of examples as contaminated, and observed a 3 percentage point  absolute decrease (4% relative decrease) in performance on the clean subset. Though the test dataset was  released after our training set was created and its labels are hidden, some of the web pages used by the  crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller  model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias  rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot  rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential  contamination.

Winograd: The overlap analysis flagged 45% of examples, and found a 2.6% decrease in performance on the  clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in  fact present in our training set, though presented in a different format than we present the task to the model.  Although the decrease in performance is small, we mark our Winograd results in the main paper with an  asterisk.

Language modeling: We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the  Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably  extract a clean subset here, we do not report results on these datasets, even though we intended to when starting  this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language  modeling benchmark.

PIQA:重叠分析将29%的示例标记为受污染的，并观察到干净子集的性能下降了3个百分点(相对下降4%)。虽然测试数据集创建发布我们的训练集和它的标签是隐藏的,使用的一些网页的创造者众包数据集都包含在我们的训练集,我们也发现了相似的下降25 x模型和更少的记忆能力小,导致我们怀疑这种转变可能是统计偏差而不是记忆;工人们模仿的例子可能更简单。不幸的是，我们不能严格地证明这个假设。因此，我们用星号标记PIQA结果，表示这种潜在的污染。

We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply  to verify how much actual contamination existed. These appeared to often contain false positives. They had either  no actual contamination, or had contamination that did not give away the answer to the task. One notable exception  was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very  small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format  precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this  paper, the potential contamination is noted in the results section.

An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the  same distribution as the original dataset. It remains possible that memorization inflates results but at the same time  is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number  of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small  models, which are unlikely to be memorizing.

Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.

## 5 Limitations  局限性

GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for  future work.

First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct  predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although  the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to  lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences  or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of  GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed  informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some  datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type  “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable  gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when  evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same  way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading  comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks.

GPT-3和我们对它的分析都有一些局限性。下面我们将对其中一些进行描述，并对未来的工作提出建议。

GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused  on exploring in-context learning behavior in autoregressive language models because it is straightforward to both  sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional  architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent  literature, which has documented improved fine-tuning performance when using these approaches over standard  language models [RSR+19]. Thus our design decision comes at the cost of potentially worse performance on tasks  which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back  and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then  generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a  few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves  comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and  RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning  than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with  few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.

GPT-3在结构和算法上有一些限制，这可以解释上面的一些问题。我们专注于探索自回归语言模型中的上下文内学习行为，因为用这个模型类进行抽样和计算可能性都很简单。因此，我们的实验不包括任何双向架构或其他训练目标，如去噪。这与最近的许多文献有明显的不同，后者记录了在标准语言模型上使用这些方法可以提高调优性能[RSR+19]。因此，我们的设计决策的代价是，在经验上受益于双向性的任务上，可能会有更糟糕的性能。这可能包括填空任务，包括回顾和比较两段内容的任务，或者要求重读或仔细考虑一篇很长的文章，然后写出非常简短的答案的任务。这可能是一个可能的解释为GPT-3滞后few-shot性能的一些任务,如WIC(包括比较词的使用在两个句子),ANLI(包括比较两个句子是否意味着另一个),和一些阅读理解任务(例如QuAC和种族)。基于过去的文献，我们还推测，一个大型的双向模型在微调方面会比GPT-3更强。在GPT-3的规模上制作一个双向模型，以及/或尝试使双向模型在很少或零射击学习中工作，是未来研究的一个有前途的方向，并且可以帮助实现“两全其美”。

A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether  autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to  predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also,  with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas  ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed  actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains  of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world  [BHT+20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a  different approach is likely to be necessary. Promising future directions in this vein might include learning the objective  function from humans [ZSW+19a], fine-tuning with reinforcement learning, or adding additional modalities such as  images to provide grounding and a better model of the world [CLY+19]. 本文所描述的一般方法的一个更基本的限制是——扩展任何类似lm的模型，无论是自回归的还是双向的——它可能最终会(或可能已经)碰到培训前目标的限制。我们目前的目标是平等地对每一个标记进行权重，并且缺乏一个概念，即哪些是最重要的，哪些是不那么重要的。[RRS20]演示定制对相关实体的预测的好处。此外，在自我监督的目标中，任务规范依赖于将所需的任务强制转化为预测问题，然而最终，有用的语言系统(例如虚拟助手)可能被认为是采取目标导向的行动，而不仅仅是进行预测。最后，大型的预训练语言模型并不基于其他经验领域，如视频或现实世界的物理互动，因此缺乏大量关于世界的上下文[BHT+20]。由于所有这些原因，纯自监督预测的缩放可能会达到极限，使用不同的方法进行扩展可能是必要的。在这方面，未来有希望的方向可能包括从人类那里学习目标函数[ZSW+19a]，用强化学习进行微调，或添加额外的模式，如图像，以提供接地和更好的世界模型[CLY+19]。

Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3  takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more  text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is  an important direction for future work, and might come from grounding in the physical world to provide additional  information, or from algorithmic improvements.

A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot  learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it  has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that  are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format,  to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on  this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words  seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although  possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what  humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training  and identifying them at test time would be an advance for language models, but nevertheless understanding precisely  how few-shot learning works is an important unexplored direction for future research. 语言模型普遍存在的另一个局限性是在训练前的样本效率较低。尽管GPT-3在测试时间样本效率方面更接近人类(一次或零次)，但它在训练前看到的文本仍然比人类在一生中看到的要多得多[Lin20]。提高训练前的样本效率是未来工作的一个重要方向，可能来自于在物理世界的基础上提供额外的信息，或者来自于算法的改进。在GPT-3中，与少样本学习相关的一个限制，或者至少是不确定性，是关于小样本学习实际上是在推理时间“从零开始”学习新任务，还是仅仅识别和识别在训练中学习到的任务的不确定性。这些可能性存在于光谱,从示威游行的训练集来自相同的分布与测试时间,认识到相同的任务,但在不同的格式,以适应一个特定的风格的QA等任务,学习一门技能完全新创。GPT-3在这个范围内的位置也可能因任务而异。合成任务，如词序打乱或定义无意义的词，似乎特别有可能从头学习，而翻译显然必须在训练前学习，尽管可能从组织和风格上与测试数据非常不同的数据。最终，我们甚至不清楚人类从从零开始和之前的演示中学到了什么。即使是在训练前组织各种演示，并在测试时识别它们，也将是语言模型的一个进步，但准确地理解少枪学习是如何工作的，是未来研究的一个重要的未探索的方向。

A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are  both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of  models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large  models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills,  most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible.  Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters;  new challenges and opportunities may be associated with applying it to models of this size.

Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable,  it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in  performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This  last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special  concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts  (Section 6).

|

Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读(四)
Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
846 1
|
2月前
|
XML 安全 Java
【Tomcat】《How Tomcat Works》英文版GPT翻译（序章）
【Tomcat】《How Tomcat Works》英文版GPT翻译（序章）
52 0
Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
761 0
|

Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读（二）
Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
413 0
|

Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读（一）
Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
613 0
|
13天前
|

Python 金融编程第二版（GPT 重译）（四）（4）
Python 金融编程第二版（GPT 重译）（四）
17 3
|
13天前
|

Python 金融编程第二版（GPT 重译）（一）（4）
Python 金融编程第二版（GPT 重译）（一）
17 2
|
13天前
|

Python 金融编程第二版（GPT 重译）（四）（5）
Python 金融编程第二版（GPT 重译）（四）
15 2
|
13天前
|

Python 金融编程第二版（GPT 重译）（四）（1）
Python 金融编程第二版（GPT 重译）（四）
19 2
|
13天前
|

Python 金融编程第二版（GPT 重译）（三）（4）
Python 金融编程第二版（GPT 重译）（三）
14 2