Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读(二)

本文涉及的产品
文本翻译,文本翻译 100万字符
NLP 自学习平台,3个模型定制额度 1个月
NLP自然语言处理_基础版,每接口每天50万次
简介: Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读

2.2 Training Dataset 训练数据集


Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset2 [RSR+19] constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset [RWC+19], collected by scraping links over a longer period of time, and first described in [KMH+20], two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.

用于语言模型的数据集已经迅速扩展,最终达到了常见的爬行数据集dataset2 [RSR+19],总计近一万亿字。这样大的数据集足以训练我们最大的模型,而无需对同一序列进行两次更新。然而,我们发现未过滤或轻度过滤版本的普通爬行往往比更有组织的数据集质量更低。因此,我们采取了3个步骤来提高数据集的平均质量:(1)我们下载和过滤的一个版本CommonCrawl基于相似性的一系列高品质参考全集,(2)我们在文档级别执行模糊重复数据删除,在和整个数据集,以防止冗余和保存我们伸出的完整性验证设置为一个精确的衡量过度拟合,和(3)我们还添加了高质量的参考语料训练增加CommonCrawl和增加其多样性。

前两个点的详细信息(处理常见的爬行)描述在附录a。第三,我们添加了几个策划高质量的数据集,包括WebText数据集的扩展版本(RWC + 19),收集的抓取链接在更长一段时间,和第一(公里/小时+ 20)中描述的两个网络书全集(Books1和Books2)和英文维基百科。

Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data. 表2.2显示了我们在训练中使用的最终混合数据集。common抓取数据从2016年到2019年的每月41个shards中下载,即过滤前压缩明文45TB,过滤后压缩明文570GB,大致相当于4000亿个字节对编码的令牌。需要注意的是,在训练过程中,对数据集的采样并不是按照数据集的大小进行的,而是我们认为质量较高的数据集的采样频率更高,例如common抓取和Books2数据集在训练过程中采样次数少于一次,而对其他数据集的采样次数为2-3次。这本质上接受了少量的过拟合,以换取更高质量的训练数据。

A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model. In Section 4 we characterize the impact of the remaining overlaps, and in future work we will more aggressively remove data contamination. 在广泛的互联网数据上预先训练过的语言模型,特别是具有记忆大量内容能力的大型模型,主要关注的方法是,在培训前无意中看到测试或开发集,可能会污染下游任务。为了减少这种污染,我们搜索并试图消除与本文研究的所有基准的开发和测试集的重叠。不幸的是,过滤中的一个bug导致我们忽略了一些重叠部分,并且由于训练的代价,对模型进行再训练是不可行的。在第4节中,我们描述了剩余重叠的影响,在未来的工作中,我们将更积极地消除数据污染。


2.3 Training Process 训练过程


As found in [KMH+20, MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table 2.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. Details of the training process and hyperparameter settings are described in Appendix B. 正如在[KMH+20, MKAT18]中发现的,较大的模型通常可以使用较大的批大小,但需要较小的学习速度。我们在训练期间测量梯度噪声尺度,并使用它来指导我们批量大小的选择[MKAT18]。表2.1显示了我们使用的参数设置。为了训练更大的模型而不耗尽内存,我们在每个矩阵乘法中混合使用模型并行性和跨网络层的模型并行性。所有的模型都是在微软提供的高带宽集群的V100 GPU上进行训练的。详细的训练过程和超参数设置在附录B中描述。


2.4 Evaluation  评估


For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning examples directly from it.

对于少弹学习,我们从任务的训练集中随机抽取K个样本作为条件,根据任务的不同用1或2个新行分隔,以此来评估评估集中的每个样本。对于LAMBADA和Storycloze,没有可用的监督训练集,所以我们从开发集中提取条件设置示例,并在测试集上进行评估。对于Winograd(原始的,不是超级胶水版本),只有一个数据集,所以我们直接从它提取条件设置示例。

K can be any value from 0 to the maximum amount allowed by the model’s context window, which is nctx = 2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better, so when a separate development and test set are available, we experiment with a few values of K on the development set and then run the best value on the test set. For some tasks (see Appendix G) we also use a natural language prompt in addition to (or for K = 0, instead of) demonstrations.

On tasks that involve choosing one correct completion from several options (multiple choice), we provide K examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing P (completion|context) P (completion|answer context) , where answer context is the string "Answer: " or "A: " and is used to prompt that the completion should be an answer but is otherwise generic.

K可以是0到模型上下文窗口允许的最大数量之间的任何值,即nctx = 2048,适用于所有模型,通常适合10到100个示例。更大的K值通常但不总是更好的,所以当一组独立的开发和测试是可用的,我们尝试几值K的开发设置,然后运行测试集上的最佳值。对于某些任务(参见附录G)我们也使用自然语言提示除了(或K = 0,而不是)示威活动。

对于涉及从多个选项(多项选择)中选择一个正确完成的任务,我们提供了K个上下文示例加上正确完成,然后只提供一个上下文示例,并比较每个完成的LM可能性。对于大多数任务我们比较每个令牌的可能性(规范化长度),然而在少量的数据集(弧、OpenBookQA和比赛)我们获得更多利益衡量发展设定的正常化的无条件概率每完成,通过计算P(完成|上下文)(完成|回答上下文),在回答上下文字符串“回答:”或“:”和用于提示完成应该答案但否则通用。

On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what is done by [RSR+19] (see Appendix G) for details.

On tasks with free-form completion, we use beam search with the same parameters as [RSR+19]: a beam width of 4 and a length penalty of α = 0.6. We score the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.

Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa) where we were able to make submission work, and we submit only the 200B few-shot results, and report development set results for everything else.

对于涉及二分类的任务,我们给选项以语义上更有意义的名称(例如“真”或“假”,而不是0或1),然后把任务当作多项选择;我们有时也会类似于[RSR+19]所完成的任务(详见附录G)。

对于自由形式完成的任务,我们使用与[RSR+19]相同的参数进行波束搜索:波束宽度为4,长度罚值为radial = 0.6。我们使用F1相似度评分、BLEU或精确匹配来给模型评分,这取决于手头数据集的标准。


对于每个模型的大小和学习设置(0 -,1 -,和小样本),最终的结果会在测试集上公布。当测试集是私有的,我们的模型往往是太大,以适应在测试服务器上,所以我们报告的结果发展。我们提交到测试服务器上少量的数据集(超强力胶水,TriviaQA PiQa)我们能够提交工作,我们只有200 b few-shot提交结果,并报告发展为一切设置结果。


3 Results 结果


In Figure 3.1 we display training curves for the 8 models described in Section 2. For this graph we also include 6  additional extra-small models with as few as 100,000 parameters. As observed in [KMH+20], language modeling  performance follows a power-law when making efficient use of training compute. After extending this trend by two  more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these  improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will  see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a  broad spectrum of natural language tasks.  

Below, we evaluate the 8 models described in Section 2 (the 175 billion parameter parameter GPT-3 and 7 smaller  models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks.  

在图3.1中,我们展示了第2节中描述的8个模型的训练曲线。在这个图中,我们还包括了6个额外的超小型模型,这些模型只有100,000个参数。正如在[KMH+20]中观察到的,在高效使用训练计算时,语言建模性能遵循幂律。在将这一趋势扩展两个数量级之后,我们只观察到与幂律有轻微的背离。人们可能会担心这些交叉熵损失的改进仅仅来自于我们训练语料库的虚假细节建模。然而,在接下来的章节中,我们将看到交叉熵损失的改进可以在广泛的自然语言任务中带来一致的性能提升。 

下面,我们在广泛的数据集上评估第2节中描述的8个模型(1750亿参数GPT-3和7个较小的模型)。我们将数据集分成9个类别,这些类别代表大致相似的任务。 

In Section 3.1 we evaluate on traditional language modeling tasks and tasks that are similar to language modeling,  such as Cloze tasks and sentence/paragraph completion tasks. In Section 3.2 we evaluate on “closed book” question  answering tasks: tasks which require using the information stored in the model’s parameters to answer general  knowledge questions. In Section 3.3 we evaluate the model’s ability to translate between languages (especially one-shot  and few-shot). In Section 3.4 we evaluate the model’s performance on Winograd Schema-like tasks. In Section 3.5 we  evaluate on datasets that involve commonsense reasoning or question answering. In Section 3.6 we evaluate on reading  comprehension tasks, in Section 3.7 we evaluate on the SuperGLUE benchmark suite, and in 3.8 we briefly explore  NLI. Finally, in Section 3.9, we invent some additional tasks designed especially to probe in-context learning abilities –  these tasks focus on on-the-fly reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the  few-shot, one-shot, and zero-shot settings. 在3.1节中,我们评估了传统的语言建模任务和类似于语言建模的任务,如完形填空任务和句子/段落完成任务。在第3.2节中,我们对“闭卷”问题回答任务进行评估,即需要使用模型参数中存储的信息来回答一般知识问题的任务。在第3.3节中,我们评估了模型在不同语言之间的翻译能力(特别是一次翻译和少次翻译)。在第3.4节中,我们评估了该模型在Winograd类模式任务上的性能。在第3.5节中,我们对涉及常识推理或问题回答的数据集进行评估。在第3.6节中,我们评估了阅读理解任务;在第3.7节中,我们评估了SuperGLUE基准套件;在3.8节中,我们简要探讨了NLI。最后,在3.9节中,我们特别设计了一些额外的任务来探究上下文中的学习能力——这些任务侧重于即时推理、适应技巧或开放式的文本合成。我们在“少拍”、“一次拍”和“零拍”设置中评估所有的任务。


3.1 Language Modeling, Cloze, and Completion Tasks 语言建模、完形填空和完成任务


In this section we test GPT-3’s performance on the traditional task of language modeling, as well as related tasks that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible completions of a piece of text.

在本节中,我们将测试GPT-3在传统的语言建模任务以及相关任务上的性能,这些任务包括预测感兴趣的单个单词、完成句子或段落,或在可能完成的一段文本之间进行选择。


3.1.1 Language Modeling   语言建模


We calculate zero-shot perplexity on the Penn Tree Bank (PTB) [MKM+94] dataset measured in [RWC+19]. We omit  the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the  one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these  issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15  points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have  a clear separation of examples to define one-shot or few-shot evaluation around, so we measure only zero-shot. 我们计算了在[RWC+19]测量的佩恩树岸(PTB) [MKM+94]数据集上的零射击perplexity。我们省略了4 Wikipedia-related任务的工作,因为他们是完全包含在我们的训练数据,我们也省略十亿字的基准由于高分数被包含在我们的训练集的数据集。肺结核逃脱这些问题由于比现代互联网。我们最大的模型在PTB上设置了一个新的SOTA,显著领先15个点,达到20.50的困惑。注意,由于PTB是一个传统的语言建模数据集,它没有一个清晰的示例分离来定义一次或少次评估,因此我们只测量零次评估。


3.1.2 LAMBADA 数据集


The LAMBADA dataset [PKL+16] tests the modeling of long-range dependencies in text – the model is asked to predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the continued scaling of language models is yielding diminishing returns on this difficult benchmark. [BHT+20] reflect on the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results ([SPP+19] and [Tur20]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of 8% over the previous state of the art.

LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that  classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a  standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but  also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word  filters [RWC+19] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a  cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We  use the following fill-in-the-blank format:


LAMBADA数据集[PKL+16]测试文本中远程依赖的建模——模型被要求预测需要阅读一段上下文的句子的最后一个单词。最近有研究表明,语言模型的不断扩大在这个困难的基准上产生的收益正在减少。[BHT+20]反思了在两个最新的研究结果([SPP+19]和[Tur20])之间,模型尺寸增加了一倍,仅提高了1.5%,并认为“继续以数量级扩展硬件和数据尺寸并不是前进的道路”。我们发现这条道路仍然很有希望,在零杆的情况下,LAMBADA的GPT-3实现了76%,比之前的技术水平提高了8%。

LAMBADA还演示了小样本学习的灵活性,因为它提供了一种方法来解决这个数据集通常出现的问题。尽管LAMBADA中的完成总是一个句子的最后一个单词,但是标准语言模型无法知道这个细节。因此,它不仅将概率分配给正确的结尾,也分配给其他有效的段落延续。这个问题已经部分解决了在过去的停止字过滤器[RWC+19](禁止“延续”字)。相反,few-shot设置允许我们将任务“设置”为一个cloze测试,并允许语言模型从示例中推断出需要完成的恰好是一个单词。我们使用以下填空格式:

When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase  of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model  size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy  by 10%. Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot  setting. Perhaps this is because all models still require several examples to recognize the pattern.

One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact on performance.

当以这种方式呈现样例时,GPT-3在小样本设置中达到了86.4%的精度,比之前的最先进水平提高了18%以上。我们观察到,随着模型尺寸的增大,小样本性能有了很大的提高。虽然这个设置将最小模型的性能降低了近20%,但对于GPT-3,它将精度提高了10%。最后,空白填充法并不是一种有效的一次性方法,它的效果总是比零填充法差。这可能是因为所有模型仍然需要几个示例来识别模式。

需要注意的一点是,对测试集污染的分析发现,LAMBADA数据集中的少数似乎出现在我们的训练数据中——然而,在第4节中执行的分析表明,对性能的影响可以忽略不计。


3.1.3 HellaSwag  数据集


The HellaSwag dataset [ZHB+19] involves picking the best ending to a story or set of instructions. The examples were  adversarially mined to be difficult for language models while remaining easy for humans (who achieve 95.6% accuracy).  GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the  75.4% accuracy of a fine-tuned 1.5B parameter language model [ZHR+19] but still a fair amount lower than the overall  SOTA of 85.6% achieved by the fine-tuned multi-task model ALUM.   HellaSwag数据集[ZHB+19]涉及到为一个故事或一组指令选择最好的结局。这些例子对语言模型来说很难挖掘,而对人类来说却很容易(达到95.6%的准确率)。GPT-3在单小样本设置中达到78.1%的准确率,在小样本设置中达到79.3%的准确率,超过了1.5B参数语言模型[ZHR+19]的75.4%的准确率,但仍低于多任务模型模型85.6%的整体SOTA。


3.1.4 StoryCloze  数据集


We next evaluate GPT-3 on the StoryCloze 2016 dataset [MCH+16], which involves selecting the correct ending  sentence for five-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot  setting (with K = 70). This is still 4.1% lower than the fine-tuned SOTA using a BERT based model [LDL19] but  improves over previous zero-shot results by roughly 10%. 接下来,我们对StoryCloze 2016数据集[MCH+16]上的GPT-3进行评估,包括为五句话长的故事选择正确的结尾句。在这里,GPT-3在零样本设置中达到83.2%,在小样本设置(K = 70)中达到87.7%。这仍然比使用基于BERT模型[LDL19]进行微调的SOTA低4.1%,但比之前的零射击结果提高了约10%。


3.2 Closed Book Question Answering  闭卷回答任务


In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense  amount of possible queries, this task has normally been approached by using an information retrieval system to find  relevant text in combination with a model which learns to generate an answer given the question and the retrieved  text. Since this setting allows a system to search for and condition on text which potentially contains the answer it  is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well  directly answering the questions without conditioning on auxilliary information. They denote this more restrictive  evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better  and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR+19],  WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in  the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than  previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself  is also not permitted.

在本节中,我们将测量GPT-3回答有关广泛事实知识的问题的能力。由于可能的查询量巨大,这个任务通常是通过使用信息检索系统查找相关文本,并结合学习生成给定问题和检索文本的答案的模型来完成的。由于该设置允许系统搜索并对可能包含答案的文本进行条件设置,因此称为“open-book”。[RRS20]最近证明,一个大型语言模型可以在不依赖辅助信息的情况下直接回答问题,表现得令人惊讶地好。他们将这种更严格的评估设置称为“闭卷”。他们的工作表明,更高容量的模型可以表现得更好,我们用GPT-3测试了这一假设。我们在[RRS20]中的3个数据集上评估GPT-3: Natural Questions [KPR+19]、WebQuestions [BCFL13]和TriviaQA [JCWZ17],使用相同的分割。注意,除了所有的结果都在闭卷设置中之外,我们使用的少样本、一次小样本和零小样本的评估代表了比以前的闭卷QA工作更严格的设置:除了不允许外部内容外,也不允许对Q&A数据集本身进行微调。

The results for GPT-3 are shown in Table 3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the  one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by  14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot  result improves by 3.7% and matches the SOTA for an open-domain QA system which not only fine-tunes but also  makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [LPP+20].  GPT-3’s few-shot result further improves performance another 3.2% beyond this.  

On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5%  in the few-shot setting. This compares to 37.4% for fine-tuned T5-11B, and 44.7% for fine-tuned T5-11B+SSM,  which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of  state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to  few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this  distribution, recovering strong performance in the few-shot setting.


GPT-3结果如表3.3所示。在TriviaQA上,我们在小样本设置中达到了64.3%,在一小样本设置中达到了68.0%,在小样本设置中达到了71.2%。zero-shot result的表现已经比经过微调的T5-11B高出14.2%,而且在培训前的问答时间跨度预测也比T5-11B高出3.8%。一次测试的结果提高了3.7%,与开放域QA系统的SOTA相匹配,该系统不仅进行了优化,而且利用了一种学习过的检索机制,对包含21M文档的15.3个参数密集向量索引进行检索[LPP+20]。此外,GPT-3的少拍效果进一步提高了性能3.2%。

在网络问题(WebQs)中,GPT-3在零杆设置中达到14.4%,在单杆设置中达到25.3%,在少杆设置中达到41.5%。相比之下,使用q&a特定的培训前程序的优化T5-11B和优化T5-11B+SSM的比例分别为37.4%和44.7%。GPT-3在小样本设置接近最先进的表现,微调模型。值得注意的是,与TriviaQA相比,WebQS从零杆到少杆的增益要大得多(事实上,WebQS的零杆和单杆性能都很差),这可能表明WebQS的问题和/或它们的回答风格在GPT-3中是不分布的。然而,GPT-3似乎能够适应这种分布,在少炮点的环境中恢复了良好的性能。

On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in  the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot  to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to  TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia  specifically which could be testing the limits of GPT-3’s capacity and broad pretraining distribution.  

Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two  datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we  find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting  the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model.

在自然问题(NQs)中,GPT-3在零杆设置中达到了14.6%,在单杆设置中达到了23.0%,在少杆设置中达到了29.9%,而在经过微调的T5 11B+SSM中达到了36.6%。与WebQS类似,从零杆到少杆的巨大增益可能意味着分布的转移,这也可能解释了与TriviaQA和WebQS相比竞争力较差的原因。特别是,NQs的问题倾向于维基百科上非常精细的知识,可以测试GPT-3的能力极限和广泛的培训前分布。 

总的来说,在三个数据集中的一个上,GPT-3的一次性匹配了开放域微调SOTA。在另外两个数据集上,尽管没有使用微调,它的性能接近封闭的SOTA。在所有3个数据集上,我们发现性能与模型大小的关系非常顺利(图3.3和附录H图H.7),可能反映了模型容量直接转化为更多吸收在模型参数中的“知识”的想法。


3.3 Translation  翻译任务


For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity  concerns. Even with this filtering GPT-2 showed some evidence of multilingual capability and performed non-trivially  when translating between French and English despite only training on 10 megabytes of remaining French text. Since we  increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training  dataset to include more representation of other languages, though this remains an area for further improvement. As  discussed in 2.2 the majority of our data is derived from raw Common Crawl with only quality-based filtering. Although  GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages.  These languages are documented in the supplemental material. In order to better understand translation capability, we  also expand our analysis to include two additional commonly studied languages, German and Romanian.  

Existing unsupervised machine translation approaches often combine pretraining on a pair of monolingual datasets  with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a  blend of training data that mixes many languages together in a natural way, combining them on a word, sentence,  and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in  particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make  use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data.  Results are shown in Table 3.4. Zero-shot GPT-3, which only receives on a natural language description of the task,  still underperforms recent unsupervised NMT results. However, providing only a single example demonstration for each translation task improves performance by over 7 BLEU and nears competitive performance with prior work.  GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior  unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the  three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into  English but underperforms when translating in the other direction. Performance on En-Ro is a noticeable outlier at  over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE  tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En,  few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and  the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art.  For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of  unsupervised pretraining, supervised finetuning on 608K labeled examples, and backtranslation [LHCG19b].   对于GPT-2,由于容量问题,在多语言文档集合上使用了一个过滤器来生成仅使用英语的数据集。即使使用了这种过滤,GPT-2也显示出了多语言能力,并且在法语和英语之间进行翻译时执行得非常出色,尽管仅对10兆字节的剩余法语文本进行了培训。由于我们将GPT-2到GPT-3的容量增加了两个数量级,因此我们还扩展了训练数据集的范围,以包括更多其他语言的表示,尽管这仍是一个有待进一步改进的领域。正如2.2中所讨论的那样,我们的大部分数据来自于原始的普通抓取,只使用基于质量的过滤。尽管GPT-3的训练数据仍然主要是英语(93%的单词计数),但它也包括了7%的其他语言的文本。这些语言被记录在补充材料中。为了更好地理解翻译能力,我们还扩展了我们的分析,包括另外两种常用的语言,德语和罗马尼亚语。 

现有的无监督机器翻译方法通常结合对单语数据集的预训练和反向翻译[SHB15],以一种可控的方式连接两种语言。相比之下,GPT-3从混合的训练数据中学习,这些数据以自然的方式将多种语言混合在一起,在单词、句子和文档级别上将它们组合在一起。GPT-3也使用单一的训练目标,它不是为任何特定任务定制或设计的。然而,我们的单样本/小样本设置并不能严格地与之前的无监督工作相比,因为它们使用了少量成对的例子(1或64个)。这相当于一页或两页上下文内训练数据。结果如表3.4所示。Zero-shot GPT-3,它只接收任务的自然语言描述,仍然表现不佳,最近的非监督NMT结果。然而,仅为每个翻译任务提供一个示例演示,就可以提高7个蓝度以上的翻译性能,接近与之前工作的竞争性能。GPT-3在全小样本设置中进一步提高了另外4个蓝度,使得平均性能与之前的无监督NMT工作相似。根据语言方向的不同,GPT-3在性能上有明显的偏差。在研究的三种输入语言中,GPT-3在翻译成英语时显著优于之前的无监督的NMT工作,但在翻译成英语时表现不佳。在enro上的性能是一个明显的异常值,比之前的无监督的NMT工作差10蓝度以上。这可能是一个弱点,因为重用了GPT-2的字节级BPE标记器,它是为一个几乎完全是英语的训练数据集开发的。对于Fr-En和De-En,很少有shot GPT-3优于我们所能找到的最佳监督结果,但由于我们不熟悉文献和这些是非竞争性基准的外观,我们不怀疑这些结果代表了真正的艺术状态。对于roen来说,很少有shot GPT-3能在整体SOTA的0.5 BLEU范围内完成,这是通过结合无监督的预训练、对608K标记示例的监督微调和反向翻译来实现的[LHCG19b]。

Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of  improvement with model capacity. This is shown in Figure 3.4 in the case of few-shot results, and scaling for all three  settings is shown in Appendix H. 最后,通过所有语言对和所有三种设置(零-、一-和少-shot),模型容量有一个平稳的提高趋势。图3.4中显示的是较少拍摄的结果,附录H中显示了所有三种设置的缩放情况。


3.4 Winograd-Style Tasks  任务


The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun  refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned  language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test  GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.  

On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method  described in [RWC+19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which  is presented as binary classification and requires entity extraction to convert to the form described in this section. On  Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear  in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human  performance. We note that contamination analysis found some Winograd schemas in the training data but this appears  to have only a small effect on results (see Section 4).  

Winograd Schemas Challenge [LDM12]是NLP中的一项经典任务,当一个代词在语法上有歧义,但在语义上对人来说没有歧义时,该任务涉及确定该代词指的是哪个词。最近,经过微调的语言模型在原始Winograd数据集上取得了接近人类的性能,但是更困难的版本,比如反向挖掘的Winogrande数据集[SBBC19],仍然显著落后于人类的性能。我们测试了GPT-3在Winograd和Winogrande上的性能,通常是在零杆、一杆和少杆设置下。 

在Winograd上,我们使用[RWC+19]中描述的相同的“部分求值”方法,在原始的273个Winograd模式集上测试GPT-3。请注意,此设置与SuperGLUE基准中的WSC任务略有不同,后者以二进制分类的形式表示,需要提取实体来转换为本节中描述的形式。Winograd的GPT-3在零杆、一杆和少杆设置中取得了88.3%、89.7%和88.6%的成绩,没有显示出明确的上下文学习,但在所有情况下都取得了较好的成绩,仅比最先进的和估计的人类性能低几个点。我们注意到,污染分析在训练数据中发现了一些Winograd模式,但这似乎只对结果有很小的影响(见第4节)。

On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the  zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned  RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and  human performance on the task as reported by [SBBC19] is 94.0%. 在更困难的Winogrande数据集上,我们确实发现了上下文学习的进步:GPT-3在零样本设置中实现了70.2%,在单样本设置中实现了73.2%,在少小样本设置中实现了77.7%。相比之下,经过微调的RoBERTA模型实现了79%,使用经过微调的高容量模型(T5),最先进的实现了84.6%,而根据[SBBC19]报告的人类在该任务上的性能是94.0%。


3.5 Common Sense Reasoning  常识推理任务


Next we consider three datasets which attempt to capture physical or scientific reasoning, as distinct from sentence  completion, reading comprehension, or broad knowledge question answering. The first, PhysicalQA (PIQA) [BZB+19],  asks common sense questions about how the physical world works and is intended as a probe of grounded understanding  of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot  (the last measured on PIQA’s test server). This compares favorably to the 79.4% accuracy prior state-of-the-art of a fine-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over 10% worse than human  performance, but GPT-3’s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis  flagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark  the result with an asterisk. See Section 4 for details.   接下来,我们考虑三个试图捕捉物理或科学推理的数据集,作为区别于句子完成,阅读理解,或广义知识问题回答。第一个是PhysicalQA (PIQA) [BZB+19],它提出了关于物质世界如何运作的常识问题,旨在探索对世界的基础理解。GPT-3的零杆精度为81.0%,单杆精度为80.5%,少杆精度为82.8%(最后一次在PIQA的测试服务器上测量)。这比较有利的79.4%的精度之前的先进先进的一个微调罗伯塔。PIQA在模型尺寸上显示出相对较浅的缩放效果,仍然比人类的表现差10%以上,但GPT-3的少射甚至零射的结果比目前最先进的技术要好。我们的分析将PIQA标记为潜在的数据污染问题(尽管隐藏了测试标签),因此我们用星号保守地标记了结果。详见第4节。 


ARC [CCE+18] is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the  “Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval  methods are unable to correctly answer, GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot  setting, and 51.5% in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline  (55.9%) from UnifiedQA [KKS+20]. On the “Easy” version of the dataset (questions which either of the mentioned  baseline approaches answered correctly), GPT-3 achieves 68.8%, 71.2%, and 70.1% which slightly exceeds a fine-tuned  RoBERTa baseline from [KKS+20]. However, both of these results are still much worse than the overall SOTAs  achieved by the UnifiedQA which exceeds GPT-3’s few-shot results by 27% on the challenge set and 22% on the easy  set.  

On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points  short of the overall SOTA. GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the  leaderboard.  

Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and  inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a significant  improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings.


ARC [CCE+18]是一个多选题数据集,收集自3至9年级的科学考试。在对简单统计或信息检索方法无法正确回答的问题进行筛选后的数据集“挑战”版本上,GPT-3在零炮设置、一次炮设置和少炮设置的准确率分别达到51.4%、53.2%和51.5%。这接近于UnifiedQA [KKS+20]的RoBERTa基线(55.9%)的性能。在数据集的“简单”版本中(上述两种基线方法都回答正确的问题),GPT-3实现了68.8%、71.2%和70.1%,这略微超过了来自[KKS+20]的RoBERTa的优化基线。然而,这两个结果仍然比UnifiedQA取得的总体SOTAs差得多,后者在挑战集上比GPT-3的少杆结果高出27%,在简单集上高出22%。 

在OpenBookQA [MCKS18]上,GPT-3从零样本到小样本设置有显著提高,但仍比整体SOTA少20分。GPT-3的少样本性能类似于一个微调的伯特大基线在排行榜上。 

总的来说,使用GPT-3的上下文学习在常识推理任务中表现出混合的结果,在PIQA和ARC的单样本和小样本学习设置中,只观察到小的和不一致的收获,但在OpenBookQA中观察到显著的改善。GPT-3在所有评估设置中对新的PIQA数据集设置SOTA。


3.6 Reading Comprehension  阅读理解任务


Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive,  multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread  in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general  we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each  respective dataset.  

接下来我们对GPT-3进行阅读理解任务的评估。在对话框和单一问题设置中,我们使用了一套5个数据集,包括抽象的、多项选择和基于跨度的回答格式。我们观察到GPT-3在这些数据集上的性能差异很大,这表明不同的回答格式具有不同的能力。一般来说,我们观察到GPT-3与初始基线和使用上下文表示对每个各自数据集进行训练的早期结果相同。

GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset  and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured  dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete  reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned  BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches  which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its  few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to  slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of  middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with  the earliest work utilizing contextual representations and is still 45% behind SOTA.

GPT-3在CoQA [RCM19]自由形式会话数据集上表现最好(在人类基线的3个点内),在QuAC [CHI+18]数据集上表现最差(低于ELMo基线13 F1),该数据集需要建模结构化对话行为和师生交互的回答范围选择。下降(DWD + 19]数据集测试离散推理和计算能力在阅读理解中,GPT-3在few-shot环境优于原始论文的BERT基线调整但仍远低于人类的性能和先进的方法增强神经网络与符号系统(RLL + 19)。在阵容2.0 [RJL18]上,GPT-3展示了它的少杆学习能力,与零杆设置相比提高了近10杆(69.8杆)。这使得它稍微优于原始论文中最好的微调结果。在RACE [LXL+17](一个针对初中和高中英语考试的多项选择数据集)上,GPT-3的表现相对较弱,仅与最早使用上下文表示的研究相比具有竞争力,仍落后于SOTA 45%。


3.7 SuperGLUE  对比


In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a  more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark  [WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07]  [BDD+09] [PCC18] [PHR+18]. GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the  few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we  used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated.  

We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA  performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving  second place on the leaderboard, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC,  performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the  original Winograd dataset as described in Section 3.4). On BoolQ, MultiRC, and RTE, performance is reasonable,  roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting.

为了更好地聚合NLP任务的结果,并与BERT和RoBERTa等流行模型进行更系统的比较,我们还在标准化数据集上对GPT-3进行了评价,即SuperGLUE基准[WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]。GPT-3在SuperGLUE数据集上的测试集性能如表3.8所示。在小样本设置中,我们对所有任务使用了32个示例,从训练集中随机采样。对于除了WSC和MultiRC之外的所有任务,我们采样了一组新的示例用于每个问题的上下文。对于WSC和MultiRC,我们使用同一组从训练集中随机抽取的例子作为我们评估的所有问题的上下文。

我们观察到GPT-3在不同任务中的表现差异很大。在COPA和记录GPT-3实现近sota的表现在一次样本和小样本设置,与COPA只下降了几个点,并在排行榜上取得第二名,第一名是由微调110亿参数模型(T5)。在WSC上,性能仍然相对较强,在小样本设置中达到80.1%(请注意,如3.4节所述,gpot -3在原始Winograd数据集上达到88.6%)。在BoolQ、MultiRC和RTE上,性能是合理的,大致与经过微调的BERT-Large匹配。在CB上,我们看到生命迹象的比例为75.6%。

WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different  phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two  sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer  in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot  setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same  way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another.  This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these  weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to  the state-of-the-art held by a fine-tuned 11 billion parameter model.

Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of  examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale K up to 32  examples per task, after which point additional examples will not reliably fit into our context. When sweeping over  values of K, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large  on overall SuperGLUE score.

WiC是一个值得注意的弱点,它的命中率为49.4%(随机)。我们为WiC尝试了许多不同的短语和公式(包括确定一个单词在两个句子中是否具有相同的意思),但没有一个能够取得很好的效果。这暗示了一个现象,在下一节将变得更清楚(讨论ANLI基准)——GPT-3似乎弱few-shot或一次性设置的一些任务,涉及比较两个句子或片段,例如一个词是否用同样的方式在两个句子,一个句子是否解释另一个,或者一个句子是否意味着另一个。这也可以解释RTE和CB的分数相对较低的原因,它们也采用这种格式。尽管存在这些弱点,GPT-3仍然在8个任务中的4个任务上优于经过微调的伯特-大公司,而在两个任务上,GPT-3通过一个经过微调的110亿参数模型已经接近最先进水平。

最后,我们注意到,随着模型大小和上下文中的示例数量的增加,少量注射的SuperGLUE得分稳步提高,显示了上下文内学习的好处越来越大(图3.8)。我们将K扩展到每个任务32个示例,超过这一点,额外的示例将不可靠地适合我们的上下文。当扫过K的值时,我们发现GPT-3每个任务总共需要少于8个示例,才能在总体超级胶水得分上超过经过微调的伯特-大。


3.8 NLI  自然语言推理任务


Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences.  In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral).  SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest  version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting  GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced  Adversarial Natural Language Inference (ANLI) dataset [NWD+19]. ANLI is a difficult dataset employing a series of  adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our  models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (∼ 33%),  whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results  for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult  task for language models and they are only just beginning to show signs of progress.

自然语言推理(NLI) [Fyo00]关注理解两个句子之间关系的能力。在实践中,这个任务通常被构造成两个或三个类的分类问题,其中模型分类第二个句子在逻辑上是否与第一个句子相符合,是否与第一个句子相矛盾,或者可能是正确的(中立的)。SuperGLUE包括一个NLI数据集RTE,它计算任务的二进制版本。在RTE上,只有最大版本的GPT-3在任何评估设置上的表现都令人信服地优于random(56%),但在小样本设置中,GPT-3的表现类似于单任务优化的BERT Large。我们还评估了最近引入的对抗式自然语言推断(ANLI)数据集[NWD+19]。ANLI是一个复杂的数据集,它在三轮(R1、R2和R3)中使用一系列逆向挖掘的自然语言推理问题。与RTE类似,我们所有小于GPT-3的模型在ANLI上的表现几乎完全是随机的,即使是在很少投篮的设置中(约33%),而GPT-3本身在第3轮显示出生命迹象。ANLI R3的结果突出显示在图3.9和全部结果轮可以在附录h .这些结果RTE和ANLI NLI基础仍然是一个非常困难的任务表明语言模型和他们才刚刚开始显示出进步的迹象。


3.9 Synthetic and Qualitative Tasks  综合和定性任务


One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which  require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have  occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we  test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the  letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to  solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new  words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets  with the hope of stimulating further study of test-time behavior of language models.  

要想了解GPT-3在“少拍”(或“零拍”和“一次拍”)环境下的能力范围,一种方法是让它执行一些任务,这些任务要求它执行简单的即时计算推理,识别训练中不太可能出现的新模式,或者快速适应不寻常的任务。我们设计了几个任务来测试这类能力。首先,我们测试GPT-3执行算术的能力。其次,我们创建了几个任务,这些任务包括重新排列或整理单词中的字母,这些任务不太可能在训练过程中被准确地看到。第三,我们测试了GPT-3解决卫星式类比问题的能力。最后,我们对GPT-3进行了几个定性测试,包括在句子中使用新单词、修改英语语法和生成新闻文章。我们将发布合成数据集,希望能促进对语言模型测试时行为的进一步研究。


相关文章
|
机器学习/深度学习 人工智能 自然语言处理
Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读(四)
Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
|
8月前
|
XML 安全 Java
【Tomcat】《How Tomcat Works》英文版GPT翻译(序章)
【Tomcat】《How Tomcat Works》英文版GPT翻译(序章)
71 0
|
机器学习/深度学习 自然语言处理 算法
Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读(三)
Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
|
自然语言处理 负载均衡 算法
Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读(一)
Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
|
6月前
|
存储 SQL 数据库
Python 金融编程第二版(GPT 重译)(四)(4)
Python 金融编程第二版(GPT 重译)(四)
60 3
|
6月前
|
存储 NoSQL 索引
Python 金融编程第二版(GPT 重译)(一)(4)
Python 金融编程第二版(GPT 重译)(一)
72 2
|
6月前
|
存储 机器学习/深度学习 关系型数据库
Python 金融编程第二版(GPT 重译)(四)(5)
Python 金融编程第二版(GPT 重译)(四)
40 2
|
6月前
|
存储 SQL 数据可视化
Python 金融编程第二版(GPT 重译)(四)(1)
Python 金融编程第二版(GPT 重译)(四)
59 2
|
6月前
|
数据可视化 Python
Python 金融编程第二版(GPT 重译)(三)(4)
Python 金融编程第二版(GPT 重译)(三)
31 2

热门文章

最新文章