Transformers 4.37 中文文档(三)(1)https://developer.aliyun.com/article/1564648
因果语言建模
原始文本:
huggingface.co/docs/transformers/v4.37.2/en/tasks/language_modeling
语言建模有两种类型,因果和掩码。本指南说明了因果语言建模。因果语言模型经常用于文本生成。您可以将这些模型用于创意应用,如选择自己的文本冒险或智能编码助手,如 Copilot 或 CodeParrot。
www.youtube-nocookie.com/embed/Vpjb1lu0MDk
因果语言建模预测令牌序列中的下一个令牌,模型只能关注左侧的令牌。这意味着模型无法看到未来的令牌。GPT-2 是因果语言模型的一个例子。
本指南将向您展示如何:
- 在ELI5数据集的r/askscience子集上微调DistilGPT2。
- 使用您微调的模型进行推理。
您可以按照本指南中的相同步骤微调其他架构以进行因果语言建模。选择以下架构之一:
BART, BERT, Bert Generation, BigBird, BigBird-Pegasus, BioGpt, Blenderbot, BlenderbotSmall, BLOOM, CamemBERT, CodeLlama, CodeGen, CPM-Ant, CTRL, Data2VecText, ELECTRA, ERNIE, Falcon, Fuyu, GIT, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, GPT NeoX Japanese, GPT-J, LLaMA, Marian, mBART, MEGA, Megatron-BERT, Mistral, Mixtral, MPT, MusicGen, MVP, OpenLlama, OpenAI GPT, OPT, Pegasus, Persimmon, Phi, PLBart, ProphetNet, QDQBert, Qwen2, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, RWKV, Speech2Text2, Transformer-XL, TrOCR, Whisper, XGLM, XLM, XLM-ProphetNet, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD
在开始之前,请确保您已安装所有必要的库。
pip install transformers datasets evaluate
我们鼓励您登录您的 Hugging Face 帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:
>>> from huggingface_hub import notebook_login >>> notebook_login()
加载 ELI5 数据集
首先加载🤗数据集库中 r/askscience 子集的 ELI5 数据集的较小子集。这将让您有机会进行实验,并确保一切正常,然后再花更多时间在完整数据集上进行训练。
>>> from datasets import load_dataset >>> eli5 = load_dataset("eli5", split="train_asks[:5000]")
使用train_test_split方法将数据集的train_asks
拆分为训练集和测试集:
>>> eli5 = eli5.train_test_split(test_size=0.2)
然后看一个例子:
>>> eli5["train"][0] {'answers': {'a_id': ['c3d1aib', 'c3d4lya'], 'score': [6, 3], 'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]}, 'answers_urls': {'url': []}, 'document': '', 'q_id': 'nyxfp', 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']}, 'subreddit': 'askscience', 'title': 'Few questions about this space walk photograph.', 'title_urls': {'url': []}}
虽然这看起来很多,但您实际上只对text
字段感兴趣。语言建模任务的有趣之处在于您不需要标签(也称为无监督任务),因为下一个词就是标签。
预处理
www.youtube-nocookie.com/embed/ma1TrR7gE7I
下一步是加载一个 DistilGPT2 分词器来处理text
子字段:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
从上面的例子中,您会注意到text
字段实际上是嵌套在answers
中的。这意味着您需要使用flatten
方法从其嵌套结构中提取text
子字段:
>>> eli5 = eli5.flatten() >>> eli5["train"][0] {'answers.a_id': ['c3d1aib', 'c3d4lya'], 'answers.score': [6, 3], 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"], 'answers_urls.url': [], 'document': '', 'q_id': 'nyxfp', 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'], 'subreddit': 'askscience', 'title': 'Few questions about this space walk photograph.', 'title_urls.url': []}
现在,每个子字段都是一个单独的列,由answers
前缀指示,而text
字段现在是一个列表。不要单独对每个句子进行分词,而是将列表转换为字符串,以便可以联合对它们进行分词。
这是一个用于连接每个示例的字符串列表并对结果进行分词的第一个预处理函数:
>>> def preprocess_function(examples): ... return tokenizer([" ".join(x) for x in examples["answers.text"]])
要在整个数据集上应用此预处理函数,请使用🤗数据集的map方法。通过设置batched=True
以一次处理数据集的多个元素,并使用num_proc
增加进程数量,可以加快map
函数的速度。删除您不需要的任何列:
>>> tokenized_eli5 = eli5.map( ... preprocess_function, ... batched=True, ... num_proc=4, ... remove_columns=eli5["train"].column_names, ... )
该数据集包含令牌序列,但其中一些比模型的最大输入长度更长。
现在可以使用第二个预处理函数
- 连接所有序列
- 将连接的序列拆分为由
block_size
定义的较短块,该块应比最大输入长度短且足够短以适应您的 GPU RAM。
>>> block_size = 128 >>> def group_texts(examples): ... # Concatenate all texts. ... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} ... total_length = len(concatenated_examples[list(examples.keys())[0]]) ... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can ... # customize this part to your needs. ... if total_length >= block_size: ... total_length = (total_length // block_size) * block_size ... # Split by chunks of block_size. ... result = { ... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] ... for k, t in concatenated_examples.items() ... } ... result["labels"] = result["input_ids"].copy() ... return result
在整个数据集上应用group_texts
函数:
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
现在使用 DataCollatorForLanguageModeling 创建一批示例。在整理过程中,将句子动态填充到批次中的最长长度,而不是将整个数据集填充到最大长度。
Pytorch 隐藏 Pytorch 内容
使用结束序列标记作为填充标记,并设置mlm=False
。这将使用输入作为标签,向右移动一个元素:
>>> from transformers import DataCollatorForLanguageModeling >>> tokenizer.pad_token = tokenizer.eos_token >>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
TensorFlow 隐藏 TensorFlow 内容
使用结束序列标记作为填充标记,并设置mlm=False
。这将使用输入作为标签,向右移动一个元素:
>>> from transformers import DataCollatorForLanguageModeling >>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
训练
Pytorch 隐藏 Pytorch 内容
如果您不熟悉使用 Trainer 微调模型,请查看基本教程!
您现在可以开始训练模型了!使用 AutoModelForCausalLM 加载 DistilGPT2:
>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer >>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
此时,只剩下三个步骤:
- 在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是
output_dir
,指定保存模型的位置。通过设置push_to_hub=True
将此模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。 - 将训练参数传递给 Trainer,以及模型、数据集和数据整理器。
- 调用 train()来微调您的模型。
>>> training_args = TrainingArguments( ... output_dir="my_awesome_eli5_clm-model", ... evaluation_strategy="epoch", ... learning_rate=2e-5, ... weight_decay=0.01, ... push_to_hub=True, ... ) >>> trainer = Trainer( ... model=model, ... args=training_args, ... train_dataset=lm_dataset["train"], ... eval_dataset=lm_dataset["test"], ... data_collator=data_collator, ... ) >>> trainer.train()
训练完成后,使用 evaluate()方法评估您的模型并获取其困惑度:
>>> import math >>> eval_results = trainer.evaluate() >>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") Perplexity: 49.61
然后使用 push_to_hub()方法将您的模型分享到 Hub,这样每个人都可以使用您的模型:
>>> trainer.push_to_hub()
TensorFlow 隐藏 TensorFlow 内容
如果您不熟悉如何使用 Keras 微调模型,请查看基础教程!
要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:
>>> from transformers import create_optimizer, AdamWeightDecay >>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
然后,您可以使用 TFAutoModelForCausalLM 加载 DistilGPT2:
>>> from transformers import TFAutoModelForCausalLM >>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")
使用 prepare_tf_dataset()将数据集转换为tf.data.Dataset
格式:
>>> tf_train_set = model.prepare_tf_dataset( ... lm_dataset["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) >>> tf_test_set = model.prepare_tf_dataset( ... lm_dataset["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, ... )
使用compile
为训练配置模型。请注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非您想要指定一个,否则不需要:
>>> import tensorflow as tf >>> model.compile(optimizer=optimizer) # No loss argument!
这可以通过在 PushToHubCallback 中指定将模型和标记器推送到何处来完成:
>>> from transformers.keras_callbacks import PushToHubCallback >>> callback = PushToHubCallback( ... output_dir="my_awesome_eli5_clm-model", ... tokenizer=tokenizer, ... )
最后,您已经准备好开始训练您的模型了!使用fit
调用您的训练和验证数据集,时代数量以及微调模型的回调:
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])
训练完成后,您的模型会自动上传到 Hub,这样每个人都可以使用它!
有关如何为因果语言建模微调模型的更深入示例,请查看相应的PyTorch 笔记本或TensorFlow 笔记本。
推理
很好,现在您已经微调了一个模型,可以用于推理!
想出一个您想要从中生成文本的提示:
>>> prompt = "Somatic hypermutation allows the immune system to"
尝试使用 pipeline()来进行推理是尝试微调模型的最简单方法。实例化一个用于文本生成的pipeline
,并将文本传递给它:
>>> from transformers import pipeline >>> generator = pipeline("text-generation", model="my_awesome_eli5_clm-model") >>> generator(prompt) [{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]
Pytorch 隐藏 Pytorch 内容
对文本进行标记化,并将input_ids
返回为 PyTorch 张量:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model") >>> inputs = tokenizer(prompt, return_tensors="pt").input_ids
使用 generate()方法生成文本。有关不同文本生成策略和控制生成的参数的更多详细信息,请查看文本生成策略页面。
>>> from transformers import AutoModelForCausalLM >>> model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model") >>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
将生成的标记 ID 解码回文本:
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True) ["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]
TensorFlow 隐藏 TensorFlow 内容
对文本进行标记化,并将input_ids
返回为 TensorFlow 张量:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model") >>> inputs = tokenizer(prompt, return_tensors="tf").input_ids
使用 generate()方法创建摘要。有关不同文本生成策略和控制生成的参数的更多详细信息,请查看文本生成策略页面。
>>> from transformers import TFAutoModelForCausalLM >>> model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model") >>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
将生成的标记 ID 解码回文本:
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True) ['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']
遮蔽语言建模
原始文本:
huggingface.co/docs/transformers/v4.37.2/en/tasks/masked_language_modeling
www.youtube-nocookie.com/embed/mqElG5QJWUg
遮蔽语言建模预测序列中的一个遮蔽标记,模型可以双向关注标记。这意味着模型可以完全访问左侧和右侧的标记。遮蔽语言建模非常适合需要对整个序列进行良好上下文理解的任务。BERT 就是一个遮蔽语言模型的例子。
本指南将向您展示如何:
- 在r/askscience ELI5 数据集的子集上对DistilRoBERTa进行微调。
- 使用您微调的模型进行推断。
您可以按照本指南中的相同步骤对其他架构进行遮蔽语言建模的微调。选择以下架构之一:
ALBERT, BART, BERT, BigBird, CamemBERT, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ESM, FlauBERT, FNet, Funnel Transformer, I-BERT, LayoutLM, Longformer, LUKE, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MRA, MVP, Nezha, Nyströmformer, Perceiver, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, TAPAS, Wav2Vec2, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, X-MOD, YOSO
在开始之前,请确保已安装所有必要的库:
pip install transformers datasets evaluate
我们鼓励您登录您的 Hugging Face 帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:
>>> from huggingface_hub import notebook_login >>> notebook_login()
加载 ELI5 数据集
首先加载来自🤗数据集库的 ELI5 数据集的 r/askscience 子集的较小子集。这将让您有机会进行实验,并确保一切正常,然后再花更多时间在完整数据集上进行训练。
>>> from datasets import load_dataset >>> eli5 = load_dataset("eli5", split="train_asks[:5000]")
使用train_test_split方法将数据集的train_asks
分割为训练集和测试集:
>>> eli5 = eli5.train_test_split(test_size=0.2)
然后看一个例子:
>>> eli5["train"][0] {'answers': {'a_id': ['c3d1aib', 'c3d4lya'], 'score': [6, 3], 'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]}, 'answers_urls': {'url': []}, 'document': '', 'q_id': 'nyxfp', 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']}, 'subreddit': 'askscience', 'title': 'Few questions about this space walk photograph.', 'title_urls': {'url': []}}
虽然这看起来很多,但您实际上只对text
字段感兴趣。语言建模任务的有趣之处在于您不需要标签(也称为无监督任务),因为下一个词就是标签。
预处理
www.youtube-nocookie.com/embed/8PmhEIXhBvI
对于遮蔽语言建模,下一步是加载一个 DistilRoBERTa 分词器来处理text
子字段:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
从上面的示例中,您会注意到text
字段实际上是嵌套在answers
内部的。这意味着您需要使用flatten
方法从其嵌套结构中提取text
子字段:
>>> eli5 = eli5.flatten() >>> eli5["train"][0] {'answers.a_id': ['c3d1aib', 'c3d4lya'], 'answers.score': [6, 3], 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"], 'answers_urls.url': [], 'document': '', 'q_id': 'nyxfp', 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'], 'subreddit': 'askscience', 'title': 'Few questions about this space walk photograph.', 'title_urls.url': []}
现在,每个子字段都是一个单独的列,由answers
前缀指示,而text
字段现在是一个列表。不要单独对每个句子进行标记化,而是将列表转换为字符串,以便可以联合对它们进行标记化。
这是一个第一个预处理函数,用于连接每个示例的字符串列表并对结果进行标记化:
>>> def preprocess_function(examples): ... return tokenizer([" ".join(x) for x in examples["answers.text"]])
要在整个数据集上应用此预处理函数,请使用🤗数据集map方法。通过设置batched=True
以一次处理数据集的多个元素,并使用num_proc
增加进程数量来加速map
函数。删除您不需要的任何列:
>>> tokenized_eli5 = eli5.map( ... preprocess_function, ... batched=True, ... num_proc=4, ... remove_columns=eli5["train"].column_names, ... )
此数据集包含标记序列,但其中一些序列比模型的最大输入长度更长。
现在可以使用第二个预处理函数
- 连接所有序列
- 将连接的序列拆分成由
block_size
定义的较短块,该块应该既比最大输入长度短,又足够短以适应您的 GPU RAM。
>>> block_size = 128 >>> def group_texts(examples): ... # Concatenate all texts. ... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} ... total_length = len(concatenated_examples[list(examples.keys())[0]]) ... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can ... # customize this part to your needs. ... if total_length >= block_size: ... total_length = (total_length // block_size) * block_size ... # Split by chunks of block_size. ... result = { ... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] ... for k, t in concatenated_examples.items() ... } ... return result
在整个数据集上应用group_texts
函数:
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
现在使用 DataCollatorForLanguageModeling 创建一批示例。在整理过程中,最好动态填充句子到批次中的最长长度,而不是将整个数据集填充到最大长度。
Pytorch 隐藏 Pytorch 内容
使用结束序列标记作为填充标记,并指定mlm_probability
以在每次迭代数据时随机屏蔽标记:
>>> from transformers import DataCollatorForLanguageModeling >>> tokenizer.pad_token = tokenizer.eos_token >>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
TensorFlow 隐藏 TensorFlow 内容
使用结束序列标记作为填充标记,并指定mlm_probability
以在每次迭代数据时随机屏蔽标记:
>>> from transformers import DataCollatorForLanguageModeling >>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")
Transformers 4.37 中文文档(三)(3)https://developer.aliyun.com/article/1564650