Transformers 4.37 中文文档(三)(4)https://developer.aliyun.com/article/1564651
多项选择
原始文本:
huggingface.co/docs/transformers/v4.37.2/en/tasks/multiple_choice
多项选择任务类似于问题回答,不同之处在于提供了几个候选答案以及上下文,模型经过训练后会选择正确的答案。
本指南将向您展示如何:
本教程中演示的任务由以下模型架构支持:
ALBERT、BERT、BigBird、CamemBERT、CANINE、ConvBERT、Data2VecText、DeBERTa-v2、DistilBERT、ELECTRA、ERNIE、ErnieM、FlauBERT、FNet、Funnel Transformer、I-BERT、Longformer、LUKE、MEGA、Megatron-BERT、MobileBERT、MPNet、MRA、Nezha、Nyströmformer、QDQBert、RemBERT、RoBERTa、RoBERTa-PreLayerNorm、RoCBert、RoFormer、SqueezeBERT、XLM、XLM-RoBERTa、XLM-RoBERTa-XL、XLNet、X-MOD、YOSO
在开始之前,请确保已安装所有必要的库:
pip install transformers datasets evaluate
我们鼓励您登录您的 Hugging Face 帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:
>>> from huggingface_hub import notebook_login >>> notebook_login()
加载 SWAG 数据集
首先加载🤗 Datasets 库中 SWAG 数据集的regular
配置:
>>> from datasets import load_dataset >>> swag = load_dataset("swag", "regular")
然后看一个例子:
>>> swag["train"][0] {'ending0': 'passes by walking down the street playing their instruments.', 'ending1': 'has heard approaching them.', 'ending2': "arrives and they're outside dancing and asleep.", 'ending3': 'turns the lead singer watches the performance.', 'fold-ind': '3416', 'gold-source': 'gold', 'label': 0, 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.', 'sent2': 'A drum line', 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line', 'video-id': 'anetv_jkn6uvmqwh4'}
虽然这里看起来有很多字段,但实际上非常简单:
sent1
和sent2
:这些字段显示句子如何开始,如果将它们放在一起,您将得到startphrase
字段。ending
:建议一个可能的句子结尾,但只有一个是正确的。label
:标识正确的句子结尾。
预处理
下一步是加载 BERT 分词器,以处理句子开头和四个可能的结尾:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
您要创建的预处理函数需要:
- 复制
sent1
字段四次,并将每个字段与sent2
组合,以重新创建句子的开头方式。 - 将
sent2
与四个可能的句子结尾中的每一个结合起来。 - 将这两个列表展平,以便对它们进行标记化,然后在之后将它们展开,以便每个示例都有相应的
input_ids
、attention_mask
和labels
字段。
>>> ending_names = ["ending0", "ending1", "ending2", "ending3"] >>> def preprocess_function(examples): ... first_sentences = [[context] * 4 for context in examples["sent1"]] ... question_headers = examples["sent2"] ... second_sentences = [ ... [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers) ... ] ... first_sentences = sum(first_sentences, []) ... second_sentences = sum(second_sentences, []) ... tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True) ... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
要在整个数据集上应用预处理函数,请使用🤗 Datasets map方法。您可以通过设置batched=True
来加速map
函数,以一次处理数据集的多个元素:
tokenized_swag = swag.map(preprocess_function, batched=True)
🤗 Transformers 没有用于多项选择的数据整理器,因此您需要调整 DataCollatorWithPadding 以创建一批示例。在整理期间,将句子动态填充到批次中的最长长度,而不是将整个数据集填充到最大长度,这样更有效。
DataCollatorForMultipleChoice
将所有模型输入展平,应用填充,然后展平结果:
PytorchHide Pytorch 内容
>>> from dataclasses import dataclass >>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy >>> from typing import Optional, Union >>> import torch >>> @dataclass ... class DataCollatorForMultipleChoice: ... """ ... Data collator that will dynamically pad the inputs for multiple choice received. ... """ ... tokenizer: PreTrainedTokenizerBase ... padding: Union[bool, str, PaddingStrategy] = True ... max_length: Optional[int] = None ... pad_to_multiple_of: Optional[int] = None ... def __call__(self, features): ... label_name = "label" if "label" in features[0].keys() else "labels" ... labels = [feature.pop(label_name) for feature in features] ... batch_size = len(features) ... num_choices = len(features[0]["input_ids"]) ... flattened_features = [ ... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features ... ] ... flattened_features = sum(flattened_features, []) ... batch = self.tokenizer.pad( ... flattened_features, ... padding=self.padding, ... max_length=self.max_length, ... pad_to_multiple_of=self.pad_to_multiple_of, ... return_tensors="pt", ... ) ... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()} ... batch["labels"] = torch.tensor(labels, dtype=torch.int64) ... return batch
TensorFlowHide TensorFlow 内容
>>> from dataclasses import dataclass >>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy >>> from typing import Optional, Union >>> import tensorflow as tf >>> @dataclass ... class DataCollatorForMultipleChoice: ... """ ... Data collator that will dynamically pad the inputs for multiple choice received. ... """ ... tokenizer: PreTrainedTokenizerBase ... padding: Union[bool, str, PaddingStrategy] = True ... max_length: Optional[int] = None ... pad_to_multiple_of: Optional[int] = None ... def __call__(self, features): ... label_name = "label" if "label" in features[0].keys() else "labels" ... labels = [feature.pop(label_name) for feature in features] ... batch_size = len(features) ... num_choices = len(features[0]["input_ids"]) ... flattened_features = [ ... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features ... ] ... flattened_features = sum(flattened_features, []) ... batch = self.tokenizer.pad( ... flattened_features, ... padding=self.padding, ... max_length=self.max_length, ... pad_to_multiple_of=self.pad_to_multiple_of, ... return_tensors="tf", ... ) ... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()} ... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64) ... return batch
评估
在训练过程中包含一个度量通常有助于评估模型的性能。您可以使用🤗 Evaluate库快速加载评估方法。对于此任务,加载accuracy度量(请参阅🤗 Evaluate quick tour以了解如何加载和计算度量):
>>> import evaluate >>> accuracy = evaluate.load("accuracy")
然后创建一个函数,将您的预测和标签传递给compute
来计算准确性:
>>> import numpy as np >>> def compute_metrics(eval_pred): ... predictions, labels = eval_pred ... predictions = np.argmax(predictions, axis=1) ... return accuracy.compute(predictions=predictions, references=labels)
您的compute_metrics
函数现在已经准备就绪,在设置训练时将返回到它。
训练
PytorchHide Pytorch 内容
如果您不熟悉使用 Trainer 微调模型,请查看这里的基本教程 here!
您现在可以开始训练您的模型了!使用 AutoModelForMultipleChoice 加载 BERT:
>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer >>> model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
此时,只剩下三个步骤:
- 在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是
output_dir
,它指定保存模型的位置。您将通过设置push_to_hub=True
将此模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。在每个时代结束时,Trainer 将评估准确性并保存训练检查点。 - 将训练参数传递给 Trainer,同时还包括模型、数据集、标记器、数据整理器和
compute_metrics
函数。 - 调用 train()来微调您的模型。
>>> training_args = TrainingArguments( ... output_dir="my_awesome_swag_model", ... evaluation_strategy="epoch", ... save_strategy="epoch", ... load_best_model_at_end=True, ... learning_rate=5e-5, ... per_device_train_batch_size=16, ... per_device_eval_batch_size=16, ... num_train_epochs=3, ... weight_decay=0.01, ... push_to_hub=True, ... ) >>> trainer = Trainer( ... model=model, ... args=training_args, ... train_dataset=tokenized_swag["train"], ... eval_dataset=tokenized_swag["validation"], ... tokenizer=tokenizer, ... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer), ... compute_metrics=compute_metrics, ... ) >>> trainer.train()
训练完成后,使用 push_to_hub()方法将您的模型共享到 Hub,以便每个人都可以使用您的模型:
>>> trainer.push_to_hub()
TensorFlowHide TensorFlow 内容
如果您不熟悉使用 Keras 微调模型,请查看这里的基本教程 here!
要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:
>>> from transformers import create_optimizer >>> batch_size = 16 >>> num_train_epochs = 2 >>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs >>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
然后,您可以使用 TFAutoModelForMultipleChoice 加载 BERT:
>>> from transformers import TFAutoModelForMultipleChoice >>> model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
使用 prepare_tf_dataset()将数据集转换为tf.data.Dataset
格式:
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer) >>> tf_train_set = model.prepare_tf_dataset( ... tokenized_swag["train"], ... shuffle=True, ... batch_size=batch_size, ... collate_fn=data_collator, ... ) >>> tf_validation_set = model.prepare_tf_dataset( ... tokenized_swag["validation"], ... shuffle=False, ... batch_size=batch_size, ... collate_fn=data_collator, ... )
使用compile
配置模型进行训练。请注意,Transformers 模型都具有默认的与任务相关的损失函数,因此除非您想要指定一个,否则不需要:
>>> model.compile(optimizer=optimizer) # No loss argument!
在开始训练之前,设置最后两件事是从预测中计算准确性,并提供一种将模型推送到 Hub 的方法。这两个都可以使用 Keras callbacks 来完成。
将您的compute_metrics
函数传递给 KerasMetricCallback:
>>> from transformers.keras_callbacks import KerasMetricCallback >>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
指定在 PushToHubCallback 中推送您的模型和分词器的位置:
>>> from transformers.keras_callbacks import PushToHubCallback >>> push_to_hub_callback = PushToHubCallback( ... output_dir="my_awesome_model", ... tokenizer=tokenizer, ... )
然后将您的回调捆绑在一起:
>>> callbacks = [metric_callback, push_to_hub_callback]
最后,您已经准备好开始训练您的模型了!调用fit
,使用您的训练和验证数据集、时代数和回调来微调模型:
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks)
训练完成后,您的模型将自动上传到 Hub,以便每个人都可以使用它!
要了解如何为多项选择微调模型的更深入示例,请查看相应的PyTorch 笔记本或TensorFlow 笔记本。
推理
很好,现在您已经对模型进行了微调,可以用于推理!
想出一些文本和两个候选答案:
>>> prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette." >>> candidate1 = "The law does not apply to croissants and brioche." >>> candidate2 = "The law applies to baguettes."
Pytorch 隐藏 Pytorch 内容
将每个提示和候选答案对进行标记化,并返回 PyTorch 张量。您还应该创建一些标签
:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model") >>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True) >>> labels = torch.tensor(0).unsqueeze(0)
将您的输入和标签传递给模型,并返回logits
:
>>> from transformers import AutoModelForMultipleChoice >>> model = AutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model") >>> outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels) >>> logits = outputs.logits
获取具有最高概率的类:
>>> predicted_class = logits.argmax().item() >>> predicted_class '0'
TensorFlow 隐藏 TensorFlow 内容
对每个提示和候选答案对进行标记化,并返回 TensorFlow 张量:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model") >>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True)
将您的输入传递给模型,并返回logits
:
>>> from transformers import TFAutoModelForMultipleChoice >>> model = TFAutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model") >>> inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()} >>> outputs = model(inputs) >>> logits = outputs.logits
获取具有最高概率的类:
>>> predicted_class = int(tf.math.argmax(logits, axis=-1)[0]) >>> predicted_class '0'