篇章4代码库,也支持使用google colab notebook打开本教程,下载相关数据集和模型。如果在google的colab中打开这个notebook,需要安装Transformers和🤗Datasets库。
!pip install transformers datasets
也可以在这里找到本notebook的多GPU分布式训练版本。
任务:微调预训练模型进行文本分类
我们将展示如何使用 🤗 Transformers代码库中的模型来解决文本分类任务,任务来源于GLUE Benchmark.
GLUE榜单包含了9个句子级别的分类任务,分别是:
【李宏毅深度学习CP18-19】自监督学习之BERT:在GLUE中,总共有9个任务。一般来说,你想知道像BERT这样的模型是否被训练得很好。所以,你实际上会得到9个模型,用于9个单独的任务。你看看这9个任务的平均准确率,该值代表这个Self-supervised模型的性能。
对于以上任务,我们将展示如何使用简单的Dataset库加载数据集,同时使用transformer中的Trainer接口对预训练模型进行微调。
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
本notebook理论上可以使用各种各样的transformer模型(模型面板),解决任何文本分类分类任务。如果您所处理的任务有所不同,大概率只需要很小的改动便可以使用本notebook进行处理。同时,您应该根据您的GPU显存来调整微调训练所需要的btach size大小,避免显存溢出。
task = "cola" model_checkpoint = "distilbert-base-uncased" batch_size = 16
一、加载数据
1.1 加载数据
我们将会使用🤗 Datasets库来加载数据和对应的评测方式。数据加载和评测方式加载只需要简单使用load_dataset
和load_metric
即可。
from datasets import load_dataset, load_metric
除了mnli-mm
以外,其他任务都可以直接通过任务名字进行加载。数据加载之后会自动缓存。
actual_task = "mnli" if task == "mnli-mm" else task dataset = load_dataset("glue", actual_task) metric = load_metric('glue', actual_task)
这个datasets
对象本身是一种DatasetDict
数据结构. 对于训练集、验证集和测试集,只需要使用对应的key(train,validation,test)
即可得到相应的数据。
1.2 查看数据
dataset
DatasetDict({ train: Dataset({ features: ['sentence', 'label', 'idx'], num_rows: 8551 }) validation: Dataset({ features: ['sentence', 'label', 'idx'], num_rows: 1043 }) test: Dataset({ features: ['sentence', 'label', 'idx'], num_rows: 1063 }) })
给定一个数据切分的key(train、validation或者test)
和下标即可查看数据。
dataset["train"][0]
{'idx': 0, 'label': 1, 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}
为了能够进一步理解数据长什么样子,下面的函数将从数据集里随机选择几个例子进行展示。
import datasets import random import pandas as pd from IPython.display import display, HTML def show_random_elements(dataset, num_examples=10): assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset." picks = [] for _ in range(num_examples): pick = random.randint(0, len(dataset)-1) while pick in picks: pick = random.randint(0, len(dataset)-1) picks.append(pick) df = pd.DataFrame(dataset[picks]) for column, typ in dataset.features.items(): if isinstance(typ, datasets.ClassLabel): df[column] = df[column].transform(lambda i: typ.names[i]) display(HTML(df.to_html()))
show_random_elements(dataset["train"])
1.3 查看评估metric的方法
评估metic是datasets.Metric
的一个实例:
metric
Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """ Compute GLUE evaluation metric associated to each GLUE dataset. Args: predictions: list of predictions to score. Each translation should be tokenized into a list of tokens. references: list of lists of references for each translation. Each reference should be tokenized into a list of tokens. Returns: depending on the GLUE subset, one or several of: "accuracy": Accuracy "f1": F1 score "pearson": Pearson Correlation "spearmanr": Spearman Correlation "matthews_correlation": Matthew Correlation Examples: >>> glue_metric = datasets.load_metric('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"] >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0} >>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp' >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0, 'f1': 1.0} >>> glue_metric = datasets.load_metric('glue', 'stsb') >>> references = [0., 1., 2., 3., 4., 5.] >>> predictions = [0., 1., 2., 3., 4., 5.] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)}) {'pearson': 1.0, 'spearmanr': 1.0} >>> glue_metric = datasets.load_metric('glue', 'cola') >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'matthews_correlation': 1.0} """, stored examples: 0)
直接调用metric的compute
方法,传入labels
和predictions
即可得到如下metric的值,
import numpy as np fake_preds = np.random.randint(0, 2, size=(64,)) fake_labels = np.random.randint(0, 2, size=(64,)) metric.compute(predictions=fake_preds, references=fake_labels)
{'matthews_correlation': 0.1513518081969605}
每一个文本分类任务所对应的metic有所不同.
二、数据预处理
2.1 构建tokenizer
在将数据喂入模型之前,我们需要对数据进行预处理。
数据预处理的基本流程:
(1)分词;(2)转化成对应任务输入模型的格式。
预处理的工具叫Tokenizer:
(1)Tokenizer首先对输入进行tokenize
(2)将tokens转化为预模型中需要对应的token ID
(3)再转化为模型需要的输入格式。
回顾一波上一节我们用BertTokenizer的from_pretrained构建tokenizer,BertTokenizer 有以下常用方法:
from_pretrained:从包含词表文件(vocab.txt)的目录中初始化一个分词器;
tokenize:将文本(词或者句子)分解为子词列表;
convert_tokens_to_ids:将子词列表转化为子词对应下标的列表;
convert_ids_to_tokens :与上一个相反;
convert_tokens_to_string:将 subword 列表按“##”拼接回词或者句子;
encode:对于单个句子输入,分解词并加入特殊词形成“[CLS], x, [SEP]”的结构并转换为词表对应下标的列表;对于两个句子输入(多个句子只取前两个),分解词并加入特殊词形成“[CLS], x1, [SEP], x2, [SEP]”的结构并转换为下标列表;
decode:可以将 encode 方法的输出变为完整句子。
class BertTokenizer(PreTrainedTokenizer): vocab_files_names = VOCAB_FILES_NAMES pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES def __init__( self, vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token="[UNK]", sep_token="[SEP]", pad_token="[PAD]", cls_token="[CLS]", mask_token="[MASK]", tokenize_chinese_chars=True, strip_accents=None, **kwargs ): super().__init__( do_lower_case=do_lower_case, do_basic_tokenize=do_basic_tokenize, never_split=never_split, unk_token=unk_token, sep_token=sep_token, pad_token=pad_token, cls_token=cls_token, mask_token=mask_token, tokenize_chinese_chars=tokenize_chinese_chars, strip_accents=strip_accents, **kwargs, ) if not os.path.isfile(vocab_file): raise ValueError( f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained " "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`" ) self.vocab = load_vocab(vocab_file) self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()]) self.do_basic_tokenize = do_basic_tokenize if do_basic_tokenize: self.basic_tokenizer = BasicTokenizer( do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=tokenize_chinese_chars, strip_accents=strip_accents, ) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token) @property def do_lower_case(self): return self.basic_tokenizer.do_lower_case @property def vocab_size(self): return len(self.vocab) def get_vocab(self): return dict(self.vocab, **self.added_tokens_encoder) def _tokenize(self, text): split_tokens = [] if self.do_basic_tokenize: for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens): # If the token is part of the never_split set if token in self.basic_tokenizer.never_split: split_tokens.append(token) else: split_tokens += self.wordpiece_tokenizer.tokenize(token) else: split_tokens = self.wordpiece_tokenizer.tokenize(text) return split_tokens def _convert_token_to_id(self, token): """Converts a token (str) in an id using the vocab.""" return self.vocab.get(token, self.vocab.get(self.unk_token)) def _convert_id_to_token(self, index): """Converts an index (integer) in a token (str) using the vocab.""" return self.ids_to_tokens.get(index, self.unk_token) def convert_tokens_to_string(self, tokens): """Converts a sequence of tokens (string) in a single string.""" out_string = " ".join(tokens).replace(" ##", "").strip() return out_string def build_inputs_with_special_tokens( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None ) -> List[int]: """ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format: - single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]`` Args: token_ids_0 (:obj:`List[int]`): List of IDs to which the special tokens will be added. token_ids_1 (:obj:`List[int]`, `optional`): Optional second list of IDs for sequence pairs. Returns: :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens. """ if token_ids_1 is None: return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] cls = [self.cls_token_id] sep = [self.sep_token_id] return cls + token_ids_0 + sep + token_ids_1 + sep def get_special_tokens_mask( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False ) -> List[int]: """ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer ``prepare_for_model`` method. Args: token_ids_0 (:obj:`List[int]`): List of IDs. token_ids_1 (:obj:`List[int]`, `optional`): Optional second list of IDs for sequence pairs. already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`): Whether or not the token list is already formatted with special tokens for the model. Returns: :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. """ if already_has_special_tokens: return super().get_special_tokens_mask( token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True ) if token_ids_1 is not None: return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1] return [1] + ([0] * len(token_ids_0)) + [1] def create_token_type_ids_from_sequences( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None ) -> List[int]: """ Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format: :: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s). Args: token_ids_0 (:obj:`List[int]`): List of IDs. token_ids_1 (:obj:`List[int]`, `optional`): Optional second list of IDs for sequence pairs. Returns: :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given sequence(s). """ sep = [self.sep_token_id] cls = [self.cls_token_id] if token_ids_1 is None: return len(cls + token_ids_0 + sep) * [0] return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: index = 0 if os.path.isdir(save_directory): vocab_file = os.path.join( save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] ) else: vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory with open(vocab_file, "w", encoding="utf-8") as writer: for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]): if index != token_index: logger.warning( f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive." " Please check that the vocabulary is not corrupted!" ) index = token_index writer.write(token + "\n") index += 1 return (vocab_file,)
为了达到数据预处理的目的,我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer,这样可以确保:
我们得到一个与预训练模型一一对应的tokenizer。
使用指定的模型checkpoint对应的tokenizer的时候,我们也下载了模型需要的词表库vocabulary,准确来说是tokens vocabulary。
这个被下载的tokens vocabulary会被缓存起来,从而再次使用的时候不会重新下载。
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
注意:use_fast=True要求tokenizer必须是transformers.PreTrainedTokenizerFast类型,因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性(比如多线程快速tokenizer)。如果对应的模型没有fast tokenizer,去掉这个选项即可。几乎所有模型对应的tokenizer都有对应的fast tokenizer。我们可以在模型tokenizer对应表里查看所有预训练模型对应的tokenizer所拥有的特点。
tokenizer既可以对单个文本进行预处理,也可以对一对文本进行预处理,tokenizer预处理后得到的数据满足预训练模型输入格式
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")
{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
取决于我们选择的预训练模型,我们将会看到tokenizer有不同的返回,tokenizer和预训练模型是一一对应的,更多信息可以在这里进行学习。
2.2 对数据集datasets所有样本进行预处理
为了预处理我们的数据,我们需要知道不同数据和对应的数据格式,因此我们定义下面这个dict。
task_to_keys = { "cola": ("sentence", None), "mnli": ("premise", "hypothesis"), "mnli-mm": ("premise", "hypothesis"), "mrpc": ("sentence1", "sentence2"), "qnli": ("question", "sentence"), "qqp": ("question1", "question2"), "rte": ("sentence1", "sentence2"), "sst2": ("sentence", None), "stsb": ("sentence1", "sentence2"), "wnli": ("sentence1", "sentence2"), }
对数据格式进行检查:
sentence1_key, sentence2_key = task_to_keys[task] if sentence2_key is None: print(f"Sentence: {dataset['train'][0][sentence1_key]}") else: print(f"Sentence 1: {dataset['train'][0][sentence1_key]}") print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")
结果为:
Sentence: Our friends won't buy this analysis, let alone the next one we propose.
随后将预处理的代码放到一个函数中:
def preprocess_function(examples): if sentence2_key is None: return tokenizer(examples[sentence1_key], truncation=True) return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)
预处理函数可以处理单个样本,也可以对多个样本进行处理。如果输入是多个样本,那么返回的是一个list:
preprocess_function(dataset['train'][:5])
输出结果为:
{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
接下来对数据集datasets里面的所有样本进行预处理,处理的方式是使用map函数,将预处理函数prepare_train_features应用到(map)所有样本上。
encoded_dataset = dataset.map(preprocess_function, batched=True)
更好的是,返回的结果会自动被缓存,避免下次处理的时候重新计算(但是也要注意,如果输入有改动,可能会被缓存影响!)。
datasets库函数会对输入的参数进行检测,判断是否有变化,如果没有变化就使用缓存数据,如果有变化就重新处理。但如果输入参数不变,想改变输入的时候,最好清理调这个缓存。清理的方式是使用load_from_cache_file=False参数。
上面使用到的batched=True这个参数是tokenizer的特点,意味这会使用多线程同时并行对输入进行处理。
三、微调预训练模型
3.1 加载分类模型
既然数据已经准备好了,现在我们需要下载并加载我们的预训练模型,然后微调预训练模型。
既然我们是做seq2seq任务,那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSequenceClassification 这个类。
和tokenizer相似,from_pretrained方法同样可以帮助我们下载并加载模型,同时也会对模型进行缓存,就不会重复下载模型啦。
需要注意的是:STS-B是一个回归问题,MNLI是一个3分类问题:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer # 定义模型 num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2 model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
Downloading: 100% 268M/268M [00:08<00:00, 31.2MB/s] Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias'] - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
由于我们微调的任务是文本分类任务,而我们加载的是预训练的语言模型,所以会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数(比如:预训练语言模型的神经网络head被扔掉了,同时随机初始化了文本分类的神经网络head)。
3.2 设定训练参数
为了能够得到一个Trainer训练工具,我们还需要3个要素,其中最重要的是训练的设定/参数 TrainingArguments。这个训练设定包含了能够定义训练过程的所有属性。
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy" args = TrainingArguments( "test-glue", evaluation_strategy = "epoch", save_strategy = "epoch", learning_rate=2e-5, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, num_train_epochs=5, weight_decay=0.01, load_best_model_at_end=True, metric_for_best_model=metric_name, )
上面evaluation_strategy = "epoch"参数告诉训练代码:我们每个epcoh会做一次验证评估。
上面batch_size在这个notebook之前定义好了。
3.3 对应任务的评测指标
最后,由于不同的任务需要不同的评测指标,我们定一个函数来根据任务名字得到评价方法:
def compute_metrics(eval_pred): predictions, labels = eval_pred if task != "stsb": predictions = np.argmax(predictions, axis=1) else: predictions = predictions[:, 0] return metric.compute(predictions=predictions, references=labels)
全部传给 Trainer
:
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation" trainer = Trainer( model, args, train_dataset=encoded_dataset["train"], eval_dataset=encoded_dataset[validation_key], tokenizer=tokenizer, compute_metrics=compute_metrics )
3.4 开始训练
trainer.train()
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. ***** Running training ***** Num examples = 8551 Num Epochs = 5 Instantaneous batch size per device = 16 Total train batch size (w. parallel, distributed & accumulation) = 16 Gradient Accumulation steps = 1 Total optimization steps = 2675
结果如下:
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. ***** Running training ***** Num examples = 8551 Num Epochs = 5 Instantaneous batch size per device = 16 Total train batch size (w. parallel, distributed & accumulation) = 16 Gradient Accumulation steps = 1 (此处一坨原本是进度条)Total optimization steps = 2675 [2675/2675 02:49, Epoch 5/5] Epoch | Training Loss| Validation Loss | Matthews Correlation 1 0.525400 0.520955 0.409248 2 0.351600 0.570341 0.477499 3 0.236100 0.622785 0.499872 4 0.166300 0.806475 0.491623 5 0.125700 0.882225 0.513900 The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. ***** Running Evaluation ***** Num examples = 1043 Batch size = 16 Saving model checkpoint to test-glue/checkpoint-535 Configuration saved in test-glue/checkpoint-535/config.json Model weights saved in test-glue/checkpoint-535/pytorch_model.bin tokenizer config file saved in test-glue/checkpoint-535/tokenizer_config.json Special tokens file saved in test-glue/checkpoint-535/special_tokens_map.json The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. ***** Running Evaluation ***** Num examples = 1043 Batch size = 16 Saving model checkpoint to test-glue/checkpoint-1070 Configuration saved in test-glue/checkpoint-1070/config.json Model weights saved in test-glue/checkpoint-1070/pytorch_model.bin tokenizer config file saved in test-glue/checkpoint-1070/tokenizer_config.json Special tokens file saved in test-glue/checkpoint-1070/special_tokens_map.json The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. ***** Running Evaluation ***** Num examples = 1043 Batch size = 16 Saving model checkpoint to test-glue/checkpoint-1605 Configuration saved in test-glue/checkpoint-1605/config.json Model weights saved in test-glue/checkpoint-1605/pytorch_model.bin tokenizer config file saved in test-glue/checkpoint-1605/tokenizer_config.json Special tokens file saved in test-glue/checkpoint-1605/special_tokens_map.json The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. ***** Running Evaluation ***** Num examples = 1043 Batch size = 16 Saving model checkpoint to test-glue/checkpoint-2140 Configuration saved in test-glue/checkpoint-2140/config.json Model weights saved in test-glue/checkpoint-2140/pytorch_model.bin tokenizer config file saved in test-glue/checkpoint-2140/tokenizer_config.json Special tokens file saved in test-glue/checkpoint-2140/special_tokens_map.json The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. ***** Running Evaluation ***** Num examples = 1043 Batch size = 16 Saving model checkpoint to test-glue/checkpoint-2675 Configuration saved in test-glue/checkpoint-2675/config.json Model weights saved in test-glue/checkpoint-2675/pytorch_model.bin tokenizer config file saved in test-glue/checkpoint-2675/tokenizer_config.json Special tokens file saved in test-glue/checkpoint-2675/special_tokens_map.json Training completed. Do not forget to share your model on huggingface.co/models =) Loading best model from test-glue/checkpoint-2675 (score: 0.5138995234247261). TrainOutput(global_step=2675, training_loss=0.27181456521292713, metrics={'train_runtime': 169.649, 'train_samples_per_second': 252.02, 'train_steps_per_second': 15.768, 'total_flos': 229537542078168.0, 'train_loss': 0.27181456521292713, 'epoch': 5.0})
从上面的结果的最后一行可知训练的loss等值:
TrainOutput(global_step=2675, training_loss=0.27181456521292713, metrics={'train_runtime': 169.649, 'train_samples_per_second': 252.02, 'train_steps_per_second': 15.768, 'total_flos': 229537542078168.0, 'train_loss': 0.27181456521292713, 'epoch': 5.0})
3.4 模型评估
训练完成后进行评估:
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. ***** Running Evaluation ***** Num examples = 1043 Batch size = 16 (此处一坨原本是进度条)[66/66 00:00] {'epoch': 5.0, 'eval_loss': 0.8822253346443176, 'eval_matthews_correlation': 0.5138995234247261, 'eval_runtime': 0.9319, 'eval_samples_per_second': 1119.255, 'eval_steps_per_second': 70.825}
To see how your model fared you can compare it to the GLUE Benchmark leaderboard.
四、超参数搜索
Trainer
同样支持超参搜索,使用optuna or Ray Tune代码库。
反注释下面两行安装依赖:
! pip install optuna ! pip install ray[tune]
4.1 设置初始化模型
超参搜索时,Trainer将会返回多个训练好的模型,所以需要传入一个定义好的模型从而让Trainer可以不断重新初始化该传入的模型:
和之前调用 Trainer
类似:
trainer = Trainer( model_init=model_init, args=args, train_dataset=encoded_dataset["train"], eval_dataset=encoded_dataset[validation_key], tokenizer=tokenizer, compute_metrics=compute_metrics )
结果为:
loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361 Model config DistilBertConfig { "activation": "gelu", "architectures": [ "DistilBertForMaskedLM" ], "attention_dropout": 0.1, "dim": 768, "dropout": 0.1, "hidden_dim": 3072, "initializer_range": 0.02, "max_position_embeddings": 512, "model_type": "distilbert", "n_heads": 12, "n_layers": 6, "pad_token_id": 0, "qa_dropout": 0.1, "seq_classif_dropout": 0.2, "sinusoidal_pos_embds": false, "tie_weights_": true, "transformers_version": "4.9.1", "vocab_size": 30522 } loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight'] - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
4.2 超参数搜索
调用方法hyperparameter_search
。注意,这个过程可能很久,我们可以先用部分数据集进行超参搜索,再进行全量训练。
比如使用1/10的数据进行搜索:
# 调参 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")
hyperparameter_search
会返回效果最好的模型相关的参数:
best_run
这个我没跑出来(可能是网络有点烂==):
BestRun(run_id='3', objective=0.5504031254980248, hyperparameters={'learning_rate': 4.301257551502102e-05, 'num_train_epochs': 5, 'seed': 20, 'per_device_train_batch_size': 8})
4.3 设置最好的参数
将Trainner
设置为搜索到的最好参数,进行训练:
for n, v in best_run.hyperparameters.items(): setattr(trainer.args, n, v) trainer.train()
结果为:
TrainOutput(global_step=5345, training_loss=0.26719996967083726, metrics={'train_runtime': 178.4912, 'train_samples_p
最后进行评估:
trainer.evaluate()
运行结果为:
{'eval_loss': 0.9789257049560547, 'eval_matthews_correlation': 0.5548273578107759, 'eval_runtime': 0.6556, 'eval_samples_per_second': 1590.796, 'eval_steps_per_second': 100.664, 'epoch': 5.0}
可以将上面结果和一开始没有超参搜索的情况(如下)对比,可看到最后的loss值从0.88提升到0.97,对应的评估指标eval_matthews_correlation
也有所提升:
{'epoch': 5.0, 'eval_loss': 0.8822253346443176, 'eval_matthews_correlation': 0.5138995234247261, 'eval_runtime': 0.9319, 'eval_samples_per_second': 1119.255, 'eval_steps_per_second': 70.825}
最后别忘了,查看如何上传模型 ,上传模型 到🤗 Model Hub。随后就可以像这个notebook一开始一样,直接用模型名字就能使用自己上传的模型啦。
五、注意事项
(1)【独立数据集加载】
1)将下载好的数据集存放到{user_dir}.cache\huggingface\datasets目录
注:Windows用户目录:
C:\Users{用户名}.cache\huggingface\datasets
2) 重新执行加载数据集的代码
比如这次的glue基准数据集,如果网络不方便,可以先手动下载:
链接:https://pan.baidu.com/s/1WTYY37dooKN0AWXkIfECmg
提取码:kq85
(2)【设置Jupyter Notebook代理】
1)配置代理,在console输入命令:
set HTTPS_PROXY=http://127.0.0.1:19180
set HTTP_PROXY=http://127.0.0.1:19180
2)启动Jupyter Notebook,命令如下:
jupyter notebook
Reference
(1)datawhale course
(3)huggingface官网:https://huggingface.co/transformers/preprocessing.html