【NLP】(task6)Transformers解决文本分类任务 + 超参搜索-阿里云开发者社区

篇章4代码库，也支持使用google colab notebook打开本教程，下载相关数据集和模型。如果在google的colab中打开这个notebook，需要安装Transformers和🤗Datasets库。

!pip install transformers datasets

也可以在这里找到本notebook的多GPU分布式训练版本。

任务：微调预训练模型进行文本分类

我们将展示如何使用 🤗 Transformers代码库中的模型来解决文本分类任务，任务来源于GLUE Benchmark.

GLUE榜单包含了9个句子级别的分类任务，分别是：

【李宏毅深度学习CP18-19】自监督学习之BERT：在GLUE中，总共有9个任务。一般来说，你想知道像BERT这样的模型是否被训练得很好。所以，你实际上会得到9个模型，用于9个单独的任务。你看看这9个任务的平均准确率，该值代表这个Self-supervised模型的性能。

对于以上任务，我们将展示如何使用简单的Dataset库加载数据集，同时使用transformer中的Trainer接口对预训练模型进行微调。

GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

本notebook理论上可以使用各种各样的transformer模型（模型面板），解决任何文本分类分类任务。如果您所处理的任务有所不同，大概率只需要很小的改动便可以使用本notebook进行处理。同时，您应该根据您的GPU显存来调整微调训练所需要的btach size大小，避免显存溢出。

task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

一、加载数据

1.1 加载数据

我们将会使用🤗 Datasets库来加载数据和对应的评测方式。数据加载和评测方式加载只需要简单使用load_dataset和load_metric即可。

from datasets import load_dataset, load_metric

除了mnli-mm以外，其他任务都可以直接通过任务名字进行加载。数据加载之后会自动缓存。

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

这个datasets对象本身是一种DatasetDict数据结构. 对于训练集、验证集和测试集，只需要使用对应的key（train，validation，test）即可得到相应的数据。

1.2 查看数据

dataset

    DatasetDict({
        train: Dataset({
            features: ['sentence', 'label', 'idx'],
            num_rows: 8551
        })
        validation: Dataset({
            features: ['sentence', 'label', 'idx'],
            num_rows: 1043
        })
        test: Dataset({
            features: ['sentence', 'label', 'idx'],
            num_rows: 1063
        })
    })

给定一个数据切分的key（train、validation或者test）和下标即可查看数据。

dataset["train"][0]

    {'idx': 0,
     'label': 1,
     'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

为了能够进一步理解数据长什么样子，下面的函数将从数据集里随机选择几个例子进行展示。

import datasets
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dataset["train"])

1.3 查看评估metric的方法

评估metic是datasets.Metric的一个实例:

metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """ Compute GLUE evaluation metric associated to each GLUE dataset. Args: predictions: list of predictions to score. Each translation should be tokenized into a list of tokens. references: list of lists of references for each translation. Each reference should be tokenized into a list of tokens. Returns: depending on the GLUE subset, one or several of: "accuracy": Accuracy "f1": F1 score "pearson": Pearson Correlation "spearmanr": Spearman Correlation "matthews_correlation": Matthew Correlation Examples: >>> glue_metric = datasets.load_metric('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"] >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0} >>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp' >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0, 'f1': 1.0} >>> glue_metric = datasets.load_metric('glue', 'stsb') >>> references = [0., 1., 2., 3., 4., 5.] >>> predictions = [0., 1., 2., 3., 4., 5.] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)}) {'pearson': 1.0, 'spearmanr': 1.0} >>> glue_metric = datasets.load_metric('glue', 'cola') >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'matthews_correlation': 1.0} """, stored examples: 0)

直接调用metric的compute方法，传入labels和predictions即可得到如下metric的值，

import numpy as np
fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

    {'matthews_correlation': 0.1513518081969605}

每一个文本分类任务所对应的metic有所不同.

二、数据预处理

2.1 构建tokenizer

在将数据喂入模型之前，我们需要对数据进行预处理。

数据预处理的基本流程：

（1）分词；（2）转化成对应任务输入模型的格式。

预处理的工具叫Tokenizer：

（1）Tokenizer首先对输入进行tokenize

（2）将tokens转化为预模型中需要对应的token ID

（3）再转化为模型需要的输入格式。

回顾一波上一节我们用BertTokenizer的from_pretrained构建tokenizer，BertTokenizer 有以下常用方法：

from_pretrained：从包含词表文件（vocab.txt）的目录中初始化一个分词器；

tokenize：将文本（词或者句子）分解为子词列表；

convert_tokens_to_ids：将子词列表转化为子词对应下标的列表；

convert_ids_to_tokens ：与上一个相反；

convert_tokens_to_string：将 subword 列表按“##”拼接回词或者句子；

encode：对于单个句子输入，分解词并加入特殊词形成“[CLS], x, [SEP]”的结构并转换为词表对应下标的列表；对于两个句子输入（多个句子只取前两个），分解词并加入特殊词形成“[CLS], x1, [SEP], x2, [SEP]”的结构并转换为下标列表；

decode：可以将 encode 方法的输出变为完整句子。

class BertTokenizer(PreTrainedTokenizer):
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    def __init__(
        self,
        vocab_file,
        do_lower_case=True,
        do_basic_tokenize=True,
        never_split=None,
        unk_token="[UNK]",
        sep_token="[SEP]",
        pad_token="[PAD]",
        cls_token="[CLS]",
        mask_token="[MASK]",
        tokenize_chinese_chars=True,
        strip_accents=None,
        **kwargs
    ):
        super().__init__(
            do_lower_case=do_lower_case,
            do_basic_tokenize=do_basic_tokenize,
            never_split=never_split,
            unk_token=unk_token,
            sep_token=sep_token,
            pad_token=pad_token,
            cls_token=cls_token,
            mask_token=mask_token,
            tokenize_chinese_chars=tokenize_chinese_chars,
            strip_accents=strip_accents,
            **kwargs,
        )
        if not os.path.isfile(vocab_file):
            raise ValueError(
                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "
                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
            )
        self.vocab = load_vocab(vocab_file)
        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
        self.do_basic_tokenize = do_basic_tokenize
        if do_basic_tokenize:
            self.basic_tokenizer = BasicTokenizer(
                do_lower_case=do_lower_case,
                never_split=never_split,
                tokenize_chinese_chars=tokenize_chinese_chars,
                strip_accents=strip_accents,
            )
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
    @property
    def do_lower_case(self):
        return self.basic_tokenizer.do_lower_case
    @property
    def vocab_size(self):
        return len(self.vocab)
    def get_vocab(self):
        return dict(self.vocab, **self.added_tokens_encoder)
    def _tokenize(self, text):
        split_tokens = []
        if self.do_basic_tokenize:
            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
                # If the token is part of the never_split set
                if token in self.basic_tokenizer.never_split:
                    split_tokens.append(token)
                else:
                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
        else:
            split_tokens = self.wordpiece_tokenizer.tokenize(text)
        return split_tokens
    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        return self.vocab.get(token, self.vocab.get(self.unk_token))
    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        return self.ids_to_tokens.get(index, self.unk_token)
    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        out_string = " ".join(tokens).replace(" ##", "").strip()
        return out_string
    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A BERT sequence has the following format:
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
        cls = [self.cls_token_id]
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + token_ids_1 + sep
    def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer ``prepare_for_model`` method.
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether or not the token list is already formatted with special tokens for the model.
        Returns:
            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
            )
        if token_ids_1 is not None:
            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
        return [1] + ([0] * len(token_ids_0)) + [1]
    def create_token_type_ids_from_sequences(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence
        pair mask has the following format:
        ::
            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
            | first sequence    | second sequence |
        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        if token_ids_1 is None:
            return len(cls + token_ids_0 + sep) * [0]
        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        index = 0
        if os.path.isdir(save_directory):
            vocab_file = os.path.join(
                save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
            )
        else:
            vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
        with open(vocab_file, "w", encoding="utf-8") as writer:
            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    logger.warning(
                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
                        " Please check that the vocabulary is not corrupted!"
                    )
                    index = token_index
                writer.write(token + "\n")
                index += 1
        return (vocab_file,)

为了达到数据预处理的目的，我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer，这样可以确保：

我们得到一个与预训练模型一一对应的tokenizer。

使用指定的模型checkpoint对应的tokenizer的时候，我们也下载了模型需要的词表库vocabulary，准确来说是tokens vocabulary。

这个被下载的tokens vocabulary会被缓存起来，从而再次使用的时候不会重新下载。

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

注意：use_fast=True要求tokenizer必须是transformers.PreTrainedTokenizerFast类型，因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性（比如多线程快速tokenizer）。如果对应的模型没有fast tokenizer，去掉这个选项即可。几乎所有模型对应的tokenizer都有对应的fast tokenizer。我们可以在模型tokenizer对应表里查看所有预训练模型对应的tokenizer所拥有的特点。

tokenizer既可以对单个文本进行预处理，也可以对一对文本进行预处理，tokenizer预处理后得到的数据满足预训练模型输入格式

tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

    {'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 
    'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

取决于我们选择的预训练模型，我们将会看到tokenizer有不同的返回，tokenizer和预训练模型是一一对应的，更多信息可以在这里进行学习。

2.2 对数据集datasets所有样本进行预处理

为了预处理我们的数据，我们需要知道不同数据和对应的数据格式，因此我们定义下面这个dict。

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

对数据格式进行检查:

sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

结果为：

Sentence: Our friends won't buy this analysis, let alone the next one we propose.

随后将预处理的代码放到一个函数中：

def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

预处理函数可以处理单个样本，也可以对多个样本进行处理。如果输入是多个样本，那么返回的是一个list：

preprocess_function(dataset['train'][:5])

输出结果为：

    {'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

接下来对数据集datasets里面的所有样本进行预处理，处理的方式是使用map函数，将预处理函数prepare_train_features应用到（map)所有样本上。

encoded_dataset = dataset.map(preprocess_function, batched=True)

更好的是，返回的结果会自动被缓存，避免下次处理的时候重新计算（但是也要注意，如果输入有改动，可能会被缓存影响！）。

datasets库函数会对输入的参数进行检测，判断是否有变化，如果没有变化就使用缓存数据，如果有变化就重新处理。但如果输入参数不变，想改变输入的时候，最好清理调这个缓存。清理的方式是使用load_from_cache_file=False参数。

上面使用到的batched=True这个参数是tokenizer的特点，意味这会使用多线程同时并行对输入进行处理。

三、微调预训练模型

3.1 加载分类模型

既然数据已经准备好了，现在我们需要下载并加载我们的预训练模型，然后微调预训练模型。

既然我们是做seq2seq任务，那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSequenceClassification 这个类。

和tokenizer相似，from_pretrained方法同样可以帮助我们下载并加载模型，同时也会对模型进行缓存，就不会重复下载模型啦。

需要注意的是：STS-B是一个回归问题，MNLI是一个3分类问题：

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
# 定义模型
num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading: 100%
268M/268M [00:08<00:00, 31.2MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

由于我们微调的任务是文本分类任务，而我们加载的是预训练的语言模型，所以会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数（比如：预训练语言模型的神经网络head被扔掉了，同时随机初始化了文本分类的神经网络head）。

3.2 设定训练参数

为了能够得到一个Trainer训练工具，我们还需要3个要素，其中最重要的是训练的设定/参数 TrainingArguments。这个训练设定包含了能够定义训练过程的所有属性。

metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

上面evaluation_strategy = "epoch"参数告诉训练代码：我们每个epcoh会做一次验证评估。

上面batch_size在这个notebook之前定义好了。

3.3 对应任务的评测指标

最后，由于不同的任务需要不同的评测指标，我们定一个函数来根据任务名字得到评价方法:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

全部传给 Trainer:

validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

3.4 开始训练

trainer.train()

    The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
    ***** Running training *****
      Num examples = 8551
      Num Epochs = 5
      Instantaneous batch size per device = 16
      Total train batch size (w. parallel, distributed & accumulation) = 16
      Gradient Accumulation steps = 1
      Total optimization steps = 2675

结果如下：

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running training *****
  Num examples = 8551
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  (此处一坨原本是进度条)Total optimization steps = 2675
 [2675/2675 02:49, Epoch 5/5]
Epoch | Training Loss| Validation Loss | Matthews Correlation
1 0.525400  0.520955  0.409248
2 0.351600  0.570341  0.477499
3 0.236100  0.622785  0.499872
4 0.166300  0.806475  0.491623
5 0.125700  0.882225  0.513900
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-535
Configuration saved in test-glue/checkpoint-535/config.json
Model weights saved in test-glue/checkpoint-535/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-535/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-535/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-1070
Configuration saved in test-glue/checkpoint-1070/config.json
Model weights saved in test-glue/checkpoint-1070/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-1070/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-1070/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-1605
Configuration saved in test-glue/checkpoint-1605/config.json
Model weights saved in test-glue/checkpoint-1605/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-1605/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-1605/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-2140
Configuration saved in test-glue/checkpoint-2140/config.json
Model weights saved in test-glue/checkpoint-2140/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-2140/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-2140/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-2675
Configuration saved in test-glue/checkpoint-2675/config.json
Model weights saved in test-glue/checkpoint-2675/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-2675/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-2675/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from test-glue/checkpoint-2675 (score: 0.5138995234247261).
TrainOutput(global_step=2675, training_loss=0.27181456521292713, metrics={'train_runtime': 169.649, 'train_samples_per_second': 252.02, 'train_steps_per_second': 15.768, 'total_flos': 229537542078168.0, 'train_loss': 0.27181456521292713, 'epoch': 5.0})

从上面的结果的最后一行可知训练的loss等值：

TrainOutput(global_step=2675, training_loss=0.27181456521292713, 
metrics={'train_runtime': 169.649, 
'train_samples_per_second': 252.02, 
'train_steps_per_second': 15.768, 
'total_flos': 229537542078168.0, 
'train_loss': 0.27181456521292713, 
'epoch': 5.0})

3.4 模型评估

训练完成后进行评估:

trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
 (此处一坨原本是进度条)[66/66 00:00]
{'epoch': 5.0,
 'eval_loss': 0.8822253346443176,
 'eval_matthews_correlation': 0.5138995234247261,
 'eval_runtime': 0.9319,
 'eval_samples_per_second': 1119.255,
 'eval_steps_per_second': 70.825}

To see how your model fared you can compare it to the GLUE Benchmark leaderboard.

四、超参数搜索

Trainer同样支持超参搜索，使用optuna or Ray Tune代码库。

反注释下面两行安装依赖：

! pip install optuna
! pip install ray[tune]

4.1 设置初始化模型

超参搜索时，Trainer将会返回多个训练好的模型，所以需要传入一个定义好的模型从而让Trainer可以不断重新初始化该传入的模型：

和之前调用 Trainer类似:

trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

结果为：

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.9.1",
  "vocab_size": 30522
}
loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

4.2 超参数搜索

调用方法hyperparameter_search。注意，这个过程可能很久，我们可以先用部分数据集进行超参搜索，再进行全量训练。

比如使用1/10的数据进行搜索：

# 调参
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

hyperparameter_search会返回效果最好的模型相关的参数：

best_run

这个我没跑出来（可能是网络有点烂==）：

BestRun(run_id='3', objective=0.5504031254980248, hyperparameters={'learning_rate': 4.301257551502102e-05, 'num_train_epochs': 5, 'seed': 20, 'per_device_train_batch_size': 8})

4.3 设置最好的参数

将Trainner设置为搜索到的最好参数，进行训练：

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)
trainer.train()

结果为：

TrainOutput(global_step=5345, training_loss=0.26719996967083726, metrics={'train_runtime': 178.4912, 'train_samples_p

最后进行评估：

trainer.evaluate()

运行结果为：

{'eval_loss': 0.9789257049560547,
 'eval_matthews_correlation': 0.5548273578107759,
 'eval_runtime': 0.6556,
 'eval_samples_per_second': 1590.796,
 'eval_steps_per_second': 100.664,
 'epoch': 5.0}

可以将上面结果和一开始没有超参搜索的情况（如下）对比，可看到最后的loss值从0.88提升到0.97，对应的评估指标eval_matthews_correlation也有所提升：

{'epoch': 5.0,
 'eval_loss': 0.8822253346443176,
 'eval_matthews_correlation': 0.5138995234247261,
 'eval_runtime': 0.9319,
 'eval_samples_per_second': 1119.255,
 'eval_steps_per_second': 70.825}

最后别忘了，查看如何上传模型，上传模型到🤗 Model Hub。随后就可以像这个notebook一开始一样，直接用模型名字就能使用自己上传的模型啦。

五、注意事项

（1）【独立数据集加载】

1）将下载好的数据集存放到{user_dir}.cache\huggingface\datasets目录

注：Windows用户目录：

C:\Users{用户名}.cache\huggingface\datasets

2）重新执行加载数据集的代码

比如这次的glue基准数据集，如果网络不方便，可以先手动下载：

链接：https://pan.baidu.com/s/1WTYY37dooKN0AWXkIfECmg

提取码：kq85

（2）【设置Jupyter Notebook代理】

1）配置代理，在console输入命令：

set HTTPS_PROXY=http://127.0.0.1:19180

set HTTP_PROXY=http://127.0.0.1:19180

2）启动Jupyter Notebook，命令如下：

jupyter notebook

Reference

（1）datawhale course

（2）https://relph1119.github.io/my-team-learning/#/transformers_nlp28/task06?id=_44-%e6%a8%a1%e5%9e%8b%e8%af%84%e4%bc%b0

（3）huggingface官网：https://huggingface.co/transformers/preprocessing.html

【NLP】(task6)Transformers解决文本分类任务 + 超参搜索

任务：微调预训练模型进行文本分类

一、加载数据

1.1 加载数据

1.2 查看数据

1.3 查看评估metric的方法

3.4 开始训练

3.4 模型评估

四、超参数搜索

4.2 超参数搜索

4.3 设置最好的参数

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【NLP】(task6)Transformers解决文本分类任务 + 超参搜索

任务：微调预训练模型进行文本分类

一、加载数据

1.1 加载数据

1.2 查看数据

1.3 查看评估metric的方法

3.4 开始训练

3.4 模型评估

四、超参数搜索

4.2 超参数搜索

4.3 设置最好的参数

热门文章

最新文章

相关课程

相关电子书

相关实验场景