【NLP】(task6)Transformers解决文本分类任务 + 超参搜索

本文涉及的产品
NLP 自学习平台,3个模型定制额度 1个月
NLP自然语言处理_高级版,每接口累计50万次
NLP自然语言处理_基础版,每接口每天50万次
简介: 篇章4代码库,也支持使用google colab notebook打开本教程,下载相关数据集和模型。如果在google的colab中打开这个notebook,需要安装Transformers和🤗Datasets库。

篇章4代码库,也支持使用google colab notebook打开本教程,下载相关数据集和模型。如果在google的colab中打开这个notebook,需要安装Transformers和🤗Datasets库。

!pip install transformers datasets

也可以在这里找到本notebook的多GPU分布式训练版本。

任务:微调预训练模型进行文本分类

我们将展示如何使用 🤗 Transformers代码库中的模型来解决文本分类任务,任务来源于GLUE Benchmark.

image.png

GLUE榜单包含了9个句子级别的分类任务,分别是:

image.pngimage.png

【李宏毅深度学习CP18-19】自监督学习之BERT:在GLUE中,总共有9个任务。一般来说,你想知道像BERT这样的模型是否被训练得很好。所以,你实际上会得到9个模型,用于9个单独的任务。你看看这9个任务的平均准确率,该值代表这个Self-supervised模型的性能。

对于以上任务,我们将展示如何使用简单的Dataset库加载数据集,同时使用transformer中的Trainer接口对预训练模型进行微调。

GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

本notebook理论上可以使用各种各样的transformer模型(模型面板),解决任何文本分类分类任务。如果您所处理的任务有所不同,大概率只需要很小的改动便可以使用本notebook进行处理。同时,您应该根据您的GPU显存来调整微调训练所需要的btach size大小,避免显存溢出。

task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

一、加载数据

1.1 加载数据

我们将会使用🤗 Datasets库来加载数据和对应的评测方式。数据加载和评测方式加载只需要简单使用load_datasetload_metric即可。

from datasets import load_dataset, load_metric

除了mnli-mm以外,其他任务都可以直接通过任务名字进行加载。数据加载之后会自动缓存。

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

这个datasets对象本身是一种DatasetDict数据结构. 对于训练集、验证集和测试集,只需要使用对应的key(train,validation,test)即可得到相应的数据。

1.2 查看数据

dataset
    DatasetDict({
        train: Dataset({
            features: ['sentence', 'label', 'idx'],
            num_rows: 8551
        })
        validation: Dataset({
            features: ['sentence', 'label', 'idx'],
            num_rows: 1043
        })
        test: Dataset({
            features: ['sentence', 'label', 'idx'],
            num_rows: 1063
        })
    })

给定一个数据切分的key(train、validation或者test)和下标即可查看数据。

dataset["train"][0]
    {'idx': 0,
     'label': 1,
     'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

为了能够进一步理解数据长什么样子,下面的函数将从数据集里随机选择几个例子进行展示。

import datasets
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
show_random_elements(dataset["train"])

image.png

1.3 查看评估metric的方法

评估metic是datasets.Metric的一个实例:

metric
    Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
    Compute GLUE evaluation metric associated to each GLUE dataset.
    Args:
        predictions: list of predictions to score.
            Each translation should be tokenized into a list of tokens.
        references: list of lists of references for each translation.
            Each reference should be tokenized into a list of tokens.
    Returns: depending on the GLUE subset, one or several of:
        "accuracy": Accuracy
        "f1": F1 score
        "pearson": Pearson Correlation
        "spearmanr": Spearman Correlation
        "matthews_correlation": Matthew Correlation
    Examples:
        >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
        >>> references = [0, 1]
        >>> predictions = [0, 1]
        >>> results = glue_metric.compute(predictions=predictions, references=references)
        >>> print(results)
        {'accuracy': 1.0}
        >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
        >>> references = [0, 1]
        >>> predictions = [0, 1]
        >>> results = glue_metric.compute(predictions=predictions, references=references)
        >>> print(results)
        {'accuracy': 1.0, 'f1': 1.0}
        >>> glue_metric = datasets.load_metric('glue', 'stsb')
        >>> references = [0., 1., 2., 3., 4., 5.]
        >>> predictions = [0., 1., 2., 3., 4., 5.]
        >>> results = glue_metric.compute(predictions=predictions, references=references)
        >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
        {'pearson': 1.0, 'spearmanr': 1.0}
        >>> glue_metric = datasets.load_metric('glue', 'cola')
        >>> references = [0, 1]
        >>> predictions = [0, 1]
        >>> results = glue_metric.compute(predictions=predictions, references=references)
        >>> print(results)
        {'matthews_correlation': 1.0}
    """, stored examples: 0)

直接调用metric的compute方法,传入labelspredictions即可得到如下metric的值,

import numpy as np
fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)
    {'matthews_correlation': 0.1513518081969605}

每一个文本分类任务所对应的metic有所不同.

二、数据预处理

2.1 构建tokenizer

在将数据喂入模型之前,我们需要对数据进行预处理。

数据预处理的基本流程:

(1)分词;(2)转化成对应任务输入模型的格式。

预处理的工具叫Tokenizer:

(1)Tokenizer首先对输入进行tokenize

(2)将tokens转化为预模型中需要对应的token ID

(3)再转化为模型需要的输入格式。

回顾一波上一节我们用BertTokenizer的from_pretrained构建tokenizer,BertTokenizer 有以下常用方法:

from_pretrained:从包含词表文件(vocab.txt)的目录中初始化一个分词器;

tokenize:将文本(词或者句子)分解为子词列表;

convert_tokens_to_ids:将子词列表转化为子词对应下标的列表;

convert_ids_to_tokens :与上一个相反;

convert_tokens_to_string:将 subword 列表按“##”拼接回词或者句子;

encode:对于单个句子输入,分解词并加入特殊词形成“[CLS], x, [SEP]”的结构并转换为词表对应下标的列表;对于两个句子输入(多个句子只取前两个),分解词并加入特殊词形成“[CLS], x1, [SEP], x2, [SEP]”的结构并转换为下标列表;

decode:可以将 encode 方法的输出变为完整句子。

class BertTokenizer(PreTrainedTokenizer):
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    def __init__(
        self,
        vocab_file,
        do_lower_case=True,
        do_basic_tokenize=True,
        never_split=None,
        unk_token="[UNK]",
        sep_token="[SEP]",
        pad_token="[PAD]",
        cls_token="[CLS]",
        mask_token="[MASK]",
        tokenize_chinese_chars=True,
        strip_accents=None,
        **kwargs
    ):
        super().__init__(
            do_lower_case=do_lower_case,
            do_basic_tokenize=do_basic_tokenize,
            never_split=never_split,
            unk_token=unk_token,
            sep_token=sep_token,
            pad_token=pad_token,
            cls_token=cls_token,
            mask_token=mask_token,
            tokenize_chinese_chars=tokenize_chinese_chars,
            strip_accents=strip_accents,
            **kwargs,
        )
        if not os.path.isfile(vocab_file):
            raise ValueError(
                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "
                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
            )
        self.vocab = load_vocab(vocab_file)
        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
        self.do_basic_tokenize = do_basic_tokenize
        if do_basic_tokenize:
            self.basic_tokenizer = BasicTokenizer(
                do_lower_case=do_lower_case,
                never_split=never_split,
                tokenize_chinese_chars=tokenize_chinese_chars,
                strip_accents=strip_accents,
            )
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
    @property
    def do_lower_case(self):
        return self.basic_tokenizer.do_lower_case
    @property
    def vocab_size(self):
        return len(self.vocab)
    def get_vocab(self):
        return dict(self.vocab, **self.added_tokens_encoder)
    def _tokenize(self, text):
        split_tokens = []
        if self.do_basic_tokenize:
            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
                # If the token is part of the never_split set
                if token in self.basic_tokenizer.never_split:
                    split_tokens.append(token)
                else:
                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
        else:
            split_tokens = self.wordpiece_tokenizer.tokenize(text)
        return split_tokens
    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        return self.vocab.get(token, self.vocab.get(self.unk_token))
    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        return self.ids_to_tokens.get(index, self.unk_token)
    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        out_string = " ".join(tokens).replace(" ##", "").strip()
        return out_string
    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A BERT sequence has the following format:
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
        cls = [self.cls_token_id]
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + token_ids_1 + sep
    def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer ``prepare_for_model`` method.
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether or not the token list is already formatted with special tokens for the model.
        Returns:
            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
            )
        if token_ids_1 is not None:
            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
        return [1] + ([0] * len(token_ids_0)) + [1]
    def create_token_type_ids_from_sequences(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence
        pair mask has the following format:
        ::
            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
            | first sequence    | second sequence |
        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        if token_ids_1 is None:
            return len(cls + token_ids_0 + sep) * [0]
        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        index = 0
        if os.path.isdir(save_directory):
            vocab_file = os.path.join(
                save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
            )
        else:
            vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
        with open(vocab_file, "w", encoding="utf-8") as writer:
            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    logger.warning(
                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
                        " Please check that the vocabulary is not corrupted!"
                    )
                    index = token_index
                writer.write(token + "\n")
                index += 1
        return (vocab_file,)

为了达到数据预处理的目的,我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer,这样可以确保:

我们得到一个与预训练模型一一对应的tokenizer。

使用指定的模型checkpoint对应的tokenizer的时候,我们也下载了模型需要的词表库vocabulary,准确来说是tokens vocabulary。

这个被下载的tokens vocabulary会被缓存起来,从而再次使用的时候不会重新下载。

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

注意:use_fast=True要求tokenizer必须是transformers.PreTrainedTokenizerFast类型,因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性(比如多线程快速tokenizer)。如果对应的模型没有fast tokenizer,去掉这个选项即可。几乎所有模型对应的tokenizer都有对应的fast tokenizer。我们可以在模型tokenizer对应表里查看所有预训练模型对应的tokenizer所拥有的特点。

tokenizer既可以对单个文本进行预处理,也可以对一对文本进行预处理,tokenizer预处理后得到的数据满足预训练模型输入格式

tokenizer("Hello, this one sentence!", "And this sentence goes with it.")
    {'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 
    'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

取决于我们选择的预训练模型,我们将会看到tokenizer有不同的返回,tokenizer和预训练模型是一一对应的,更多信息可以在这里进行学习。

2.2 对数据集datasets所有样本进行预处理

为了预处理我们的数据,我们需要知道不同数据和对应的数据格式,因此我们定义下面这个dict。

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

对数据格式进行检查:

sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

结果为:

Sentence: Our friends won't buy this analysis, let alone the next one we propose.

随后将预处理的代码放到一个函数中:

def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

预处理函数可以处理单个样本,也可以对多个样本进行处理。如果输入是多个样本,那么返回的是一个list:

preprocess_function(dataset['train'][:5])

输出结果为:

    {'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

接下来对数据集datasets里面的所有样本进行预处理,处理的方式是使用map函数,将预处理函数prepare_train_features应用到(map)所有样本上。

encoded_dataset = dataset.map(preprocess_function, batched=True)

更好的是,返回的结果会自动被缓存,避免下次处理的时候重新计算(但是也要注意,如果输入有改动,可能会被缓存影响!)。

datasets库函数会对输入的参数进行检测,判断是否有变化,如果没有变化就使用缓存数据,如果有变化就重新处理。但如果输入参数不变,想改变输入的时候,最好清理调这个缓存。清理的方式是使用load_from_cache_file=False参数。

上面使用到的batched=True这个参数是tokenizer的特点,意味这会使用多线程同时并行对输入进行处理。

三、微调预训练模型

3.1 加载分类模型

既然数据已经准备好了,现在我们需要下载并加载我们的预训练模型,然后微调预训练模型。

既然我们是做seq2seq任务,那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSequenceClassification 这个类。

和tokenizer相似,from_pretrained方法同样可以帮助我们下载并加载模型,同时也会对模型进行缓存,就不会重复下载模型啦。

需要注意的是:STS-B是一个回归问题,MNLI是一个3分类问题:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
# 定义模型
num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
Downloading: 100%
268M/268M [00:08<00:00, 31.2MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

由于我们微调的任务是文本分类任务,而我们加载的是预训练的语言模型,所以会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数(比如:预训练语言模型的神经网络head被扔掉了,同时随机初始化了文本分类的神经网络head)。

3.2 设定训练参数

为了能够得到一个Trainer训练工具,我们还需要3个要素,其中最重要的是训练的设定/参数 TrainingArguments。这个训练设定包含了能够定义训练过程的所有属性。

metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

上面evaluation_strategy = "epoch"参数告诉训练代码:我们每个epcoh会做一次验证评估。

上面batch_size在这个notebook之前定义好了。

3.3 对应任务的评测指标

最后,由于不同的任务需要不同的评测指标,我们定一个函数来根据任务名字得到评价方法:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

全部传给 Trainer:

validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

3.4 开始训练

trainer.train()
    The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
    ***** Running training *****
      Num examples = 8551
      Num Epochs = 5
      Instantaneous batch size per device = 16
      Total train batch size (w. parallel, distributed & accumulation) = 16
      Gradient Accumulation steps = 1
      Total optimization steps = 2675

结果如下:

image.png

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running training *****
  Num examples = 8551
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  (此处一坨原本是进度条)Total optimization steps = 2675
 [2675/2675 02:49, Epoch 5/5]
Epoch | Training Loss| Validation Loss | Matthews Correlation
1 0.525400  0.520955  0.409248
2 0.351600  0.570341  0.477499
3 0.236100  0.622785  0.499872
4 0.166300  0.806475  0.491623
5 0.125700  0.882225  0.513900
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-535
Configuration saved in test-glue/checkpoint-535/config.json
Model weights saved in test-glue/checkpoint-535/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-535/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-535/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-1070
Configuration saved in test-glue/checkpoint-1070/config.json
Model weights saved in test-glue/checkpoint-1070/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-1070/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-1070/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-1605
Configuration saved in test-glue/checkpoint-1605/config.json
Model weights saved in test-glue/checkpoint-1605/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-1605/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-1605/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-2140
Configuration saved in test-glue/checkpoint-2140/config.json
Model weights saved in test-glue/checkpoint-2140/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-2140/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-2140/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-2675
Configuration saved in test-glue/checkpoint-2675/config.json
Model weights saved in test-glue/checkpoint-2675/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-2675/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-2675/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from test-glue/checkpoint-2675 (score: 0.5138995234247261).
TrainOutput(global_step=2675, training_loss=0.27181456521292713, metrics={'train_runtime': 169.649, 'train_samples_per_second': 252.02, 'train_steps_per_second': 15.768, 'total_flos': 229537542078168.0, 'train_loss': 0.27181456521292713, 'epoch': 5.0})

从上面的结果的最后一行可知训练的loss等值:

TrainOutput(global_step=2675, training_loss=0.27181456521292713, 
metrics={'train_runtime': 169.649, 
'train_samples_per_second': 252.02, 
'train_steps_per_second': 15.768, 
'total_flos': 229537542078168.0, 
'train_loss': 0.27181456521292713, 
'epoch': 5.0})

3.4 模型评估

训练完成后进行评估:

trainer.evaluate()
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
 (此处一坨原本是进度条)[66/66 00:00]
{'epoch': 5.0,
 'eval_loss': 0.8822253346443176,
 'eval_matthews_correlation': 0.5138995234247261,
 'eval_runtime': 0.9319,
 'eval_samples_per_second': 1119.255,
 'eval_steps_per_second': 70.825}

To see how your model fared you can compare it to the GLUE Benchmark leaderboard.

四、超参数搜索

Trainer同样支持超参搜索,使用optuna or Ray Tune代码库。

反注释下面两行安装依赖:

! pip install optuna
! pip install ray[tune]

4.1 设置初始化模型

超参搜索时,Trainer将会返回多个训练好的模型,所以需要传入一个定义好的模型从而让Trainer可以不断重新初始化该传入的模型:

和之前调用 Trainer类似:

trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

结果为:

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.9.1",
  "vocab_size": 30522
}
loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

4.2 超参数搜索

调用方法hyperparameter_search。注意,这个过程可能很久,我们可以先用部分数据集进行超参搜索,再进行全量训练。

比如使用1/10的数据进行搜索:

# 调参
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

hyperparameter_search会返回效果最好的模型相关的参数:

best_run

这个我没跑出来(可能是网络有点烂==):

BestRun(run_id='3', objective=0.5504031254980248, hyperparameters={'learning_rate': 4.301257551502102e-05, 'num_train_epochs': 5, 'seed': 20, 'per_device_train_batch_size': 8})

4.3 设置最好的参数

Trainner设置为搜索到的最好参数,进行训练:

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)
trainer.train()

结果为:

TrainOutput(global_step=5345, training_loss=0.26719996967083726, metrics={'train_runtime': 178.4912, 'train_samples_p

最后进行评估:

trainer.evaluate()

运行结果为:

{'eval_loss': 0.9789257049560547,
 'eval_matthews_correlation': 0.5548273578107759,
 'eval_runtime': 0.6556,
 'eval_samples_per_second': 1590.796,
 'eval_steps_per_second': 100.664,
 'epoch': 5.0}

可以将上面结果和一开始没有超参搜索的情况(如下)对比,可看到最后的loss值从0.88提升到0.97,对应的评估指标eval_matthews_correlation也有所提升:

{'epoch': 5.0,
 'eval_loss': 0.8822253346443176,
 'eval_matthews_correlation': 0.5138995234247261,
 'eval_runtime': 0.9319,
 'eval_samples_per_second': 1119.255,
 'eval_steps_per_second': 70.825}

最后别忘了,查看如何上传模型 ,上传模型 到🤗 Model Hub。随后就可以像这个notebook一开始一样,直接用模型名字就能使用自己上传的模型啦。

五、注意事项

(1)【独立数据集加载】

1)将下载好的数据集存放到{user_dir}.cache\huggingface\datasets目录

注:Windows用户目录:

C:\Users{用户名}.cache\huggingface\datasets

2) 重新执行加载数据集的代码

比如这次的glue基准数据集,如果网络不方便,可以先手动下载:

链接:https://pan.baidu.com/s/1WTYY37dooKN0AWXkIfECmg

提取码:kq85

(2)【设置Jupyter Notebook代理】

1)配置代理,在console输入命令:

set HTTPS_PROXY=http://127.0.0.1:19180

set HTTP_PROXY=http://127.0.0.1:19180

2)启动Jupyter Notebook,命令如下:

jupyter notebook

Reference

(1)datawhale course

(2)https://relph1119.github.io/my-team-learning/#/transformers_nlp28/task06?id=_44-%e6%a8%a1%e5%9e%8b%e8%af%84%e4%bc%b0

(3)huggingface官网:https://huggingface.co/transformers/preprocessing.html

相关文章
|
7月前
|
自然语言处理 PyTorch 算法框架/工具
自然语言生成任务中的5种采样方法介绍和Pytorch代码实现
在自然语言生成任务(NLG)中,采样方法是指从生成模型中获取文本输出的一种技术。本文将介绍常用的5中方法并用Pytorch进行实现。
272 0
|
2月前
|
数据采集 自然语言处理 机器人
如何使用生成器来提高自然语言处理任务的性能?
如何使用生成器来提高自然语言处理任务的性能?
|
2月前
|
机器学习/深度学习 存储 自然语言处理
从理论到实践:如何使用长短期记忆网络(LSTM)改善自然语言处理任务
【10月更文挑战第7天】随着深度学习技术的发展,循环神经网络(RNNs)及其变体,特别是长短期记忆网络(LSTMs),已经成为处理序列数据的强大工具。在自然语言处理(NLP)领域,LSTM因其能够捕捉文本中的长期依赖关系而变得尤为重要。本文将介绍LSTM的基本原理,并通过具体的代码示例来展示如何在实际的NLP任务中应用LSTM。
94 4
|
5月前
|
机器学习/深度学习 数据采集 自然语言处理
自然语言处理中的文本分类技术深度解析
【7月更文挑战第31天】文本分类作为自然语言处理领域的重要技术之一,正不断推动着智能信息处理的发展。随着深度学习技术的不断成熟和计算资源的日益丰富,我们有理由相信,未来的文本分类技术将更加智能化、高效化、普适化,为人类社会带来更加便捷、精准的信息服务。
|
4月前
|
机器学习/深度学习 数据采集 自然语言处理
【NLP-新闻文本分类】处理新闻文本分类所有开源解决方案汇总
汇总了多个用于新闻文本分类的开源解决方案,包括TextCNN、Bert、LSTM、CNN、Transformer以及多模型融合方法。
52 1
|
4月前
|
机器学习/深度学习 存储 自然语言处理
【NLP-新闻文本分类】3 Bert模型的对抗训练
详细介绍了使用BERT模型进行新闻文本分类的过程,包括数据集预处理、使用预处理数据训练BERT语料库、加载语料库和词典后用原始数据训练BERT模型,以及模型测试。
69 1
|
4月前
|
机器学习/深度学习 数据采集 监控
【NLP-新闻文本分类】2特征工程
本文讨论了特征工程的重要性和处理流程,强调了特征工程在机器学习中的关键作用,并概述了特征工程的步骤,包括数据预处理、特征提取、特征处理、特征选择和特征监控。
32 1
|
4月前
|
数据采集 自然语言处理 数据挖掘
【NLP-新闻文本分类】1 数据分析和探索
文章提供了新闻文本分类数据集的分析,包括数据预览、类型检查、缺失值分析、分布情况,指出了类别不均衡和句子长度差异等问题,并提出了预处理建议。
51 1
|
4月前
|
机器学习/深度学习 自然语言处理 数据挖掘
【NLP】深度学习的NLP文本分类常用模型
本文详细介绍了几种常用的深度学习文本分类模型,包括FastText、TextCNN、DPCNN、TextRCNN、TextBiLSTM+Attention、HAN和Bert,并提供了相关论文和不同框架下的实现源码链接。同时,还讨论了模型的优缺点、适用场景以及一些优化策略。
118 1
|
5月前
|
数据采集 自然语言处理 机器人
使用生成器来提高自然语言处理任务的性能
使用生成器来提高自然语言处理任务的性能