Transformers 自然语言处理（一）（3）-阿里云开发者社区

Transformers 自然语言处理（一）（2）https://developer.aliyun.com/article/1514346

导入模块

我们将导入所需的预训练模块，如预训练的BERT tokenizer和 BERT 模型的配置。导入了BERTAdam优化器以及序列分类模块：

#@title Importing the modules
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import AdamW, BertForSequenceClassification, get_linear_schedule_with_warmup

从tqdm导入一个漂亮的进度条模块：

from tqdm import tqdm, trange

现在我们可以导入广泛使用的标准 Python 模块：

import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

如果一切顺利，不会显示任何消息，需要记住 Google Colab 已经在我们使用的虚拟机上预先安装了这些模块。

指定 CUDA 作为 torch 的设备

现在我们将指定 torch 使用计算统一设备架构（CUDA）来利用 NVIDIA 卡的并行计算能力，为我们的多头注意力模型工作：

#@title Harware verification and device attribution
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
!nvidia-smi

输出可能会随 Google Colab 的配置而变化。请参阅附录 II：Transformer 模型的硬件约束，了解解释和截图。

现在我们将加载数据集。

加载数据集

现在我们将基于 Warstadt等人（2018 年）的论文加载 CoLA。

通用语言理解评估（GLUE）将语言可接受性视为首要的自然语言处理任务。在第五章，使用 Transformer 的下游自然语言处理任务中，我们将探讨 Transformer 必须执行的关键任务，以证明其效率。

笔记本中的以下单元格会自动下载必要的文件：

import os
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/master/Chapter03/in_domain_train.tsv --output "in_domain_train.tsv"
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/master/Chapter03/out_of_domain_dev.tsv --output "out_of_domain_dev.tsv"

你应该看到它们出现在文件管理器中：

图 3.5：上传数据集

现在程序将加载数据集：

#@title Loading the Dataset
#source of dataset : https://nyu-mll.github.io/CoLA/
df = pd.read_csv("in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
df.shape

输出显示了我们导入的数据集的形状：

(8551, 4)

显示了 10 行样本以可视化可接受性判断任务，并查看序列是否合理：

df.sample(10)

输出显示了 10行标记数据集，每次运行后可能会更改：

	sentence_source	label	label_notes	sentence

r-67

NaN

they said that tom would n't pay up , but pay…

bc01

NaN

although he likes cabbage too , fred likes egg…

c_13

NaN

wendy 's mother country is iceland .

bc01

john is wanted to win .

ks08

NaN

i did n't find any bugs in my bed .

sks13

NaN

the girl he met at the departmental party will...

ad03

peter is the old pigs .

bc01

frank promised the men all to leave .

b_73

i 've seen as much of a coward as frank .

c_13

NaN

we drove all the way to buenos aires .

.tsv文件中每个样本包含四个以制表符分隔的列：

第一列：句子的来源（代码）
第二列：标签（0=不可接受，1=可接受）
第三列：作者注释的标签
第四列：待分类的句子

你可以在本地打开.tsv文件，阅读数据集的一些样本。程序现在将处理数据用于 BERT 模型。

创建句子，标签列表，并添加 BERT 标记

程序现在会按照本章准备预训练输入环境部分的描述创建句子：

#@ Creating sentence, label lists and adding Bert tokens
sentences = df.sentence.values
# Adding CLS and SEP tokens at the beginning and end of each sentence for BERT
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values

[CLS]和[SEP]现在已经添加。

程序现在激活了分词器。

激活 BERT 分词器

在本节中，我们将初始化一个预训练的 BERT 分词器。这将节省从头开始训练它所需的时间。

程序选择了一个小写分词器，激活它，并显示了第一个标记化的句子：

#@title Activating the BERT Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

输出包含分类令牌和序列分割令牌：

Tokenize the first sentence:
['[CLS]', 'our', 'friends', 'wo', 'n', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.', '[SEP]']

程序现在将处理数据。

处理数据

我们需要确定一个固定的最大长度并为模型处理数据。数据集中的句子很短。但为了确保这一点，程序将序列的最大长度设置为128，然后进行填充：

#@title Processing the data
# Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway. 
# In the original paper, the authors used a length of 512.
MAX_LEN = 128
# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

序列已经被处理，现在程序创建了注意力掩码。

创建注意力掩码

现在就是过程中的一个棘手部分了。在前一个单元格中，我们添加了填充的序列。但我们想阻止模型对这些填充的标记执行注意力！

想法是为每个标记应用一个值为1的掩码，0 将用于填充：

#@title Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

程序现在将分割数据。

将数据分割成训练集和验证集

程序现在执行标准的数据分割过程，将数据分成训练集和验证集：

#@title Splitting data into train and validation sets
# Use train_test_split to split our data into train and validation sets for training
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, random_state=2018, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,random_state=2018, test_size=0.1)

数据已经准备好训练，但仍需调整为 torch。

将所有数据转换为 torch 张量

微调模型使用 torch 张量。程序必须将数据转换为 torch 张量：

#@title Converting all the data into torch tensors
# Torch tensors are the required datatype for our model
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

转换结束了。现在我们需要创建一个迭代器。

选择批处理大小并创建迭代器

在这个单元格中，程序会选择批处理大小并创建一个迭代器。这个迭代器是避免加载所有数据到内存中并且配合 torch 的DataLoader巧妙的方式，可以批量训练大型数据集而不会使机器内存崩溃。

在这个模型中，批处理大小是32：

#@title Selecting a Batch Size and Creating and Iterator
# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
batch_size = 32
# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

数据已经被处理并且准备就绪。程序现在可以加载并配置 BERT 模型。

BERT 模型配置

程序现在初始化了一个 BERT 小写配置:

#@title BERT Model Configuration
# Initializing a BERT bert-base-uncased style configuration
#@title Transformer Installation
try:
  import transformers
except:
  print("Installing transformers")
  !pip -qq install transformers
from transformers import BertModel, BertConfig
configuration = BertConfig()
# Initializing a model from the bert-base-uncased style configuration
model = BertModel(configuration)
# Accessing the model configuration
configuration = model.config
print(configuration)

输出包含主要的 Hugging Face 参数，类似于以下内容（该库经常更新）：

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

来看看这些主要参数：

attention_probs_dropout_prob：0.1将0.1的丢失率应用于注意力概率。
hidden_act: "gelu" 是编码器中的非线性激活函数。这是一个高斯误差线性单元的激活函数。输入按其大小加权，这使其非线性。
hidden_dropout_prob: 0.1 是应用于全连接层的 dropout 概率。在嵌入、编码器和池层中都可以找到全连接。输出并不总是对序列内容的良好反映。池化隐藏状态序列有助于改善输出序列。
hidden_size: 768 是编码层和池层的维度。
initializer_range: 0.02 是初始化权重矩阵的标准差值。
intermediate_size: 3072 是编码器的前馈层的维度。
layer_norm_eps: 1e-12 是层归一化层的 epsilon 值。
max_position_embeddings: 512 是模型使用的最大长度。
model_type: "bert" 是模型的名称。
num_attention_heads: 12 是头的数量。
num_hidden_layers: 12 是层数的数量。
pad_token_id: 0 是填充标记的 ID，以避免训练填充标记。
type_vocab_size: 2 是token_type_ids的大小，它标识序列。例如，“the dog[SEP] The cat.[SEP]"可以用 token IDs [0,0,0, 1,1,1]表示。
vocab_size: 30522 是模型用于表示input_ids的不同标记的数量。

有了这些参数，我们现在可以加载预训练模型。

加载 Hugging Face BERT uncased 基础模型

程序现在加载了预训练的 BERT 模型：

#@title Loading the Hugging Face Bert uncased base model 
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model = nn.DataParallel(model)
model.to(device)

我们已经定义了模型，定义了并行处理，并且将模型发送到设备。更多解释，请参见附录 II，Transformer 模型的硬件限制。

如果需要，该预训练模型可以进一步训练。通过详细探索架构，可以可视化每个子层的参数，就像下面的摘录所示：

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )

现在让我们来看一下优化器的主要参数。

优化器分组参数

程序现在将初始化模型参数的优化器。微调模型的起始点是初始化预训练模型的参数值（而不是它们的名称）。

优化器的参数包括权重衰减率以避免过拟合，还有一些参数被筛选了出来。

目标是为了为训练循环准备模型的参数：

##@title Optimizer Grouped Parameters
#This code is taken from:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L102
# Don't apply weight decay to any parameters whose names include these tokens.
# (Here, the BERT doesn't have 'gamma' or 'beta' parameters, only 'bias' terms)
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.weight']
# Separate the 'weight' parameters from the 'bias' parameters. 
# - For the 'weight' parameters, this specifies a 'weight_decay_rate' of 0.01\. 
# - For the 'bias' parameters, the 'weight_decay_rate' is 0.0\. 
optimizer_grouped_parameters = [
    # Filter for all parameters which *don't* include 'bias', 'gamma', 'beta'.
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.1},
    # Filter for parameters which *do* include those.
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]
# Note - 'optimizer_grouped_parameters' only includes the parameter values, not the names.

参数已经被准备并清理干净。它们已经为训练循环做好了准备。

训练循环的超参数

尽管似乎无害，但训练循环的超参数至关重要。例如，Adam 会激活权重衰减，并且还会经历一个温和的阶段。

学习率（lr）和温和率（warmup）在优化阶段的早期应该设为一个很小的值，并且在一定数量的迭代后逐渐增加。这可以避免大梯度和超出优化目标。

一些研究人员认为，在层归一化之前的子层输出水平上的梯度不需要预热率。解决这个问题需要进行多次实验运行。

优化器是一种称为BertAdam的 BERT 版本的 Adam：

#@title The Hyperparameters for the Training Loop 
optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=2e-5,
                     warmup=.1)

程序添加了一个精度测量函数来比较预测和标签：

#Creating the Accuracy Measurement Function
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

数据准备就绪。参数准备就绪。现在是激活训练循环的时候了！

训练循环

训练循环遵循标准学习过程。epochs 数设置为4，并且损失和准确率的测量将被绘制。训练循环使用dataloader加载和训练批次。训练过程进行了测量和评估。

代码从初始化train_loss_set开始，它将存储将绘制的损失和准确率。它开始训练其 epochs 并运行标准训练循环，如下所示的摘录：

#@title The Training Loop
t = [] 
# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs (authors recommend between 2 and 4)
epochs = 4
# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
…./…
    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1
  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

输出使用trange包装器为每个epoch显示信息，for _ in trange(epochs, desc="Epoch")：

***output***
Epoch:   0%|          | 0/4 [00:00<?, ?it/s]
Train loss: 0.5381132976395461
Epoch:  25%|██▌       | 1/4 [07:54<23:43, 474.47s/it]
Validation Accuracy: 0.788966049382716
Train loss: 0.315329696132929
Epoch:  50%|█████     | 2/4 [15:49<15:49, 474.55s/it]
Validation Accuracy: 0.836033950617284
Train loss: 0.1474070605354314
Epoch:  75%|███████▌  | 3/4 [23:43<07:54, 474.53s/it]
Validation Accuracy: 0.814429012345679
Train loss: 0.07655430570461196
Epoch: 100%|██████████| 4/4 [31:38<00:00, 474.58s/it]
Validation Accuracy: 0.810570987654321

Transformer 模型发展非常迅速，可能会出现废弃消息甚至错误。Hugging Face 也不例外，当发生这种情况时，我们必须相应地更新我们的代码。

模型已经训练。我们现在可以显示训练评估。

训练评估

损失和准确率数值存储在train_loss_set中，这在训练循环开始时定义。

程序现在绘制测量结果：

#@title Training Evaluation
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

输出是一张图表，显示训练过程进行得很顺利和高效：

图 3.6：每批次训练损失

模型已经微调。我们现在可以进行预测。

使用留存数据集进行预测和评估

BERT 下游模型是使用in_domain_train.tsv数据集训练的。程序现在将使用out_of_domain_dev.tsv文件中的留存（测试）数据集进行预测。目标是预测句子是否语法正确。

以下代码摘录显示了应用于训练数据的数据准备过程在留存数据集部分的代码中被重复使用：

#@title Predicting and Evaluating Using the Holdout Dataset 
df = pd.read_csv("out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
# Create sentence and label lists
sentences = df.sentence.values
# We need to add special tokens at the beginning and end of each sentence for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
.../...

程序然后使用dataloader运行批次预测：

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  # Telling the model not to compute or store gradients, saving memory and speeding up prediction
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

预测的 logits 和标签被移动到 CPU 上：

# Move logits and labels to CPU
  logits =  logits['logits'].detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()

预测和它们的真实标签已存储：

# Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

程序现在可以评估预测了。

使用马修斯相关系数进行评估

马修斯相关系数（MCC）最初是设计用来衡量二元分类的质量，并且可以修改为多类相关系数。每个预测可以使用四个概率进行二元分类：

TP = 真阳性
TN = 真阴性
FP = 假阳性
FN = 假阴性

布莱恩·W·马修斯，一位生物化学家，于 1975 年设计了它，受他的前辈的phi函数的启发。从那时起，它已经发展成为各种格式，如以下格式之一：

MCC 产生的值在-1和+1之间。 +1是预测的最大正值。 -1是反向预测。 0是平均随机预测。

GLUE 用 MCC 评估语言可接受性。

MCC 是导入自sklearn.metrics的：

#@title Evaluating Using Matthew's Correlation Coefficient
# Import and evaluate each test batch using Matthew's correlation coefficient
from sklearn.metrics import matthews_corrcoef

创建了一组预测：

matthews_set = []

MCC 值被计算并存储在matthews_set中：

for i in range(len(true_labels)):
  matthews = matthews_corrcoef(true_labels[i],
                 np.argmax(predictions[i], axis=1).flatten())
  matthews_set.append(matthews)

由于库和模块版本更改，您可能会看到一些消息。最终得分将基于整个测试集，但让我们看看各个批次的得分，以了解批次之间指标的变化。

各个批次的得分

让我们来看看各个批次的得分：

#@title Score of Individual Batches
matthews_set

输出产生了预期的-1和+1之间的 MCC 值：

[0.049286405809014416,
 -0.2548235957188128,
 0.4732058754737091,
 0.30508307783296046,
 0.3567530340063379,
 0.8050112948805689,
 0.23329882422520506,
 0.47519096331149147,
 0.4364357804719848,
 0.4700159919404217,
 0.7679476477883045,
 0.8320502943378436,
 0.5807564950208268,
 0.5897435897435898,
 0.38461538461538464,
 0.5716350506349809,
 0.0]

几乎所有的 MCC 值都是正值，这是个好消息。让我们看看整个数据集的评估如何。

对整个数据集进行 Matthews 评估

MCC 是评估分类模型的一种实用方法。

该程序现在将为整个数据集聚合真实值：

#@title Matthew's Evaluation on the Whole Dataset
# Flatten the predictions and true values for aggregate Matthew's evaluation on the whole dataset
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]
matthews_corrcoef(flat_true_labels, flat_predictions)

MCC 产生-1和+1之间的相关值。 0是平均预测，-1是反向预测，1是完美预测。在这种情况下，输出证实 MCC 是正值，表明模型和数据集之间存在关联：

0.45439842471680725

在 BERT 模型的微调最终积极评估中，我们对 BERT 训练框架有了整体的认识。

摘要

BERT 为 transformers 引入了双向关注。从左到右预测序列并屏蔽未来的标记以训练模型具有严格限制。如果屏蔽的序列包含我们正在寻找的意义，模型将产生错误。 BERT 同时关注序列的所有标记。

我们探索了 BERT 的架构，它只使用 transformers 的编码器堆栈。 BERT 被设计为一个两步框架。框架的第一步是预训练一个模型。第二步是微调模型。我们为可接受性判断下游任务构建了一个微调的 BERT 模型。微调过程经历了所有阶段的过程。首先，我们加载了数据集和加载了模型的必要预训练模块。然后训练模型，并测量其性能。

对一个预训练模型进行微调所需的机器资源比从头开始训练下游任务要少。细调模型可以执行各种任务。 BERT 证明我们只需对模型进行两项训练预处理就能实现这一点，这本身就是了不起的。但是基于 BERT 预训练模型的训练参数产生多任务微调模型是异常的。

第七章，具有 GPT-3 引擎的超人变形金刚的兴起，表明 OpenAI 已经达到了零调校水平。

在本章中，我们对 BERT 模型进行了微调。在下一章，第四章，从头开始预训练 RoBERTa 模型，我们将深入研究 BERT 框架，并从头构建一个预训练的类 BERT 模型。

问题

BERT 代表双向编码器来自Transformers。 (True/False)
BERT 是一个两步框架。 步骤 1 是预训练。 步骤 2 是微调。 (True/False)
对 BERT 模型进行微调意味着从头开始训练参数。 (True/False)
BERT 仅通过所有下游任务进行预训练。 (True/False)
BERT 预训练采用掩码语言建模 (MLM)。 (True/False)
BERT 预训练采用下一句预测 (NSP)。 (True/False)
BERT 预训练数学函数。 (True/False)
问答任务是一个下游任务。 (True/False)
BERT 预训练模型不需要标记化。 (True/False)
对 BERT 模型进行微调所需时间比预训练少。 (True/False)

参考资料

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017, 注意力全靠你: arxiv.org/abs/1706.03762
Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和 Kristina Toutanova, 2018, BERT：为语言理解预训练的深度双向Transformers: arxiv.org/abs/1810.04805
Alex Warstadt, Amanpreet Singh, 和 Samuel R. Bowman, 2018, 神经网络可接受性判断: arxiv.org/abs/1805.12471
语言可接受性语料库 (CoLA)：nyu-mll.github.io/CoLA/
Hugging Face 模型文档：

加入我们书的 Discord 空间

加入书籍的 Discord 工作空间，与作者进行每月的 向我提问 会话：

www.packt.link/Transformers

Transformers 自然语言处理（一）（3）

导入模块

指定 CUDA 作为 torch 的设备

加载数据集

创建句子，标签列表，并添加 BERT 标记

激活 BERT 分词器

处理数据

创建注意力掩码

将数据分割成训练集和验证集

将所有数据转换为 torch 张量

选择批处理大小并创建迭代器

BERT 模型配置

加载 Hugging Face BERT uncased 基础模型

优化器分组参数

训练循环的超参数

训练循环

训练评估

使用留存数据集进行预测和评估

使用马修斯相关系数进行评估

各个批次的得分

对整个数据集进行 Matthews 评估

摘要

问题

参考资料

加入我们书的 Discord 空间

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Transformers 自然语言处理（一）（3）

导入模块

指定 CUDA 作为 torch 的设备

加载数据集

创建句子，标签列表，并添加 BERT 标记

激活 BERT 分词器

处理数据

创建注意力掩码

将数据分割成训练集和验证集

将所有数据转换为 torch 张量

选择批处理大小并创建迭代器

BERT 模型配置

加载 Hugging Face BERT uncased 基础模型

优化器分组参数

训练循环的超参数

训练循环

训练评估

使用留存数据集进行预测和评估

使用马修斯相关系数进行评估

各个批次的得分

对整个数据集进行 Matthews 评估

摘要

问题

参考资料

加入我们书的 Discord 空间

热门文章

最新文章

相关课程

相关电子书

相关实验场景