Transformers 4.37 中文文档（一）（4）-阿里云开发者社区

Transformers 4.37 中文文档（一）（3）https://developer.aliyun.com/article/1565782

使用 AutoClass 加载预训练实例

原文链接：huggingface.co/docs/transformers/v4.37.2/en/autoclass_tutorial

由于有这么多不同的 Transformer 架构，为您的检查点创建一个可能是具有挑战性的。作为🤗 Transformers 核心理念的一部分，使库易于使用、简单灵活，AutoClass会自动推断并从给定的检查点加载正确的架构。from_pretrained()方法让您快速加载任何架构的预训练模型，这样您就不必花时间和资源从头开始训练模型。生成这种与检查点无关的代码意味着，如果您的代码适用于一个检查点，它将适用于另一个检查点 - 只要它是为类似任务训练的 - 即使架构不同。

请记住，架构指的是模型的骨架，检查点是给定架构的权重。例如，BERT是一个架构，而bert-base-uncased是一个检查点。模型是一个通用术语，可以指代架构或检查点。

在本教程中，学习：

加载一个预训练分词器。
加载一个预训练图像处理器
加载一个预训练特征提取器。
加载一个预训练处理器。
加载一个预训练模型。
加载一个作为骨干的模型。

AutoTokenizer

几乎每个 NLP 任务都以分词器开始。分词器将您的输入转换为模型可以处理的格式。

使用 AutoTokenizer.from_pretrained()加载一个分词器：

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

然后按照下面所示对您的输入进行标记化：

>>> sequence = "In a hole in the ground there lived a hobbit."
>>> print(tokenizer(sequence))
{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

AutoImageProcessor

对于视觉任务，图像处理器将图像处理成正确的输入格式。

>>> from transformers import AutoImageProcessor
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

AutoFeatureExtractor

对于音频任务，特征提取器将音频信号处理成正确的输入格式。

使用 AutoFeatureExtractor.from_pretrained()加载一个特征提取器：

>>> from transformers import AutoFeatureExtractor
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(
...     "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
... )

AutoProcessor

多模态任务需要一个结合两种预处理工具的处理器。例如，LayoutLMV2 模型需要一个图像处理器来处理图像，一个分词器来处理文本；处理器将两者结合起来。

使用 AutoProcessor.from_pretrained()加载一个处理器：

>>> from transformers import AutoProcessor
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")

AutoModel

PytorchHide Pytorch content

AutoModelFor类让您加载给定任务的预训练模型（请参阅此处以获取可用任务的完整列表）。例如，使用 AutoModelForSequenceClassification.from_pretrained()加载一个用于序列分类的模型：

>>> from transformers import AutoModelForSequenceClassification
>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

轻松重用相同的检查点来加载不同任务的架构：

>>> from transformers import AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")

对于 PyTorch 模型，from_pretrained()方法使用torch.load()，内部使用pickle，已知存在安全风险。一般来说，永远不要加载可能来自不受信任来源或可能被篡改的模型。对于在 Hugging Face Hub 上托管的公共模型，这种安全风险部分得到缓解，这些模型在每次提交时都会进行恶意软件扫描。查看Hub 文档以获取最佳实践，如使用 GPG 进行签名提交验证。

TensorFlow 和 Flax 检查点不受影响，可以在 PyTorch 架构中使用from_tf和from_flax参数加载，以绕过此问题。

通常，我们建议使用AutoTokenizer类和AutoModelFor类来加载模型的预训练实例。这将确保您每次加载正确的架构。在下一个教程中，学习如何使用新加载的分词器、图像处理器、特征提取器和处理器来预处理数据集进行微调。

TensorFlow 隐藏 TensorFlow 内容

最后，TFAutoModelFor类让您加载给定任务的预训练模型（请参阅此处以获取可用任务的完整列表）。例如，使用 TFAutoModelForSequenceClassification.from_pretrained()加载用于序列分类的模型：

>>> from transformers import TFAutoModelForSequenceClassification
>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

轻松地重复使用相同的检查点来加载不同任务的架构：

>>> from transformers import TFAutoModelForTokenClassification
>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")

通常，我们建议使用AutoTokenizer类和TFAutoModelFor类来加载模型的预训练实例。这将确保您每次加载正确的架构。在下一个教程中，学习如何使用新加载的分词器、图像处理器、特征提取器和处理器来预处理数据集进行微调。

AutoBackbone

AutoBackbone允许您将预训练模型用作骨干，并从模型的不同阶段获得特征图作为输出。下面您可以看到如何从 Swin 检查点获取特征图。

>>> from transformers import AutoImageProcessor, AutoBackbone
>>> import torch
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> processor = AutoImageProcessor.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
>>> model = AutoBackbone.from_pretrained("microsoft/swin-tiny-patch4-window7-224", out_indices=(0,))
>>> inputs = processor(image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> feature_maps = outputs.feature_maps
>>> list(feature_maps[-1].shape)
[1, 96, 56, 56]

预处理

原始文本：huggingface.co/docs/transformers/v4.37.2/en/preprocessing

在您可以在数据集上训练模型之前，需要将其预处理为预期的模型输入格式。无论您的数据是文本、图像还是音频，都需要将其转换并组装成张量批次。🤗 Transformers 提供了一组预处理类来帮助准备数据供模型使用。在本教程中，您将了解到：

文本，使用 Tokenizer 将文本转换为一系列标记，创建标记的数值表示，并将它们组装成张量。
语音和音频，使用 Feature extractor 从音频波形中提取序列特征并将其转换为张量。
图像输入使用 ImageProcessor 将图像转换为张量。
多模态输入，使用 Processor 来结合一个分词器和一个特征提取器或图像处理器。

AutoProcessor 总是有效，并自动选择您正在使用的模型的正确类别，无论您是使用分词器、图像处理器、特征提取器还是处理器。

在开始之前，请安装🤗数据集，以便加载一些数据集进行实验：

pip install datasets

自然语言处理

www.youtube-nocookie.com/embed/Yffk5aydLzg

预处理文本数据的主要工具是 tokenizer。分词器根据一组规则将文本分割为标记。这些标记被转换为数字，然后成为模型输入的张量。分词器会添加模型所需的任何额外输入。

如果您打算使用预训练模型，重要的是使用相关的预训练分词器。这确保文本被分割的方式与预训练语料库相同，并且在预训练期间使用相同的对应标记索引（通常称为词汇表）。

通过 AutoTokenizer.from_pretrained()方法加载预训练的分词器来开始。这会下载模型预训练时使用的词汇表：

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

然后将您的文本传递给分词器：

>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
>>> print(encoded_input)
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

分词器返回一个包含三个重要项目的字典：

input_ids 是句子中每个标记对应的索引。
attention_mask 指示一个标记是否应该被关注。
token_type_ids 标识一个标记属于哪个序列，当有多个序列时。

通过解码input_ids返回您的输入：

>>> tokenizer.decode(encoded_input["input_ids"])
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

正如您所看到的，分词器添加了两个特殊标记 - CLS和SEP（分类器和分隔符）- 到句子中。并非所有模型都需要特殊标记，但如果需要，分词器会自动为您添加它们。

如果有几个句子需要预处理，将它们作为列表传递给分词器：

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1]]}

填充

句子长度不总是相同，这可能是一个问题，因为张量，即模型输入，需要具有统一的形状。填充是一种确保张量是矩形的策略，通过向较短的句子添加一个特殊的填充标记。

将padding参数设置为True，以将批次中较短的序列填充到与最长序列相匹配的长度：

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

第一句和第三句现在用0填充，因为它们较短。

截断

另一方面，有时一个序列可能太长，模型无法处理。在这种情况下，您需要将序列截断为较短的长度。

将truncation参数设置为True，将序列截断为模型接受的最大长度：

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

查看填充和截断概念指南，了解更多不同的填充和截断参数。

构建张量

最后，您希望分词器返回实际馈送到模型的张量。

将return_tensors参数设置为pt以供 PyTorch 使用，或设置为tf以供 TensorFlow 使用：

Pytorch 隐藏 Pytorch 内容

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
>>> print(encoded_input)
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

TensorFlow 隐藏 TensorFlow 内容

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
>>> print(encoded_input)
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>,
 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>,
 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}

不同的管道以不同的方式在其__call__()中支持分词器参数。text-2-text-generation管道仅支持（即传递）truncation。text-generation管道支持max_length、truncation、padding和add_special_tokens。在fill-mask管道中，分词器参数可以在tokenizer_kwargs参数（字典）中传递。

音频

对于音频任务，您将需要一个特征提取器来准备您的数据集以供模型使用。特征提取器旨在从原始音频数据中提取特征，并将其转换为张量。

加载MInDS-14数据集（查看🤗Datasets 教程以获取有关如何加载数据集的更多详细信息）以查看如何在音频数据集中使用特征提取器：

>>> from datasets import load_dataset, Audio
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

访问audio列的第一个元素以查看输入。调用audio列会自动加载和重新采样音频文件：

>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}

这将返回三个项目：

array是加载的语音信号 - 可能已重新采样 - 作为 1D 数组。
path指向音频文件的位置。
sampling_rate指的是每秒测量的语音信号中有多少数据点。

在本教程中，您将使用Wav2Vec2模型。查看模型卡片，您将了解到 Wav2Vec2 是在 16kHz 采样的语音音频上进行预训练的。重要的是，您的音频数据的采样率要与用于预训练模型的数据集的采样率匹配。如果您的数据采样率不同，则需要对数据进行重新采样。

使用🤗 Datasets 的cast_column方法将采样率上采样至 16kHz：

>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))

再次调用audio列以重新采样音频文件：

>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 16000}

接下来，加载一个特征提取器来对输入进行归一化和填充。在填充文本数据时，会为较短的序列添加0。相同的思想也适用于音频数据。特征提取器会向array中添加一个0 - 被解释为静音。

使用 AutoFeatureExtractor.from_pretrained()加载特征提取器：

>>> from transformers import AutoFeatureExtractor
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

将音频array传递给特征提取器。我们还建议在特征提取器中添加sampling_rate参数，以更好地调试可能发生的任何静默错误。

>>> audio_input = [dataset[0]["audio"]["array"]]
>>> feature_extractor(audio_input, sampling_rate=16000)
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}

与分词器一样，您可以应用填充或截断来处理批处理中的可变序列。查看这两个音频样本的序列长度：

>>> dataset[0]["audio"]["array"].shape
(173398,)
>>> dataset[1]["audio"]["array"].shape
(106496,)

创建一个函数来预处理数据集，使音频样本具有相同的长度。指定最大样本长度，特征提取器将填充或截断序列以匹配它：

>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
...         max_length=100000,
...         truncation=True,
...     )
...     return inputs

对数据集中的前几个示例应用preprocess_function：

>>> processed_dataset = preprocess_function(dataset[:5])

现在样本长度相同并与指定的最大长度匹配。现在可以将处理过的数据集传递给模型了！

>>> processed_dataset["input_values"][0].shape
(100000,)
>>> processed_dataset["input_values"][1].shape
(100000,)

Transformers 4.37 中文文档（一）（5）https://developer.aliyun.com/article/1565784

Transformers 4.37 中文文档（一）（4）

使用 AutoClass 加载预训练实例

AutoTokenizer

AutoImageProcessor

AutoFeatureExtractor

AutoProcessor

AutoModel

AutoBackbone

预处理

自然语言处理

填充

截断

构建张量

音频

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Transformers 4.37 中文文档（一）（4）

使用 AutoClass 加载预训练实例

AutoTokenizer

AutoImageProcessor

AutoFeatureExtractor

AutoProcessor

AutoModel

AutoBackbone

预处理

自然语言处理

填充

截断

构建张量

音频

热门文章

最新文章

相关电子书