Transformers 4.37 中文文档（十一）（1）-阿里云开发者社区

原文：huggingface.co/docs/transformers

如何创建自定义管道？

原始文本：huggingface.co/docs/transformers/v4.37.2/en/add_new_pipeline

在本指南中，我们将看到如何创建自定义管道并在Hub上共享它或将其添加到🤗 Transformers 库中。

首先，您需要决定管道将能够接受的原始条目。它可以是字符串、原始字节、字典或任何看起来最有可能的期望输入。尽量保持这些输入尽可能纯粹的 Python，因为这样可以使兼容性更容易（甚至通过 JSON 通过其他语言）。这些将是管道的inputs（preprocess）。

然后定义outputs。与inputs相同的策略。越简单越好。这些将是postprocess方法的输出。

首先通过继承基类Pipeline，具有实现preprocess、_forward、postprocess和_sanitize_parameters所需的 4 个方法。

from transformers import Pipeline
class MyPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        if "maybe_arg" in kwargs:
            preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
        return preprocess_kwargs, {}, {}
    def preprocess(self, inputs, maybe_arg=2):
        model_input = Tensor(inputs["input_ids"])
        return {"model_input": model_input}
    def _forward(self, model_inputs):
        # model_inputs == {"model_input": model_input}
        outputs = self.model(**model_inputs)
        # Maybe {"logits": Tensor(...)}
        return outputs
    def postprocess(self, model_outputs):
        best_class = model_outputs["logits"].softmax(-1)
        return best_class

这种分解的结构支持相对无缝地支持 CPU/GPU，同时支持在不同线程上在 CPU 上进行预处理/后处理

preprocess将获取最初定义的输入，并将其转换为可供模型使用的内容。它可能包含更多信息，通常是一个Dict。

_forward是实现细节，不应直接调用。forward是首选的调用方法，因为它包含了确保一切在预期设备上工作的保障措施。如果任何内容与真实模型相关，则应该放在_forward方法中，其他内容应该放在预处理/后处理中。

postprocess方法将获取_forward的输出并将其转换为之前决定的最终输出。

_sanitize_parameters存在是为了允许用户在任何时候传递任何参数，无论是在初始化时pipeline(...., maybe_arg=4)还是在调用时pipe = pipeline(...); output = pipe(...., maybe_arg=4)。

_sanitize_parameters的返回值是将直接传递给preprocess、_forward和postprocess的 3 个 kwargs 字典。如果调用者没有使用任何额外参数，则不要填写任何内容。这样可以保持函数定义中的默认参数，这总是更“自然”的。

一个经典的例子是在分类任务的后处理中添加一个top_k参数。

>>> pipe = pipeline("my-new-task")
>>> pipe("This is a test")
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]
>>> pipe("This is a test", top_k=2)
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]

为了实现这一点，我们将使用一个默认参数5来更新我们的postprocess方法，并编辑_sanitize_parameters以允许这个新参数。

def postprocess(self, model_outputs, top_k=5):
    best_class = model_outputs["logits"].softmax(-1)
    # Add logic to handle top_k
    return best_class
def _sanitize_parameters(self, **kwargs):
    preprocess_kwargs = {}
    if "maybe_arg" in kwargs:
        preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
    postprocess_kwargs = {}
    if "top_k" in kwargs:
        postprocess_kwargs["top_k"] = kwargs["top_k"]
    return preprocess_kwargs, {}, postprocess_kwargs

尽量保持输入/输出非常简单，最好是可 JSON 序列化的，因为这样可以使管道的使用非常简单，而不需要用户理解新类型的对象。通常也支持许多不同类型的参数，以便于使用（例如音频文件，可以是文件名、URL 或纯字节）

将其添加到支持任务列表中

要将您的new-task注册到支持任务列表中，您必须将其添加到PIPELINE_REGISTRY中：

from transformers.pipelines import PIPELINE_REGISTRY
PIPELINE_REGISTRY.register_pipeline(
    "new-task",
    pipeline_class=MyPipeline,
    pt_model=AutoModelForSequenceClassification,
)

如果需要，您可以指定一个默认模型，此时应该附带一个特定的修订版（可以是分支名称或提交哈希，这里我们取"abcdef"）以及类型：

PIPELINE_REGISTRY.register_pipeline(
    "new-task",
    pipeline_class=MyPipeline,
    pt_model=AutoModelForSequenceClassification,
    default={"pt": ("user/awesome_model", "abcdef")},
    type="text",  # current support type: text, audio, image, multimodal
)

在 Hub 上共享您的管道

要在 Hub 上共享您的自定义管道，只需将Pipeline子类的自定义代码保存在一个 python 文件中。例如，假设我们想要像这样为句对分类使用自定义管道：

import numpy as np
from transformers import Pipeline
def softmax(outputs):
    maxes = np.max(outputs, axis=-1, keepdims=True)
    shifted_exp = np.exp(outputs - maxes)
    return shifted_exp / shifted_exp.sum(axis=-1, keepdims=True)
class PairClassificationPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        if "second_text" in kwargs:
            preprocess_kwargs["second_text"] = kwargs["second_text"]
        return preprocess_kwargs, {}, {}
    def preprocess(self, text, second_text=None):
        return self.tokenizer(text, text_pair=second_text, return_tensors=self.framework)
    def _forward(self, model_inputs):
        return self.model(**model_inputs)
    def postprocess(self, model_outputs):
        logits = model_outputs.logits[0].numpy()
        probabilities = softmax(logits)
        best_class = np.argmax(probabilities)
        label = self.model.config.id2label[best_class]
        score = probabilities[best_class].item()
        logits = logits.tolist()
        return {"label": label, "score": score, "logits": logits}

这个实现是与框架无关的，将适用于 PyTorch 和 TensorFlow 模型。如果我们将其保存在一个名为pair_classification.py的文件中，然后可以像这样导入并注册它：

from pair_classification import PairClassificationPipeline
from transformers.pipelines import PIPELINE_REGISTRY
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
PIPELINE_REGISTRY.register_pipeline(
    "pair-classification",
    pipeline_class=PairClassificationPipeline,
    pt_model=AutoModelForSequenceClassification,
    tf_model=TFAutoModelForSequenceClassification,
)

完成后，我们可以使用预训练模型。例如sgugger/finetuned-bert-mrpc已在 MRPC 数据集上进行了微调，用于将句子对分类为释义或非释义。

from transformers import pipeline
classifier = pipeline("pair-classification", model="sgugger/finetuned-bert-mrpc")

然后我们可以通过在Repository中使用save_pretrained方法在 Hub 上共享它：

from huggingface_hub import Repository
repo = Repository("test-dynamic-pipeline", clone_from="{your_username}/test-dynamic-pipeline")
classifier.save_pretrained("test-dynamic-pipeline")
repo.push_to_hub()

这将复制您在文件中定义PairClassificationPipeline的文件夹"test-dynamic-pipeline"中，同时保存管道的模型和分词器，然后将所有内容推送到存储库{your_username}/test-dynamic-pipeline中。之后，任何人只要提供选项trust_remote_code=True就可以使用它：

from transformers import pipeline
classifier = pipeline(model="{your_username}/test-dynamic-pipeline", trust_remote_code=True)

将管道添加到🤗 Transformers

如果您想将您的管道贡献给🤗 Transformers，您需要在pipelines子模块中添加一个新模块，其中包含您的管道的代码，然后将其添加到pipelines/__init__.py中定义的任务列表中。

然后，您需要添加测试。创建一个新文件tests/test_pipelines_MY_PIPELINE.py，其中包含其他测试的示例。

run_pipeline_test函数将非常通用，并在由model_mapping和tf_model_mapping定义的每种可能的架构上运行小型随机模型。

这对于测试未来的兼容性非常重要，这意味着如果有人为XXXForQuestionAnswering添加了一个新模型，那么管道测试将尝试在其上运行。由于模型是随机的，无法检查实际值，这就是为什么有一个辅助ANY，它将简单地尝试匹配管道类型的输出。

您还需要实现 2（理想情况下 4）个测试。

test_small_model_pt：为这个管道定义一个小模型（结果是否有意义并不重要），并测试管道的输出。结果应该与test_small_model_tf相同。
test_small_model_tf：为这个管道定义一个小模型（结果是否有意义并不重要），并测试管道的输出。结果应该与test_small_model_pt相同。
test_large_model_pt (optional): 在一个真实的管道上测试管道，结果应该是有意义的。这些测试很慢，应该标记为这样。这里的目标是展示管道，并确保将来的发布中没有漂移。
test_large_model_tf (optional): 在一个真实的管道上测试管道，结果应该是有意义的。这些测试很慢，应该标记为这样。这里的目标是展示管道，并确保将来的发布中没有漂移。

测试

原始文本：huggingface.co/docs/transformers/v4.37.2/en/testing

让我们看看🤗 Transformers 模型是如何测试的，以及您如何编写新测试并改进现有测试。

存储库中有 2 个测试套件：

tests — 用于一般 API 的测试
examples — 主要用于不属于 API 的各种应用的测试

如何测试 transformers

一旦提交了 PR，它将通过 9 个 CircleCi 作业进行测试。对该 PR 的每个新提交都会重新测试。这些作业在此配置文件中定义，因此如果需要，您可以在您的机器上重现相同的环境。
这些 CI 作业不运行@slow测试。
由github actions运行 3 个作业：

torch hub 集成：检查 torch hub 集成是否正常工作。
自托管（推送）：仅在main上的提交上在 GPU 上运行快速测试。仅在main上的提交更新了以下文件夹中的代码时才运行：src，tests，.github（以防止在添加模型卡、笔记本等时运行）。
自托管的运行器：在tests和examples中的 GPU 上运行正常和慢速测试：

RUN_SLOW=1 pytest tests/
RUN_SLOW=1 pytest examples/

结果可以在此处观察。

运行测试

选择要运行的测试

本文详细介绍了如何运行测试。如果阅读完所有内容后，您需要更多细节，您可以在此处找到它们。

以下是运行测试的一些最有用的方法。

运行所有：

pytest

或：

make test

请注意，后者被定义为：

python -m pytest -n auto --dist=loadfile -s -v ./tests/

告诉 pytest：

运行与 CPU 核心数量相同的测试进程（如果 RAM 不足可能会太多！）
确保同一文件中的所有测试将由同一个测试进程运行
不捕获输出
以详细模式运行

获取所有测试的列表

测试套件的所有测试：

pytest --collect-only -q

给定测试文件的所有测试：

pytest tests/test_optimization.py --collect-only -q

运行特定测试模块

要运行单个测试模块：

pytest tests/utils/test_logging.py

运行特定测试

由于大多数测试中使用了 unittest，要运行特定的子测试，您需要知道包含这些测试的 unittest 类的名称。例如，可能是：

pytest tests/test_optimization.py::OptimizationTest::test_adam_w

这里：

tests/test_optimization.py - 具有测试的文件
OptimizationTest - 类的名称
test_adam_w - 特定测试函数的名称

如果文件包含多个类，您可以选择仅运行给定类的测试。例如：

pytest tests/test_optimization.py::OptimizationTest

将运行该类中的所有测试。

如前所述，您可以通过运行以下内容查看OptimizationTest类中包含的所有测试：

pytest tests/test_optimization.py::OptimizationTest --collect-only -q

您可以通过关键字表达式运行测试。

仅运行包含adam名称的测试：

pytest -k adam tests/test_optimization.py

逻辑and和or可用于指示是否应匹配所有关键字或任一关键字。not可用于否定。

要运行除包含adam的名称的测试之外的所有测试：

pytest -k "not adam" tests/test_optimization.py

您可以将这两种模式结合在一起：

pytest -k "ada and not adam" tests/test_optimization.py

例如，要同时运行test_adafactor和test_adam_w，您可以使用：

pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py

请注意，我们在这里使用or，因为我们希望关键字中的任何一个匹配以包括两者。

如果要仅包含包含两种模式的测试，应使用and：

pytest -k "test and ada" tests/test_optimization.py

运行加速测试

有时您需要在模型上运行accelerate测试。为此，您只需将-m accelerate_tests添加到您的命令中，例如，如果您想在OPT上运行这些测试：

RUN_SLOW=1 pytest -m accelerate_tests tests/models/opt/test_modeling_opt.py

Transformers 4.37 中文文档（十一）（2）https://developer.aliyun.com/article/1564971

Transformers 4.37 中文文档（十一）（1）

如何创建自定义管道？

将其添加到支持任务列表中

在 Hub 上共享您的管道

将管道添加到🤗 Transformers

测试

如何测试 transformers

运行测试

选择要运行的测试

获取所有测试的列表

运行特定测试模块

运行特定测试

运行加速测试

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Transformers 4.37 中文文档（十一）（1）

如何创建自定义管道？

将其添加到支持任务列表中

在 Hub 上共享您的管道

将管道添加到🤗 Transformers

测试

如何测试 transformers

运行测试

选择要运行的测试

获取所有测试的列表

运行特定测试模块

运行特定测试

运行加速测试

热门文章

最新文章

相关课程

相关电子书