Transformers 4.37 中文文档（十七）（1）-阿里云开发者社区

原文：huggingface.co/docs/transformers

管道

原文链接: huggingface.co/docs/transformers/v4.37.2/en/main_classes/pipelines

管道是使用模型进行推断的一种很好且简单的方式。这些管道是抽象出库中大部分复杂代码的对象，提供了专门用于多个任务的简单 API，包括命名实体识别、掩码语言建模、情感分析、特征提取和问答。查看任务摘要以获取使用示例。

有两种要注意的管道抽象类别：

pipeline() 是封装所有其他管道的最强大对象。
针对音频、计算机视觉、自然语言处理和多模态任务提供了特定任务的管道。

管道抽象

pipeline 抽象是围绕所有其他可用管道的包装器。它像任何其他管道一样实例化，但可以提供额外的生活质量。

简单调用一个项目：

>>> pipe = pipeline("text-classification")
>>> pipe("This restaurant is awesome")
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]

如果要使用来自hub的特定模型，可以忽略任务，如果 hub 上的模型已经定义了它：

>>> pipe = pipeline(model="roberta-large-mnli")
>>> pipe("This restaurant is awesome")
[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]

要在多个项目上调用管道，可以使用列表调用它。

>>> pipe = pipeline("text-classification")
>>> pipe(["This restaurant is awesome", "This restaurant is awful"])
[{'label': 'POSITIVE', 'score': 0.9998743534088135},
 {'label': 'NEGATIVE', 'score': 0.9996669292449951}]

要遍历完整数据集，建议直接使用dataset。这意味着您不需要一次性分配整个数据集，也不需要自己进行批处理。这应该与 GPU 上的自定义循环一样快。如果不是，请不要犹豫创建一个问题。

import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")
# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
for out in tqdm(pipe(KeyDataset(dataset, "file"))):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

为了方便使用，也可以使用生成器：

from transformers import pipeline
pipe = pipeline("text-classification")
def data():
    while True:
        # This could come from a dataset, a database, a queue or HTTP request
        # in a server
        # Caveat: because this is iterative, you cannot use `num_workers > 1` variable
        # to use multiple threads to preprocess data. You can still have 1 thread that
        # does the preprocessing while the main runs the big inference
        yield "This is a test"
for out in pipe(data()):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

`transformers.pipeline`

< source >

( task: str = None model: Union = None config: Union = None tokenizer: Union = None feature_extractor: Union = None image_processor: Union = None framework: Optional = None revision: Optional = None use_fast: bool = True token: Union = None device: Union = None device_map = None torch_dtype = None trust_remote_code: Optional = None model_kwargs: Dict = None pipeline_class: Optional = None **kwargs ) → export const metadata = 'undefined';Pipeline

参数

task (str) — 定义将返回哪个管道的任务。当前接受的任务有：

"audio-classification": 将返回一个 AudioClassificationPipeline。
"automatic-speech-recognition": 将返回一个 AutomaticSpeechRecognitionPipeline。
"conversational": 将返回一个 ConversationalPipeline。
"depth-estimation": 将返回一个 DepthEstimationPipeline。
"document-question-answering": 将返回一个 DocumentQuestionAnsweringPipeline。
"feature-extraction": 将返回一个 FeatureExtractionPipeline。
"fill-mask": 将返回一个 FillMaskPipeline。
"image-classification": 将返回一个 ImageClassificationPipeline。
"image-segmentation": 将返回一个 ImageSegmentationPipeline。
"image-to-image": 将返回一个 ImageToImagePipeline。
"image-to-text": 将返回一个 ImageToTextPipeline。
"mask-generation": 将返回一个 MaskGenerationPipeline。
"object-detection"：将返回一个 ObjectDetectionPipeline。
"question-answering"：将返回一个 QuestionAnsweringPipeline。
"summarization"：将返回一个 SummarizationPipeline。
"table-question-answering"：将返回一个 TableQuestionAnsweringPipeline。
"text2text-generation"：将返回一个 Text2TextGenerationPipeline。
"text-classification"（别名"sentiment-analysis"可用）：将返回一个 TextClassificationPipeline。
"text-generation"：将返回一个 TextGenerationPipeline。
"text-to-audio"（别名"text-to-speech"可用）：将返回一个 TextToAudioPipeline。
"token-classification"（别名"ner"可用）：将返回一个 TokenClassificationPipeline。
"translation"：将返回一个 TranslationPipeline。
"translation_xx_to_yy"：将返回一个 TranslationPipeline。
"video-classification"：将返回一个 VideoClassificationPipeline。
"visual-question-answering"：将返回一个 VisualQuestionAnsweringPipeline。
"zero-shot-classification"：将返回一个 ZeroShotClassificationPipeline。
"zero-shot-image-classification"：将返回一个 ZeroShotImageClassificationPipeline。
"zero-shot-audio-classification"：将返回一个 ZeroShotAudioClassificationPipeline。
"zero-shot-object-detection"：将返回一个 ZeroShotObjectDetectionPipeline。

model（str或 PreTrainedModel 或 TFPreTrainedModel，可选）— 该模型将被管道用于进行预测。这可以是一个模型标识符或一个实际的继承自 PreTrainedModel（对于 PyTorch）或 TFPreTrainedModel（对于 TensorFlow）的预训练模型实例。
如果未提供，task的默认值将被加载。
config（str或 PretrainedConfig，可选）— 该配置将被管道用于实例化模型。这可以是模型标识符或实际的预训练模型配置，继承自 PretrainedConfig。
如果未提供，则将使用请求的模型的默认配置文件。这意味着如果提供了model，将使用其默认配置。但是，如果未提供model，则将使用此task的默认模型配置。
tokenizer（str或 PreTrainedTokenizer，可选）— 该分词器将被管道用于对模型的数据进行编码。这可以是模型标识符或实际的预训练分词器，继承自 PreTrainedTokenizer。
如果未提供，则将加载给定model的默认分词器（如果是字符串）。如果未指定model或不是字符串，则将加载config的默认分词器（如果是字符串）。但是，如果也未提供config或不是字符串，则将加载给定task的默认分词器。
feature_extractor（str或PreTrainedFeatureExtractor，可选）— 该特征提取器将被管道用于对模型的数据进行编码。这可以是模型标识符或实际的预训练特征提取器，继承自PreTrainedFeatureExtractor。
特征提取器用于非 NLP 模型，例如语音或视觉模型以及多模态模型。多模态模型还需要传递一个分词器。
如果未提供，则将加载给定model的默认特征提取器（如果是字符串）。如果未指定model或不是字符串，则将加载config的默认特征提取器（如果是字符串）。但是，如果也未提供config或不是字符串，则将加载给定task的默认特征提取器。
framework（str，可选）— 要使用的框架，可以是"pt"表示 PyTorch，也可以是"tf"表示 TensorFlow。指定的框架必须已安装。
如果未指定框架，将默认使用当前安装的框架。如果未指定框架并且两个框架都已安装，则将默认使用model的框架，或者如果未提供模型，则将默认使用 PyTorch。
revision（str，可选，默认为"main"）— 当传递任务名称或字符串模型标识符时：要使用的特定模型版本。它可以是分支名称、标签名称或提交 ID，因为我们在 huggingface.co 上使用基于 git 的系统存储模型和其他工件，所以revision可以是 git 允许的任何标识符。
use_fast（bool，可选，默认为True）— 是否尽可能使用快速分词器（PreTrainedTokenizerFast）。
use_auth_token（str或bool，可选）— 用作远程文件的 HTTP bearer 授权的令牌。如果为True，将使用运行huggingface-cli login时生成的令牌（存储在~/.huggingface中）。
device（int或str或torch.device）— 定义此管道将分配到的设备（例如，"cpu"，"cuda:1"，"mps"，或类似1的 GPU 序数等）。
device_map（str或Dict[str, Union[int, str, torch.device]，可选）— 直接作为model_kwargs发送（只是一个更简单的快捷方式）。当存在accelerate库时，设置device_map="auto"以自动计算最优化的device_map（有关更多信息，请参见这里）。
不要同时使用device_map和device，因为它们会发生冲突
torch_dtype（str或torch.dtype，可选）- 直接发送为model_kwargs（只是一个更简单的快捷方式）以使用此模型的可用精度（torch.float16，torch.bfloat16，…或"auto"）。
trust_remote_code（bool，可选，默认为False）- 是否允许在 Hub 上定义的自定义代码在其自己的建模、配置、标记化甚至管道文件中执行。此选项应仅对您信任的存储库设置为True，并且您已经阅读了代码，因为它将在本地机器上执行 Hub 上存在的代码。
model_kwargs（Dict[str, Any]，可选）- 传递给模型的from_pretrained(..., **model_kwargs)函数的其他关键字参数字典。
kwargs（Dict[str, Any]，可选）- 传递给特定管道初始化的其他关键字参数（请参阅相应管道类的文档以获取可能的值）。

Pipeline

适合任务的管道。

构建 Pipeline 的实用工厂方法。

管道由以下组成：

负责将原始文本输入映射到标记的分词器。
从输入中进行预测的模型。
一些（可选的）后处理以增强模型的输出。

示例：

>>> from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
>>> # Sentiment analysis pipeline
>>> analyzer = pipeline("sentiment-analysis")
>>> # Question answering pipeline, specifying the checkpoint identifier
>>> oracle = pipeline(
...     "question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="bert-base-cased"
... )
>>> # Named entity recognition pipeline, passing in a specific model and tokenizer
>>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> recognizer = pipeline("ner", model=model, tokenizer=tokenizer)

管道批处理

所有管道都可以使用批处理。每当管道使用其流式处理能力时（因此当传递列表或Dataset或generator时），它将起作用。

from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
import datasets
dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
pipe = pipeline("text-classification", device=0)
for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
    print(out)
    # [{'label': 'POSITIVE', 'score': 0.9998743534088135}]
    # Exactly the same output as before, but the content are passed
    # as batches to the model

然而，这并不自动意味着性能提升。它可能是 10 倍的加速或 5 倍的减速，取决于硬件、数据和实际使用的模型。

主要是加速的示例：

from transformers import pipeline
from torch.utils.data import Dataset
from tqdm.auto import tqdm
pipe = pipeline("text-classification", device=0)
class MyDataset(Dataset):
    def __len__(self):
        return 5000
    def __getitem__(self, i):
        return "This is a test"
dataset = MyDataset()
for batch_size in [1, 8, 64, 256]:
    print("-" * 30)
    print(f"Streaming batch_size={batch_size}")
    for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):
        pass

# On GTX 970
------------------------------
Streaming no batching
100%|██████████████████████████████████████████████████████████████████████| 5000/5000 [00:26<00:00, 187.52it/s]
------------------------------
Streaming batch_size=8
100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:04<00:00, 1205.95it/s]
------------------------------
Streaming batch_size=64
100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2478.24it/s]
------------------------------
Streaming batch_size=256
100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 2554.43it/s]
(diminishing returns, saturated the GPU)

主要是减速的示例：

class MyDataset(Dataset):
    def __len__(self):
        return 5000
    def __getitem__(self, i):
        if i % 64 == 0:
            n = 100
        else:
            n = 1
        return "This is a test" * n

与其他句子相比，这是一个偶尔非常长的句子。在这种情况下，整个批次将需要 400 个标记长，因此整个批次将是[64, 400]而不是[64, 4]，导致严重减速。更糟糕的是，在更大的批次上，程序会直接崩溃。

------------------------------
Streaming no batching
100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s]
------------------------------
Streaming batch_size=8
100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 265.74it/s]
------------------------------
Streaming batch_size=64
100%|██████████████████████████████████████████████████████████████████████| 1000/1000 [00:26<00:00, 37.80it/s]
------------------------------
Streaming batch_size=256
  0%|                                                                                 | 0/1000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/nicolas/src/transformers/test.py", line 42, in <module>
    for out in tqdm(pipe(dataset, batch_size=256), total=len(dataset)):
....
    q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)
RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 0; 3.95 GiB total capacity; 1.72 GiB already allocated; 354.88 MiB free; 2.46 GiB reserved in total by PyTorch)

对于这个问题没有好的（通用）解决方案，您的使用情况可能会有所不同。经验法则：

对于用户，一个经验法则是：

在您的负载上测量性能，使用您的硬件。测量，测量，继续测量。真实数字是唯一的方法。
如果您受到延迟约束（进行推断的实时产品），则不要批处理。
如果您正在使用 CPU，则不要批处理。
如果您正在使用吞吐量（希望在一堆静态数据上运行模型），在 GPU 上，则：

如果您对序列长度的大小一无所知（“自然”数据），默认情况下不要批处理，测量并尝试试探性地添加它，添加 OOM 检查以在失败时恢复（如果您不控制序列长度，它将在某个时候失败）。
如果您的序列长度非常规则，则批处理更有可能非常有趣，测量并推动它直到出现 OOM。
GPU 越大，批处理就越有可能更有趣

一旦启用批处理，请确保您可以很好地处理 OOM。

管道块批处理

zero-shot-classification和question-answering在某种意义上略有特殊，因为单个输入可能会导致模型的多次前向传递。在正常情况下，这将导致batch_size参数出现问题。

为了规避这个问题，这两个管道都有点特殊，它们是ChunkPipeline而不是常规的Pipeline。简而言之：

preprocessed = pipe.preprocess(inputs)
model_outputs = pipe.forward(preprocessed)
outputs = pipe.postprocess(model_outputs)

现在变成了：

all_model_outputs = []
for preprocessed in pipe.preprocess(inputs):
    model_outputs = pipe.forward(preprocessed)
    all_model_outputs.append(model_outputs)
outputs = pipe.postprocess(all_model_outputs)

这对您的代码应该非常透明，因为管道的使用方式相同。

这是一个简化的视图，因为管道可以自动处理批处理！这意味着您无需关心实际将触发多少前向传递，您可以独立于输入优化batch_size。前一节中的注意事项仍然适用。

管道自定义代码

如果要覆盖特定管道。

不要犹豫为您手头的任务创建一个问题，管道的目标是易于使用并支持大多数情况，因此transformers可能支持您的用例。

如果您只想简单尝试，可以：

子类化您选择的管道

class MyPipeline(TextClassificationPipeline):
    def postprocess():
        # Your code goes here
        scores = scores * 100
        # And here
my_pipeline = MyPipeline(model=model, tokenizer=tokenizer, ...)
# or if you use *pipeline* function, then:
my_pipeline = pipeline(model="xxxx", pipeline_class=MyPipeline)

这应该使您能够执行所有您想要的自定义代码。

实现管道

实现新管道

音频

音频任务可用的管道包括以下内容。

音频分类管道

`class transformers.AudioClassificationPipeline`

<来源>

( *args **kwargs )

参数

model (PreTrainedModel 或 TFPreTrainedModel) — 该模型将由管道用于进行预测。这需要是继承自 PreTrainedModel（对于 PyTorch）和 TFPreTrainedModel（对于 TensorFlow）的模型。
tokenizer (PreTrainedTokenizer) — 该 tokenizer 将被管道用于为模型编码数据。此对象继承自 PreTrainedTokenizer。
modelcard (str 或 ModelCard, 可选) — 为此管道的模型指定的模型卡。
framework (str, 可选) — 要使用的框架，可以是"pt"表示 PyTorch 或"tf"表示 TensorFlow。指定的框架必须已安装。
如果未指定框架，将默认使用当前安装的框架。如果未指定框架并且两个框架都已安装，则将默认使用model的框架，或者如果未提供模型，则将默认使用 PyTorch。
task (str, 默认为 "") — 管道的任务标识符。
num_workers (int, 可选, 默认为 8) — 当管道将使用DataLoader（在传递数据集时，在 Pytorch 模型的 GPU 上），要使用的工作程序数量。
batch_size (int, 可选, 默认为 1) — 当管道将使用DataLoader（在传递数据集时，在 Pytorch 模型的 GPU 上），要使用的批次大小，对于推断，这并不总是有益的，请阅读使用管道进行批处理。
args_parser (ArgumentHandler, 可选) — 负责解析提供的管道参数的对象的引用。
device (int, 可选, 默认为 -1) — CPU/GPU 支持的设备序数。将其设置为 -1 将利用 CPU，正数将在关联的 CUDA 设备 ID 上运行模型。您也可以传递本机torch.device或str。
binary_output (bool, 可选, 默认为 False) — 指示管道输出是否应以二进制格式（即 pickle）或原始文本格式发生的标志。

使用任何AutoModelForAudioClassification的音频分类管道。该管道预测原始波形或音频文件的类别。在音频文件的情况下，应安装 ffmpeg 以支持多种音频格式。

示例：

>>> from transformers import pipeline
>>> classifier = pipeline(model="superb/wav2vec2-base-superb-ks")
>>> classifier("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")
[{'score': 0.997, 'label': '_unknown_'}, {'score': 0.002, 'label': 'left'}, {'score': 0.0, 'label': 'yes'}, {'score': 0.0, 'label': 'down'}, {'score': 0.0, 'label': 'stop'}]

了解有关在 pipeline 教程中使用管道的基础知识

目前可以使用以下任务标识符从 pipeline()加载此管道：“audio-classification”。

在huggingface.co/models上查看可用模型的列表。

__call__

< source >

( inputs: Union **kwargs ) → export const metadata = 'undefined';A list of dict with the following keys

参数

inputs (np.ndarray 或 bytes 或 str 或 dict) — 输入可以是：

str — 音频文件的文件名，文件将以正确的采样率读取以获取波形，使用ffmpeg。这需要在系统上安装ffmpeg。
bytes 应该是音频文件的内容，并由ffmpeg以相同方式解释。
(np.ndarray，形状为(n, )，类型为np.float32或np.float64) — 在正确的采样率下的原始音频（不会进行进一步检查）
dict — 可以用于传递以任意sampling_rate采样的原始音频，并让此管道进行重新采样。字典必须是以下格式之一：{"sampling_rate": int, "raw": np.array}或{"sampling_rate": int, "array": np.array}，其中键"raw"或"array"用于表示原始音频波形。

top_k (int，可选，默认为 None) — 管道将返回的顶部标签数。如果提供的数字为None或高于模型配置中可用标签的数量，则默认为标签数。

一个带有以下键的dict列表

label (str) — 预测的标签。
score (float) — 相应的概率。

对给定的输入序列进行分类。有关更多信息，请参阅 AutomaticSpeechRecognitionPipeline 文档。

AutomaticSpeechRecognitionPipeline

`class transformers.AutomaticSpeechRecognitionPipeline`

< source >

( model: PreTrainedModel feature_extractor: Union = None tokenizer: Optional = None decoder: Union = None device: Union = None torch_dtype: Union = None **kwargs )

参数

model (PreTrainedModel 或 TFPreTrainedModel) — 该模型将被管道用于进行预测。这需要是一个继承自 PreTrainedModel（对于 PyTorch）或 TFPreTrainedModel（对于 TensorFlow）的模型。
feature_extractor (SequenceFeatureExtractor) — 该特征提取器将被管道用于为模型编码波形。
tokenizer (PreTrainedTokenizer) — 该分词器将被管道用于为模型编码数据。该对象继承自 PreTrainedTokenizer。
decoder (pyctcdecode.BeamSearchDecoderCTC, 可选) — 可以传递PyCTCDecode 的 BeamSearchDecoderCTC以进行语言模型增强解码。有关更多信息，请参阅 Wav2Vec2ProcessorWithLM。
chunk_length_s (float，可选，默认为 0) — 每个块中的输入长度。如果chunk_length_s = 0，则禁用分块（默认）。
有关如何有效使用 chunk_length_s 的更多信息，请查看ASR 分块博文。
stride_length_s (float, 可选, 默认为 chunk_length_s / 6) — 每个块左右两侧的步幅长度。仅在 chunk_length_s > 0 时使用。这使得模型能够看到更多的上下文，并比没有上下文更好地推断字母，但管道会丢弃末尾的步幅位，以使最终重构尽可能完美。
有关如何有效使用 stride_length_s 的更多信息，请查看ASR 分块博文。
framework (str, 可选) — 要使用的框架，可以是 "pt" 代表 PyTorch 或 "tf" 代表 TensorFlow。指定的框架必须已安装。如果未指定框架，将默认使用当前安装的框架。如果未指定框架且两个框架都已安装，则默认使用 model 的框架，或者如果未提供模型，则默认使用 PyTorch。
device (Union[int, torch.device], 可选) — CPU/GPU 支持的设备序数。将其设置为 None 将使用 CPU，将其设置为正数将在关联的 CUDA 设备上运行模型。
torch_dtype (Union[int, torch.dtype], 可选) — 计算的数据类型（dtype）。将其设置为 None 将使用 float32 精度。设置为 torch.float16 或 torch.bfloat16 将使用相应 dtype 的半精度。

旨在从某些音频中提取包含的口语文本的管道。

输入可以是原始波形或音频文件。在音频文件的情况下，需要安装 ffmpeg 以支持多种音频格式

示例：

>>> from transformers import pipeline
>>> transcriber = pipeline(model="openai/whisper-base")
>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")
{'text': ' He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered flour-fatten sauce.'}

了解有关在 pipeline 教程中使用管道的基础知识

`call`

< source >

( inputs: Union **kwargs ) → export const metadata = 'undefined';Dict

参数

inputs (np.ndarray 或 bytes 或 str 或 dict) — 输入可以是：

str，可以是本地音频文件的文件名，也可以是下载音频文件的公共 URL 地址。文件将以正确的采样率读取，以使用ffmpeg获取波形。这需要系统上安装ffmpeg。
bytes 应该是音频文件内容，并由ffmpeg以相同方式解释。
(np.ndarray 的形状为 (n, )，类型为 np.float32 或 np.float64) — 以正确采样率的原始音频（不会进行进一步检查）
可以使用dict形式传递以任意sampling_rate采样的原始音频，并让此管道进行重新采样。字典必须采用格式 {"sampling_rate": int, "raw": np.array}，可选地包含一个 "stride": (left: int, right: int)，可以要求管道在解码时忽略前 left 个样本和最后 right 个样本（但在推理中使用以向模型提供更多上下文）。仅在 CTC 模型中使用 stride。

return_timestamps (可选, str 或 bool) — 仅适用于纯 CTC 模型（Wav2Vec2、HuBERT 等）和 Whisper 模型。不适用于其他序列到序列模型。对于 CTC 模型，时间戳可以采用以下两种格式之一：

"char": 管道将为文本中的每个字符返回时间戳。例如，如果您得到 [{"text": "h", "timestamp": (0.5, 0.6)}, {"text": "i", "timestamp": (0.7, 0.9)}]，则表示模型预测字母“h”在 0.5 秒后和 0.6 秒前被发音。
"word": 管道将为文本中的每个单词返回时间戳。例如，如果您得到 [{"text": "hi ", "timestamp": (0.5, 0.9)}, {"text": "there", "timestamp": (1.0, 1.5)}]，则表示模型预测单词“hi”在 0.5 秒后和 0.9 秒前被发音。

对于 Whisper 模型，时间戳可以采用以下两种格式之一：

"word": 与单词级 CTC 时间戳相同。单词级时间戳通过*动态时间规整（DTW）*算法预测，通过检查交叉注意力权重来近似单词级时间戳。
True：管道将在文本中为单词段返回时间戳。例如，如果您获得[{"text": " Hi there!", "timestamp": (0.5, 1.5)}]，则表示模型预测段“Hi there!”在0.5秒后和1.5秒前被说出。请注意，文本段指的是一个或多个单词的序列，而不是单词级时间戳。

generate_kwargs (dict, 可选) — 用于生成调用的generate_config的自定义参数字典。有关 generate 的完整概述，请查看以下指南。
max_new_tokens (int, 可选) — 要生成的最大标记数，忽略提示中的标记数。

Dict

具有以下键的字典：

text (str): 识别的文本。
chunks (*可选(, List[Dict]) 当使用return_timestamps时，chunks将变成一个包含模型识别的各种文本块的列表，例如[{"text": "hi ", "timestamp": (0.5, 0.9)}, {"text": "there", "timestamp": (1.0, 1.5)}]。原始完整文本可以通过"".join(chunk["text"] for chunk in output["chunks"])来粗略恢复。

将给定的音频序列转录为文本。有关更多信息，请参阅 AutomaticSpeechRecognitionPipeline 文档。

Transformers 4.37 中文文档（十七）（2）https://developer.aliyun.com/article/1564938

Transformers 4.37 中文文档（十七）（1）

管道

管道抽象

`transformers.pipeline`

管道批处理

管道块批处理

管道自定义代码

实现管道

音频

音频分类管道

`class transformers.AudioClassificationPipeline`

AutomaticSpeechRecognitionPipeline

`class transformers.AutomaticSpeechRecognitionPipeline`

`call`

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Transformers 4.37 中文文档（十七）（1）

管道

管道抽象

transformers.pipeline

管道批处理

管道块批处理

管道自定义代码

实现管道

音频

音频分类管道

class transformers.AudioClassificationPipeline

AutomaticSpeechRecognitionPipeline

class transformers.AutomaticSpeechRecognitionPipeline

__call__

热门文章

最新文章

相关课程

相关电子书

`transformers.pipeline`

`class transformers.AudioClassificationPipeline`

`class transformers.AutomaticSpeechRecognitionPipeline`

`call`