Transformers 4.37 中文文档（八十六）（5）-阿里云开发者社区

Transformers 4.37 中文文档（八十六）（4）https://developer.aliyun.com/article/1563239

GitProcessor

`class transformers.GitProcessor`

( image_processor tokenizer )

参数

image_processor (AutoImageProcessor) — 图像处理器是必需的输入。
tokenizer (AutoTokenizer) — Tokenizer 是必需的输入。

构建一个 GIT 处理器，将 CLIP 图像处理器和 BERT 分词器包装成单个处理器。

GitProcessor 提供了 CLIPImageProcessor 和 BertTokenizerFast 的所有功能。查看 call() 和 decode() 以获取更多信息。

`call`

< source >

( text = None images = None return_tensors = None **kwargs ) → export const metadata = 'undefined';BatchEncoding

参数

text (str, List[str], List[List[str]]) — 要编码的序列或序列批次。每个序列可以是字符串或字符串列表（预分词字符串）。如果序列以字符串列表（预分词）的形式提供，则必须设置 is_split_into_words=True（以消除与序列批次的歧义）。
images (PIL.Image.Image, np.ndarray, torch.Tensor, List[PIL.Image.Image], List[np.ndarray], List[torch.Tensor]) — 要准备的图像或图像批次。每个图像可以是 PIL 图像、NumPy 数组或 PyTorch 张量。如果是 NumPy 数组/PyTorch 张量，则每个图像应为形状 (C, H, W)，其中 C 是通道数，H 和 W 是图像高度和宽度。
return_tensors (str 或 TensorType, optional) — 如果设置，将返回特定框架的张量。可接受的值为：

'tf': 返回 TensorFlow tf.constant 对象。
'pt': 返回 PyTorch torch.Tensor 对象。
'np': 返回 NumPy np.ndarray 对象。
'jax': 返回 JAX jnp.ndarray 对象。

BatchEncoding

一个带有以下字段的 BatchEncoding：

input_ids — 要提供给模型的标记 id 列表。当text不为None时返回。
attention_mask — 指定哪些标记应该被模型关注的索引列表（当return_attention_mask=True或者*attention_mask*在self.model_input_names中，且text不为None时）。
pixel_values — 要提供给模型的像素值。当images不为None时返回。

为模型准备一个或多个序列和图像的主要方法。如果text不为None，则此方法将text和kwargs参数转发给 BertTokenizerFast 的call()以对文本进行编码。为准备图像，如果images不为None，则此方法将images和kwrags参数转发给 CLIPImageProcessor 的call()。请参考上述两种方法的文档以获取更多信息。

GitModel

`class transformers.GitModel`

<来源>

( config )

参数

config（GitConfig） — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只会加载配置。查看 from_pretrained()方法以加载模型权重。

由 CLIP 图像编码器和文本解码器组成的基本 GIT 模型变压器，输出原始隐藏状态，没有特定的头部。

此模型继承自 PreTrainedModel。查看超类文档以获取库为所有模型实现的通用方法（如下载或保存、调整输入嵌入、修剪头等）。

此模型还是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取与一般用法和行为相关的所有信息。

`forward`

<来源>

( input_ids: Optional = None attention_mask: Optional = None position_ids: Optional = None pixel_values: Optional = None head_mask: Optional = None inputs_embeds: Optional = None past_key_values: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)

参数

input_ids（形状为(batch_size, sequence_length)的torch.LongTensor） — 词汇表中输入序列标记的索引。
可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。
什么是输入 ID？
attention_mask（形状为(batch_size, sequence_length)的torch.FloatTensor，可选） — 避免在填充标记索引上执行注意力的掩码。选择在[0, 1]中的掩码值：

1 表示“未被掩码”的标记，
0 表示“被掩码”的标记。

什么是注意力掩码？
position_ids（形状为(batch_size, sequence_length)的torch.LongTensor，可选） — 每个输入序列标记在位置嵌入中的位置索引。在范围[0, config.max_position_embeddings - 1]中选择。
什么是位置 ID？
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 像素值。可以使用 AutoImageProcessor 获取像素值。有关详细信息，请参阅 CLIPImageProcessor.call()。
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — 用于使自注意力模块的选定头部失效的掩码。掩码值选定在[0, 1]之间：

1 表示头部未被masked，
0 表示头部被masked。

inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，可以直接传递嵌入表示，而不是传递input_ids。如果您想要更多控制如何将input_ids索引转换为相关向量，而不是使用模型的内部嵌入查找矩阵，则这很有用。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多细节，请参阅返回张量中的attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多细节，请参阅返回张量中的hidden_states。
return_dict (bool, optional) — 是否返回 ModelOutput 而不是普通元组。
past_key_values (tuple(tuple(torch.FloatTensor))，长度为config.n_layers，每个元组包含 4 个形状为(batch_size, num_heads, sequence_length - 1, embed_size_per_head)的张量） — 包含注意力块的预计算的键和值隐藏状态。可用于加速解码。
如果使用了past_key_values，用户可以选择仅输入最后的decoder_input_ids（即没有将其过去的键值状态提供给此模型的那些）的形状为(batch_size, 1)的张量，而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
use_cache (bool, optional) — 如果设置为True，将返回past_key_values键值状态，并可用于加速解码（参见past_key_values）。

transformers.modeling_outputs.BaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutputWithPooling 或一个torch.FloatTensor元组（如果传递了return_dict=False或config.return_dict=False时）包含各种元素，取决于配置（GitConfig）和输入。

last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — 模型最后一层的隐藏状态序列。
pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — 序列的第一个标记（分类标记）的最后一层隐藏状态（经过用于辅助预训练任务的层进一步处理后）。例如，对于 BERT 系列模型，这返回经过线性层和 tanh 激活函数处理后的分类标记。线性层的权重是从预训练期间的下一个句子预测（分类）目标中训练的。
hidden_states (tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（如果模型具有嵌入层的输出，则为嵌入的输出+每个层的输出）。
模型在每一层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), optional, 当传递output_attentions=True或config.output_attentions=True时返回) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组。
在注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

GitModel 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会处理运行前后处理步骤，而后者会默默地忽略它们。

示例：

>>> from transformers import AutoProcessor, AutoModel
>>> import requests
>>> from PIL import Image
>>> processor = AutoProcessor.from_pretrained("microsoft/git-base")
>>> model = AutoModel.from_pretrained("microsoft/git-base")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> text = "this is an image of two cats"
>>> inputs = processor(text, images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state

GitForCausalLM

`class transformers.GitForCausalLM`

< source >

( config )

参数

config (GitConfig) — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

带有语言建模头部的 GIT 模型，用于自回归语言建模。

此模型继承自 PreTrainedModel。查看超类文档以了解库为所有模型实现的通用方法（如下载或保存、调整输入嵌入、修剪头等）。

此模型也是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

`forward`

< source >

( input_ids: Optional = None attention_mask: Optional = None position_ids: Optional = None pixel_values: Optional = None head_mask: Optional = None inputs_embeds: Optional = None labels: Optional = None past_key_values: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。
可以使用 AutoTokenizer 获取索引。查看 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()获取详细信息。
什么是输入 ID？
attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — 用于避免在填充标记索引上执行注意力的掩码。掩码值选择在[0, 1]之间：

1 表示未被掩码的标记，
0 表示被掩码的标记。

什么是注意力掩码？
position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — 每个输入序列标记在位置嵌入中的位置索引。在范围[0, config.max_position_embeddings - 1]中选择。
什么是位置 ID？
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 像素值。可以使用 AutoImageProcessor 获取像素值。查看 CLIPImageProcessor.call()获取详细信息。
head_mask（形状为(num_heads,)或(num_layers, num_heads)的torch.FloatTensor，可选）— 用于使自注意力模块中选择的头部失效的掩码。掩码值在[0, 1]中选择：

1 表示头部未被掩盖，
0 表示头部被“掩盖”。

inputs_embeds（形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor，可选）— 可选地，您可以选择直接传递嵌入表示，而不是传递input_ids。如果您希望更多地控制如何将input_ids索引转换为相关向量，而不是模型的内部嵌入查找矩阵，则这很有用。
output_attentions（bool，可选）— 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states（bool，可选）— 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict（bool，可选）— 是否返回 ModelOutput 而不是普通元组。
labels（形状为(batch_size, sequence_length)的torch.LongTensor，可选）— 用于计算从左到右的语言建模损失（下一个单词预测）的标签。索引应在[-100, 0, ..., config.vocab_size]中（请参见input_ids文档字符串）。索引设置为-100的标记将被忽略（掩盖），仅对具有标签 n [0, ..., config.vocab_size]的标记计算损失
past_key_values（长度为config.n_layers的tuple(tuple(torch.FloatTensor))，每个元组包含形状为(batch_size, num_heads, sequence_length - 1, embed_size_per_head)的 4 个张量）— 包含注意力块的预计算键和值隐藏状态。可用于加速解码。
如果使用了past_key_values，用户可以选择仅输入最后的decoder_input_ids（这些不具有其过去键值状态的模型）的形状为(batch_size, 1)的张量，而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
use_cache（bool，可选）— 如果设置为True，则返回past_key_values键值状态，并可用于加速解码（请参见past_key_values）。

transformers.modeling_outputs.CausalLMOutputWithPast 或tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.CausalLMOutputWithPast 或一个torch.FloatTensor元组（如果传递了return_dict=False或config.return_dict=False时）包含根据配置（GitConfig）和输入的各种元素。

loss（形状为(1,)的torch.FloatTensor，可选，在提供labels时返回）— 语言建模损失（用于下一个标记预测）。
logits（形状为(batch_size, sequence_length, config.vocab_size)的torch.FloatTensor）— 语言建模头的预测分数（SoftMax 之前每个词汇标记的分数）。
past_key_values（可选，在传递use_cache=True或config.use_cache=True时返回）— 长度为config.n_layers的tuple(tuple(torch.FloatTensor))元组，每个元组包含形状为(batch_size, num_heads, sequence_length, embed_size_per_head)的 2 个张量）
包含预计算的隐藏状态（自注意块中的键和值）可用于加速顺序解码（请参见past_key_values输入）。
hidden_states（tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或config.output_hidden_states=True时返回） — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每一层的输出）。
模型在每一层输出处的隐藏状态以及可选的初始嵌入输出。
attentions（tuple(torch.FloatTensor)，可选，当传递output_attentions=True或config.output_attentions=True时返回） — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每一层一个）。
注意力权重在注意力 softmax 之后，用于计算自注意力头中的加权平均值。

GitForCausalLM 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的步骤需要在这个函数内定义，但应该在此之后调用Module实例，而不是这个函数，因为前者会处理运行前后的处理步骤，而后者会默默地忽略它们。

示例：

图像字幕示例：

>>> from transformers import AutoProcessor, AutoModelForCausalLM
>>> import requests
>>> from PIL import Image
>>> processor = AutoProcessor.from_pretrained("microsoft/git-base-coco")
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-coco")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> pixel_values = processor(images=image, return_tensors="pt").pixel_values
>>> generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
>>> generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
>>> print(generated_caption)
two cats sleeping on a pink blanket next to remotes.

视觉问答（VQA）示例：

>>> from transformers import AutoProcessor, AutoModelForCausalLM
>>> from huggingface_hub import hf_hub_download
>>> from PIL import Image
>>> processor = AutoProcessor.from_pretrained("microsoft/git-base-textvqa")
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-textvqa")
>>> file_path = hf_hub_download(repo_id="nielsr/textvqa-sample", filename="bus.png", repo_type="dataset")
>>> image = Image.open(file_path).convert("RGB")
>>> pixel_values = processor(images=image, return_tensors="pt").pixel_values
>>> question = "what does the front of the bus say at the top?"
>>> input_ids = processor(text=question, add_special_tokens=False).input_ids
>>> input_ids = [processor.tokenizer.cls_token_id] + input_ids
>>> input_ids = torch.tensor(input_ids).unsqueeze(0)
>>> generated_ids = model.generate(pixel_values=pixel_values, input_ids=input_ids, max_length=50)
>>> print(processor.batch_decode(generated_ids, skip_special_tokens=True))
['what does the front of the bus say at the top? special']

视频字幕示例：

>>> import av
>>> import numpy as np
>>> from PIL import Image
>>> from huggingface_hub import hf_hub_download
>>> from transformers import AutoProcessor, AutoModelForCausalLM
>>> processor = AutoProcessor.from_pretrained("microsoft/git-base-vatex")
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-vatex")
>>> # set seed for reproducability
>>> np.random.seed(45)
>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])
>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices
>>> # load video
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)
>>> # sample frames
>>> num_frames = model.config.num_image_with_embedding
>>> indices = sample_frame_indices(
...     clip_len=num_frames, frame_sample_rate=4, seg_len=container.streams.video[0].frames
... )
>>> frames = read_video_pyav(container, indices)
>>> pixel_values = processor(images=list(frames), return_tensors="pt").pixel_values
>>> generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
>>> print("Generated caption:", processor.batch_decode(generated_ids, skip_special_tokens=True))
Generated caption: ['a woman is sitting at a table and she is talking about the food she is holding.']

he video with PyAV decoder.
 … Args:
 … container (av.container.input.InputContainer): PyAV container.
 … indices (List[int]): List of frame indices to decode.
 … Returns:
 … result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
 … ‘’’
 … frames = []
 … container.seek(0)
 … start_index = indices[0]
 … end_index = indices[-1]
 … for i, frame in enumerate(container.decode(video=0)):
 … if i > end_index:
 … break
 … if i >= start_index and i in indices:
 … frames.append(frame)
 … return np.stack([x.to_ndarray(format=“rgb24”) for x in frames])
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
 … ‘’’
 … Sample a given number of frame indices from the video.
 … Args:
 … clip_len (int): Total number of frames to sample.
 … frame_sample_rate (int): Sample every n-th frame.
 … seg_len (int): Maximum allowed index of sample’s last frame.
 … Returns:
 … indices (List[int]): List of sampled frame indices
 … ‘’’
 … converted_len = int(clip_len * frame_sample_rate)
 … end_idx = np.random.randint(converted_len, seg_len)
 … start_idx = end_idx - converted_len
 … indices = np.linspace(start_idx, end_idx, num=clip_len)
 … indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
 … return indices

load video

file_path = hf_hub_download(
 … repo_id=“nielsr/video-demo”, filename=“eating_spaghetti.mp4”, repo_type=“dataset”
 … )
 container = av.open(file_path)

sample frames

num_frames = model.config.num_image_with_embedding
 indices = sample_frame_indices(
 … clip_len=num_frames, frame_sample_rate=4, seg_len=container.streams.video[0].frames
 … )
 frames = read_video_pyav(container, indices)
pixel_values = processor(images=list(frames), return_tensors=“pt”).pixel_values
generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
print(“Generated caption:”, processor.batch_decode(generated_ids, skip_special_tokens=True))
 Generated caption: [‘a woman is sitting at a table and she is talking about the food she is holding.’]

Transformers 4.37 中文文档（八十六）（5）

GitProcessor

`class transformers.GitProcessor`

`call`

GitModel

`class transformers.GitModel`

`forward`

GitForCausalLM

`class transformers.GitForCausalLM`

`forward`

load video

sample frames

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Transformers 4.37 中文文档（八十六）（5）

GitProcessor

class transformers.GitProcessor

__call__

GitModel

class transformers.GitModel

forward

GitForCausalLM

class transformers.GitForCausalLM

forward

load video

sample frames

热门文章

最新文章

相关电子书

`class transformers.GitProcessor`

`call`

`class transformers.GitModel`

`forward`

`class transformers.GitForCausalLM`

`forward`