Transformers 4.37 中文文档（七十四）（2）-阿里云开发者社区

Transformers 4.37 中文文档（七十四）（1）https://developer.aliyun.com/article/1564202

VivitModel

`class transformers.VivitModel`

( config add_pooling_layer = True )

参数

config (VivitConfig) — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained() 方法以加载模型权重。

ViViT Transformer 模型的基本输出，输出原始隐藏状态，没有特定的头部。此模型是 PyTorch torch.nn.Module 的子类。将其用作常规的 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

`forward`

< source >

( pixel_values: Optional = None head_mask: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为(batch_size, num_frames, num_channels, height, width)) — 像素值。可以使用 VivitImageProcessor 获取像素值。详情请参阅 VivitImageProcessor.preprocess()。
head_mask (torch.FloatTensor，形状为(num_heads,)或(num_layers, num_heads)，可选) — 用于使自注意力模块的选定头部失效的掩码。掩码值选定在[0, 1]之间。

1 表示头部为未屏蔽,
0 表示头部为屏蔽。

output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的hidden_states。
return_dict (bool，可选) — 是否返回一个 ModelOutput 而不是一个普通元组。

返回值

transformers.modeling_outputs.BaseModelOutputWithPooling 或tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutputWithPooling 或一个torch.FloatTensor元组（如果传递return_dict=False或config.return_dict=False）包含各种元素，具体取决于配置（VivitConfig）和输入。

last_hidden_state (torch.FloatTensor，形状为(batch_size, sequence_length, hidden_size)) — 模型最后一层的隐藏状态序列。
pooler_output (torch.FloatTensor，形状为(batch_size, hidden_size)) — 经过用于辅助预训练任务的层进一步处理后的序列第一个标记（分类标记）的最后一层隐藏状态。例如，对于 BERT 系列模型，这将返回经过线性层和 tanh 激活函数处理后的分类标记。线性层权重是在预训练期间从下一个句子预测（分类）目标中训练的。
hidden_states (tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或config.output_hidden_states=True时返回） — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（如果模型有嵌入层，则为嵌入层的输出+每层的输出）。
每层模型的隐藏状态加上可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor)，可选，当传递output_attentions=True或config.output_attentions=True时返回） — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每层一个）。
注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

VivitModel 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会处理运行前后处理步骤，而后者会默默地忽略它们。

示例：

>>> import av
>>> import numpy as np
>>> from transformers import VivitImageProcessor, VivitModel
>>> from huggingface_hub import hf_hub_download
>>> np.random.seed(0)
>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])
>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices
>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)
>>> # sample 32 frames
>>> indices = sample_frame_indices(clip_len=32, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container=container, indices=indices)
>>> image_processor = VivitImageProcessor.from_pretrained("google/vivit-b-16x2-kinetics400")
>>> model = VivitModel.from_pretrained("google/vivit-b-16x2-kinetics400")
>>> # prepare video for the model
>>> inputs = image_processor(list(video), return_tensors="pt")
>>> # forward pass
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 3137, 768]

VivitForVideoClassification

`class transformers.VivitForVideoClassification`

<来源>

( config )

参数

config (VivitConfig) — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

带有视频分类头部的 ViViT Transformer 模型（在[CLS]标记的最终隐藏状态之上的线性层），例如用于 Kinetics-400。此模型是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

`forward`

< source >

( pixel_values: Optional = None head_mask: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为(batch_size, num_frames, num_channels, height, width)) — 像素值。像素值可以使用 VivitImageProcessor 获取。有关详细信息，请参阅 VivitImageProcessor.preprocess()。
head_mask (torch.FloatTensor，形状为(num_heads,)或(num_layers, num_heads)，optional) — 用于使自注意力模块的选定头部失效的掩码。掩码值选定在[0, 1]范围内：

1 表示头部未被masked，
0 表示头部被masked。

output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量中的attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量中的hidden_states。
return_dict (bool，optional) — 是否返回 ModelOutput 而不是普通元组。
labels (torch.LongTensor，形状为(batch_size,)，optional) — 用于计算图像分类/回归损失的标签。索引应在[0, ..., config.num_labels - 1]范围内。如果config.num_labels==1，则计算回归损失（均方损失），如果config.num_labels>1，则计算分类损失（交叉熵）。

transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.ImageClassifierOutput 或一个torch.FloatTensor的元组（如果传递了return_dict=False或config.return_dict=False），包括根据配置（VivitConfig）和输入的不同元素。

loss (torch.FloatTensor，形状为(1,)，optional，当提供labels时返回） — 分类（如果config.num_labels==1则为回归）损失。
logits (torch.FloatTensor，形状为(batch_size, config.num_labels)) — 分类（如果config.num_labels==1则为回归）得分（SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 模型在每个阶段输出的隐藏状态（也称为特征图）的元组，形状为(batch_size, sequence_length, hidden_size)。
attentions (tuple(torch.FloatTensor), 可选的，当传递output_attentions=True或当config.output_attentions=True时返回) — 形状为(batch_size, num_heads, patch_size, sequence_length)的torch.FloatTensor元组。
注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

VivitForVideoClassification 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的步骤需要在这个函数内定义，但应该在之后调用Module实例，而不是这个函数，因为前者会处理运行前后的处理步骤，而后者会默默地忽略它们。

示例：

>>> import av
>>> import numpy as np
>>> import torch
>>> from transformers import VivitImageProcessor, VivitForVideoClassification
>>> from huggingface_hub import hf_hub_download
>>> np.random.seed(0)
>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])
>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices
>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)
>>> # sample 32 frames
>>> indices = sample_frame_indices(clip_len=32, frame_sample_rate=4, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container=container, indices=indices)
>>> image_processor = VivitImageProcessor.from_pretrained("google/vivit-b-16x2-kinetics400")
>>> model = VivitForVideoClassification.from_pretrained("google/vivit-b-16x2-kinetics400")
>>> inputs = image_processor(list(video), return_tensors="pt")
>>> with torch.no_grad():
...     outputs = model(**inputs)
...     logits = outputs.logits
>>> # model predicts one of the 400 Kinetics-400 classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
LABEL_116

YOLOS

原始文本：huggingface.co/docs/transformers/v4.37.2/en/model_doc/yolos

概述

YOLOS 模型是由 Yuxin Fang、Bencheng Liao、Xinggang Wang、Jiemin Fang、Jiyang Qi、Rui Wu、Jianwei Niu、Wenyu Liu 在You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection中提出的。YOLOS 建议仅利用普通的 Vision Transformer (ViT)进行目标检测，受到 DETR 的启发。结果表明，仅使用基本大小的仅编码器 Transformer 也可以在 COCO 上实现 42 AP，类似于 DETR 和更复杂的框架，如 Faster R-CNN。

论文摘要如下：

Transformer 能否从纯序列到序列的角度执行 2D 对象和区域级别的识别，而对 2D 空间结构的了解很少？为了回答这个问题，我们提出了 You Only Look at One Sequence (YOLOS)，这是一系列基于普通 Vision Transformer 的目标检测模型，只进行了最少的修改、区域先验以及目标任务的归纳偏差。我们发现，仅在中等规模的 ImageNet-1k 数据集上预训练的 YOLOS 模型已经可以在具有挑战性的 COCO 目标检测基准上取得相当有竞争力的性能，例如，直接采用 BERT-Base 架构的 YOLOS-Base 在 COCO val 上可以获得 42.0 的 box AP。我们还通过 YOLOS 讨论了当前预训练方案和 Transformer 视觉模型扩展策略的影响以及局限性。

YOLOS 架构。摘自原始论文。

此模型由nielsr贡献。原始代码可以在这里找到。

资源

官方 Hugging Face 和社区（由🌎表示）资源列表，可帮助您开始使用 YOLOS。

目标检测

所有演示推断+微调 YolosForObjectDetection 在自定义数据集上的示例笔记本可以在这里找到。
另请参阅：目标检测任务指南

如果您有兴趣提交资源以包含在此处，请随时提交拉取请求，我们将进行审核！资源应该展示一些新内容，而不是重复现有资源。

使用 YolosImageProcessor 来为模型准备图像（和可选目标）。与 DETR 相反，YOLOS 不需要创建pixel_mask。

YolosConfig

`class transformers.YolosConfig`

<来源>

( hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-12 image_size = [512, 864] patch_size = 16 num_channels = 3 qkv_bias = True num_detection_tokens = 100 use_mid_position_embeddings = True auxiliary_loss = False class_cost = 1 bbox_cost = 5 giou_cost = 2 bbox_loss_coefficient = 5 giou_loss_coefficient = 2 eos_coefficient = 0.1 **kwargs )

参数

hidden_size（int，可选，默认为 768）— 编码器层和池化层的维度。
num_hidden_layers（int，可选，默认为 12）— Transformer 编码器中的隐藏层数量。
num_attention_heads（int，可选，默认为 12）— Transformer 编码器中每个注意力层的注意力头数。
intermediate_size（int，可选，默认为 3072）— Transformer 编码器中“中间”（即前馈）层的维度。
hidden_act（str或function，可选，默认为"gelu"）— 编码器和池化器中的非线性激活函数（函数或字符串）。如果是字符串，支持"gelu"、"relu"、"selu"和"gelu_new"。
hidden_dropout_prob（float，可选，默认为 0.0）— 嵌入层、编码器和池化器中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (float, optional, 默认为 0.0) — 注意力概率的丢失比率。
initializer_range (float, optional, 默认为 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。
layer_norm_eps (float, optional, 默认为 1e-12) — 层归一化层使用的 epsilon。
image_size (List[int], optional, 默认为[512, 864]) — 每个图像的大小（分辨率）。
patch_size (int, optional, 默认为 16) — 每个补丁的大小（分辨率）。
num_channels (int, optional, 默认为 3) — 输入通道的数量。
qkv_bias (bool, optional, 默认为True) — 是否为查询、键和值添加偏置。
num_detection_tokens (int, optional, 默认为 100) — 检测令牌的数量。
use_mid_position_embeddings (bool, optional, 默认为True) — 是否使用中间层位置编码。
auxiliary_loss (bool, optional, 默认为False) — 是否使用辅助解码损失（每个解码器层的损失）。
class_cost (float, optional, 默认为 1) — 匈牙利匹配成本中分类错误的相对权重。
bbox_cost (float, optional, 默认为 5) — 匈牙利匹配成本中边界框坐标的 L1 误差的相对权重。
YolosImageProcessor
bbox_loss_coefficient (float, optional, 默认为 5) — 目标检测损失中 L1 边界框损失的相对权重。
giou_loss_coefficient (float, optional, 默认为 2) — 目标检测损失中广义 IoU 损失的相对权重。
eos_coefficient (float, optional, 默认为 0.1) — 目标检测损失中“无对象”类的相对分类权重。

这是一个配置类，用于存储 YolosModel 的配置。根据指定的参数实例化一个 YOLOS 模型，定义模型架构。使用默认值实例化配置将产生类似于 YOLOS hustvl/yolos-base架构的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读 PretrainedConfig 的文档以获取更多信息。

示例：

>>> from transformers import YolosConfig, YolosModel
>>> # Initializing a YOLOS hustvl/yolos-base style configuration
>>> configuration = YolosConfig()
>>> # Initializing a model (with random weights) from the hustvl/yolos-base style configuration
>>> model = YolosModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Transformers 4.37 中文文档（七十四）（3）https://developer.aliyun.com/article/1564204

Transformers 4.37 中文文档（七十四）（2）

VivitModel

`class transformers.VivitModel`

`forward`

VivitForVideoClassification

`class transformers.VivitForVideoClassification`

`forward`

YOLOS

概述

资源

YolosConfig

`class transformers.YolosConfig`

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Transformers 4.37 中文文档（七十四）（2）

VivitModel

class transformers.VivitModel

forward

VivitForVideoClassification

class transformers.VivitForVideoClassification

forward

YOLOS

概述

资源

YolosConfig

class transformers.YolosConfig

热门文章

最新文章

相关课程

相关电子书

`class transformers.VivitModel`

`forward`

`class transformers.VivitForVideoClassification`

`forward`

`class transformers.YolosConfig`