Transformers 4.37 中文文档（七十二）（5）-阿里云开发者社区

Transformers 4.37 中文文档（七十二）（4）https://developer.aliyun.com/article/1564168

VideoMAEModel

`class transformers.VideoMAEModel`

( config )

参数

config（VideoMAEConfig）— 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained() 方法以加载模型权重。

裸的 VideoMAE 模型变压器输出原始隐藏状态，没有特定的头部。此模型是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

`forward`

<来源>

( pixel_values: FloatTensor bool_masked_pos: Optional = None head_mask: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)

参数

pixel_values（形状为 (batch_size, num_frames, num_channels, height, width) 的 torch.FloatTensor）— 像素值。可以使用 AutoImageProcessor 获取像素值。有关详细信息，请参阅 VideoMAEImageProcessor.call()。
head_mask（形状为 (num_heads,) 或 (num_layers, num_heads) 的 torch.FloatTensor，可选）— 用于使自注意力模块中的选定头部失效的掩码。掩码值选定在 [0, 1]：

1 表示头部未被掩盖，
0 表示头部被“掩盖”。

output_attentions（bool，可选）— 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states（bool，可选）— 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict（bool，可选）— 是否返回 ModelOutput 而不是普通元组。
bool_masked_pos（形状为 (batch_size, sequence_length) 的 torch.BoolTensor，可选）— 布尔掩码位置。指示哪些补丁被掩盖（1）哪些不被掩盖（0）。批次中的每个视频必须具有相同数量的掩盖补丁。如果为 None，则认为所有补丁都被考虑。序列长度为 (num_frames // tubelet_size) * (image_size // patch_size) ** 2。

transformers.modeling_outputs.BaseModelOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutput 或一个torch.FloatTensor元组（如果传递return_dict=False或当config.return_dict=False时）包含各种元素，具体取决于配置（VideoMAEConfig）和输入。

last_hidden_state（形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor）- 模型最后一层的隐藏状态序列的输出。
hidden_states（tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或当config.output_hidden_states=True时返回）- 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（如果模型有嵌入层，则为嵌入的输出和每一层的输出）。
模型在每一层输出的隐藏状态以及可选的初始嵌入输出。
attentions（tuple(torch.FloatTensor)，可选，当传递output_attentions=True或当config.output_attentions=True时返回）- 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每层一个）。
在注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

VideoMAEModel 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> import av
>>> import numpy as np
>>> from transformers import AutoImageProcessor, VideoMAEModel
>>> from huggingface_hub import hf_hub_download
>>> np.random.seed(0)
>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])
>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices
>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)
>>> # sample 16 frames
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container, indices)
>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
>>> model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
>>> # prepare video for the model
>>> inputs = image_processor(list(video), return_tensors="pt")
>>> # forward pass
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 1568, 768]

VideoMAEForPreTraining

VideoMAEForPreTraining包括顶部的解码器用于自监督预训练。

`class transformers.VideoMAEForPreTraining`

<来源>

( config )

参数

config（VideoMAEConfig）- 模型的所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained()方法以加载模型权重。

带有顶部解码器的 VideoMAE 模型变压器，用于自监督预训练。此模型是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

`forward`

<来源>

( pixel_values: FloatTensor bool_masked_pos: BoolTensor head_mask: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput or tuple(torch.FloatTensor)

参数

pixel_values（形状为(batch_size, num_frames, num_channels, height, width)的torch.FloatTensor）- 像素值。可以使用 AutoImageProcessor 获取像素值。有关详细信息，请参阅 VideoMAEImageProcessor.call()。
head_mask（形状为(num_heads,)或(num_layers, num_heads)的torch.FloatTensor，可选）- 用于使自注意力模块中选择的头部失效的掩码。掩码值选在[0, 1]之间：

1 表示头部未被遮蔽，
0 表示头部被遮蔽。

output_attentions（bool，可选）- 是否返回所有注意力层的注意力张量。有关更多详细信息，请查看返回张量下的attentions。
output_hidden_states（bool，可选） - 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的hidden_states。
return_dict（bool，可选） - 是否返回 ModelOutput 而不是普通元组。
bool_masked_pos（形状为(batch_size, sequence_length)的torch.BoolTensor） - 布尔掩码位置。指示哪些补丁被掩盖（1）哪些没有（0）。批次中的每个视频必须具有相同数量的掩码补丁。序列长度为(num_frames // tubelet_size) * (image_size // patch_size) ** 2。

transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput或tuple(torch.FloatTensor)

一个transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput或一个torch.FloatTensor元组（如果传递return_dict=False或config.return_dict=False），包括根据配置（VideoMAEConfig）和输入的不同元素。

loss（形状为(1,)的torch.FloatTensor） - 像素重建损失。
logits（形状为(batch_size, patch_size ** 2 * num_channels)的torch.FloatTensor） - 像素重建 logits。
hidden_states（tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或config.output_hidden_states=True时返回） - 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组。模型在每一层的输出的隐藏状态加上初始嵌入输出。
attentions（tuple(torch.FloatTensor)，可选，当传递output_attentions=True或config.output_attentions=True时返回） - 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组。注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

VideoMAEForPreTraining 前向方法，覆盖__call__特殊方法。

虽然前向传递的方法需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者负责运行前处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from transformers import AutoImageProcessor, VideoMAEForPreTraining
>>> import numpy as np
>>> import torch
>>> num_frames = 16
>>> video = list(np.random.randint(0, 256, (num_frames, 3, 224, 224)))
>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
>>> model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")
>>> pixel_values = image_processor(video, return_tensors="pt").pixel_values
>>> num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
>>> seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
>>> bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()
>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
>>> loss = outputs.loss

VideoMAEForVideoClassification

视频分类的 VideoMAEForVideoClassification 类

<来源>

( config )

参数

config（VideoMAEConfig） - 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

VideoMAE 模型变压器，顶部带有视频分类头（所有令牌的平均池化隐藏状态之上的线性层），例如用于 ImageNet。此模型是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

`forward`

<来源>

( pixel_values: Optional = None head_mask: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为(batch_size, num_frames, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 VideoMAEImageProcessor.call()。
head_mask (torch.FloatTensor，形状为(num_heads,)或(num_layers, num_heads)，可选) — 用于使自注意力模块中选择的头部失效的掩码。掩码值选定在[0, 1]之间：

1 表示头部未被masked，
0 表示头部被masked。

output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回的张量中的attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回的张量中的hidden_states。
return_dict (bool，可选) — 是否返回一个 ModelOutput 而不是一个普通元组。
labels (torch.LongTensor，形状为(batch_size,)，可选) — 用于计算图像分类/回归损失的标签。索引应在[0, ..., config.num_labels - 1]范围内。如果config.num_labels == 1，则计算回归损失（均方损失），如果config.num_labels > 1，则计算分类损失（交叉熵）。

返回值

transformers.modeling_outputs.ImageClassifierOutput 或tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.ImageClassifierOutput 或一个torch.FloatTensor元组（如果传递了return_dict=False或当config.return_dict=False时）包含根据配置（VideoMAEConfig）和输入的不同元素。

损失 (torch.FloatTensor，形状为(1,)，可选，在提供labels时返回) — 分类（如果config.num_labels==1则为回归）损失。
logits (torch.FloatTensor，形状为(batch_size, config.num_labels)) — 分类（如果config.num_labels==1则为回归）得分（SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor)，可选，在传递output_hidden_states=True或当config.output_hidden_states=True时返回) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（如果模型具有嵌入层，则为嵌入的输出+每个阶段的输出）隐藏状态（也称为特征图）在每个阶段的模型输出处。
attentions (tuple(torch.FloatTensor)，可选，在传递output_attentions=True或当config.output_attentions=True时返回) — 形状为(batch_size, num_heads, patch_size, sequence_length)的torch.FloatTensor元组（每层一个）。
在注意力 softmax 之后的注意力权重，用于计算自注意力头部的加权平均值。

VideoMAEForVideoClassification 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的配方需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> import av
>>> import torch
>>> import numpy as np
>>> from transformers import AutoImageProcessor, VideoMAEForVideoClassification
>>> from huggingface_hub import hf_hub_download
>>> np.random.seed(0)
>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])
>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices
>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)
>>> # sample 16 frames
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container, indices)
>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
>>> model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
>>> inputs = image_processor(list(video), return_tensors="pt")
>>> with torch.no_grad():
...     outputs = model(**inputs)
...     logits = outputs.logits
>>> # model predicts one of the 400 Kinetics-400 classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
eating spaghetti

for i, frame in enumerate(container.decode(video=0)):
 … if i > end_index:
 … break
 … if i >= start_index and i in indices:
 … frames.append(frame)
 … return np.stack([x.to_ndarray(format=“rgb24”) for x in frames])
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
 … ‘’’
 … Sample a given number of frame indices from the video.
 … Args:
 … clip_len (int): Total number of frames to sample.
 … frame_sample_rate (int): Sample every n-th frame.
 … seg_len (int): Maximum allowed index of sample’s last frame.
 … Returns:
 … indices (List[int]): List of sampled frame indices
 … ‘’’
 … converted_len = int(clip_len * frame_sample_rate)
 … end_idx = np.random.randint(converted_len, seg_len)
 … start_idx = end_idx - converted_len
 … indices = np.linspace(start_idx, end_idx, num=clip_len)
 … indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
 … return indices

video clip consists of 300 frames (10 seconds at 30 FPS)

file_path = hf_hub_download(
 … repo_id=“nielsr/video-demo”, filename=“eating_spaghetti.mp4”, repo_type=“dataset”
 … )
 container = av.open(file_path)

sample 16 frames

indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
 video = read_video_pyav(container, indices)
image_processor = AutoImageProcessor.from_pretrained(“MCG-NJU/videomae-base-finetuned-kinetics”)
 model = VideoMAEForVideoClassification.from_pretrained(“MCG-NJU/videomae-base-finetuned-kinetics”)
inputs = image_processor(list(video), return_tensors=“pt”)
with torch.no_grad():
 … outputs = model(**inputs)
 … logits = outputs.logits

model predicts one of the 400 Kinetics-400 classes

predicted_label = logits.argmax(-1).item()
 print(model.config.id2label[predicted_label])
 eating spaghetti

Transformers 4.37 中文文档（七十二）（5）

VideoMAEModel

`class transformers.VideoMAEModel`

`forward`

VideoMAEForPreTraining

`class transformers.VideoMAEForPreTraining`

`forward`

VideoMAEForVideoClassification

`forward`

video clip consists of 300 frames (10 seconds at 30 FPS)

sample 16 frames

model predicts one of the 400 Kinetics-400 classes

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Transformers 4.37 中文文档（七十二）（5）

VideoMAEModel

class transformers.VideoMAEModel

forward

VideoMAEForPreTraining

class transformers.VideoMAEForPreTraining

forward

VideoMAEForVideoClassification

forward

video clip consists of 300 frames (10 seconds at 30 FPS)

sample 16 frames

model predicts one of the 400 Kinetics-400 classes

热门文章

最新文章

相关电子书

`class transformers.VideoMAEModel`

`forward`

`class transformers.VideoMAEForPreTraining`

`forward`

`forward`