Transformers 4.37 中文文档（四）（5）-阿里云开发者社区

Transformers 4.37 中文文档（四）（4）https://developer.aliyun.com/article/1564979

加载一个模型进行微调

从预训练的检查点和其关联的图像处理器实例化一个视频分类模型。模型的编码器带有预训练参数，分类头是随机初始化的。当为我们的数据集编写预处理流水线时，图像处理器会派上用场。

>>> from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
>>> model_ckpt = "MCG-NJU/videomae-base"
>>> image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt)
>>> model = VideoMAEForVideoClassification.from_pretrained(
...     model_ckpt,
...     label2id=label2id,
...     id2label=id2label,
...     ignore_mismatched_sizes=True,  # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
... )

当模型加载时，您可能会注意到以下警告：

Some weights of the model checkpoint at MCG-NJU/videomae-base were not used when initializing VideoMAEForVideoClassification: [..., 'decoder.decoder_layers.1.attention.output.dense.bias', 'decoder.decoder_layers.2.attention.attention.key.weight']
- This IS expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of VideoMAEForVideoClassification were not initialized from the model checkpoint at MCG-NJU/videomae-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

警告告诉我们，我们正在丢弃一些权重（例如classifier层的权重和偏差），并随机初始化其他一些权重和偏差（新classifier层的权重和偏差）。在这种情况下，这是预期的，因为我们正在添加一个新的头部，我们没有预训练的权重，所以库警告我们在使用它进行推断之前应该微调这个模型，这正是我们要做的。

请注意，此检查点在此任务上表现更好，因为该检查点是在一个具有相当大领域重叠的类似下游任务上微调得到的。您可以查看此检查点，该检查点是通过微调MCG-NJU/videomae-base-finetuned-kinetics获得的。

为训练准备数据集

为了对视频进行预处理，您将利用PyTorchVideo 库。首先导入我们需要的依赖项。

>>> import pytorchvideo.data
>>> from pytorchvideo.transforms import (
...     ApplyTransformToKey,
...     Normalize,
...     RandomShortSideScale,
...     RemoveKey,
...     ShortSideScale,
...     UniformTemporalSubsample,
... )
>>> from torchvision.transforms import (
...     Compose,
...     Lambda,
...     RandomCrop,
...     RandomHorizontalFlip,
...     Resize,
... )

对于训练数据集的转换，使用统一的时间子采样、像素归一化、随机裁剪和随机水平翻转的组合。对于验证和评估数据集的转换，保持相同的转换链，除了随机裁剪和水平翻转。要了解这些转换的详细信息，请查看PyTorchVideo 的官方文档。

使用与预训练模型相关联的image_processor来获取以下信息：

用于归一化视频帧像素的图像均值和标准差。
将视频帧调整为的空间分辨率。

首先定义一些常量。

>>> mean = image_processor.image_mean
>>> std = image_processor.image_std
>>> if "shortest_edge" in image_processor.size:
...     height = width = image_processor.size["shortest_edge"]
>>> else:
...     height = image_processor.size["height"]
...     width = image_processor.size["width"]
>>> resize_to = (height, width)
>>> num_frames_to_sample = model.config.num_frames
>>> sample_rate = 4
>>> fps = 30
>>> clip_duration = num_frames_to_sample * sample_rate / fps

现在，分别定义数据集特定的转换和数据集。从训练集开始：

>>> train_transform = Compose(
...     [
...         ApplyTransformToKey(
...             key="video",
...             transform=Compose(
...                 [
...                     UniformTemporalSubsample(num_frames_to_sample),
...                     Lambda(lambda x: x / 255.0),
...                     Normalize(mean, std),
...                     RandomShortSideScale(min_size=256, max_size=320),
...                     RandomCrop(resize_to),
...                     RandomHorizontalFlip(p=0.5),
...                 ]
...             ),
...         ),
...     ]
... )
>>> train_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "train"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration),
...     decode_audio=False,
...     transform=train_transform,
... )

相同的工作流程顺序可以应用于验证集和评估集：

>>> val_transform = Compose(
...     [
...         ApplyTransformToKey(
...             key="video",
...             transform=Compose(
...                 [
...                     UniformTemporalSubsample(num_frames_to_sample),
...                     Lambda(lambda x: x / 255.0),
...                     Normalize(mean, std),
...                     Resize(resize_to),
...                 ]
...             ),
...         ),
...     ]
... )
>>> val_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "val"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
...     decode_audio=False,
...     transform=val_transform,
... )
>>> test_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "test"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
...     decode_audio=False,
...     transform=val_transform,
... )

注意：上述数据集管道取自官方 PyTorchVideo 示例。我们使用pytorchvideo.data.Ucf101()函数，因为它专为 UCF-101 数据集定制。在内部，它返回一个pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset对象。LabeledVideoDataset类是 PyTorchVideo 数据集中所有视频相关内容的基类。因此，如果您想使用 PyTorchVideo 不支持的自定义数据集，可以相应地扩展LabeledVideoDataset类。请参考data API 文档以了解更多。此外，如果您的数据集遵循类似的结构（如上所示），那么使用pytorchvideo.data.Ucf101()应该可以正常工作。

您可以访问num_videos参数以了解数据集中的视频数量。

>>> print(train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos)
# (300, 30, 75)

可视化预处理后的视频以进行更好的调试

>>> import imageio
>>> import numpy as np
>>> from IPython.display import Image
>>> def unnormalize_img(img):
...     """Un-normalizes the image pixels."""
...     img = (img * std) + mean
...     img = (img * 255).astype("uint8")
...     return img.clip(0, 255)
>>> def create_gif(video_tensor, filename="sample.gif"):
...     """Prepares a GIF from a video tensor.
...     
...     The video tensor is expected to have the following shape:
...     (num_frames, num_channels, height, width).
...     """
...     frames = []
...     for video_frame in video_tensor:
...         frame_unnormalized = unnormalize_img(video_frame.permute(1, 2, 0).numpy())
...         frames.append(frame_unnormalized)
...     kargs = {"duration": 0.25}
...     imageio.mimsave(filename, frames, "GIF", **kargs)
...     return filename
>>> def display_gif(video_tensor, gif_name="sample.gif"):
...     """Prepares and displays a GIF from a video tensor."""
...     video_tensor = video_tensor.permute(1, 0, 2, 3)
...     gif_filename = create_gif(video_tensor, gif_name)
...     return Image(filename=gif_filename)
>>> sample_video = next(iter(train_dataset))
>>> video_tensor = sample_video["video"]
>>> display_gif(video_tensor)

训练模型

利用🤗 Transformers 中的Trainer来训练模型。要实例化一个Trainer，您需要定义训练配置和一个评估指标。最重要的是TrainingArguments，这是一个包含所有属性以配置训练的类。它需要一个输出文件夹名称，用于保存模型的检查点。它还有助于将模型存储库中的所有信息同步到🤗 Hub 中。

大多数训练参数都是不言自明的，但这里有一个非常重要的参数是remove_unused_columns=False。这个参数将删除模型调用函数未使用的任何特征。默认情况下是True，因为通常最好删除未使用的特征列，这样更容易将输入解压缩到模型的调用函数中。但是，在这种情况下，您需要未使用的特征（特别是‘video’）以便创建pixel_values（这是我们的模型在输入中期望的一个必需键）。

>>> from transformers import TrainingArguments, Trainer
>>> model_name = model_ckpt.split("/")[-1]
>>> new_model_name = f"{model_name}-finetuned-ucf101-subset"
>>> num_epochs = 4
>>> args = TrainingArguments(
...     new_model_name,
...     remove_unused_columns=False,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
...     per_device_train_batch_size=batch_size,
...     per_device_eval_batch_size=batch_size,
...     warmup_ratio=0.1,
...     logging_steps=10,
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
...     max_steps=(train_dataset.num_videos // batch_size) * num_epochs,
... )

pytorchvideo.data.Ucf101()返回的数据集没有实现__len__方法。因此，在实例化TrainingArguments时，我们必须定义max_steps。

接下来，您需要定义一个函数来计算从预测中得出的指标，该函数将使用您现在将加载的metric。您唯一需要做的预处理是取出我们预测的 logits 的 argmax：

import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

关于评估的说明：

在VideoMAE 论文中，作者使用以下评估策略。他们在测试视频的几个剪辑上评估模型，并对这些剪辑应用不同的裁剪，并报告聚合得分。然而，出于简单和简洁的考虑，我们在本教程中不考虑这一点。

此外，定义一个collate_fn，用于将示例批处理在一起。每个批次包括 2 个键，即pixel_values和labels。

>>> def collate_fn(examples):
...     # permute to (num_frames, num_channels, height, width)
...     pixel_values = torch.stack(
...         [example["video"].permute(1, 0, 2, 3) for example in examples]
...     )
...     labels = torch.tensor([example["label"] for example in examples])
...     return {"pixel_values": pixel_values, "labels": labels}

然后，将所有这些与数据集一起传递给Trainer：

>>> trainer = Trainer(
...     model,
...     args,
...     train_dataset=train_dataset,
...     eval_dataset=val_dataset,
...     tokenizer=image_processor,
...     compute_metrics=compute_metrics,
...     data_collator=collate_fn,
... )

您可能想知道为什么在预处理数据时将image_processor作为标记器传递。这只是为了确保图像处理器配置文件（存储为 JSON）也将上传到 Hub 上的存储库中。

现在通过调用train方法对我们的模型进行微调：

>>> train_results = trainer.train()

训练完成后，使用 push_to_hub()方法将您的模型共享到 Hub，以便每个人都可以使用您的模型：

>>> trainer.push_to_hub()

推断

很好，现在您已经对模型进行了微调，可以将其用于推断！

加载视频进行推断：

>>> sample_test_video = next(iter(test_dataset))

尝试使用您微调的模型进行推断的最简单方法是在pipeline中使用它。使用您的模型实例化一个视频分类的pipeline，并将视频传递给它：

>>> from transformers import pipeline
>>> video_cls = pipeline(model="my_awesome_video_cls_model")
>>> video_cls("https://huggingface.co/datasets/sayakpaul/ucf101-subset/resolve/main/v_BasketballDunk_g14_c06.avi")
[{'score': 0.9272987842559814, 'label': 'BasketballDunk'},
 {'score': 0.017777055501937866, 'label': 'BabyCrawling'},
 {'score': 0.01663011871278286, 'label': 'BalanceBeam'},
 {'score': 0.009560945443809032, 'label': 'BandMarching'},
 {'score': 0.0068979403004050255, 'label': 'BaseballPitch'}]

如果愿意，您也可以手动复制pipeline的结果。

>>> def run_inference(model, video):
...     # (num_frames, num_channels, height, width)
...     perumuted_sample_test_video = video.permute(1, 0, 2, 3)
...     inputs = {
...         "pixel_values": perumuted_sample_test_video.unsqueeze(0),
...         "labels": torch.tensor(
...             [sample_test_video["label"]]
...         ),  # this can be skipped if you don't have labels available.
...     }
...     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
...     inputs = {k: v.to(device) for k, v in inputs.items()}
...     model = model.to(device)
...     # forward pass
...     with torch.no_grad():
...         outputs = model(**inputs)
...         logits = outputs.logits
...     return logits

现在，将您的输入传递给模型并返回logits：

>>> logits = run_inference(trained_model, sample_test_video["video"])

解码logits，我们得到：

>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
# Predicted class: BasketballDunk

_metrics,
 … data_collator=collate_fn,
 … )

您可能想知道为什么在预处理数据时将`image_processor`作为标记器传递。这只是为了确保图像处理器配置文件（存储为 JSON）也将上传到 Hub 上的存储库中。
现在通过调用`train`方法对我们的模型进行微调：
```py
>>> train_results = trainer.train()

训练完成后，使用 push_to_hub()方法将您的模型共享到 Hub，以便每个人都可以使用您的模型：

>>> trainer.push_to_hub()

推断

很好，现在您已经对模型进行了微调，可以将其用于推断！

加载视频进行推断：

>>> sample_test_video = next(iter(test_dataset))

[外链图片转存中…(img-lPAORD5L-1719115353645)]

尝试使用您微调的模型进行推断的最简单方法是在pipeline中使用它。使用您的模型实例化一个视频分类的pipeline，并将视频传递给它：

>>> from transformers import pipeline
>>> video_cls = pipeline(model="my_awesome_video_cls_model")
>>> video_cls("https://huggingface.co/datasets/sayakpaul/ucf101-subset/resolve/main/v_BasketballDunk_g14_c06.avi")
[{'score': 0.9272987842559814, 'label': 'BasketballDunk'},
 {'score': 0.017777055501937866, 'label': 'BabyCrawling'},
 {'score': 0.01663011871278286, 'label': 'BalanceBeam'},
 {'score': 0.009560945443809032, 'label': 'BandMarching'},
 {'score': 0.0068979403004050255, 'label': 'BaseballPitch'}]

如果愿意，您也可以手动复制pipeline的结果。

>>> def run_inference(model, video):
...     # (num_frames, num_channels, height, width)
...     perumuted_sample_test_video = video.permute(1, 0, 2, 3)
...     inputs = {
...         "pixel_values": perumuted_sample_test_video.unsqueeze(0),
...         "labels": torch.tensor(
...             [sample_test_video["label"]]
...         ),  # this can be skipped if you don't have labels available.
...     }
...     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
...     inputs = {k: v.to(device) for k, v in inputs.items()}
...     model = model.to(device)
...     # forward pass
...     with torch.no_grad():
...         outputs = model(**inputs)
...         logits = outputs.logits
...     return logits

现在，将您的输入传递给模型并返回logits：

>>> logits = run_inference(trained_model, sample_test_video["video"])

解码logits，我们得到：

>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
# Predicted class: BasketballDunk

Transformers 4.37 中文文档（四）（5）

加载一个模型进行微调

为训练准备数据集

可视化预处理后的视频以进行更好的调试

训练模型

推断

推断

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Transformers 4.37 中文文档（四）（5）

加载一个模型进行微调

为训练准备数据集

可视化预处理后的视频以进行更好的调试

训练模型

推断

推断

热门文章

最新文章

相关电子书