Transformers 4.37 中文文档（五）（1）-阿里云开发者社区

原文：huggingface.co/docs/transformers

目标检测

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/object_detection

目标检测是计算机视觉任务，用于检测图像中的实例（如人类、建筑物或汽车）。目标检测模型接收图像作为输入，并输出检测到的对象的边界框的坐标和相关标签。一幅图像可以包含多个对象，每个对象都有自己的边界框和标签（例如，它可以有一辆汽车和一座建筑物），每个对象可以出现在图像的不同部分（例如，图像可以有几辆汽车）。这个任务通常用于自动驾驶，用于检测行人、道路标志和交通灯等。其他应用包括在图像中计数对象、图像搜索等。

在本指南中，您将学习如何：

对DETR进行微调，这是一个将卷积主干与编码器-解码器 Transformer 结合的模型，在CPPE-5数据集上进行训练。
使用您微调的模型进行推断。

本教程中所示的任务由以下模型架构支持：

条件 DETR, 可变 DETR, DETA, DETR, 表格 Transformer, YOLOS

在开始之前，请确保已安装所有必要的库：

pip install -q datasets transformers evaluate timm albumentations

您将使用🤗数据集从 Hugging Face Hub 加载数据集，🤗转换器来训练您的模型，并使用albumentations来增强数据。目前需要使用timm来加载 DETR 模型的卷积主干。

我们鼓励您与社区分享您的模型。登录到您的 Hugging Face 帐户并将其上传到 Hub。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login
>>> notebook_login()

加载 CPPE-5 数据集

CPPE-5 数据集包含带有注释的图像，用于识别 COVID-19 大流行背景下的医疗个人防护装备（PPE）。

首先加载数据集：

>>> from datasets import load_dataset
>>> cppe5 = load_dataset("cppe-5")
>>> cppe5
DatasetDict({
    train: Dataset({
        features: ['image_id', 'image', 'width', 'height', 'objects'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['image_id', 'image', 'width', 'height', 'objects'],
        num_rows: 29
    })
})

您将看到这个数据集已经带有一个包含 1000 张图像的训练集和一个包含 29 张图像的测试集。

熟悉数据，探索示例的外观。

>>> cppe5["train"][0]
{'image_id': 15,
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=943x663 at 0x7F9EC9E77C10>,
 'width': 943,
 'height': 663,
 'objects': {'id': [114, 115, 116, 117],
  'area': [3796, 1596, 152768, 81002],
  'bbox': [[302.0, 109.0, 73.0, 52.0],
   [810.0, 100.0, 57.0, 28.0],
   [160.0, 31.0, 248.0, 616.0],
   [741.0, 68.0, 202.0, 401.0]],
  'category': [4, 4, 0, 0]}}

数据集中的示例具有以下字段：

image_id：示例图像 id
image：包含图像的PIL.Image.Image对象
width：图像的宽度
height：图像的高度
objects：包含图像中对象的边界框元数据的字典：

id：注释 id
area：边界框的面积
bbox：对象的边界框（以COCO 格式）
category：对象的类别，可能的值包括防护服（0）、面罩（1）、手套（2）、护目镜（3）和口罩（4）

您可能会注意到bbox字段遵循 COCO 格式，这是 DETR 模型期望的格式。然而，objects内部字段的分组与 DETR 所需的注释格式不同。在使用此数据进行训练之前，您需要应用一些预处理转换。

为了更好地理解数据，可视化数据集中的一个示例。

>>> import numpy as np
>>> import os
>>> from PIL import Image, ImageDraw
>>> image = cppe5["train"][0]["image"]
>>> annotations = cppe5["train"][0]["objects"]
>>> draw = ImageDraw.Draw(image)
>>> categories = cppe5["train"].features["objects"].feature["category"].names
>>> id2label = {index: x for index, x in enumerate(categories, start=0)}
>>> label2id = {v: k for k, v in id2label.items()}
>>> for i in range(len(annotations["id"])):
...     box = annotations["bbox"][i]
...     class_idx = annotations["category"][i]
...     x, y, w, h = tuple(box)
...     # Check if coordinates are normalized or not
...     if max(box) > 1.0:
...         # Coordinates are un-normalized, no need to re-scale them
...         x1, y1 = int(x), int(y)
...         x2, y2 = int(x + w), int(y + h)
...     else:
...         # Coordinates are normalized, re-scale them
...         x1 = int(x * width)
...         y1 = int(y * height)
...         x2 = int((x + w) * width)
...         y2 = int((y + h) * height)
...     draw.rectangle((x, y, x + w, y + h), outline="red", width=1)
...     draw.text((x, y), id2label[class_idx], fill="white")
>>> image

要可视化带有关联标签的边界框，您可以从数据集的元数据中获取标签，特别是category字段。您还需要创建映射标签 id 到标签类别（id2label）以及反向映射（label2id）的字典。在设置模型时，您可以稍后使用它们。包括这些映射将使您的模型在 Hugging Face Hub 上共享时可以被其他人重复使用。请注意，上述代码中绘制边界框的部分假定它是以XYWH（x，y 坐标和框的宽度和高度）格式。对于其他格式如(x1，y1，x2，y2)可能无法正常工作。

作为熟悉数据的最后一步，探索可能存在的问题。目标检测数据集的一个常见问题是边界框“拉伸”到图像边缘之外。这种“失控”的边界框可能会在训练过程中引发错误，应在此阶段加以解决。在这个数据集中有一些示例存在这个问题。为了简化本指南中的操作，我们将这些图像从数据中删除。

>>> remove_idx = [590, 821, 822, 875, 876, 878, 879]
>>> keep = [i for i in range(len(cppe5["train"])) if i not in remove_idx]
>>> cppe5["train"] = cppe5["train"].select(keep)

预处理数据

要微调模型，您必须预处理您计划使用的数据，以精确匹配预训练模型使用的方法。AutoImageProcessor 负责处理图像数据以创建pixel_values，pixel_mask和labels，供 DETR 模型训练。图像处理器具有一些属性，您无需担心：

image_mean = [0.485, 0.456, 0.406 ]
image_std = [0.229, 0.224, 0.225]

这些是用于在模型预训练期间对图像进行归一化的均值和标准差。在进行推理或微调预训练图像模型时，这些值至关重要。

从要微调的模型相同的检查点实例化图像处理器。

>>> from transformers import AutoImageProcessor
>>> checkpoint = "facebook/detr-resnet-50"
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)

在将图像传递给image_processor之前，对数据集应用两个预处理转换：

增强图像
重新格式化注释以满足 DETR 的期望

首先，为了确保模型不会在训练数据上过拟合，您可以使用任何数据增强库进行图像增强。这里我们使用Albumentations…此库确保转换影响图像并相应更新边界框。🤗数据集库文档有一个详细的关于如何为目标检测增强图像的指南，它使用相同的数据集作为示例。在这里应用相同的方法，将每个图像调整为(480, 480)，水平翻转并增加亮度：

>>> import albumentations
>>> import numpy as np
>>> import torch
>>> transform = albumentations.Compose(
...     [
...         albumentations.Resize(480, 480),
...         albumentations.HorizontalFlip(p=1.0),
...         albumentations.RandomBrightnessContrast(p=1.0),
...     ],
...     bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
... )

image_processor期望注释采用以下格式：{'image_id': int, 'annotations': List[Dict]}，其中每个字典是一个 COCO 对象注释。让我们添加一个函数来为单个示例重新格式化注释：

>>> def formatted_anns(image_id, category, area, bbox):
...     annotations = []
...     for i in range(0, len(category)):
...         new_ann = {
...             "image_id": image_id,
...             "category_id": category[i],
...             "isCrowd": 0,
...             "area": area[i],
...             "bbox": list(bbox[i]),
...         }
...         annotations.append(new_ann)
...     return annotations

现在，您可以将图像和注释转换组合在一起，用于一批示例：

>>> # transforming a batch
>>> def transform_aug_ann(examples):
...     image_ids = examples["image_id"]
...     images, bboxes, area, categories = [], [], [], []
...     for image, objects in zip(examples["image"], examples["objects"]):
...         image = np.array(image.convert("RGB"))[:, :, ::-1]
...         out = transform(image=image, bboxes=objects["bbox"], category=objects["category"])
...         area.append(objects["area"])
...         images.append(out["image"])
...         bboxes.append(out["bboxes"])
...         categories.append(out["category"])
...     targets = [
...         {"image_id": id_, "annotations": formatted_anns(id_, cat_, ar_, box_)}
...         for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes)
...     ]
...     return image_processor(images=images, annotations=targets, return_tensors="pt")

使用🤗数据集的with_transform方法将此预处理函数应用于整个数据集。此方法在加载数据集元素时动态应用转换。

此时，您可以检查数据集经过转换后的示例是什么样子。您应该看到一个带有pixel_values的张量，一个带有pixel_mask的张量和labels。

>>> cppe5["train"] = cppe5["train"].with_transform(transform_aug_ann)
>>> cppe5["train"][15]
{'pixel_values': tensor([[[ 0.9132,  0.9132,  0.9132,  ..., -1.9809, -1.9809, -1.9809],
          [ 0.9132,  0.9132,  0.9132,  ..., -1.9809, -1.9809, -1.9809],
          [ 0.9132,  0.9132,  0.9132,  ..., -1.9638, -1.9638, -1.9638],
          ...,
          [-1.5699, -1.5699, -1.5699,  ..., -1.9980, -1.9980, -1.9980],
          [-1.5528, -1.5528, -1.5528,  ..., -1.9980, -1.9809, -1.9809],
          [-1.5528, -1.5528, -1.5528,  ..., -1.9980, -1.9809, -1.9809]],
         [[ 1.3081,  1.3081,  1.3081,  ..., -1.8431, -1.8431, -1.8431],
          [ 1.3081,  1.3081,  1.3081,  ..., -1.8431, -1.8431, -1.8431],
          [ 1.3081,  1.3081,  1.3081,  ..., -1.8256, -1.8256, -1.8256],
          ...,
          [-1.3179, -1.3179, -1.3179,  ..., -1.8606, -1.8606, -1.8606],
          [-1.3004, -1.3004, -1.3004,  ..., -1.8606, -1.8431, -1.8431],
          [-1.3004, -1.3004, -1.3004,  ..., -1.8606, -1.8431, -1.8431]],
         [[ 1.4200,  1.4200,  1.4200,  ..., -1.6476, -1.6476, -1.6476],
          [ 1.4200,  1.4200,  1.4200,  ..., -1.6476, -1.6476, -1.6476],
          [ 1.4200,  1.4200,  1.4200,  ..., -1.6302, -1.6302, -1.6302],
          ...,
          [-1.0201, -1.0201, -1.0201,  ..., -1.5604, -1.5604, -1.5604],
          [-1.0027, -1.0027, -1.0027,  ..., -1.5604, -1.5430, -1.5430],
          [-1.0027, -1.0027, -1.0027,  ..., -1.5604, -1.5430, -1.5430]]]),
 'pixel_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]),
 'labels': {'size': tensor([800, 800]), 'image_id': tensor([756]), 'class_labels': tensor([4]), 'boxes': tensor([[0.7340, 0.6986, 0.3414, 0.5944]]), 'area': tensor([519544.4375]), 'iscrowd': tensor([0]), 'orig_size': tensor([480, 480])}}

您已成功增强了单个图像并准备好它们的注释。然而，预处理还没有完成。在最后一步中，创建一个自定义的collate_fn来将图像批量处理在一起。将图像（现在是pixel_values）填充到批次中最大的图像，并创建一个相应的pixel_mask来指示哪些像素是真实的（1），哪些是填充的（0）。

>>> def collate_fn(batch):
...     pixel_values = [item["pixel_values"] for item in batch]
...     encoding = image_processor.pad(pixel_values, return_tensors="pt")
...     labels = [item["labels"] for item in batch]
...     batch = {}
...     batch["pixel_values"] = encoding["pixel_values"]
...     batch["pixel_mask"] = encoding["pixel_mask"]
...     batch["labels"] = labels
...     return batch

Transformers 4.37 中文文档（五）（1）https://developer.aliyun.com/article/1565242

Transformers 4.37 中文文档（五）（1）

目标检测

加载 CPPE-5 数据集

预处理数据

热门文章

最新文章

相关课程

相关电子书