yolo-world 源码解析（四）（1）-阿里云开发者社区

yolo-world 源码解析（四）（1）

2024-04-16 152

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

公共DNS（含HTTPDNS解析），每月1000万次HTTP解析

全局流量管理 GTM，标准版 1个月

云解析 DNS，旗舰版 1个月

简介： yolo-world 源码解析（四）

Preparing Data for YOLO-World

Overview

For pre-training YOLO-World, we adopt several datasets as listed in the below table:

Data	Samples	Type	Boxes
Objects365v1	609k	detection	9,621k
GQA	621k	grounding	3,681k
Flickr	149k	grounding	641k
CC3M-Lite	245k	image-text	821k

Dataset Directory

We put all data into the data directory, such as:

├── coco
│   ├── annotations
│   ├── lvis
│   ├── train2017
│   ├── val2017
├── flickr
│   ├── annotations
│   └── images
├── mixed_grounding
│   ├── annotations
│   ├── images
├── mixed_grounding
│   ├── annotations
│   ├── images
├── objects365v1
│   ├── annotations
│   ├── train
│   ├── val

NOTE: We strongly suggest that you check the directories or paths in the dataset part of the config file, especially for the values ann_file, data_root, and data_prefix.

We provide the annotations of the pre-training data in the below table:

Data	images	Annotation File
Objects365v1	`Objects365 train`	`objects365_train.json`
MixedGrounding	`GQA`	`final_mixed_train_no_coco.json`
Flickr30k	`Flickr30k`	`final_flickr_separateGT_train.json`
LVIS-minival	`COCO val2017`	`lvis_v1_minival_inserted_image_name.json`

Acknowledgement: We sincerely thank GLIP and mdetr for providing the annotation files for pre-training.

Dataset Class

For training YOLO-World, we mainly adopt two kinds of dataset classs:

1. `MultiModalDataset`

MultiModalDataset is a simple wrapper for pre-defined Dataset Class, such as Objects365 or COCO, which add the texts (category texts) into the dataset instance for formatting input texts.

Text JSON

The json file is formatted as follows:

[
    ['A_1','A_2'],
    ['B'],
    ['C_1', 'C_2', 'C_3'],
    ...
]

We have provided the text json for LVIS, COCO, and Objects365

2. `YOLOv5MixedGroundingDataset`

The YOLOv5MixedGroundingDataset extends the COCO dataset by supporting loading texts/captions from the json file. It’s desgined for MixedGrounding or Flickr30K with text tokens for each object.

🔥 Custom Datasets

For custom dataset, we suggest the users convert the annotation files according to the usage. Note that, converting the annotations to the standard COCO format is basically required.

Large vocabulary, grounding, referring: you can follow the annotation format as the MixedGrounding dataset, which adds caption and tokens_positive for assigning the text for each object. The texts can be a category or a noun phrases.
Custom vocabulary (fixed): you can adopt the MultiModalDataset wrapper as the Objects365 and create a text json for your custom categories.

Fine-tuning YOLO-World

Fine-tuning YOLO-World is easy and we provide the samples for COCO object detection as a simple guidance.

Fine-tuning Requirements

Fine-tuning YOLO-World is cheap:

it does not require 32 GPUs for multi-node distributed training. 8 GPUs or even 1 GPU is enough.
it does not require the long schedule, e.g., 300 epochs or 500 epochs for training YOLOv5 or YOLOv8. 80 epochs or fewer is enough considering that we provide the good pre-trained weights.

Data Preparation

The fine-tuning dataset should have the similar format as the that of the pre-training dataset.

We suggest you refer to docs/data for more details about how to build the datasets:

if you fine-tune YOLO-World for close-set / custom vocabulary object detection, using MultiModalDataset with a text json is preferred.
if you fine-tune YOLO-World for open-vocabulary detection with rich texts or grounding tasks, using MixedGroundingDataset is preferred.

Hyper-parameters and Config

Please refer to the config for fine-tuning YOLO-World-L on COCO for more details.

Basic config file:

If the fine-tuning dataset contains mask annotations:

_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')

If the fine-tuning dataset doesn’t contain mask annotations:

_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py')

Training Schemes:

Reducing the epochs and adjusting the learning rate

max_epochs = 80
base_lr = 2e-4
weight_decay = 0.05
train_batch_size_per_gpu = 16
close_mosaic_epochs=10
train_cfg = dict(
    max_epochs=max_epochs,
    val_interval=5,
    dynamic_intervals=[((max_epochs - close_mosaic_epochs),
                        _base_.val_interval_stage2)])

Datasets:

coco_train_dataset = dict(
    _delete_=True,
    type='MultiModalDataset',
    dataset=dict(
        type='YOLOv5CocoDataset',
        data_root='data/coco',
        ann_file='annotations/instances_train2017.json',
        data_prefix=dict(img='train2017/'),
        filter_cfg=dict(filter_empty_gt=False, min_size=32)),
    class_text_path='data/texts/coco_class_texts.json',
    pipeline=train_pipeline)

Finetuning without RepVL-PAN or Text Encoder 🚀

For further efficiency and simplicity, we can fine-tune an efficient version of YOLO-World without RepVL-PAN and the text encoder.

The efficient version of YOLO-World has the similar architecture or layers with the orignial YOLOv8 but we provide the pre-trained weights on large-scale datasets.

The pre-trained YOLO-World has strong generalization capabilities and is more robust compared to YOLOv8 trained on the COCO dataset.

You can refer to the config for Efficient YOLO-World for more details.

The efficient YOLO-World adopts EfficientCSPLayerWithTwoConv and the text encoder can be removed during inference or exporting models.

model = dict(
    type='YOLOWorldDetector',
    mm_neck=True,
    neck=dict(type='YOLOWorldPAFPN',
              guide_channels=text_channels,
              embed_channels=neck_embed_channels,
              num_heads=neck_num_heads,
              block_cfg=dict(type='EfficientCSPLayerWithTwoConv')))

Launch Fine-tuning!

It’s easy:

./dist_train.sh <path/to/config> <NUM_GPUS> --amp

COCO Fine-tuning

model	efficient neck	AP	AP₅₀	AP₇₅	weights
YOLO-World-S	✖️	45.7	62.3	49.9	comming
YOLO-World-M	✖️	50.7	67.2	55.1	comming
YOLO-World-L	✖️	53.3	70.3	58.1	comming
YOLO-World-S	✔️	45.9	62.3	50.1	comming
YOLO-World-M	✔️	51.2	68.1	55.9	comming
YOLO-World-L	✔️	53.3	70.1	58.2	comming

Update Notes

We provide the details for important updates of YOLO-World in this note.

Model Architecture

[2024-2-29]: YOLO-World-v2:

We remove the I-PoolingAttention: though it improves the performance for zero-shot LVIS evaluation, it affects the inference speeds after exporting YOLO-World to ONNX or TensorRT. Considering the trade-off, we remove the I-PoolingAttention in the newest version.
We replace the L2-Norm in the contrastive head with the BatchNorm. The L2-Norm contains complex operations, such as reduce, which is time-consuming for deployment. However, the BatchNorm can be fused into the convolution, which is much more efficient and also improves the zero-shot performance.

yolo-world 源码解析（四）（2）https://developer.aliyun.com/article/1483876

yolo-world 源码解析（四）（1）

Preparing Data for YOLO-World

Overview

Dataset Directory

Dataset Class

1. `MultiModalDataset`

2. `YOLOv5MixedGroundingDataset`

🔥 Custom Datasets

Fine-tuning YOLO-World

Fine-tuning Requirements

Data Preparation

Hyper-parameters and Config

Finetuning without RepVL-PAN or Text Encoder 🚀

Launch Fine-tuning!

COCO Fine-tuning

Update Notes

Model Architecture

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

yolo-world 源码解析（四）（1）

Preparing Data for YOLO-World

Overview

Dataset Directory

Dataset Class

1. MultiModalDataset

2. YOLOv5MixedGroundingDataset

🔥 Custom Datasets

Fine-tuning YOLO-World

Fine-tuning Requirements

Data Preparation

Hyper-parameters and Config

Finetuning without RepVL-PAN or Text Encoder 🚀

Launch Fine-tuning!

COCO Fine-tuning

Update Notes

Model Architecture

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

1. `MultiModalDataset`

2. `YOLOv5MixedGroundingDataset`