Preparing Data for YOLO-World
Overview
For pre-training YOLO-World, we adopt several datasets as listed in the below table:
Data | Samples | Type | Boxes |
Objects365v1 | 609k | detection | 9,621k |
GQA | 621k | grounding | 3,681k |
Flickr | 149k | grounding | 641k |
CC3M-Lite | 245k | image-text | 821k |
Dataset Directory
We put all data into the data
directory, such as:
├── coco │ ├── annotations │ ├── lvis │ ├── train2017 │ ├── val2017 ├── flickr │ ├── annotations │ └── images ├── mixed_grounding │ ├── annotations │ ├── images ├── mixed_grounding │ ├── annotations │ ├── images ├── objects365v1 │ ├── annotations │ ├── train │ ├── val
NOTE: We strongly suggest that you check the directories or paths in the dataset part of the config file, especially for the values ann_file
, data_root
, and data_prefix
.
We provide the annotations of the pre-training data in the below table:
Data | images | Annotation File |
Objects365v1 | Objects365 train |
objects365_train.json |
MixedGrounding | GQA |
final_mixed_train_no_coco.json |
Flickr30k | Flickr30k |
final_flickr_separateGT_train.json |
LVIS-minival | COCO val2017 |
lvis_v1_minival_inserted_image_name.json |
Acknowledgement: We sincerely thank GLIP and mdetr for providing the annotation files for pre-training.
Dataset Class
For training YOLO-World, we mainly adopt two kinds of dataset classs:
1. MultiModalDataset
MultiModalDataset
is a simple wrapper for pre-defined Dataset Class, such as Objects365
or COCO
, which add the texts (category texts) into the dataset instance for formatting input texts.
Text JSON
The json file is formatted as follows:
[ ['A_1','A_2'], ['B'], ['C_1', 'C_2', 'C_3'], ... ]
We have provided the text json for LVIS
, COCO
, and Objects365
2. YOLOv5MixedGroundingDataset
The YOLOv5MixedGroundingDataset
extends the COCO
dataset by supporting loading texts/captions from the json file. It’s desgined for MixedGrounding
or Flickr30K
with text tokens for each object.
🔥 Custom Datasets
For custom dataset, we suggest the users convert the annotation files according to the usage. Note that, converting the annotations to the standard COCO format is basically required.
- Large vocabulary, grounding, referring: you can follow the annotation format as the
MixedGrounding
dataset, which addscaption
andtokens_positive
for assigning the text for each object. The texts can be a category or a noun phrases. - Custom vocabulary (fixed): you can adopt the
MultiModalDataset
wrapper as theObjects365
and create a text json for your custom categories.
Fine-tuning YOLO-World
Fine-tuning YOLO-World is easy and we provide the samples for COCO object detection as a simple guidance.
Fine-tuning Requirements
Fine-tuning YOLO-World is cheap:
- it does not require 32 GPUs for multi-node distributed training. 8 GPUs or even 1 GPU is enough.
- it does not require the long schedule, e.g., 300 epochs or 500 epochs for training YOLOv5 or YOLOv8. 80 epochs or fewer is enough considering that we provide the good pre-trained weights.
Data Preparation
The fine-tuning dataset should have the similar format as the that of the pre-training dataset.
We suggest you refer to docs/data
for more details about how to build the datasets:
- if you fine-tune YOLO-World for close-set / custom vocabulary object detection, using
MultiModalDataset
with atext json
is preferred. - if you fine-tune YOLO-World for open-vocabulary detection with rich texts or grounding tasks, using
MixedGroundingDataset
is preferred.
Hyper-parameters and Config
Please refer to the config for fine-tuning YOLO-World-L on COCO for more details.
- Basic config file:
If the fine-tuning dataset contains mask annotations:
_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')
If the fine-tuning dataset doesn’t contain mask annotations:
_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py')
- Training Schemes:
Reducing the epochs and adjusting the learning rate
max_epochs = 80 base_lr = 2e-4 weight_decay = 0.05 train_batch_size_per_gpu = 16 close_mosaic_epochs=10 train_cfg = dict( max_epochs=max_epochs, val_interval=5, dynamic_intervals=[((max_epochs - close_mosaic_epochs), _base_.val_interval_stage2)])
- Datasets:
coco_train_dataset = dict( _delete_=True, type='MultiModalDataset', dataset=dict( type='YOLOv5CocoDataset', data_root='data/coco', ann_file='annotations/instances_train2017.json', data_prefix=dict(img='train2017/'), filter_cfg=dict(filter_empty_gt=False, min_size=32)), class_text_path='data/texts/coco_class_texts.json', pipeline=train_pipeline)
Finetuning without RepVL-PAN or Text Encoder 🚀
For further efficiency and simplicity, we can fine-tune an efficient version of YOLO-World without RepVL-PAN and the text encoder.
The efficient version of YOLO-World has the similar architecture or layers with the orignial YOLOv8 but we provide the pre-trained weights on large-scale datasets.
The pre-trained YOLO-World has strong generalization capabilities and is more robust compared to YOLOv8 trained on the COCO dataset.
You can refer to the config for Efficient YOLO-World for more details.
The efficient YOLO-World adopts EfficientCSPLayerWithTwoConv
and the text encoder can be removed during inference or exporting models.
model = dict( type='YOLOWorldDetector', mm_neck=True, neck=dict(type='YOLOWorldPAFPN', guide_channels=text_channels, embed_channels=neck_embed_channels, num_heads=neck_num_heads, block_cfg=dict(type='EfficientCSPLayerWithTwoConv')))
Launch Fine-tuning!
It’s easy:
./dist_train.sh <path/to/config> <NUM_GPUS> --amp
COCO Fine-tuning
model | efficient neck | AP | AP50 | AP75 | weights |
YOLO-World-S | ✖️ | 45.7 | 62.3 | 49.9 | comming |
YOLO-World-M | ✖️ | 50.7 | 67.2 | 55.1 | comming |
YOLO-World-L | ✖️ | 53.3 | 70.3 | 58.1 | comming |
YOLO-World-S | ✔️ | 45.9 | 62.3 | 50.1 | comming |
YOLO-World-M | ✔️ | 51.2 | 68.1 | 55.9 | comming |
YOLO-World-L | ✔️ | 53.3 | 70.1 | 58.2 | comming |
Update Notes
We provide the details for important updates of YOLO-World in this note.
Model Architecture
[2024-2-29]: YOLO-World-v2:
- We remove the
I-PoolingAttention
: though it improves the performance for zero-shot LVIS evaluation, it affects the inference speeds after exporting YOLO-World to ONNX or TensorRT. Considering the trade-off, we remove theI-PoolingAttention
in the newest version. - We replace the
L2-Norm
in the contrastive head with theBatchNorm
. TheL2-Norm
contains complex operations, such asreduce
, which is time-consuming for deployment. However, theBatchNorm
can be fused into the convolution, which is much more efficient and also improves the zero-shot performance.
yolo-world 源码解析(四)(2)https://developer.aliyun.com/article/1483876