yolo-world 源码解析(四)(1)

本文涉及的产品
公共DNS(含HTTPDNS解析),每月1000万次HTTP解析
全局流量管理 GTM,标准版 1个月
云解析 DNS,旗舰版 1个月
简介: yolo-world 源码解析(四)

Preparing Data for YOLO-World

Overview

For pre-training YOLO-World, we adopt several datasets as listed in the below table:

Data Samples Type Boxes
Objects365v1 609k detection 9,621k
GQA 621k grounding 3,681k
Flickr 149k grounding 641k
CC3M-Lite 245k image-text 821k

Dataset Directory

We put all data into the data directory, such as:

├── coco
│   ├── annotations
│   ├── lvis
│   ├── train2017
│   ├── val2017
├── flickr
│   ├── annotations
│   └── images
├── mixed_grounding
│   ├── annotations
│   ├── images
├── mixed_grounding
│   ├── annotations
│   ├── images
├── objects365v1
│   ├── annotations
│   ├── train
│   ├── val

NOTE: We strongly suggest that you check the directories or paths in the dataset part of the config file, especially for the values ann_file, data_root, and data_prefix.

We provide the annotations of the pre-training data in the below table:

Acknowledgement: We sincerely thank GLIP and mdetr for providing the annotation files for pre-training.

Dataset Class

For training YOLO-World, we mainly adopt two kinds of dataset classs:

1. MultiModalDataset

MultiModalDataset is a simple wrapper for pre-defined Dataset Class, such as Objects365 or COCO, which add the texts (category texts) into the dataset instance for formatting input texts.

Text JSON

The json file is formatted as follows:

[
    ['A_1','A_2'],
    ['B'],
    ['C_1', 'C_2', 'C_3'],
    ...
]

We have provided the text json for LVIS, COCO, and Objects365

2. YOLOv5MixedGroundingDataset

The YOLOv5MixedGroundingDataset extends the COCO dataset by supporting loading texts/captions from the json file. It’s desgined for MixedGrounding or Flickr30K with text tokens for each object.

🔥 Custom Datasets

For custom dataset, we suggest the users convert the annotation files according to the usage. Note that, converting the annotations to the standard COCO format is basically required.

  1. Large vocabulary, grounding, referring: you can follow the annotation format as the MixedGrounding dataset, which adds caption and tokens_positive for assigning the text for each object. The texts can be a category or a noun phrases.
  2. Custom vocabulary (fixed): you can adopt the MultiModalDataset wrapper as the Objects365 and create a text json for your custom categories.

Fine-tuning YOLO-World

Fine-tuning YOLO-World is easy and we provide the samples for COCO object detection as a simple guidance.

Fine-tuning Requirements

Fine-tuning YOLO-World is cheap:

  • it does not require 32 GPUs for multi-node distributed training. 8 GPUs or even 1 GPU is enough.
  • it does not require the long schedule, e.g., 300 epochs or 500 epochs for training YOLOv5 or YOLOv8. 80 epochs or fewer is enough considering that we provide the good pre-trained weights.

Data Preparation

The fine-tuning dataset should have the similar format as the that of the pre-training dataset.

We suggest you refer to docs/data for more details about how to build the datasets:

  • if you fine-tune YOLO-World for close-set / custom vocabulary object detection, using MultiModalDataset with a text json is preferred.
  • if you fine-tune YOLO-World for open-vocabulary detection with rich texts or grounding tasks, using MixedGroundingDataset is preferred.

Hyper-parameters and Config

Please refer to the config for fine-tuning YOLO-World-L on COCO for more details.

  1. Basic config file:

If the fine-tuning dataset contains mask annotations:

_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')

If the fine-tuning dataset doesn’t contain mask annotations:

_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py')
  1. Training Schemes:

Reducing the epochs and adjusting the learning rate

max_epochs = 80
base_lr = 2e-4
weight_decay = 0.05
train_batch_size_per_gpu = 16
close_mosaic_epochs=10
train_cfg = dict(
    max_epochs=max_epochs,
    val_interval=5,
    dynamic_intervals=[((max_epochs - close_mosaic_epochs),
                        _base_.val_interval_stage2)])
  1. Datasets:
coco_train_dataset = dict(
    _delete_=True,
    type='MultiModalDataset',
    dataset=dict(
        type='YOLOv5CocoDataset',
        data_root='data/coco',
        ann_file='annotations/instances_train2017.json',
        data_prefix=dict(img='train2017/'),
        filter_cfg=dict(filter_empty_gt=False, min_size=32)),
    class_text_path='data/texts/coco_class_texts.json',
    pipeline=train_pipeline)
Finetuning without RepVL-PAN or Text Encoder 🚀

For further efficiency and simplicity, we can fine-tune an efficient version of YOLO-World without RepVL-PAN and the text encoder.

The efficient version of YOLO-World has the similar architecture or layers with the orignial YOLOv8 but we provide the pre-trained weights on large-scale datasets.

The pre-trained YOLO-World has strong generalization capabilities and is more robust compared to YOLOv8 trained on the COCO dataset.

You can refer to the config for Efficient YOLO-World for more details.

The efficient YOLO-World adopts EfficientCSPLayerWithTwoConv and the text encoder can be removed during inference or exporting models.

model = dict(
    type='YOLOWorldDetector',
    mm_neck=True,
    neck=dict(type='YOLOWorldPAFPN',
              guide_channels=text_channels,
              embed_channels=neck_embed_channels,
              num_heads=neck_num_heads,
              block_cfg=dict(type='EfficientCSPLayerWithTwoConv')))

Launch Fine-tuning!

It’s easy:

./dist_train.sh <path/to/config> <NUM_GPUS> --amp

COCO Fine-tuning

model efficient neck AP AP50 AP75 weights
YOLO-World-S ✖️ 45.7 62.3 49.9 comming
YOLO-World-M ✖️ 50.7 67.2 55.1 comming
YOLO-World-L ✖️ 53.3 70.3 58.1 comming
YOLO-World-S ✔️ 45.9 62.3 50.1 comming
YOLO-World-M ✔️ 51.2 68.1 55.9 comming
YOLO-World-L ✔️ 53.3 70.1 58.2 comming

Update Notes

We provide the details for important updates of YOLO-World in this note.

Model Architecture

[2024-2-29]: YOLO-World-v2:

  1. We remove the I-PoolingAttention: though it improves the performance for zero-shot LVIS evaluation, it affects the inference speeds after exporting YOLO-World to ONNX or TensorRT. Considering the trade-off, we remove the I-PoolingAttention in the newest version.
  2. We replace the L2-Norm in the contrastive head with the BatchNorm. The L2-Norm contains complex operations, such as reduce, which is time-consuming for deployment. However, the BatchNorm can be fused into the convolution, which is much more efficient and also improves the zero-shot performance.

yolo-world 源码解析(四)(2)https://developer.aliyun.com/article/1483876

相关文章
|
2月前
|
监控 Java 应用服务中间件
高级java面试---spring.factories文件的解析源码API机制
【11月更文挑战第20天】Spring Boot是一个用于快速构建基于Spring框架的应用程序的开源框架。它通过自动配置、起步依赖和内嵌服务器等特性,极大地简化了Spring应用的开发和部署过程。本文将深入探讨Spring Boot的背景历史、业务场景、功能点以及底层原理,并通过Java代码手写模拟Spring Boot的启动过程,特别是spring.factories文件的解析源码API机制。
110 2
|
28天前
|
存储 设计模式 算法
【23种设计模式·全精解析 | 行为型模式篇】11种行为型模式的结构概述、案例实现、优缺点、扩展对比、使用场景、源码解析
行为型模式用于描述程序在运行时复杂的流程控制,即描述多个类或对象之间怎样相互协作共同完成单个对象都无法单独完成的任务,它涉及算法与对象间职责的分配。行为型模式分为类行为模式和对象行为模式,前者采用继承机制来在类间分派行为,后者采用组合或聚合在对象间分配行为。由于组合关系或聚合关系比继承关系耦合度低,满足“合成复用原则”,所以对象行为模式比类行为模式具有更大的灵活性。 行为型模式分为: • 模板方法模式 • 策略模式 • 命令模式 • 职责链模式 • 状态模式 • 观察者模式 • 中介者模式 • 迭代器模式 • 访问者模式 • 备忘录模式 • 解释器模式
【23种设计模式·全精解析 | 行为型模式篇】11种行为型模式的结构概述、案例实现、优缺点、扩展对比、使用场景、源码解析
|
28天前
|
设计模式 存储 安全
【23种设计模式·全精解析 | 创建型模式篇】5种创建型模式的结构概述、实现、优缺点、扩展、使用场景、源码解析
结构型模式描述如何将类或对象按某种布局组成更大的结构。它分为类结构型模式和对象结构型模式,前者采用继承机制来组织接口和类,后者釆用组合或聚合来组合对象。由于组合关系或聚合关系比继承关系耦合度低,满足“合成复用原则”,所以对象结构型模式比类结构型模式具有更大的灵活性。 结构型模式分为以下 7 种: • 代理模式 • 适配器模式 • 装饰者模式 • 桥接模式 • 外观模式 • 组合模式 • 享元模式
【23种设计模式·全精解析 | 创建型模式篇】5种创建型模式的结构概述、实现、优缺点、扩展、使用场景、源码解析
|
28天前
|
设计模式 存储 安全
【23种设计模式·全精解析 | 创建型模式篇】5种创建型模式的结构概述、实现、优缺点、扩展、使用场景、源码解析
创建型模式的主要关注点是“怎样创建对象?”,它的主要特点是"将对象的创建与使用分离”。这样可以降低系统的耦合度,使用者不需要关注对象的创建细节。创建型模式分为5种:单例模式、工厂方法模式抽象工厂式、原型模式、建造者模式。
【23种设计模式·全精解析 | 创建型模式篇】5种创建型模式的结构概述、实现、优缺点、扩展、使用场景、源码解析
|
4天前
|
自然语言处理 数据处理 索引
mindspeed-llm源码解析(一)preprocess_data
mindspeed-llm是昇腾模型套件代码仓,原来叫"modelLink"。这篇文章带大家阅读一下数据处理脚本preprocess_data.py(基于1.0.0分支),数据处理是模型训练的第一步,经常会用到。
16 0
|
2月前
|
缓存 监控 Java
Java线程池提交任务流程底层源码与源码解析
【11月更文挑战第30天】嘿,各位技术爱好者们,今天咱们来聊聊Java线程池提交任务的底层源码与源码解析。作为一个资深的Java开发者,我相信你一定对线程池并不陌生。线程池作为并发编程中的一大利器,其重要性不言而喻。今天,我将以对话的方式,带你一步步深入线程池的奥秘,从概述到功能点,再到背景和业务点,最后到底层原理和示例,让你对线程池有一个全新的认识。
65 12
|
1月前
|
PyTorch Shell API
Ascend Extension for PyTorch的源码解析
本文介绍了Ascend对PyTorch代码的适配过程,包括源码下载、编译步骤及常见问题,详细解析了torch-npu编译后的文件结构和三种实现昇腾NPU算子调用的方式:通过torch的register方式、定义算子方式和API重定向映射方式。这对于开发者理解和使用Ascend平台上的PyTorch具有重要指导意义。
|
29天前
|
安全 搜索推荐 数据挖掘
陪玩系统源码开发流程解析,成品陪玩系统源码的优点
我们自主开发的多客陪玩系统源码,整合了市面上主流陪玩APP功能,支持二次开发。该系统适用于线上游戏陪玩、语音视频聊天、心理咨询等场景,提供用户注册管理、陪玩者资料库、预约匹配、实时通讯、支付结算、安全隐私保护、客户服务及数据分析等功能,打造综合性社交平台。随着互联网技术发展,陪玩系统正成为游戏爱好者的新宠,改变游戏体验并带来新的商业模式。
|
2月前
|
存储 安全 Linux
Golang的GMP调度模型与源码解析
【11月更文挑战第11天】GMP 调度模型是 Go 语言运行时系统的核心部分,用于高效管理和调度大量协程(goroutine)。它通过少量的操作系统线程(M)和逻辑处理器(P)来调度大量的轻量级协程(G),从而实现高性能的并发处理。GMP 模型通过本地队列和全局队列来减少锁竞争,提高调度效率。在 Go 源码中,`runtime.h` 文件定义了关键数据结构,`schedule()` 和 `findrunnable()` 函数实现了核心调度逻辑。通过深入研究 GMP 模型,可以更好地理解 Go 语言的并发机制。
|
2月前
|
消息中间件 缓存 安全
Future与FutureTask源码解析,接口阻塞问题及解决方案
【11月更文挑战第5天】在Java开发中,多线程编程是提高系统并发性能和资源利用率的重要手段。然而,多线程编程也带来了诸如线程安全、死锁、接口阻塞等一系列复杂问题。本文将深度剖析多线程优化技巧、Future与FutureTask的源码、接口阻塞问题及解决方案,并通过具体业务场景和Java代码示例进行实战演示。
70 3

热门文章

最新文章

推荐镜像

更多