MMDetection系列 | 5. MMDetection运行配置介绍

简介: MMDetection系列 | 5. MMDetection运行配置介绍

1. 优化器配置


optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)


  • 使用梯度剪辑来稳定训练
optimizer_config = dict(
    _delete_=True, grad_clip=dict(max_norm=35, norm_type=2))


其中,_delete_=True将用新键替换backbone字段中的所有旧键


2. 学习率配置


lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])       # 表示初始学习率在第8和11个epoch衰减10倍


还有其他的配置方案:


  • Poly schedule
lr_config = dict(policy='poly', power=0.9, min_lr=1e-4, by_epoch=False)


  • ConsineAnnealing schedule
lr_config = dict(
    policy='CosineAnnealing',
    warmup='linear',
    warmup_iters=1000,
    warmup_ratio=1.0 / 10,
    min_lr_ratio=1e-5)


  • 使用动量调度加速模型收敛

支持动量调度器根据学习率修改模型的动量,这可以使模型以更快的方式收敛。Momentum 调度器通常与 LR 调度器一起使用


lr_config = dict(
    policy='cyclic',
    target_ratio=(10, 1e-4),
    cyclic_times=1,
    step_ratio_up=0.4,
)
momentum_config = dict(
    policy='cyclic',
    target_ratio=(0.85 / 0.95, 1),
    cyclic_times=1,
    step_ratio_up=0.4,
)


3. 工作流程配置


工作流是 (phase, epochs) 的列表,用于指定运行顺序和时期。默认情况下,它设置为:


workflow = [('train', 1)]


这意味着运行 1 个 epoch 进行训练。有时用户可能想要检查验证集上模型的一些指标(例如损失、准确性)。在这种情况下,我们可以将工作流设置为


[('train', 1), ('val', 1)]


这样 1 个 epoch 的训练和 1 个 epoch 的验证将被迭代运行。而验证集的损失同样会被计算出来。如果想先进行验证,再进行训练,还可以设置如下:


[('val', 1), ('train', n)]


这样设置表示先对验证集进行验证与损失计算,再进行n个epoch的计算。


4. 检查点配置


checkpoint_config = dict(interval=20)          # 20个epoch保存一次权重


参数说明见:https://mmcv.readthedocs.io/en/latest/api.html#mmcv.runner.CheckpointHook


CLASSmmcv.runner.CheckpointHook(interval: int = - 1, by_epoch: bool = True, save_optimizer: bool = True, out_dir: Optional[str] = None, max_keep_ckpts: int = - 1, save_last: bool = True, sync_buffer: bool = False, file_client_args: Optional[dict] = None, **kwargs)


  • interval (int) – The saving period. If by_epoch=True, interval indicates epochs, otherwise it indicates iterations. Default: -1, which means “never”.
  • by_epoch (bool) – Saving checkpoints by epoch or by iteration. Default: True.
  • save_optimizer (bool) – Whether to save optimizer state_dict in the checkpoint. It is usually used for resuming experiments. Default: True.
  • out_dir (str, optional) – The root directory to save checkpoints. If not specified, runner.work_dir will be used by default. If specified, the out_dir will be the concatenation of out_dir and the last level directory of runner.work_dir. Changed in version 1.3.16.
  • max_keep_ckpts (int, optional) – The maximum checkpoints to keep. In some cases we want only the latest few checkpoints and would like to delete old ones to save the disk space. Default: -1, which means unlimited.
  • save_last (bool, optional) – Whether to force the last checkpoint to be saved regardless of interval. Default: True.
  • sync_buffer (bool, optional) – Whether to synchronize buffers in different gpus. Default: False.
  • file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmcv.fileio.FileClient for details. Default: None. New in version 1.3.16.


5. 日志配置


包装多个记录器log_config挂钩并允许设置间隔。现在 MMCV 支持WandbLoggerHook、MlflowLoggerHook和TensorboardLoggerHook.


log_config = dict(
    interval=50,    # 每500个迭代就打印一次训练信息
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])


参数说明见:https://mmcv.readthedocs.io/en/latest/api.html#mmcv.runner.EvalHook


CLASSmmcv.runner.LoggerHook(interval: int = 10, ignore_last: bool = True, reset_flag: bool = False, by_epoch: bool = True)[SOURCE]


  • interval (int) – Logging interval (every k iterations). Default 10.
  • ignore_last (bool) – Ignore the log of last iterations in each epoch if less than interval. Default True.
  • reset_flag (bool) – Whether to clear the output buffer after logging. Default False.
  • by_epoch (bool) – Whether EpochBasedRunner is used. Default True.


6. 评估配置


配置的evaluation将用于初始化EvalHook. 除了 key interval,其他参数如metric将传递给dataset.evaluate()

evaluation = dict(interval=1, metric=‘bbox’)


参数说明https://mmcv.readthedocs.io/en/latest/api.html?highlight=EpochBasedRunner#mmcv.runner.EpochBasedRunner


mmcv.runner.EvalHook(dataloader: torch.utils.data.dataloader.DataLoader, start: Optional[int] = None, interval: int = 1, by_epoch: bool = True, save_best: Optional[str] = None, rule: Optional[str] = None, test_fn: Optional[Callable] = None, greater_keys: Optional[List[str]] = None, less_keys: Optional[List[str]] = None, out_dir: Optional[str] = None, file_client_args: Optional[dict] = None, **eval_kwargs)


  • dataloader (DataLoader) – A PyTorch dataloader, whose dataset has implemented evaluate function.
  • start (int | None, optional) – Evaluation starting epoch. It enables evaluation before the training starts if start <= the resuming epoch. If None, whether to evaluate is merely decided by interval. Default: None.
  • interval (int) – Evaluation interval. Default: 1.
  • by_epoch (bool) – Determine perform evaluation by epoch or by iteration. If set to True, it will perform by epoch. Otherwise, by iteration. Default: True.
  • save_best (str, optional) – If a metric is specified, it would measure the best checkpoint during evaluation. The information about best checkpoint would be saved in runner.meta[‘hook_msgs’] to keep best score value and best checkpoint path, which will be also loaded when resume checkpoint. Options are the evaluation metrics on the test dataset. e.g., bbox_mAP, segm_mAP for bbox detection and instance segmentation. AR@100 for proposal recall. If save_best is auto, the first key of the returned OrderedDict result will be used. Default: None.
  • rule (str | None, optional) – Comparison rule for best score. If set to None, it will infer a reasonable rule. Keys such as ‘acc’, ‘top’ .etc will be inferred by ‘greater’ rule. Keys contain ‘loss’ will be inferred by ‘less’ rule. Options are ‘greater’, ‘less’, None. Default: None.
  • test_fn (callable, optional) – test a model with samples from a dataloader, and return the test results. If None, the default test function mmcv.engine.single_gpu_test will be used. (default: None)
  • greater_keys (List[str] | None, optional) – Metric keys that will be inferred by ‘greater’ comparison rule. If None, _default_greater_keys will be used. (default: None)
  • less_keys (List[str] | None, optional) – Metric keys that will be inferred by ‘less’ comparison rule. If None, _default_less_keys will be used. (default: None)
  • out_dir (str, optional) – The root directory to save checkpoints. If not specified, runner.work_dir will be used by default. If specified, the out_dir will be the concatenation of out_dir and the last level directory of runner.work_dir. New in version 1.3.16.
  • file_client_args (dict) – Arguments to instantiate a FileClient. See mmcv.fileio.FileClient for details. Default: None. New in version 1.3.16.
  • **eval_kwargs – Evaluation arguments fed into the evaluate function of the dataset.


7. 训练设置


runner = dict(type='EpochBasedRunner', max_epochs=150)   # 设置模型训练多少次


参数说明:https://mmcv.readthedocs.io/en/latest/api.html#mmcv.runner.EpochBasedRunner


mmcv.runner.EpochBasedRunner(model: torch.nn.modules.module.Module, batch_processor: Optional[Callable] = None, optimizer: Optional[Union[Dict, torch.optim.optimizer.Optimizer]] = None, work_dir: Optional[str] = None, logger: Optional[logging.Logger] = None, meta: Optional[Dict] = None, max_iters: Optional[int] = None, max_epochs: Optional[int] = None)


总结:


一般来说,我们写配置文件都会继承default_runtime.py这个文件


_base_ = [
    '../_base_/default_runtime.py'
]


这个文件的内容如下所示:


checkpoint_config = dict(interval=5)    # 每5个epoch保存一次权重
# yapf:disable
log_config = dict(
    interval=50,    # 每500个迭代就打印一次训练信息
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None        # 加载权重文件
resume_from = None
workflow = [('train', 1)]
# disable opencv multithreading to avoid system being overloaded
opencv_num_threads = 0
# set multi-process start method as `fork` to speed up the training
mp_start_method = 'fork'
# Default setting for scaling LR automatically
#   - `enable` means enable scaling LR automatically
#       or not by default.
#   - `base_batch_size` = (8 GPUs) x (2 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=16)


一般不需要更改太多的内容,可以时代的更改log_config进行合理的打印训练信息,还有设置checkpoint_config进行合理的保存权重文件,其他的设置按默认即可。


下面展示我继承了default_runtime.py后更改的内容,其实就是更改了以上我所介绍的七点内容:


_base_ = [
    '../_base_/default_runtime.py'
]
......
# optimizer
optimizer = dict(   # 设置使用AdamW优化器(默认使用的是SGD)
    type='AdamW',
    lr=0.0001,
    weight_decay=0.0001,
    paramwise_cfg=dict(custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=1.0)}))
evaluation = dict(interval=5, metric='bbox')   # 5个epoch验证一次
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))  # 设置梯度裁剪(default_runtime.py中默认为None)
checkpoint_config = dict(interval=20)          # 20个epoch保存一次权重
log_config = dict(interval=50,     # 每50次迭代训练就打印一次信息(注意是迭代而不是epoch)
                  hooks=[dict(type='TextLoggerHook')])
# learning policy
lr_config = dict(policy='step', step=[100])              # 学习率在100个epoch进行衰减
runner = dict(type='EpochBasedRunner', max_epochs=150)   # 训练150个epoch



参考资料:


1. Customize Runtime Settings


2. mmcv官方文档


相关实践学习
部署Stable Diffusion玩转AI绘画(GPU云服务器)
本实验通过在ECS上从零开始部署Stable Diffusion来进行AI绘画创作,开启AIGC盲盒。
目录
相关文章
|
6月前
|
并行计算 Docker 容器
Mamba 环境安装:causal-conv1d和mamba-ssm报错解决办法
Mamba 环境安装:causal-conv1d和mamba-ssm报错解决办法
2077 0
|
6月前
|
JSON API 持续交付
逐步指南:使用FastAPI部署YOLO模型的步骤
逐步指南:使用FastAPI部署YOLO模型的步骤
|
12月前
|
机器学习/深度学习 算法 计算机视觉
3D目标检测框架 MMDetection3D环境搭建 docker篇
本文介绍如何搭建3D目标检测框架,使用docker快速搭建MMDetection3D的开发环境,实现视觉3D目标检测、点云3D目标检测、多模态3D目标检测等等。
590 0
|
Python
MMDetection系列 | 3. MMDetection自定义模型训练
MMDetection系列 | 3. MMDetection自定义模型训练
831 0
MMDetection系列 | 3. MMDetection自定义模型训练
|
并行计算 PyTorch 测试技术
MMDetection系列 | 1. MMDetection安装流程与测试
MMDetection系列 | 1. MMDetection安装流程与测试
773 0
|
机器学习/深度学习 编解码 算法
MMDetection系列 | 4. MMDetection模型代码训练及测试过程的详细解析
MMDetection系列 | 4. MMDetection模型代码训练及测试过程的详细解析
874 0
MMDetection系列 | 4. MMDetection模型代码训练及测试过程的详细解析
|
5月前
|
机器学习/深度学习 存储 监控
基于YOLOv8的多目标检测与自动标注软件【python源码+PyqtUI界面+exe文件】【深度学习】
基于YOLOv8的多目标检测与自动标注软件【python源码+PyqtUI界面+exe文件】【深度学习】
|
人工智能 数据可视化 TensorFlow
从Tensorflow模型文件中解析并显示网络结构图(CKPT模型篇)
从Tensorflow模型文件中解析并显示网络结构图(CKPT模型篇)
从Tensorflow模型文件中解析并显示网络结构图(CKPT模型篇)
|
异构计算
【超快超轻YOLO】YOLO-Fastest从Darknet源码编译、测试再到训练完整图文教程!
【超快超轻YOLO】YOLO-Fastest从Darknet源码编译、测试再到训练完整图文教程!
449 0
|
编解码 并行计算 算法
MMdetection框架速成系列 第01部分:学习路线图与步骤+优先学习的两个目标检测模型代码+loss计算流程+遇到问题如何求助+Anaconda3下的安装教程(mmdet+mmdet3d)
Tip:目前 MMDetection 实现的算法中主要包括 one-stage 和 two-stage 算法,而 two-stage 算法可以简单认为是 one-stage + pool + one-stage 步骤。
893 0