1. 设置可见GPU,进行多显卡深度学习训练
import os # 按照PCI_BUS_ID顺序从0开始排列GPU设备 os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # 设置当前使用的GPU设备仅为0号设备 设备名称为'/gpu:0' os.environ["CUDA_VISIBLE_DEVICES"]="0" # 设置当前使用的GPU设备为0,1号两个设备,名称依次为'/gpu:0'、'/gpu:1' os.environ["CUDA_VISIBLE_DEVICES"] = "0,1" # 设置当前使用的GPU设备为1,0号两个设备,名称依次为'/gpu:0'、'/gpu:1'。表示优先使用1号设备,然后使用0号设备 os.environ["CUDA_VISIBLE_DEVICES"] = "1,0"
2. 一些标志位的设置
- torch.backends.cudnn.enabled = True
官方解释:A bool that controls whether cuDNN is enabled.
cuDNN使用非确定性算法,并且可以使用cudnn.enabled = False来进行禁用;如果设置为True,那么cuDNN使用的非确定性算法就会自动寻找最适合当前配置的高效算法,来达到优化运行效率的问题。
- torch.backends.cudnn.benchmark = True
官方解释:A bool that, if True, causes cuDNN to benchmark multiple convolution algorithms and select the fastest.
对模型里的卷积层进行预先的优化,也就是在每一个卷积层中测试 cuDNN 提供的所有卷积实现算法,然后选择最快的那个。这样在模型启动的时候,只要额外多花一点点预处理时间,就可以较大幅度地减少训练时间。但是如果不停地改变输入形状,运行效率就会很低。这里cudnn.benchmark默认是False。
所以,以上两条可以配合使用,如下所示:
torch.backends.cudnn.enabled = True torch.backends.cudnn.benchmark = True
- torch.backends.cudnn.deterministic = Ture
官方解释:A bool that, if True, causes cuDNN to only use deterministic convolution algorithms.
若cudnn.benchmark == False,cudnn.deterministic=True,则每次返回的卷积算法将是确定的,即默认算法。如果配合上设置 Torch 的随机种子为固定值的话,应该可以保证每次运行网络的时候相同输入的输出是固定的。
下面会介绍分布式训练的使用方法:
- DataParallel:一般使用在单机多卡的情况下
- DistributedParallel:一般使用在多机多卡的情况下(当然在单机多卡下效果也会比DataParallel要好)
3. 多卡训练且平均数据DataParallel
在之前,DataParallel是一个比较常用的选择,使用也比较简单:
# 使用gpu来训练 if torch.cuda.is_available(): device = torch.device('cuda') net = net.to(device) # 默认使用所以卡, 所以这三行代码是等价的 net = torch.nn.DataParallel(net) # net = torch.nn.DataParallel(net , device_ids=[0,1,2,3]).cuda() # net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count())) # 使用cpu来训练 else: device = torch.device('cpu') net = net.to(device)
效果如下,两个显卡的利用空间是相似的:
去掉了这几行代码,只会利用一个gpu设备,只有当超出0卡的显存空间时,才会用1卡,也就是优先使用0号设备,然后使用1号设备:
但是尽管如此,计算的时候还是主要位于0卡中的:
所以,这时候就需要多gpu真正的并行训练,也就是接下来需要介绍的分布式训练。
4. 分布式训练DistributedParallel
这里主要介绍的是pytoch中DistributedParallel函数的使用。
pytorch-encoding使用介绍:
这是一个开源gpu balance的工具,使用方法如下:
from utils.encoding importDataParallelModel, DataParallelCriterion model =DataParallelModel(model) criterion =DataParallelCriterion(criterion)
giuhub链接:https://link.zhihu.com/?target=https%3A//github.com/zhanghang1989/PyTorch-Encoding
DistributedParallel使用介绍:
网上资料的测试代码:
寻找资料,一般使用DistributedParallel的处理流程如下所示:
import torch import os import argparse import torch.distributed as dist from torch.utils.data import DataLoader from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel parser = argparse.ArgumentParser() # 是否启用SyncBatchNorm, SyncBatchNorm只能在分布式训练时才可以使用 parser.add_argument('--syncBN', type=bool, default=True) # 开启的进程数,不用设置该参数,会根据nproc_per_node自动设置 parser.add_argument('--world-size', default=2, type=int, help='number of distributed processes') parser.add_argument('--dist-url', default='tcp://172.16.1.186:2222', type=str, help='url used to set up distributed training') parser.add_argument('--dist-backend', default='gloo', type=str, help='distributed backend') parser.add_argument('--dist-rank', default=0, type=int, help='rank of distributed processes') args = parser.parse_args() # Init torch.distributed.init_process_group(backend="nccl") # dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.dist_rank) # 配置每个进程的gpu local_rank = torch.distributed.get_rank() torch.cuda.set_device(local_rank) device = torch.device('cuda:%d' % local_rank) # 封装之前要把模型移到对应的gpu model = YourModel() model = model.to(device) # model.cuda() # model = torch.nn.parallel.DistributedDataParallel(model) model = DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank) # load data train_data = Dataset(root=args.root, resize=args.resize, mode='train') train_loader = DataLoader(train_data, args.batch_size, pin_memory=True, drop_last=True, sampler=DistributedSampler(train_data))
yolov3spp的multi-gpu代码:
先贴上yolov3spp中关于分布式训练的代码(这里的代码是由b站博主劈里啪啦提供的):
# 导入分布式训练的相关函数 from train_utils import init_distributed_mode, torch_distributed_zero_first from train_utils import get_coco_api_from_dataset # coco验证包 # 开启的进程数(注意不是线程),不用设置该参数,会根据nproc_per_node自动设置 parser.add_argument('--world-size', default=2, type=int, help='number of distributed processes') parser.add_argument('--dist-url', default='env://', help='url used to set up distributed training') # Init def init_distributed_mode(args): if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ: args.rank = int(os.environ["RANK"]) args.world_size = int(os.environ['WORLD_SIZE']) args.gpu = int(os.environ['LOCAL_RANK']) elif 'SLURM_PROCID' in os.environ: args.rank = int(os.environ['SLURM_PROCID']) args.gpu = args.rank % torch.cuda.device_count() else: print('Not using distributed mode') args.distributed = False return args.distributed = True torch.cuda.set_device(args.gpu) args.dist_backend = 'nccl' print('| distributed init (rank {}): {}'.format( args.rank, args.dist_url), flush=True) torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank) torch.distributed.barrier() setup_for_distributed(args.rank == 0) # 参数:opt是训练的参数设置, hyp是模型的参数设置(也就是超参数) def main(opt, hyp): # 初始化各进程 init_distributed_mode(opt) ... device = torch.device(opt.device) model = Darknet(cfg).to(device) model.load_state_dict(ckpt["model"], strict=False) # 导入权重或者预训练模型 model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[opt.gpu]) ... # Make sure only the first process in DDP process the dataset first, and the following others can use the cache. with torch_distributed_zero_first(opt.rank): train_dataset = LoadImagesAndLabels(train_path, imgsz_train, batch_size, augment=True, hyp=hyp, # augmentation hyperparameters rect=opt.rect, # rectangular training cache_images=opt.cache_images, single_cls=opt.single_cls, rank=opt.rank) # 验证集的图像尺寸指定为img_size(512) val_dataset = LoadImagesAndLabels(test_path, imgsz_test, batch_size, hyp=hyp, cache_images=opt.cache_images, single_cls=opt.single_cls, rank=opt.rank) # 给每个rank对应的进程分配训练的样本索引 train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset) # 将样本索引每batch_size个元素组成一个list train_batch_sampler = torch.utils.data.BatchSampler( train_sampler, batch_size, drop_last=True) ... # dataloader train_data_loader = torch.utils.data.DataLoader( train_dataset, batch_sampler=train_batch_sampler, num_workers=nw, pin_memory=True, collate_fn=train_dataset.collate_fn) val_data_loader = torch.utils.data.DataLoader( val_dataset, batch_size=batch_size, sampler=val_sampler, num_workers=nw, pin_memory=True, collate_fn=val_dataset.collate_fn) ... # caching val_data when you have plenty of memory(RAM) with torch_distributed_zero_first(opt.rank): if os.path.exists("tmp.pk") is False: coco = get_coco_api_from_dataset(val_dataset) with open("tmp.pk", "wb") as f: pickle.dump(coco, f) else: with open("tmp.pk", "rb") as f: coco = pickle.load(f) # 之后就正常使用dataloader就可以了 for epoch in range(start_epoch, epochs): train_sampler.set_epoch(epoch) # 训练 mloss, lr = train_util.train_one_epoch(model, optimizer, train_data_loader,...) ... # 测试 result_info = train_util.evaluate(model, val_data_loader, coco=coco, device=device) ...
DistributedParallel使用模板代码:
这里我对yolov3spp的代码进行简化,整理的使用模板如下所示:
# 导入分布式训练的相关函数 import torch.distributed as dist from torch.utils.data import DataLoader from torch.nn.parallel import DistributedDataParallel from torch.utils.data.distributed import DistributedSampler # 开启的进程数(注意不是线程),不用设置该参数,会根据nproc_per_node自动设置 parser = argparse.ArgumentParser() parser.add_argument('--device', default='cuda', help='device id (i.e. 0 or 0,1 or cpu)') parser.add_argument('--world-size', default=2, type=int, help='number of distributed processes') parser.add_argument('--dist-url', default='env://', help='url used to set up distributed training') args = parser.parse_args() # Init args.gpu = int(os.environ['LOCAL_RANK']) args.dist_backend = 'nccl' args.world_size = int(os.environ['WORLD_SIZE']) args.rank = int(os.environ["RANK"]) args.distributed = True torch.cuda.set_device(args.gpu) device = torch.device(args.device) # 初始化 torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank) torch.distributed.barrier() ... # 导入权重或者预训练模型 model = Darknet(cfg).to(device) model.load_state_dict(ckpt["model"], strict=False) model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[opt.gpu]) ... # 自定义数据集,传入args.rank信息,只打印主进程信息 train_dataset = LoadImagesAndLabels(train_path, imgsz_train, batch_size, augment=True, hyp=hyp, # augmentation hyperparameters rect=opt.rect, # rectangular training cache_images=opt.cache_images, single_cls=opt.single_cls, rank=args.rank) val_dataset = LoadImagesAndLabels(test_path, imgsz_test, batch_size, hyp=hyp, cache_images=opt.cache_images, single_cls=opt.single_cls, rank=args.rank) # 给每个rank对应的进程分配训练的样本索引 train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset) # 将样本索引每batch_size个元素组成一个list train_batch_sampler = torch.utils.data.BatchSampler( train_sampler, batch_size, drop_last=True) ... # dataloader train_data_loader = torch.utils.data.DataLoader( train_dataset, batch_sampler=train_batch_sampler, num_workers=nw, pin_memory=True, collate_fn=train_dataset.collate_fn) val_data_loader = torch.utils.data.DataLoader( val_dataset, batch_size=batch_size, sampler=val_sampler, num_workers=nw, pin_memory=True, collate_fn=val_dataset.collate_fn) ... # 之后就正常使用dataloader就可以了 for epoch in range(start_epoch, epochs): train_sampler.set_epoch(epoch) # 训练 mloss, lr = train_util.train_one_epoch(model, optimizer, train_data_loader,...) ... # 测试 result_info = train_util.evaluate(model, val_data_loader, coco=coco, device=device) ...
然后在不能直接在pycharm中运行,使用分布式训练需要只使用命令行来跑:
# nproc_per_node参数为使用GPU数量,这里我们的服务器只有两张卡 python -m torch.distributed.launch --nproc_per_node=2 --use_env train_multi_GPU.py # 这里的--use_env参数好像还可以忽略 python -m torch.distributed.launch --nproc_per_node=2 train_multi_GPU.py
5. 一些增加gpu利用率的方法
1)主函数前面加:(这个会牺牲一点点现存提高模型精度)
cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.enabled = True
2)训练时,epoch前面加:(定期清空模型,效果感觉不明显)
torch.cuda.empty_cache()
3)dataloader的长度_len_设置:(dataloader会间歇式出现卡顿,设置成这样会避免不少)
def __len__(self): return self.images.shape[0]
4)dataloader的预加载设置:(会在模型训练的时候加载数据,提高一点点gpu利用率)
train_loader = torch.utils.data.DataLoader( train_dataset, ... pin_memory=True, )
ps:后续还有其他的会继续更新…
参考链接:
1)和nn.DataParallel说再见:https://zhuanlan.zhihu.com/p/95700549
2)PyTorch常见的坑汇总:https://cloud.tencent.com/developer/article/1512508
3)pytorch 分布式训练初探:https://zhuanlan.zhihu.com/p/43424629
4)Pytorch中多GPU训练指北:https://www.cnblogs.com/jfdwd/p/11196439.html
5)torch.backends.cudnn.benchmark标志位True or False:https://zhuanlan.zhihu.com/p/333632424