caffe调loss方法

简介: 正文what should I do if......my loss diverges? (increases by order of magnitude, goes to inf. or NaN)lower the learning rateraise momentum (with cor...

正文

what should I do if...
...my loss diverges? (increases by order of magnitude, goes to inf. or NaN)
lower the learning rate
raise momentum (with corresponding learning rate drop)
raise weight decay
raise batch size
use gradient clipping (limit the L2 norm of the gradient to a particular value at each iteration; shrink it to that norm if greater)
try another solver: momentum SGD, ADAM, RMSProp, ...
try a smaller initialization (e.g., for a Gaussian init., lower the stdev.)

what should I do if...
...my loss doesn’t improve / gets stuck / drops slowly?

  • raise the learning rate
  • (maybe) lower momentum, weight decay, and/or batch size
  • try another solver: momentum SGD, ADAM, RMSProp, ...
  • transfer a pre-trained (e.g. on ImageNet) initialization, if possible
  • use a larger initialization (in particular, make sure you didn’t zero-initialize any multiplicative weights in intermediate layers)
  • use a “smarter” initialization (e.g., for linear layers followed by ReLUs, try the msra initialization in Caffe)

  • remove some layers to make the network shallower
    at least to start!
    a strategy for model design: begin with a simple, trainable network; “deepen” it by adding new layers one-by-one

-modify the architecture to improve gradient flow:
batch normalization
residual learning [ResNet]
intermediate losses [GoogLeNet]
other tricks

be patient! (go outside?)
deep learning can take a long time
training AlexNet in 2012: 12 days
although this is down to 1 day in 2015!
loss hovers around the chance value of ln(1000) ≅ 6.908 for the first 1000+ iterations (~1 hour on 2012 GPU)
training ResNet-152 in 2015: 1-2 months (on 8 GPUs!)
the best configurations (net architectures, solvers) at convergence are often not the ones that train fastest early on
some tricks to speed up learning can be “greedy” rather than ultimately beneficial

补充一个:如果显存不够,考虑设定iter_size来增大batch_size

reference

https://docs.google.com/presentation/d/1HxGdeq8MPktHaPb-rlmYYQ723iWzq9ur6Gjo71YiG0Y/edit#slide=id.g8629ab2c8_0_60

目录
相关文章
|
C语言
C语言学习笔记-C语言中的数据类型
C语言学习笔记-C语言中的数据类型
145 1
|
6月前
|
开发框架
osharp集成Yitter.IdGenerator并实现分布式ID
本文介绍了在 osharp 框架中集成 Yitter.IdGenerator 实现分布式 ID 的方法。osharp 是一个基于 .NET Core 的快速开发框架,而 Yitter.IdGenerator 是一种高效的分布式 ID 生成器。通过实现 `IKeyGenerator<long>` 接口并创建 `YitterSnowKeyGenerator` 类,结合 `YitterIdGeneratorPack` 模块化配置,实现了分布式环境下唯一 ID 的生成。
123 0
|
11月前
|
NoSQL Java MongoDB
MongoDB 排序
10月更文挑战第16天
235 4
|
缓存 JavaScript
Error: EPERM: operation not permitted, mkdir ‘C:\Program Files\nodejs‘TypeError: Cannot read proper
Error: EPERM: operation not permitted, mkdir ‘C:\Program Files\nodejs‘TypeError: Cannot read proper
270 0
|
数据采集 监控 数据安全/隐私保护
ERP系统中的人力资源管理与员工绩效评估解析
【7月更文挑战第25天】 ERP系统中的人力资源管理与员工绩效评估解析
701 1
|
JavaScript Java 测试技术
基于SpringBoot+Vue+uniapp的共享客栈管理系统的详细设计和实现(源码+lw+部署文档+讲解等)
基于SpringBoot+Vue+uniapp的共享客栈管理系统的详细设计和实现(源码+lw+部署文档+讲解等)
|
安全 Go 区块链
区块链游戏链游系统开发功能详情丨方案逻辑丨开发项目丨案例分析丨源码规则
 In recent years, with the continuous development of blockchain technology, NFTs (non homogeneous tokens) and DAPPs (decentralized applications) have emerged in the gaming industry.
|
前端开发
使用Css实现坐标点位效果
使用Css实现坐标点位效果
|
开发框架 .NET Unix
Linux CentOS7部署ASP.NET Core应用程序,并配置Nginx反向代理服务器和Supervisor守护服务 (下)
Linux CentOS7部署ASP.NET Core应用程序,并配置Nginx反向代理服务器和Supervisor守护服务
431 0
Linux CentOS7部署ASP.NET Core应用程序,并配置Nginx反向代理服务器和Supervisor守护服务 (下)