Paper之DL之BP：《Understanding the difficulty of training deep feedforward neural networks》-阿里云开发者社区

Paper之DL之BP：《Understanding the difficulty of training deep feedforward neural networks》

2021-10-28 354

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Paper之DL之BP：《Understanding the difficulty of training deep feedforward neural networks》

原文解读

原文：http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

文章内容以及划重点

Sigmoid的四层局限

sigmoid函数的test loss和training loss要经过很多轮数一直为0.5，后再有到0.1的差强人意的变化。

We hypothesize that this behavior is due to the combinationof random initialization and the fact that an hidden unitoutput of 0 corresponds to a saturated sigmoid. Note that deep networks with sigmoids but initialized from unsupervisedpre-training (e.g. from RBMs) do not suffer fromthis saturation behavior.

tanh、softsign的五层局限

换为tanh函数，就会很好很快的收敛

结论

1、The normalization factor may therefore be important when initializing deep networks because of the multiplicative effect through layers, and we suggest the following initialization procedure to approximately satisfy our objectives of maintaining activation variances and back-propagated gradients variance as one moves up or down the network. We call it the normalized initialization

2、结果可知分布更加均匀

Activation values normalized histograms with hyperbolic tangent activation, with standard (top) vs normalized initialization (bottom). Top: 0-peak increases for higher layers.

Several conclusions can be drawn from these error curves:

(1)、The more classical neural networks with sigmoid or hyperbolic tangent units and standard initialization fare rather poorly, converging more slowly and apparently towards ultimately poorer local minima.

(2)、The softsign networks seem to be more robust to the initialization procedure than the tanh networks, presumably because of their gentler non-linearity.

(3)、For tanh networks, the proposed normalized initialization can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of activations (flowing upward) and gradients (flowing backward).

3、Sigmoid 5代表有5层，N代表正则化，可得出预训练会得到更小的误差

文章标签：

C++

Paper之DL之BP：《Understanding the difficulty of training deep feedforward neural networks》

原文解读

文章内容以及划重点

tanh、softsign的五层局限

结论

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Paper之DL之BP：《Understanding the difficulty of training deep feedforward neural networks》

原文解读

文章内容以及划重点

tanh、softsign的五层局限

结论

热门文章

最新文章

相关电子书