Regularization on GBDT-阿里云开发者社区

Regularization on GBDT

2017-12-04 1691

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

之前一篇文章简单地讲了XGBoost的实现与普通GBDT实现的不同之处，本文尝试总结一下GBDT运用的正则化技巧。

Early Stopping

Early Stopping是机器学习迭代式训练模型中很常见的防止过拟合技巧，维基百科里如下描述:

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent.

具体的做法是选择一部分样本作为验证集，在迭代拟合训练集的过程中，如果模型在验证集里错误率不再下降，就停止训练，也就是说控制迭代的轮数（树的个数）。

XGBoost Python关于early stopping的参数设置文档非常清晰，API如下：

# code snippets from xgboost python-package training.py
def train(..., evals=(), early_stopping_rounds=None) """Train a booster with given parameters. Parameters ---------- early_stopping_rounds: int Activates early stopping. Validation error needs to decrease at least every <early_stopping_rounds> round(s) to continue training. """

Sklearn的GBDT实现虽然可以添加early stopping，但是比较复杂。官方没有相应的文档和代码样例，必须看源码。实现的时候需要用户提供monitor回调函数，且要了解源码内部_fit_stages函数的locals，总之对新手很不友好：

#code snippets from sklearn.ensemble.gradient_boosting
class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble, _LearntSelectorMixin)): """Abstract base class for Gradient Boosting. """ ... def fit(self, X, y, sample_weight=None, monitor=None): """Fit the gradient boosting model. Parameters ---------- monitor : callable, optional The monitor is called after each iteration with the current iteration, a reference to the estimator and the local variables of ``_fit_stages`` as keyword arguments ``callable(i, self, locals())``. If the callable returns ``True`` the fitting procedure is stopped. The monitor can be used for various things such as computing held-out estimates, early stopping, model introspect, and snapshoting. """

对Sklearn感兴趣的可以看这篇文章Using Gradient Boosting (with Early Stopping)，里面有回调函数monitor的参考实现。

Shrinkage

Shrinkage就是将每棵树的输出结果乘一个因子( $0 < ν < 1$

f m (x) = f m - 1 (x) + ν \cdot Σ J m j = 1 γ j m I (x \in R j

ESL书中这样讲：

The parameter $ν$

$ν$

下面是Sklearn的实现关于该参数设置的片段，XGBoost类似：

#code snippets from sklearn.ensemble.gradient_boosting
class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin): """Gradient Boosting for classification.""" def __init__(self, ..., learning_rate=0.1, n_estimators=100, ...): """ Parameters ---------- learning_rate : float, optional (default=0.1) learning rate shrinks the contribution of each tree by `learning_rate`. There is a trade-off between learning_rate and n_estimators. n_estimators : int (default=100) The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance """

Subsampling

Subsampling其实源于bootstrap averaging(bagging)思想，GBDT里的做法是在每一轮建树时，样本是从训练集合中无放回随机抽样的 $η$

事实上，XGBoost和Sklearn的实现均借鉴了随机森林，除了有样本层次上的采样，也有特征采样。也就是说建树的时候只从随机选取的一些特征列寻找最优分裂。下面是Sklearn里的相关参数设置的片段，

#code snippets from sklearn.ensemble.gradient_boosting
class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin): """Gradient Boosting for classification.""" def __init__(self, ..., subsample=1.0, max_features=None,...): """ Parameters ---------- subsample : float, optional (default=1.0) The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. `subsample` interacts with the parameter `n_estimators`. Choosing `subsample < 1.0` leads to a reduction of variance and an increase in bias. max_features : int, float, string or None, optional (default=None) The number of features to consider when looking for the best split: """

Regularized Learning Objective

将树模型的复杂度作为正则项显式地加进优化目标里，是XGBoost实现的独到之处。

L (t) = \sum i = 1 n l (y i, y * (t - 1) i + f t (x i))

where

Ω (f) = γ T + 1 2 λ | | w | | 2

其中 $y_{i}^{* (t)}$

我个人的看法是将树模型的复杂度作为正则化项加在优化目标，相比自己通过参数控制每轮树的复杂度更直接，这可能是XGBoost相比普通GBDT实现效果更好的一个很重要的原因。很遗憾，Sklearn暂时无相应的实现。

Dropout

Dropout是deep learning里很常用的正则化技巧，很自然的我们会想能不能把Dropout用到GBDT模型上呢？AISTATS2015有篇文章DART: Dropouts meet Multiple Additive Regression Trees进行了一些尝试。

文中提到GBDT里会出现over-specialization的问题：

Trees added at later iterations tend to impact the prediction of only a few instances, and they make negligible contribution towards the prediction of all the remaining instances. We call this issue of subsequent trees affecting the prediction of only a small fraction of the training instances over-specialization.

也就是说前面迭代的树对预测值的贡献比较大，后面的树会集中预测一小部分样本的偏差。Shrinkage可以减轻over-specialization的问题，但不是很好。作者想通过Dropout来平衡所有树对预测的贡献，如下图的效果：

具体的做法如下：

DART divergesfrom MART at two places. First, when computing the gradient that the next tree will fit, only a random subset of the existing ensemble is considered. The second place at which DART diverges from MART is when adding the new tree to the ensemble where DART performs a normalization step.

简单说就是每次新加一棵树，这棵树要拟合的并不是之前全部树ensemble后的残差，而是随机抽取的一些树ensemble；同时新加的树结果要规范化一下。

这种新做法对GBDT效果的提升有多明显还有待大家探索尝试。

本文转自博客园知识天地的博客，原文链接： Regularization on GBDT，如需转载请自行联系原博主。

Regularization on GBDT

Early Stopping

Shrinkage

Subsampling

Regularized Learning Objective

Dropout

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Regularization on GBDT

Early Stopping

Shrinkage

Subsampling

Regularized Learning Objective

Dropout

热门文章

最新文章

相关电子书