决策树
A decision tree is a machine learning model that builds upon iteratively asking questions to partition data and reach a solution.
结点:feature
分支:决策
叶子:结果
痛点:
过拟合 即在validation dataset 上表现好,但在test dataset上表现差
Ensemble Learning集成学习
a model that makes predictions based on a number of different models.
优点:由于集成不同模型,集成学习更灵活(小偏差),更少的数据敏感(小方差)
两个最流行的集成方法:
-
bagging (并行训) Random Forest : 每个树在随机子集上训练,结果取平均
Training a bunch of models in parallel way. Each model learns from a random subset of the data.
-
boosting (顺序训) GBDT
Training a bunch of models sequentially. Each model learns from the mistakes of the previous model.
GBDT
weak learner : perform only slightly better than a random chance.
每一步的重点是构造出新的weak learner去处理剩余的困难的观测。
gbdt所借鉴的算法:Adaboost
weak learner : decision stumps(Decision stumps are decision trees with only a single split.)并对难分类的实例加权较高,易分类的实例加权较少。
the final result was average of weighted outputs from all individual learners. 结果加权平均
GB算法与之的区别:优化损失函数代替加权平均
it uses a loss function to minimize loss and converge upon a final output value. The loss function optimization is done using gradient descent
GBDT的weak learner:决策树
优点:准确度较高
缺点:由于串行的方式,学习速度缓慢
检测residual残差:a loss function
-
mse : 回归
-
log loss : 分类
超参
-
Learning rate:每次对已有模型修改的大小由学习率控制
-
n_estimators:使用的树的数量 使用太多树容易过拟合
GBDT对超参很敏感,而随机森林,树多了,则不会过拟合,因为其是并行的方式
对GBDT过拟合的改进方法:
-
随机梯度下降:子采样
-
小的学习率:0.1~0.3
-
正则化
-
树约束:
-
树的数量
-
树的深度:4-8
-
每层的最小损失提升
-
没个划分的观测数量 即样本数