模型融合
一般来说,通过融合多个不同的模型,可能提升机器学习的性能,这一方法在各种机器学习比赛中广泛应用, 常见的集成学习&模型融合方法包括:简单的Voting/Averaging(分别对于分类和回归问题)、Stacking、Boosting和Bagging。
1 Voting
模型融合其实也没有想象的那么高大上,从最简单的Voting说起,这也可以说是一种模型融合。假设对于一个二分类问题,有3个基础模型,那么就采取投票制的方法,投票多者确定为最终的分类。
2 Averaging
对于回归问题,一个简单直接的思路是取平均。稍稍改进的方法是进行加权平均。
权值可以用排序的方法确定,举个例子,比如A、B、C三种基本模型,模型效果进行排名,假设排名分别是1,2,3,那么给这三个模型赋予的权值分别是3/6、2/6、1/6。
注意两个问题:
- 如果进行投票的模型越多,那么显然其结果将会更好。但是其前提条件是模型之间相互独立,结果之间没有相关性。越相近的模型进行融合,融合效果也会越差。 模型之间差异越大,融合所得的结果将会更好。这种特性不会受融合方式的影响。注意这里所指模型之间的差异,并不是指正确率的差异,而是指模型之间相关性的差异。对于回归问题,对各种模型的预测结果进行平均,所得到的结果通过能够减少过拟合,并使得边界更加平滑,单个模型的边界可能很粗糙。
- 在上述融合方法的基础上,一个进行改良的方式是对各个投票者/平均者分配不同的权重以改变其对最终结果影响的大小。对于正确率低的模型给予更低的权重,而正确率更高的模型给予更高的权重。
3 Bagging
Bagging就是采用有放回的方式进行抽样,用抽样的样本建立子模型,对子模型进行训练,这个过程重复多次,最后进行融合。大概分为这样两步:
- 重复K次
有放回地重复抽样建模
训练子模型 - 模型融合
分类问题:voting
回归问题:average
Bagging算法不用我们自己实现,随机森林就是基于Bagging算法的一个典型例子,采用的基分类器是决策树。
4 Boosting
Bagging算法可以并行处理,而Boosting的思想是一种迭代的方法,每一次训练的时候都更加关心分类错误的样例,给这些分类错误的样例增加更大的权重,下一次迭代的目标就是能够更容易辨别出上一轮分类错误的样例。最终将这些弱分类器进行加权相加。
其基本工作机制如下:
1、从初始样本集中训练出一个基学习器; 2、根据基学习器的表现对样本集分布进行调整,使得做错的样本能在之后的过程中受到更多的关注; 3、用调整后的样本集训练下一个基学习器; 4、重复上述步骤,直到满足一定条件。
注意,一般只有弱分类器都是同一种分类器(即同质集成)的时候,才将弱分类器称为基学习器,如果是异质集成,则称之为个体学习器。由于不是本文重点,所以此处不作区分。特此说明。
最终将这些弱分类器进行加权相加。
常见的Boosting方法有Adaboost、GBDT、XGBOOST等
5 Stacking 算法
1、首先我们将训练集分为五份。
2、对于每一个基模型来说,我们用其中的四份来训练,然后对未用来的训练的一份训练集和测试集进行预测。然后改变所选的用来训练的训练集和用来验证的训练集,重复此步骤,直到获得完整的训练集的预测结果。
3、对五个模型,分别进行步骤2,我们将获得5个模型,以及五个模型分别通过交叉验证获得的训练集预测结果。即P1、P2、P3、P4、P5。
4、用五个模型分别对测试集进行预测,得到测试集的预测结果:T1、T2、T3、T4、T5。
5、将P15、T15作为下一层的训练集和测试集。在图中分别作为了模型6的训练集和测试集。
代码实例如下:
""" @author: quincy qiang @license: Apache Licence @file: 04_stakcing_template.py @time: 2019/12/12 @software: PyCharm """ import numpy as np import pandas as pd import lightgbm as lgb import xgboost as xgb from sklearn.linear_model import BayesianRidge from sklearn.model_selection import KFold, RepeatedKFold from sklearn.preprocessing import OneHotEncoder, LabelEncoder from scipy import sparse import warnings import time import sys import os import re import datetime import matplotlib.pyplot as plt import seaborn as sns import plotly.offline as py py.init_notebook_mode(connected=True) import plotly.graph_objs as go import plotly.tools as tls from sklearn.metrics import mean_squared_error from sklearn.metrics import log_loss from gen_feas import load_data train, test, no_features, features = load_data() X_train = train[features].values y_train = train['target'] target = y_train X_test = test[features].values ## lgb param = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': {'auc'}, 'max_depth': 4, 'min_child_weight': 6, 'num_leaves': 16, 'learning_rate': 0.02, # 0.05 'feature_fraction': 0.7, 'bagging_fraction': 0.7, 'bagging_freq': 5, 'verbose': -1 } folds = KFold(n_splits=5, shuffle=True, random_state=2018) oof_lgb = np.zeros(len(train)) predictions_lgb = np.zeros(len(test)) for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)): print("fold n°{}".format(fold_ + 1)) trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx]) val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx]) num_round = 10000 clf = lgb.train(param, trn_data, num_round, valid_sets=[trn_data, val_data], verbose_eval=200, early_stopping_rounds=100) oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration) predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb, target))) #### xgb xgb_params = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eta': 0.02, 'max_depth': 4, 'min_child_weight': 6, 'colsample_bytree': 0.7, 'subsample': 0.7, 'silent': 1, 'eval_metric': ['auc']} folds = KFold(n_splits=5, shuffle=True, random_state=2018) oof_xgb = np.zeros(len(train)) predictions_xgb = np.zeros(len(test)) for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)): print("fold n°{}".format(fold_ + 1)) trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx]) val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx]) watchlist = [(trn_data, 'train'), (val_data, 'valid_data')] clf = xgb.train(dtrain=trn_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=100, params=xgb_params) oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit) predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, target))) # 将lgb和xgb的结果进行stacking train_stack = np.vstack([oof_lgb, oof_xgb]).transpose() test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose() folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590) oof_stack = np.zeros(train_stack.shape[0]) predictions = np.zeros(test_stack.shape[0]) for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, target)): print("fold {}".format(fold_)) trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values clf_3 = BayesianRidge() clf_3.fit(trn_data, trn_y) oof_stack[val_idx] = clf_3.predict(val_data) predictions += clf_3.predict(test_stack) / 10 mean_squared_error(target.values, oof_stack) from pandas import DataFrame result = DataFrame() result['id'] = test['id'] result['target'] = predictions result.to_csv('result/stacking.csv', index=False, sep=",", float_format='%.6f')
6 Blending
第一步:将原始训练数据划分为训练集和验证集。
第二步:使用训练集对训练T个不同的模型。
第三步:使用T个基模型,对验证集进行预测,结果作为新的训练数据。
第四步:使用新的训练数据,训练一个元模型。
第五步:使用T个基模型,对测试数据进行预测,结果作为新的测试数据。
第六步:使用元模型对新的测试数据进行预测,得到最终结果。
超参数优化
推荐两个工具:Optuna和BayesianOptimization
推荐1:Optuna
import numpy as np import optuna import lightgbm as lgb import sklearn.datasets import sklearn.metrics from sklearn.model_selection import train_test_split # FYI: Objective functions can take additional arguments # (https://optuna.readthedocs.io/en/stable/faq.html#objective-func-additional-args). def objective(trial): data, target = sklearn.datasets.load_breast_cancer(return_X_y=True) train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25) dtrain = lgb.Dataset(train_x, label=train_y) param = { "objective": "binary", "metric": "binary_logloss", "verbosity": -1, "boosting_type": "gbdt", "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True), "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True), "num_leaves": trial.suggest_int("num_leaves", 2, 256), "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0), "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0), "bagging_freq": trial.suggest_int("bagging_freq", 1, 7), "min_child_samples": trial.suggest_int("min_child_samples", 5, 100), } gbm = lgb.train(param, dtrain) preds = gbm.predict(valid_x) pred_labels = np.rint(preds) accuracy = sklearn.metrics.accuracy_score(valid_y, pred_labels) return accuracy if __name__ == "__main__": study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100) print("Number of finished trials: {}".format(len(study.trials))) print("Best trial:") trial = study.best_trial print(" Value: {}".format(trial.value)) print(" Params: ") for key, value in trial.params.items(): print(" {}: {}".format(key, value))
推荐2 BayesianOptimization
数据来源:https://www.kaggle.com/c/home-credit-default-risk
import pandas as pd import numpy as np import warnings import time warnings.filterwarnings("ignore") import lightgbm as lgb from bayes_opt import BayesianOptimization from sklearn.metrics import roc_auc_score application_train = pd.read_csv('../input/application_train.csv') from sklearn.preprocessing import LabelEncoder def label_encoder(input_df, encoder_dict=None): """ Process a dataframe into a form useable by LightGBM """ # Label encode categoricals categorical_feats = input_df.columns[input_df.dtypes == 'object'] for feat in categorical_feats: encoder = LabelEncoder() input_df[feat] = encoder.fit_transform(input_df[feat].fillna('NULL')) return input_df, categorical_feats.tolist(), encoder_dict application_train, categorical_feats, encoder_dict = label_encoder(application_train) X = application_train.drop('TARGET', axis=1) y = application_train.TARGET # 第一步:设置需要优化的参数 def lgb_eval(num_leaves, feature_fraction, bagging_fraction, max_depth, lambda_l1, lambda_l2, min_split_gain, min_child_weight): params = {'application':'binary','num_iterations':4000, 'learning_rate':0.05, 'early_stopping_round':100, 'metric':'auc'} params["num_leaves"] = round(num_leaves) params['feature_fraction'] = max(min(feature_fraction, 1), 0) params['bagging_fraction'] = max(min(bagging_fraction, 1), 0) params['max_depth'] = round(max_depth) params['lambda_l1'] = max(lambda_l1, 0) params['lambda_l2'] = max(lambda_l2, 0) params['min_split_gain'] = min_split_gain params['min_child_weight'] = min_child_weight cv_result = lgb.cv(params, train_data, nfold=n_folds, seed=random_seed, stratified=True, verbose_eval =200, metrics=['auc']) return max(cv_result['auc-mean']) # 第二步:设置超参数搜索范围 lgbBO = BayesianOptimization(lgb_eval, {'num_leaves': (24, 45), 'feature_fraction': (0.1, 0.9), 'bagging_fraction': (0.8, 1), 'max_depth': (5, 8.99), 'lambda_l1': (0, 5), 'lambda_l2': (0, 3), 'min_split_gain': (0.001, 0.1), 'min_child_weight': (5, 50)}, random_state=0) # 第三步:设置优化目标 # lgbBO.maximize(init_points=init_round, n_iter=opt_round) # 第四步:获取最优参数 # lgbBO.res['max']['max_params']
参考资料
- 集成学习-Blending算法 https://blog.csdn.net/qq_36816848/article/details/116674754
- 图解Blending&Stacking https://blog.csdn.net/sinat_35821976/article/details/83622594