【DSW Gallery】 XGBoost:如何使用XGBoost解决回归问题-阿里云开发者社区

直接使用

请打开 XGBoost:如何使用XGBoost解决回归问题，并点击右上角 “ 在DSW中打开” 。

使用XGBoost算法来预测房价

本文基于一个包含房子各种属性的数据集，以房价为label，使用XGBoost来预测房价.

DSW的Sample Notebook中，还有另外一个Notebook也在使用和本文同样的数据集进行房价的回归分析，不同的是那一个Notebook使用的是普通的线性回归算法。

有兴趣的同学可以看一下线性回归的Notebook。链接

最后的结果是XGBoost实现了更高的精度(92% vs 86%)

准备工作

本文依赖的软件包都已经在DSW镜像中预置安装，如果您的环境没有安装的话，可以用pip install xxx来完成准备。

我们先把需要的python库import进来。

import numpy as np  
import pandas as pd  
%matplotlib inline
import matplotlib.pyplot as plt   
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn  
from scipy import stats
from scipy.stats import norm, skew  
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))

数据加载

使用Pandas读入数据，并查看原始数据。train.csv文件是我们已经提前从网上下载并准备好。本文没有涉及测试样本，可以在网上下载对应的test.csv文件。

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.head(5)

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65	8450	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2008	WD	Normal	208500
1	2	20	RL	80	9600	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2007	WD	Normal	181500
2	3	60	RL	68	11250	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal	223500
3	4	70	RL	60	9550	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2006	WD	Abnorml	140000
4	5	60	RL	84	14260	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	12	2008	WD	Normal	250000
5 rows × 81 columns

test.head(5)

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	0	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43	5005	Pave	NaN	IR1	HLS	AllPub	...	144	0	NaN	NaN	NaN	0	1	2010	WD	Normal
5 rows × 80 columns

数据清洗与预处理

一般我们拿到的原始数据都有各种各样的问题，不利于分析和训练，所以要经过一个清洗和预处理的阶段，比如去重，缺失值，异常值等等的处理。

我们在前面已经看到原始数据有81列特征，总计1460条记录。其中ID列对我们做训练没有意义，先去掉

print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']
#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))

The train data size before dropping Id feature is : (1460, 81) 
The test data size before dropping Id feature is : (1459, 80) 
The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79)

特征工程

在本文一开始的链接所在的Notebook中已经对特征工程做了详细的解释和说明，本文就不再展开叙述

# drop掉这两个异常点
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#对label使用np.log1p来平滑处理，让他更接近标准正态分布
train["SalePrice"] = np.log1p(train["SalePrice"])
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
# 拿到缺失值的feature，下面会分别处理
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
corrmat = train.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)

<AxesSubplot:>

缺失值处理

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data = all_data.drop(['Utilities'], axis=1)
all_data["Functional"] = all_data["Functional"].fillna("Typ")
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
# 类型转换
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)
all_data['OverallCond'] = all_data['OverallCond'].astype(str)
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)
# encoding
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))
print('Shape all_data: {}'.format(all_data.shape))
# 根据相关的行业知识，创建新的feature
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
# 一共有多少个洗漱间
all_data['TotalBath'] = all_data[['BsmtFullBath','BsmtHalfBath','FullBath','HalfBath']].sum(axis=1)
# 门廊的面积
all_data['TotalPorchSF'] = all_data[['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','WoodDeckSF']].sum(axis=1)
# 计算feature的偏度
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\n数值类型feature的偏度: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
# 对偏度大于0，75的进行平滑处理
skewness = skewness[abs(skewness) > 0.75]
print("一共有 {} 个feature需要处理".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    all_data[feat] = boxcox1p(all_data[feat], lam)
all_data = pd.get_dummies(all_data)
# 产生最终的数据集
train = all_data[:ntrain]
test = all_data[ntrain:]

Shape all_data: (2917, 78)
数值类型feature的偏度: 
一共有 61 个feature需要处理

MiscFeature : 同上

建模

import相关的包

from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

这里使用cross validation来减轻overfitting

def get_accuracy(model,train,y_train):
    n_folds=7
    kf1 = KFold(n_folds, shuffle=True, random_state=42)
    kf_cv_scores = cross_val_score(model,train,y_train,cv=kf1)
    return kf_cv_scores

XGBoost :

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, 
                             random_state =7, nthread = -1)

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

开始训练

model_xgb.fit(train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4603,
             enable_categorical=False, gamma=0.0468, gpu_id=-1,
             importance_type=None, interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=3,
             min_child_weight=1.7817, missing=nan, monotone_constraints='()',
             n_estimators=2200, n_jobs=8, nthread=-1, num_parallel_tree=1,
             predictor='auto', random_state=7, reg_alpha=0.464,
             reg_lambda=0.8571, scale_pos_weight=1, subsample=0.5213,
             tree_method='exact', validate_parameters=1, verbosity=None)

kf_cv_scores = get_accuracy(model_xgb,train,y_train)
print("Average score: %.2f" % kf_cv_scores.mean())

Average score: 0.92

打印其中一棵树的树结构

tree_struct = xgb.to_graphviz(model_xgb,num_trees=111)
tree_struct

<graphviz.sources.Source at 0x7f08f7fcbeb8>

附录

XGBoost的参数介绍

XGBoost 超参数

1.1基本超参数

1.1.1 booster
1.1.2 verbosity
1.1.3 nthread

1.2Booster 超参数

1.2.1 eta
1.2.2 gamma
1.2.3 max_depth
1.2.4 min_child_weight
1.2.5 max_delta_step
1.2.6 subsample
1.2.7 colsample_bytree, colsample_bylevel, colsample_bynode
1.2.8 lambda
1.2.9 alpha
1.2.10 tree_method
1.2.11 scale_pos_weight
1.2.12 max_leaves

1.3学习任务本身的超参数

2.3.1 objective
2.3.2 eval_metric
2.3.3 seed

1.1.1 booster

booster[default = gbtree]

booster这个参数主要帮助我们选择XGBoost算法中使用的基学习器的算法

gbtree and dart - 使用树作为基学习器
gblinear 使用线性模型作为基学习器

1.1.2 verbosity

verbosity[default = 1]

这个参数控制XGBoost打印日志的级别 0 (silent), 1 (warning), 2 (info), 3 (debug).

1.1.3 nthread

nthread [default = maximum number of threads available if not set]

这个参数控制XGBoost的并发度，也就是一共起多少个Thread来训练XGBoost模型

1.2.1 eta

eta [default=0.3, alias: learning_rate]

这个参数就是学习率，取值范围是[0,1]，一般来说设置到0.01-0.2之间比较合理

1.2.2 gamma

gamma [default=0, alias: min_split_loss]

这个参数就是决XGBoost训练过程中，树分裂的时候，最小的gain。也就是说，如果当前节点最大的gain都小于这个值，节点的分裂将停止
默认值是0

1.2.3 max_depth

max_depth [default=6]

树的最大深度，默认值是6

1.2.4 min_child_weight

min_child_weight [default=1]

这个参数是每个叶子结点上面的样本的最小weight值。这个值太大，会导致模型under-fitting. 太小的话，会导致over-fitting
默认值是1

1.2.6 subsample

subsample [default=1]

这个参数是为了改善XGBoost的训练效率，取值范围是[0,1]。他的值代表在所有样本中，随机的选择这个参数指定的比例的sample来训练

1.2.7 colsample_bytree, colsample_bylevel, colsample_bynode

colsample_bytree, colsample_bylevel, colsample_bynode [default=1]

这些参数是XGBoost实现列采样的相关参数，这些参数的取值范围都是[0,1],默认值都是1
colsample_bytree 每棵树列采样的比例
colsample_bylevel 每一层列采样的比例
colsample_bynode 每个叶子结点列采样的比例
上面这三个参数是同时生效的，比如说 {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5}，如果一共有64个feature的话，最后只剩下8个feature会被选中

1.2.8 lambda

lambda [default=1, alias: reg_lambda]

这个参数就是L2正则化表达式中的lamda参数，是个常量。默认值是0
如果训练数据的维度比较高，可以设置这个参数，这样可以限制模型的复杂度，推理的时候效率会提升

1.2.9 alpha

alpha [default=0, alias: reg_alpha]

XGBoost正则化常量，默认值是0
如果训练数据的维度较高，可以设置这个参数，这样可以限制模型的复杂度，推理的时候效率会提升

1.2.10 tree_method

tree_method string [default= auto]

这个参数设置了XGBoost在训练的时候，构建树的策略
包含这些可选值:auto,exact,approx,hist,gpu_hist

auto: 自动选择

对于小的数据集，精确贪心策略exact将会被选择
对于大的数据集，加权的分位数策略将会approx将会被选择

exact: 精确贪心策略
approx: 加权的分位数策略
hist: 直方图减法
gpu_hist: GPU版本的直方图减法

1.2.11 scale_pos_weight
scale_pos_weight [default=1]

这个参数一般用于分类问题，并且类别不均衡的时候，一般按照这个公式设置: sum(negativeinstances)/sum(positiveinstances).

1.2.12 max_leaves
max_leaves [default=0]

最大的叶子结点数目
只有当 grow_policy=lossguide 这个参数设置的时候才有效

1.3.1 objective

objective [default=reg:squarederror]
这个参数设置了XGBoost使用的loss函数 -

reg:squarederror : 一般回归问题使用
reg:squaredlogerror: 一般回归问题使用
reg:logistic : 逻辑回归
binary:logistic : 二分类
binary:logitraw: 二分类，输出每一类的score
binary:hinge : 二分类，输出类别
multi:softmax : 多分类，使用这个loss函数的时候，还需要设置 num_class(number of classes)参数
multi:softprob : 多分类，但是结果是一个矩阵

1.3.2 eval_metric

eval_metric [default according to objective]
evaluation 指标，这个参数可以设置多个，用来设置衡量模型的一些指标

rmse : root mean square error
mae : mean absolute error
logloss : negative log-likelihood
error : 二分类的错误率，就是用来判断是正类还是反类的门限值
merror : 多分类的错误率
mlogloss : Multiclass logloss
auc: Area under the curve
aucpr : Area under the PR curve

【DSW Gallery】 XGBoost:如何使用XGBoost解决回归问题

直接使用

使用XGBoost算法来预测房价

准备工作

数据加载

数据清洗与预处理

特征工程

缺失值处理

建模

附录

人工智能平台PAI

热门文章

最新文章

相关电子书