直接使用
请打开 XGBoost:如何使用XGBoost解决回归问题,并点击右上角 “ 在DSW中打开” 。
使用XGBoost算法来预测房价
本文基于一个包含房子各种属性的数据集,以房价为label,使用XGBoost来预测房价.
DSW的Sample Notebook中,还有另外一个Notebook也在使用和本文同样的数据集进行房价的回归分析,不同的是那一个Notebook使用的是普通的线性回归算法。
有兴趣的同学可以看一下线性回归的Notebook。链接
最后的结果是XGBoost实现了更高的精度(92% vs 86%)
准备工作
本文依赖的软件包都已经在DSW镜像中预置安装,如果您的环境没有安装的话,可以用pip install xxx来完成准备。
我们先把需要的python库import进来。
import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns color = sns.color_palette() sns.set_style('darkgrid') import warnings def ignore_warn(*args, **kwargs): pass warnings.warn = ignore_warn from scipy import stats from scipy.stats import norm, skew pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
数据加载
使用Pandas读入数据,并查看原始数据。train.csv文件是我们已经提前从网上下载并准备好。本文没有涉及测试样本,可以在网上下载对应的test.csv文件。
train = pd.read_csv('train.csv') test = pd.read_csv('test.csv')
train.head(5)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
0 | 1 | 60 | RL | 65 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns |
test.head(5)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
0 | 1461 | 20 | RH | 80 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
3 | 1464 | 60 | RL | 78 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
4 | 1465 | 120 | RL | 43 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
5 rows × 80 columns |
数据清洗与预处理
一般我们拿到的原始数据都有各种各样的问题,不利于分析和训练,所以要经过一个清洗和预处理的阶段,比如去重,缺失值,异常值等等的处理。
我们在前面已经看到原始数据有81列特征,总计1460条记录。其中ID列对我们做训练没有意义,先去掉
print("The train data size before dropping Id feature is : {} ".format(train.shape)) print("The test data size before dropping Id feature is : {} ".format(test.shape)) #Save the 'Id' column train_ID = train['Id'] test_ID = test['Id'] #Now drop the 'Id' colum since it's unnecessary for the prediction process. train.drop("Id", axis = 1, inplace = True) test.drop("Id", axis = 1, inplace = True) #check again the data size after dropping the 'Id' variable print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) print("The test data size after dropping Id feature is : {} ".format(test.shape))
The train data size before dropping Id feature is : (1460, 81) The test data size before dropping Id feature is : (1459, 80) The train data size after dropping Id feature is : (1460, 80) The test data size after dropping Id feature is : (1459, 79)
特征工程
在本文一开始的链接所在的Notebook中已经对特征工程做了详细的解释和说明,本文就不再展开叙述
# drop掉这两个异常点 train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index) #对label使用np.log1p来平滑处理,让他更接近标准正态分布 train["SalePrice"] = np.log1p(train["SalePrice"]) ntrain = train.shape[0] ntest = test.shape[0] y_train = train.SalePrice.values all_data = pd.concat((train, test)).reset_index(drop=True) all_data.drop(['SalePrice'], axis=1, inplace=True) # 拿到缺失值的feature,下面会分别处理 all_data_na = (all_data.isnull().sum() / len(all_data)) * 100 all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30] missing_data = pd.DataFrame({'Missing Ratio' :all_data_na}) corrmat = train.corr() plt.subplots(figsize=(12,9)) sns.heatmap(corrmat, vmax=0.9, square=True)
<AxesSubplot:>
缺失值处理
all_data["PoolQC"] = all_data["PoolQC"].fillna("None") all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None") all_data["Alley"] = all_data["Alley"].fillna("None") all_data["Fence"] = all_data["Fence"].fillna("None") all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None") all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform( lambda x: x.fillna(x.median())) for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'): all_data[col] = all_data[col].fillna('None') for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'): all_data[col] = all_data[col].fillna(0) for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'): all_data[col] = all_data[col].fillna(0) for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'): all_data[col] = all_data[col].fillna('None') all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None") all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0) all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0]) all_data = all_data.drop(['Utilities'], axis=1) all_data["Functional"] = all_data["Functional"].fillna("Typ") all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0]) all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0]) all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0]) all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0]) all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0]) all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None") all_data_na = (all_data.isnull().sum() / len(all_data)) * 100 all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False) missing_data = pd.DataFrame({'Missing Ratio' :all_data_na}) # 类型转换 all_data['MSSubClass'] = all_data['MSSubClass'].apply(str) all_data['OverallCond'] = all_data['OverallCond'].astype(str) all_data['YrSold'] = all_data['YrSold'].astype(str) all_data['MoSold'] = all_data['MoSold'].astype(str) # encoding from sklearn.preprocessing import LabelEncoder cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope', 'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 'YrSold', 'MoSold') for c in cols: lbl = LabelEncoder() lbl.fit(list(all_data[c].values)) all_data[c] = lbl.transform(list(all_data[c].values)) print('Shape all_data: {}'.format(all_data.shape)) # 根据相关的行业知识,创建新的feature all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF'] # 一共有多少个洗漱间 all_data['TotalBath'] = all_data[['BsmtFullBath','BsmtHalfBath','FullBath','HalfBath']].sum(axis=1) # 门廊的面积 all_data['TotalPorchSF'] = all_data[['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','WoodDeckSF']].sum(axis=1) # 计算feature的偏度 numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False) print("\n数值类型feature的偏度: \n") skewness = pd.DataFrame({'Skew' :skewed_feats}) # 对偏度大于0,75的进行平滑处理 skewness = skewness[abs(skewness) > 0.75] print("一共有 {} 个feature需要处理".format(skewness.shape[0])) from scipy.special import boxcox1p skewed_features = skewness.index lam = 0.15 for feat in skewed_features: all_data[feat] = boxcox1p(all_data[feat], lam) all_data = pd.get_dummies(all_data) # 产生最终的数据集 train = all_data[:ntrain] test = all_data[ntrain:]
Shape all_data: (2917, 78) 数值类型feature的偏度: 一共有 61 个feature需要处理
- MiscFeature : 同上
建模
import相关的包
from sklearn.kernel_ridge import KernelRidge from sklearn.pipeline import make_pipeline from sklearn.preprocessing import RobustScaler from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone from sklearn.model_selection import KFold, cross_val_score, train_test_split from sklearn.metrics import mean_squared_error import xgboost as xgb
这里使用cross validation来减轻overfitting
def get_accuracy(model,train,y_train): n_folds=7 kf1 = KFold(n_folds, shuffle=True, random_state=42) kf_cv_scores = cross_val_score(model,train,y_train,cv=kf1) return kf_cv_scores
- XGBoost :
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05, max_depth=3, min_child_weight=1.7817, n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571, subsample=0.5213, random_state =7, nthread = -1)
def rmsle(y, y_pred): return np.sqrt(mean_squared_error(y, y_pred))
开始训练
model_xgb.fit(train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.4603, enable_categorical=False, gamma=0.0468, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.05, max_delta_step=0, max_depth=3, min_child_weight=1.7817, missing=nan, monotone_constraints='()', n_estimators=2200, n_jobs=8, nthread=-1, num_parallel_tree=1, predictor='auto', random_state=7, reg_alpha=0.464, reg_lambda=0.8571, scale_pos_weight=1, subsample=0.5213, tree_method='exact', validate_parameters=1, verbosity=None)
kf_cv_scores = get_accuracy(model_xgb,train,y_train) print("Average score: %.2f" % kf_cv_scores.mean())
Average score: 0.92
打印其中一棵树的树结构
tree_struct = xgb.to_graphviz(model_xgb,num_trees=111) tree_struct
<graphviz.sources.Source at 0x7f08f7fcbeb8>
附录
- 1.1基本超参数
- 1.2Booster 超参数
- 1.2.1 eta
- 1.2.2 gamma
- 1.2.3 max_depth
- 1.2.4 min_child_weight
- 1.2.5 max_delta_step
- 1.2.6 subsample
- 1.2.7 colsample_bytree, colsample_bylevel, colsample_bynode
- 1.2.8 lambda
- 1.2.9 alpha
- 1.2.10 tree_method
- 1.2.11 scale_pos_weight
- 1.2.12 max_leaves
- 1.3学习任务本身的超参数
- 2.3.1 objective
- 2.3.2 eval_metric
- 2.3.3 seed
- booster[default = gbtree]
- booster这个参数主要帮助我们选择XGBoost算法中使用的基学习器的算法
- gbtree and dart - 使用树作为基学习器
- gblinear 使用线性模型作为基学习器
- verbosity[default = 1]
- 这个参数控制XGBoost打印日志的级别 0 (silent), 1 (warning), 2 (info), 3 (debug).
- nthread [default = maximum number of threads available if not set]
- 这个参数控制XGBoost的并发度,也就是一共起多少个Thread来训练XGBoost模型
- eta [default=0.3, alias: learning_rate]
- 这个参数就是学习率,取值范围是[0,1],一般来说设置到0.01-0.2之间比较合理
- gamma [default=0, alias: min_split_loss]
- 这个参数就是决XGBoost训练过程中,树分裂的时候,最小的gain。也就是说,如果当前节点最大的gain都小于这个值,节点的分裂将停止
- 默认值是0
- max_depth [default=6]
- 树的最大深度,默认值是6
- min_child_weight [default=1]
- 这个参数是每个叶子结点上面的样本的最小weight值。这个值太大,会导致模型under-fitting. 太小的话,会导致over-fitting
- 默认值是1
- subsample [default=1]
- 这个参数是为了改善XGBoost的训练效率,取值范围是[0,1]。 他的值代表在所有样本中,随机的选择这个参数指定的比例的sample来训练
1.2.7 colsample_bytree, colsample_bylevel, colsample_bynode
- colsample_bytree, colsample_bylevel, colsample_bynode [default=1]
- 这些参数是XGBoost实现列采样的相关参数,这些参数的取值范围都是[0,1],默认值都是1
- colsample_bytree 每棵树列采样的比例
- colsample_bylevel 每一层列采样的比例
- colsample_bynode 每个叶子结点列采样的比例
- 上面这三个参数是同时生效的,比如说 {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5},如果一共有64个feature的话,最后只剩下8个feature会被选中
- lambda [default=1, alias: reg_lambda]
- 这个参数就是L2正则化表达式中的lamda参数,是个常量。默认值是0
- 如果训练数据的维度比较高,可以设置这个参数,这样可以限制模型的复杂度,推理的时候效率会提升
- alpha [default=0, alias: reg_alpha]
- XGBoost正则化常量,默认值是0
- 如果训练数据的维度较高,可以设置这个参数,这样可以限制模型的复杂度,推理的时候效率会提升
- tree_method string [default= auto]
- 这个参数设置了XGBoost在训练的时候,构建树的策略
- 包含这些可选值:
auto
,exact
,approx
,hist
,gpu_hist
- auto: 自动选择
- 对于小的数据集,精确贪心策略exact将会被选择
- 对于大的数据集,加权的分位数策略将会approx将会被选择
- exact: 精确贪心策略
- approx: 加权的分位数策略
- hist: 直方图减法
- gpu_hist: GPU版本的直方图减法
- 这个参数一般用于分类问题,并且类别不均衡的时候,一般按照这个公式设置:
sum(negativeinstances)/sum(positiveinstances)
.
- 最大的叶子结点数目
- 只有当
grow_policy=lossguide
这个参数设置的时候才有效
- objective [default=reg:squarederror]
- 这个参数设置了XGBoost使用的loss函数 -
- reg:squarederror : 一般回归问题使用
- reg:squaredlogerror: 一般回归问题使用
- reg:logistic : 逻辑回归
- binary:logistic : 二分类
- binary:logitraw: 二分类,输出每一类的score
- binary:hinge : 二分类,输出类别
- multi:softmax : 多分类,使用这个loss函数的时候,还需要设置 num_class(number of classes)参数
- multi:softprob : 多分类,但是结果是一个矩阵
- eval_metric [default according to objective]
- evaluation 指标,这个参数可以设置多个,用来设置衡量模型的一些指标
- rmse : root mean square error
- mae : mean absolute error
- logloss : negative log-likelihood
- error : 二分类的错误率,就是用来判断是正类还是反类的门限值
- merror : 多分类的错误率
- mlogloss : Multiclass logloss
- auc: Area under the curve
- aucpr : Area under the PR curve