机器学习 —— 分类预测与集成学习（下）-阿里云开发者社区

机器学习 —— 分类预测与集成学习（上）https://developer.aliyun.com/article/1507851?spm=a2c6h.13148508.setting.25.1b484f0eMnwKQL

2. 将所有文本列均转换成数值编码

此处将训练数据和测试数据合并起来进行编码

# merged_data = train_data.append(test_data)    # 合并训练集和测试集
# FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
# 弃用警告，append方法将弃用，建议使用concat方法，而concat是pandas的函数，pandas.concat()通常用来连接DataFrame对象。要写成df=pd.concat([df1,df2])类似这样
# concat()通常用来连接DataFrame对象,默认情况下是对两个DataFrame对象进行纵向连接
merged_data = pd.concat([train_data, test_data]) # 合并训练集和测试集
for column in merged_data.columns:
    if merged_data[column].dtype == 'object':
        merged_data[column] = pd.Categorical(merged_data[column]).codes#Categorical函数，将文本转换成数值。

train_data = merged_data[0:train_data.shape[0]]
test_data = merged_data[train_data.shape[0]:]

print("训练数据维度：", train_data.shape)#.shape返回的是元组
print("测试数据维度：", test_data.shape)
print(test_data.head())#head()默认情况下,它会显示5行系列数据

图17：将所有文本列均转换成数值编码

（四）模型训练

训练模型并选择最优的超参数。

1. 准备工作

准备好训练特征数据集、标签数据集和测试特征数据集、标签数据集

预设超参数

X_train = train_data.iloc[:, :-1]
y_train = train_data['wage_class']

X_test = test_data.iloc[:, :-1]
y_test = test_data['wage_class']

cv_params = {'max_depth': [3, 5, 7], 'min_child_weight': [1, 3, 5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 1000, 'seed': 0,
              'subsample': 0.8, 'colsample_bytree': 0.8,
              'objective': 'binary:logistic'}

图18：准备工作

2. 使用XGBoost模型训练，并且优选出最佳的模型参数

先固定learning_rate和subsample，以便优选另外两个超参数：max_depth, min_child_weight

from xgboost import XGBClassifier#加载xgboost分类器（封装好的直接用）
from sklearn.model_selection import GridSearchCV#模型调参方法：GridSearch+CV（网格搜索+交叉验证）

print("训练模型并选择最优参数......")
# 使用5-fold cross-validation（5折交叉验证）来优选最佳的模型
optimized_GBM = GridSearchCV(XGBClassifier(**ind_params), cv_params, scoring='accuracy', cv=5, n_jobs=-1, verbose=10)#网格搜索：自动调参
# fit生成规则
optimized_GBM.fit(X_train, y_train)#训练

print("最佳参数：", optimized_GBM.best_params_)
# cv_results_返回模型训练过程所有详细信息  cv交叉验证
means = optimized_GBM.cv_results_['mean_test_score']#输出交叉验证结果 将规则应用于训练集
stds = optimized_GBM.cv_results_['std_test_score']#输出交叉验证结果 将规则应用于测试集

for mean, std, params in zip(means, stds, optimized_GBM.cv_results_['params']):    
    print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))#输出结果

图19：使用XGBoost模型训练，并且优选出最佳的模型参数

3. 计算模型性能

针对测试数据进行预测

分别计算每个类别的精确度、召回率和F1值

from sklearn.metrics import classification_report
# predict()函数是Python中预测函数，常用于预测测试集数据，返回的是样本所属的类别标签。
y_pred = optimized_GBM.predict(X_test)#预测/测试
print(classification_report(y_test, y_pred))

图20：计算模型性能

4. 再次调整超参数

在上述最优超参数{‘max_depth’: 3, ‘min_child_weight’: 5}，条件下调整learning_rate，以及subsample并选出最优超参数

cv_params = {'learning_rate': [0.1, 0.05, 0.01], 'subsample': [0.7, 0.8, 0.9]}
ind_params = {'max_depth': 3, 'n_estimators': 1000, 'seed': 0,'min_child_weight': 5, 'colsample_bytree': 0.8, 'objective': 'binary:logistic'}

print("训练模型并选择最优参数......")
# 使用5-fold cross-validation来优选最佳的模型
optimized_GBM = GridSearchCV(XGBClassifier(**ind_params), cv_params, scoring='accuracy', cv=5, n_jobs=-1, verbose=10)#网格搜索：自动调参
optimized_GBM.fit(X_train, y_train)#训练   # fit生成规则

print("最佳参数：", optimized_GBM.best_params_)
# cv_results_返回模型训练过程所有详细信息  cv交叉验证
means = optimized_GBM.cv_results_['mean_test_score']#输出交叉验证结果
stds = optimized_GBM.cv_results_['std_test_score']#输出交叉验证结果

for mean, std, params in zip(means, stds, optimized_GBM.cv_results_['params']):    
    print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))#输出结果

图21：再次调整超参数

5. 寻找最优的模型训练迭代停止时机

利用前述选定的最佳参数：{‘max_depth’: 3, ‘min_child_weight’: 5， ‘learning_rate’: 0.05, ‘subsample’: 0.8},构建最优XGBoost模型

XGBoost模型训练时，如果迭代次数过多会进入过拟合。表现就是随着迭代次数的增加，测试集上的测试误差开始下降；当开始过拟合或者过训练时，测试集上的测试误差开始上升，或者波动

通过设置early_stopping_rounds可指定停止训练的时机。当测试集上的误差在early_stopping_rounds轮迭代之内都没有降低的话，就停止训练

通过best_iteration属性可获得最佳的迭代次数

# 构建最优XGBoost模型
ind_params = {'max_depth': 3, 'min_child_weight': 5, 'learning_rate': 0.05, 'subsample': 0.8, 'n_estimators': 1000, 'seed': 0, 'colsample_bytree': 0.8, 'objective': 'binary:logistic'}
eval_set = [(X_test, y_test)]

model = XGBClassifier(**ind_params)
# 通过设置early_stopping_rounds可指定停止训练的时机。当测试集上的误差在early_stopping_rounds轮迭代之内都没有降低的话，就停止训练
result = model.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=20)#训练
print("最佳迭代次数:", result.best_iteration)#通过best_iteration属性可获得最佳的迭代次数

图22：寻找最优的模型训练迭代停止时机

6. 计算最终模型的性能

# predict()函数是Python中预测函数，常用于预测测试集数据，返回的是样本所属的类别标签。
y_pred = model.predict(X_test, ntree_limit=result.best_iteration)#预测/测试
print(classification_report(y_test, y_pred))

图23：计算最终模型的性能

（五）特征分析

观察某个特征之间的相关关系，调整部分特征。

1. 查看各个特征之间的相关性

seaborn.heatmap用于绘制数据集中每两个特征(列)之间的相关性热力图

两个特征相关性数值在-1.0~1.0之间。取1.0时说明最强正相关(例如，该特征与自身肯定是1.0)，取-1.0时说明最强负相关(数据变化趋势完全相反)

在下图中，观察每个方格的颜色。越接近白色，说明该方格对应的两个特征(分别由方格所在的横坐标和纵坐标表示)正相关；越接近黑色，则说明负相关

具有较强相关性(无论正、负)的两个不同特征，可以考虑在建模时，只选取其中的一个特征参与训练(因为另一个特征的趋势与被选中的特征几乎一致或完全相反，对分类结果的影响也相同)；

在本例中，可观察到sex和relationship的负相关性很强(黑色方格)，education和education_num的正相关性也比较强(白色方格),因此可以各保留1个特征

去掉部分强相关特征后，对建模结果几乎们没有影响，但应该能减少计算量

import seaborn as sns

#绘制数据集中每两个特征(列)之间的相关性热力图
#步骤：创建画布→获取数据（train_data.corr()）→绘图→show
# 创建画布
# style must be one of white, dark, whitegrid, darkgrid, ticks
sns.set(style='white')
# 建立画布,figsize设置画布大小
plt.figure(figsize=(10, 8))
# 获取数据并绘制热力图
corr = train_data.corr()
# annot为True时，可设置各个参数，包括大小，颜色，加粗，斜体字等
# cmap：matplotlib的colormap名称或颜色对象
# fmt，格式设置
sns.heatmap(corr, annot=True, cmap='Greens',fmt=".2f")#heatmap将矩形数据绘制为颜色编码矩阵
# 保存图片
plt.savefig("three.png") 
# 显示图像
plt.show()

图24：查看各个特征之间的相关性

图25：three.png

2. 去除强相关的冗余特征

本例中去除education_num，保留education特征；去除relationship，保留sex特征

from xgboost import XGBClassifier
from sklearn.metrics import classification_report

# 指定超参数
ind_params = {'max_depth': 3, 'min_child_weight': 5, 'learning_rate': 0.05, 'subsample': 0.8, 'n_estimators': 1000, 'seed': 0, 'colsample_bytree': 0.8, 'objective': 'binary:logistic'}

# 去除强相关的列
# drop()函数的功能是通过指定的索引或标签名称,也就是行名称或者列名称进行删除数据。
X_train_reduced = X_train.drop(columns = ['education_num','relationship'])#drop删除某一列
X_test_reduced = X_test.drop(columns = ['education_num','relationship'])#drop删除某一列
eval_set = [(X_test_reduced, y_test)]

# 训练模型
model = XGBClassifier(**ind_params)
result = model.fit(X_train_reduced, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=50)#fit训练
print("最佳迭代次数:", result.best_iteration)

# 预测并计算性能
# predict()函数是Python中预测函数，常用于预测测试集数据，返回的是样本所属的类别标签。
y_pred = model.predict(X_test_reduced, ntree_limit=result.best_iteration)#预测/测试
print(classification_report(y_test, y_pred))

图26：去除强相关的冗余特征

图27：去除强相关的冗余特征结果

3. 将age特征分箱处理

考虑到age(年龄)是连续的自然数值，在一定程度上，考虑年龄区间可能会比年龄值本身更有意义

numpy.digitize方法用于将数据集划分到指定的区间中，并重新赋给区间编号值

# 定义年龄区间
age_bins = [20, 30, 40, 50, 60, 70]    # 区间0：0~20，区间1：20~30，......区间6：70~

# 将'age'转换成区间
# digitize()主要用于将一组数据进行分区
X_train_reduced['age'] = np.digitize(X_train['age'], bins=age_bins)#返回每个值属于bins的索引
X_test_reduced['age'] = np.digitize(X_test['age'], bins=age_bins)#返回每个值属于bins的索引
print(X_train_reduced['age'].unique())#提取数据集合中的唯一值（去除重复的元素）
eval_set = [(X_test_reduced, y_test)]

# 指定超参数
ind_params = {'max_depth': 3, 'min_child_weight': 5, 'learning_rate': 0.05, 'subsample': 0.8, 'n_estimators': 1000, 'seed': 0, 'colsample_bytree': 0.8, 'objective': 'binary:logistic'}

# 训练模型
model = XGBClassifier(**ind_params)
result = model.fit(X_train_reduced, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=20)#fit训练
print("最佳迭代次数:", result.best_iteration)

# 预测并计算性能
# predict()函数是Python中预测函数，常用于预测测试集数据，返回的是样本所属的类别标签。
y_pred = model.predict(X_test_reduced, ntree_limit=result.best_iteration)#测试/预测
print(classification_report(y_test, y_pred))

图28：将age特征分箱处理

图29：将age特征分箱处理结果

异常问题与解决方案

异常问题1：No module named ‘xgboost’

解决方法：%pip install xgboost

图30：解决方法

异常问题2：不知crosstab（交叉表）函数如何使用

图31：异常问题2

解决方法：两种方法解决

图32：解决方法

机器学习 —— 分类预测与集成学习（下）

2. 将所有文本列均转换成数值编码

（四）模型训练

1. 准备工作

2. 使用XGBoost模型训练，并且优选出最佳的模型参数

3. 计算模型性能

4. 再次调整超参数

5. 寻找最优的模型训练迭代停止时机

6. 计算最终模型的性能

（五）特征分析

1. 查看各个特征之间的相关性

2. 去除强相关的冗余特征

3. 将age特征分箱处理

异常问题与解决方案

参考资料

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

机器学习 —— 分类预测与集成学习（下）

2. 将所有文本列均转换成数值编码

（四）模型训练

1. 准备工作

2. 使用XGBoost模型训练，并且优选出最佳的模型参数

3. 计算模型性能

4. 再次调整超参数

5. 寻找最优的模型训练迭代停止时机

6. 计算最终模型的性能

（五）特征分析

1. 查看各个特征之间的相关性

2. 去除强相关的冗余特征

3. 将age特征分箱处理

异常问题与解决方案

参考资料

热门文章

最新文章

相关课程

相关电子书

相关实验场景