机器学习集成学习进阶Xgboost算法案例分析 2-阿里云开发者社区

机器学习集成学习进阶Xgboost算法案例分析 2

2023-09-21 103

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 机器学习集成学习进阶Xgboost算法案例分析

4 otto案例介绍

– Otto Group Product Classification Challenge【xgboost实现】

4.1 背景介绍

奥托集团是世界上最大的电子商务公司之一，在20多个国家设有子公司。该公司每天都在世界各地销售数百万种产品,所以对其产品根据性能合理的分类非常重要。

不过,在实际工作中,工作人员发现,许多相同的产品得到了不同的分类。本案例要求,你对奥拓集团的产品进行正确的分分类。尽可能的提供分类的准确性。

链接：https://www.kaggle.com/c/otto-group-product-classification-challenge/overview

4.22 思路分析

1.数据获取

2.数据基本处理

2.1 截取部分数据

2.2 把标签纸转换为数字

2.3 分割数据(使用StratifiedShuffleSplit)

2.4 数据标准化

2.5 数据pca降维

3.模型训练

3.1 基本模型训练

3.2 模型调优

3.2.1 调优参数:

n_estimator,

max_depth,

min_child_weights,

subsamples,

consample_bytrees,

etas

3.2.2 确定最后最优参数

4.3 部分代码实现

2.数据基本处理

2.1 截取部分数据

2.2 把标签值转换为数字

2.3 分割数据(使用StratifiedShuffleSplit)

# 使用StratifiedShuffleSplit对数据集进行分割
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in sss.split(X_resampled.values, y_resampled):
    print(len(train_index))
    print(len(test_index))
    x_train = X_resampled.values[train_index]
    x_val = X_resampled.values[test_index]
    y_train = y_resampled[train_index]
    y_val = y_resampled[test_index]
# 分割数据图形可视化
import seaborn as sns
sns.countplot(y_val)
plt.show()

2.4 数据标准化

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_val_scaled = scaler.transform(x_val)

2.5 数据pca降维

print(x_train_scaled.shape)
# (13888, 93)
from sklearn.decomposition import PCA
pca = PCA(n_components=0.9)
x_train_pca = pca.fit_transform(x_train_scaled)
x_val_pca = pca.transform(x_val_scaled)
print(x_train_pca.shape, x_val_pca.shape)
(13888, 65) (3473, 65)

从上面输出的数据可以看出,只选择65个元素,就可以表达出特征中90%的信息

# 降维数据可视化
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("元素数量")
plt.ylabel("可表达信息的百分占比")
plt.show()

3.模型训练

3.1 基本模型训练

from xgboost import XGBClassifier

xgb = XGBClassifier()

xgb.fit(x_train_pca, y_train)

from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(x_train_pca, y_train)
# 改变预测值的输出模式,让输出结果为百分占比,降低logloss值
y_pre_proba = xgb.predict_proba(x_val_pca)
# logloss进行模型评估
from sklearn.metrics import log_loss
log_loss(y_val, y_pre_proba, eps=1e-15, normalize=True)
xgb.get_params

3.2 模型调优

3.2.1 调优参数:

1） n_estimator

scores_ne = []
n_estimators = [100,200,400,450,500,550,600,700]
for nes in n_estimators:
    print("n_estimators:", nes)
    xgb = XGBClassifier(max_depth=3, 
                        learning_rate=0.1, 
                        n_estimators=nes, 
                        objective="multi:softprob", 
                        n_jobs=-1, 
                        nthread=4, 
                        min_child_weight=1, 
                        subsample=1, 
                        colsample_bytree=1,
                        seed=42)
    xgb.fit(x_train_pca, y_train)
    y_pre = xgb.predict_proba(x_val_pca)
    score = log_loss(y_val, y_pre)
    scores_ne.append(score)
    print("测试数据的logloss值为:{}".format(score))
# 数据变化可视化
plt.plot(n_estimators, scores_ne, "o-")
plt.ylabel("log_loss")
plt.xlabel("n_estimators")
print("n_estimators的最优值为:{}".format(n_estimators[np.argmin(scores_ne)]))

2）max_depth

scores_md = []
max_depths = [1,3,5,6,7]
for md in max_depths:  # 修改
    xgb = XGBClassifier(max_depth=md, # 修改
                        learning_rate=0.1, 
                        n_estimators=n_estimators[np.argmin(scores_ne)],   # 修改 
                        objective="multi:softprob", 
                        n_jobs=-1, 
                        nthread=4, 
                        min_child_weight=1, 
                        subsample=1, 
                        colsample_bytree=1,
                        seed=42)
    xgb.fit(x_train_pca, y_train)
    y_pre = xgb.predict_proba(x_val_pca)
    score = log_loss(y_val, y_pre)
    scores_md.append(score)  # 修改
    print("测试数据的logloss值为:{}".format(log_loss(y_val, y_pre)))
# 数据变化可视化
plt.plot(max_depths, scores_md, "o-")  # 修改
plt.ylabel("log_loss")
plt.xlabel("max_depths")  # 修改
print("max_depths的最优值为:{}".format(max_depths[np.argmin(scores_md)]))  # 修改

3） min_child_weights,

依据上面模式进行调整

4） subsamples,

5） consample_bytrees,

6） etas

3.2.2 确定最后最优参数

xgb = XGBClassifier(learning_rate =0.1, 
                    n_estimators=550, 
                    max_depth=3, 
                    min_child_weight=3, 
                    subsample=0.7, 
                    colsample_bytree=0.7, 
                    nthread=4, 
                    seed=42, 
                    objective='multi:softprob')
xgb.fit(x_train_scaled, y_train)
y_pre = xgb.predict_proba(x_val_scaled)
print("测试数据的logloss值为 : {}".format(log_loss(y_val, y_pre, eps=1e-15, normalize=True)))

机器学习集成学习进阶Xgboost算法案例分析 2

4 otto案例介绍

4.1 背景介绍

4.22 思路分析

4.3 部分代码实现

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

机器学习集成学习进阶Xgboost算法案例分析 2

4 otto案例介绍

4.1 背景介绍

4.22 思路分析

4.3 部分代码实现

热门文章

最新文章

相关课程

相关电子书

相关实验场景