ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(三)

简介: ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(三)

3. 参数微调案例/Parameter Tuning with Example


We will take the data set from Data Hackathon 3.x AV hackathon, same as that taken in the GBM article. The details of the problem can be found on the competition page. You can download the data set from here. I have performed the following steps:

我们将从Data Hackathon 3.x AV Hackathon获取数据集,与GBM文章中的数据集相同。问题详情可在竞赛页面上找到。您可以从这里下载数据集。我已执行以下步骤:


City variable dropped because of too many categories

由于类别太多,城市变量已删除

DOB converted to Age | DOB dropped

出生日期转换为年龄出生日期下降

EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | Original variable EMI_Loan_Submitted dropped

如果Emi_Loan_Submitted丢失,则为1,否则为0提交的原始变量Emi_Loan_已删除。

EmployerName dropped because of too many categories

由于类别太多,EmployerName被删除

Existing_EMI imputed with 0 (median) since only 111 values were missing

由于只缺少111个值,因此用0(中位数)输入的现有EMI

Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Original variable Interest_Rate dropped

如果利率丢失,则创建利率丢失,如果利率丢失,则为1;否则0原始可变利率丢失

Lead_Creation_Date dropped because made little intuitive impact on outcome

由于对结果的直观影响很小,导致潜在客户创建日期下降。

Loan_Amount_Applied, Loan_Tenure_Applied imputed with median values

贷款额,贷款期限

Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Original variable Loan_Amount_Submitted dropped

如果Loan_Amount_Submitted缺失,则为1;否则0原始可变贷款_Amount_Submitted已删除。

Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Original variable Loan_Tenure_Submitted dropped

如果Loan_Perioration_Submitted缺失,则为1;否则0_Original Variable Loan_Perioration_Submitted已删除

LoggedIn, Salary_Account dropped

工资账户被删除

Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Original variable Processing_Fee dropped

处理费丢失已创建,如果处理费丢失则为1,否则为0原始变量处理费丢失

Source – top 2 kept as is and all others combined into different category

资料来源——前2名保持原样,其他所有人合并为不同类别

Numerical and One-Hot-Coding performed

数字化和独热编码

For those who have the original data from competition, you can check out these steps from the data_preparation iPython notebook in the repository.

对于那些拥有来自竞争对手的原始数据的人,您可以从存储库中的“数据准备”ipython笔记本中查看这些步骤。


Lets start by importing the required libraries and loading the data:   首先导入所需的库并加载数据:


#Import libraries:

import pandas as pd

import numpy as np

import xgboost as xgb

from xgboost.sklearn import XGBClassifier

from sklearn import cross_validation, metrics   #Additional scklearn functions

from sklearn.grid_search import GridSearchCV   #Perforing grid search

import matplotlib.pylab as plt

%matplotlib inline

from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 12, 4

train = pd.read_csv('train_modified.csv')

target = 'Disbursed'

IDcol = 'ID'

Note that I have imported 2 forms of XGBoost:

请注意,我导入了两种XGBoost形式:


xgb – this is the direct xgboost library. I will use a specific function “cv” from this library

xgb–这是直接xgboost库。我将使用这个库中的特定函数“cv”

XGBClassifier – this is an sklearn wrapper for XGBoost. This allows us to use sklearn’s Grid Search with parallel processing in the same way we did for GBM

XGBClassifier–这是XGBoost的sklearn包装。这使得我们可以使用sklearn的网格搜索和并行处理就像我们对gbm所做的那样

Before proceeding further, lets define a function which will help us create XGBoost models and perform cross-validation. The best part is that you can take this function as it is and use it later for your own models.

在继续之前,我们先定义一个函数,它将帮助我们创建xgboost模型并执行交叉验证。最好的一点是,您可以将此函数按原样使用,稍后将其用于您自己的模型。


def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

 

   if useTrainCV:

       xgb_param = alg.get_xgb_params()

       xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)

       cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,

           metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)

       alg.set_params(n_estimators=cvresult.shape[0])

 

   #Fit the algorithm on the data

   alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')

     

   #Predict training set:

   dtrain_predictions = alg.predict(dtrain[predictors])

   dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]

     

   #Print model report:

   print "\nModel Report"

   print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)

   print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)

                 

   feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)

   feat_imp.plot(kind='bar', title='Feature Importances')

   plt.ylabel('Feature Importance Score')

This code is slightly different from what I used for GBM. The focus of this article is to cover the concepts and not coding. Please feel free to drop a note in the comments if you find any challenges in understanding any part of it. Note that xgboost’s sklearn wrapper doesn’t have a “feature_importances” metric but a get_fscore() function which does the same job.

此代码与我用于GBM的代码稍有不同。本文的重点是涵盖概念而不是编码。如果您在理解其中的任何部分时发现任何挑战,请随时在评论中留言。请注意,xgboost的sklearn包装器没有“feature-importances”度量标准,而是有一个get-fscore()函数,它执行相同的工作。



参数微调的一般方法/General Approach for Parameter Tuning


We will use an approach similar to that of GBM here. The various steps to be performed are:

我们将使用类似于GBM的方法。要执行的各种步骤包括:


Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.

选择相对较高的学习率。一般来说,学习率为0.1有效,但在0.05到0.3之间的某个值应适用于不同的问题。确定此学习率的最佳树数。XGBoost有一个非常有用的函数叫做“cv”,它在每次提升迭代时执行交叉验证,从而返回所需的最佳树数。

Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.

调整特定于树的参数(最大深度、最小子树重量、gamma、子样本、colsample字节树),以确定学习速率和树数。注意,我们可以选择不同的参数来定义一个树,我将在这里举一个例子。

Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.

调整xgboost的正则化参数(lambda,alpha),这有助于降低模型复杂性和提高性能。

Lower the learning rate and decide the optimal parameters .

降低学习率,确定最佳参数。

Let us look at a more detailed step by step approach.

让我们看一个更详细的逐步方法。



Step 1: Fix learning rate and number of estimators for tuning tree-based parameters

步骤1:固定学习率和基于树的参数调整估计数


In order to decide on boosting parameters, we need to set some initial values of other parameters. Lets take the following values:

为了确定 boosting参数,我们需要设置其他参数的一些初始值。让我们采用以下值:


max_depth = 5 : This should be between 3-10. I’ve started with 5 but you can choose a different number as well. 4-6 can be good starting points.

最大深度=5:应该在3-10之间。我从5开始,但你也可以选择不同的数字。4-6可以是很好的起点。

min_child_weight = 1 : A smaller value is chosen because it is a highly imbalanced class problem and leaf nodes can have smaller size groups.

min_child_weight=1:选择较小的值是因为它是高度不平衡的类问题,叶节点可以有较小的大小组。

gamma = 0 : A smaller value like 0.1-0.2 can also be chosen for starting. This will anyways be tuned later.

gamma=0:也可以选择较小的值(如0.1-0.2)启动。无论如何,稍后将对此进行调整。

subsample, colsample_bytree = 0.8 : This is a commonly used used start value. Typical values range between 0.5-0.9.

subsample,colsample_bytree=0.8:这是常用的起始值。典型值在0.5-0.9之间。

scale_pos_weight = 1: Because of high class imbalance.

scale_pos_weight =1:因为等级不平衡。

Please note that all the above are just initial estimates and will be tuned later. Lets take the default learning rate of 0.1 here and check the optimum number of trees using cv function of xgboost. The function defined above will do it for us.

请注意,以上只是初步估计,稍后将进行调整。让我们在这里取默认的学习率0.1,并使用xgboost的cv函数检查最佳树数。上面定义的函数将为我们做这件事。


#Choose all predictors except target & IDcols

predictors = [x for x in train.columns if x not in [target, IDcol]]

xgb1 = XGBClassifier(

learning_rate =0.1,

n_estimators=1000,

max_depth=5,

min_child_weight=1,

gamma=0,

subsample=0.8,

colsample_bytree=0.8,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27)

modelfit(xgb1, train, predictors)

As you can see that here we got 140 as the optimal estimators for 0.1 learning rate. Note that this value might be too high for you depending on the power of your system. In that case you can increase the learning rate and re-run the command to get the reduced number of estimators.

如你所见,这里我们得到140作为0.1学习率的最佳估计量。请注意,根据系统的功率,此值可能对您来说太高。在这种情况下,您可以提高学习率,并重新运行该命令以获得减少的估计数。


Note: You will see the test AUC as “AUC Score (Test)” in the outputs here. But this would not appear if you try to run the command on your system as the data is not made public. It’s provided here just for reference. The part of the code which generates this output has been removed here.

注意:您将在这里的输出中将测试AUC视为“AUC得分(测试)”。但如果您试图在系统上运行该命令,因为数据不会公开,则不会出现这种情况。这里提供仅供参考。生成此输出的代码部分已在此处删除。



Step 2: Tune max_depth and min_child_weight

第二步:调整max_depth 和min_child_weight


We tune these first as they will have the highest impact on model outcome. To start with, let’s set wider ranges and then we will perform another iteration for smaller ranges.

我们首先对它们进行调整,因为它们对模型结果的影响最大。首先,让我们设置更宽的范围,然后对较小的范围执行另一个迭代。


Important Note: I’ll be doing some heavy-duty grid searched in this section which can take 15-30 mins or even more time to run depending on your system. You can vary the number of values you are testing based on what your system can handle.

重要提示:我将在本节中搜索一些重载网格,根据您的系统,这可能需要15-30分钟甚至更长的时间来运行。您可以根据系统可以处理的内容改变正在测试的值的数量。


param_test1 = {

'max_depth':range(3,10,2),

'min_child_weight':range(1,6,2)

}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,

min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),

param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch1.fit(train[predictors],train[target])

gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

Here, we have run 12 combinations with wider intervals between values. The ideal values are 5 for max_depth and 5 for min_child_weight. Lets go one step deeper and look for optimum values. We’ll search for values 1 above and below the optimum values because we took an interval of two.

在这里,我们运行12组合,值之间的间隔更大。理想值为5 max_depth 和5min_child_weight。我们再深入一步,寻找最佳值。我们将在最佳值的上方和下方搜索值1,因为我们采用了两个间隔。


param_test2 = {

'max_depth':[4,5,6],

'min_child_weight':[4,5,6]

}

gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,

min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch2.fit(train[predictors],train[target])

gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

Here, we get the optimum values as 4 for max_depth and 6 for min_child_weight. Also, we can see the CV score increasing slightly. Note that as the model performance increases, it becomes exponentially difficult to achieve even marginal gains in performance. You would have noticed that here we got 6 as optimum value for min_child_weight but we haven’t tried values more than 6. We can do that as follow:.

在这里,我们得到了最大深度4和最小儿童体重6的最佳值。此外,我们还可以看到简历分数略有增加。请注意,随着模型性能的提高,甚至很难获得性能的边际收益。你可能会注意到,这里我们有6个最小儿童体重的最佳值,但我们没有尝试超过6个。我们可以这样做:。


param_test2b = {

'min_child_weight':[6,8,10,12]

}

gsearch2b = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=4,

min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test2b, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch2b.fit(train[predictors],train[target])

modelfit(gsearch3.best_estimator_, train, predictors)

gsearch2b.grid_scores_, gsearch2b.best_params_, gsearch2b.best_score_

We see 6 as the optimal value.

我们认为6是最佳值。


相关文章
|
10天前
|
JavaScript 前端开发 Python
用python执行js代码:PyExecJS库
文章讲述了如何使用PyExecJS库在Python环境中执行JavaScript代码,并提供了安装指南和示例代码。
54 1
用python执行js代码:PyExecJS库
|
7天前
|
Python
以下是一些常用的图表类型及其Python代码示例,使用Matplotlib和Seaborn库。
以下是一些常用的图表类型及其Python代码示例,使用Matplotlib和Seaborn库。
|
10天前
|
Python
turtle库的几个案例进阶,代码可直接运行(python经典编程案例)
该文章展示了使用Python的turtle库进行绘图的进阶案例,包括绘制彩色圆形和复杂图案的代码示例。
51 6
turtle库的几个案例进阶,代码可直接运行(python经典编程案例)
|
2天前
|
数据安全/隐私保护 Python
探索Python中的装饰器:简化代码,提升效率
【9月更文挑战第32天】在Python编程世界中,装饰器是一个强大的工具,它允许我们在不改变函数源代码的情况下增加函数的功能。本文将通过直观的例子和代码片段,引导你理解装饰器的概念、使用方法及其背后的魔法,旨在帮助你写出更加优雅且高效的代码。
|
1天前
|
大数据 Python
Python 高级编程:深入探索高级代码实践
本文深入探讨了Python的四大高级特性:装饰器、生成器、上下文管理器及并发与并行编程。通过装饰器,我们能够在不改动原函数的基础上增添功能;生成器允许按需生成值,优化处理大数据;上下文管理器确保资源被妥善管理和释放;多线程等技术则助力高效完成并发任务。本文通过具体代码实例详细解析这些特性的应用方法,帮助读者提升Python编程水平。
18 5
|
6天前
|
Python
? Python 装饰器入门:让代码更灵活和可维护
? Python 装饰器入门:让代码更灵活和可维护
12 4
|
6天前
|
缓存 测试技术 Python
探索Python中的装饰器:简化代码,提高可读性
【9月更文挑战第28天】在Python编程中,装饰器是一个强大的工具,它允许我们在不修改原有函数代码的情况下增加额外的功能。本文将深入探讨装饰器的概念、使用方法及其在实际项目中的应用,帮助读者理解并运用装饰器来优化和提升代码的效率与可读性。通过具体示例,我们将展示如何创建自定义装饰器以及如何利用它们简化日常的编程任务。
11 3
|
5天前
|
机器学习/深度学习 数据格式 Python
将特征向量转化为Python代码
将特征向量转化为Python代码
12 1
|
10天前
|
Python
turtle库的几个简单案例,代码可直接运行(python经典编程案例)
该文章提供了多个使用Python的turtle库绘制不同图形的简单示例代码,如画三角形、正方形、多边形等,展示了如何通过turtle进行基本的绘图操作。
17 5
|
8天前
|
JavaScript 前端开发 Python
python执行js代码
本文档详细介绍如何安装Node.js环境及PyExecJS库,并提供示例代码展示其功能。首先,通过指定链接安装Node.js,安装完毕后可在命令行中输入`node --version`来验证安装是否成功。接着,使用`pip install PyExecJS`安装PyExecJS库,该库允许Python程序执行JavaScript代码。文档还提供了多个示例代码,展示了如何在Python环境中执行和编译JavaScript代码,并可以选择特定的JavaScript运行时环境,如Node.js或JScript。最后,通过具体案例展示了PyExecJS的功能与使用方法。
16 3
下一篇
无影云桌面