ML之Xgboost:利用Xgboost模型对数据集(比马印第安人糖尿病)进行二分类预测(5年内是否患糖尿病)

简介: ML之Xgboost:利用Xgboost模型对数据集(比马印第安人糖尿病)进行二分类预测(5年内是否患糖尿病)

输出结

X_train内容:

[[  3.    102.     44.    ...  30.8     0.4    26.   ]

[  1.     77.     56.    ...  33.3     1.251  24.   ]

[  9.    124.     70.    ...  35.4     0.282  34.   ]

...

[  0.     57.     60.    ...  21.7     0.735  67.   ]

[  1.    105.     58.    ...  24.3     0.187  21.   ]

[  8.    179.     72.    ...  32.7     0.719  36.   ]]

y_train内容:

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1.

0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.

1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1.

1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0.

0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1.

0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.

0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1.

0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 0. 0.

0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1.

0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0.

0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0.

0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1.

1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1.

1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0.

1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.

0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1.

0. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.

0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1.

1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.

1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1.

0. 1. 0. 0. 0. 1. 1. 0. 0. 1.]


image.png

 

设计思

image.png

 

核心代

class XGBClassifier Found at: xgboost.sklearn

class XGBClassifier(XGBModel, XGBClassifierBase):

   # pylint: disable=missing-docstring,too-many-arguments,invalid-name

   __doc__ = "Implementation of the scikit-learn API for XGBoost classification.\n\n" + '\n'.join

    (XGBModel.__doc__.split('\n')[2:])

   def __init__(self, max_depth=3, learning_rate=0.1,

       n_estimators=100, silent=True,

       objective="binary:logistic", booster='gbtree',

       n_jobs=1, nthread=None, gamma=0, min_child_weight=1,

       max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,

       reg_alpha=0, reg_lambda=1, scale_pos_weight=1,

       base_score=0.5, random_state=0, seed=None, missing=None, **kwargs):

       super(XGBClassifier, self).__init__(max_depth, learning_rate, n_estimators, silent,

        objective, booster, n_jobs, nthread, gamma, min_child_weight, max_delta_step, subsample,

        colsample_bytree, colsample_bylevel, reg_alpha, reg_lambda, scale_pos_weight,

        base_score, random_state, seed, missing, **kwargs)

 

   def fit(self, X, y, sample_weight=None, eval_set=None, eval_metric=None,

       early_stopping_rounds=None, verbose=True, xgb_model=None,

       sample_weight_eval_set=None, callbacks=

       # pylint: disable = attribute-defined-outside-init,arguments-differ

       None):

       """

       Fit gradient boosting classifier

       Parameters

       ----------

       X : array_like

           Feature matrix

       y : array_like

           Labels

       sample_weight : array_like

           Weight for each instance

       eval_set : list, optional

           A list of (X, y) pairs to use as a validation set for

           early-stopping

       sample_weight_eval_set : list, optional

           A list of the form [L_1, L_2, ..., L_n], where each L_i is a list of

           instance weights on the i-th validation set.

       eval_metric : str, callable, optional

           If a str, should be a built-in evaluation metric to use. See

           doc/parameter.rst. If callable, a custom evaluation metric. The call

           signature is func(y_predicted, y_true) where y_true will be a

           DMatrix object such that you may need to call the get_label

           method. It must return a str, value pair where the str is a name

           for the evaluation and value is the value of the evaluation

           function. This objective is always minimized.

       early_stopping_rounds : int, optional

           Activates early stopping. Validation error needs to decrease at

           least every <early_stopping_rounds> round(s) to continue training.

           Requires at least one item in evals. If there's more than one,

           will use the last. If early stopping occurs, the model will have

           three additional fields: bst.best_score, bst.best_iteration and

           bst.best_ntree_limit (bst.best_ntree_limit is the ntree_limit parameter

           default value in predict method if not any other value is specified).

           (Use bst.best_ntree_limit to get the correct value if num_parallel_tree

           and/or num_class appears in the parameters)

       verbose : bool

           If `verbose` and an evaluation set is used, writes the evaluation

           metric measured on the validation set to stderr.

       xgb_model : str

           file name of stored xgb model or 'Booster' instance Xgb model to be

           loaded before training (allows training continuation).

       callbacks : list of callback functions

           List of callback functions that are applied at end of each iteration.

           It is possible to use predefined callbacks by using :ref:`callback_api`.

           Example:

           .. code-block:: python

               [xgb.callback.reset_learning_rate(custom_rates)]

       """

       evals_result = {}

       self.classes_ = np.unique(y)

       self.n_classes_ = len(self.classes_)

       xgb_options = self.get_xgb_params()

       if callable(self.objective):

           obj = _objective_decorator(self.objective)

       # Use default value. Is it really not used ?

           xgb_options["objective"] = "binary:logistic"

       else:

           obj = None

       if self.n_classes_ > 2:

       # Switch to using a multiclass objective in the underlying XGB instance

           xgb_options["objective"] = "multi:softprob"

           xgb_options['num_class'] = self.n_classes_

       feval = eval_metric if callable(eval_metric) else None

       if eval_metric is not None:

           if callable(eval_metric):

               eval_metric = None

           else:

               xgb_options.update({"eval_metric":eval_metric})

       self._le = XGBLabelEncoder().fit(y)

       training_labels = self._le.transform(y)

       if eval_set is not None:

           if sample_weight_eval_set is None:

               sample_weight_eval_set = [None] * len(eval_set)

           evals = list(

               DMatrix(eval_set[i][0], label=self._le.transform(eval_set[i][1]),

                   missing=self.missing, weight=sample_weight_eval_set[i],

                   nthread=self.n_jobs) for

               i in range(len(eval_set)))

           nevals = len(evals)

           eval_names = ["validation_{}".format(i) for i in range(nevals)]

           evals = list(zip(evals, eval_names))

       else:

           evals = ()

       self._features_count = X.shape[1]

       if sample_weight is not None:

           train_dmatrix = DMatrix(X, label=training_labels, weight=sample_weight,

               missing=self.missing, nthread=self.n_jobs)

       else:

           train_dmatrix = DMatrix(X, label=training_labels,

               missing=self.missing, nthread=self.n_jobs)

       self._Booster = train(xgb_options, train_dmatrix, self.n_estimators,

           evals=evals,

           early_stopping_rounds=early_stopping_rounds,

           evals_result=evals_result, obj=obj, feval=feval,

           verbose_eval=verbose, xgb_model=xgb_model,

           callbacks=callbacks)

       self.objective = xgb_options["objective"]

       if evals_result:

           for val in evals_result.items():

               evals_result_key = list(val[1].keys())[0]

               evals_result[val[0]][evals_result_key] = val[1][evals_result_key]

         

           self.evals_result_ = evals_result

       if early_stopping_rounds is not None:

           self.best_score = self._Booster.best_score

           self.best_iteration = self._Booster.best_iteration

           self.best_ntree_limit = self._Booster.best_ntree_limit

       return self

 

   def predict(self, data, output_margin=False, ntree_limit=None, validate_features=True):

       """

       Predict with `data`.

       .. note:: This function is not thread safe.

         For each booster object, predict can only be called from one thread.

         If you want to run prediction using multiple thread, call ``xgb.copy()`` to make copies

         of model object and then call ``predict()``.

       .. note:: Using ``predict()`` with DART booster

         If the booster object is DART type, ``predict()`` will perform dropouts, i.e. only

         some of the trees will be evaluated. This will produce incorrect results if ``data`` is

         not the training data. To obtain correct results on test sets, set ``ntree_limit`` to

         a nonzero value, e.g.

         .. code-block:: python

           preds = bst.predict(dtest, ntree_limit=num_round)

       Parameters

       ----------

       data : DMatrix

           The dmatrix storing the input.

       output_margin : bool

           Whether to output the raw untransformed margin value.

       ntree_limit : int

           Limit number of trees in the prediction; defaults to best_ntree_limit if defined

           (i.e. it has been trained with early stopping), otherwise 0 (use all trees).

       validate_features : bool

           When this is True, validate that the Booster's and data's feature_names are identical.

           Otherwise, it is assumed that the feature_names are the same.

       Returns

       -------

       prediction : numpy array

       """

       test_dmatrix = DMatrix(data, missing=self.missing, nthread=self.n_jobs)

       if ntree_limit is None:

           ntree_limit = getattr(self, "best_ntree_limit", 0)

       class_probs = self.get_booster().predict(test_dmatrix,

           output_margin=output_margin,

           ntree_limit=ntree_limit,

           validate_features=validate_features)

       if output_margin:

           # If output_margin is active, simply return the scores

           return class_probs

       if len(class_probs.shape) > 1:

           column_indexes = np.argmax(class_probs, axis=1)

       else:

           column_indexes = np.repeat(0, class_probs.shape[0])

           column_indexes[class_probs > 0.5] = 1

       return self._le.inverse_transform(column_indexes)

 

   def predict_proba(self, data, ntree_limit=None, validate_features=True):

       """

       Predict the probability of each `data` example being of a given class.

       .. note:: This function is not thread safe

           For each booster object, predict can only be called from one thread.

           If you want to run prediction using multiple thread, call ``xgb.copy()`` to make copies

           of model object and then call predict

       Parameters

       ----------

       data : DMatrix

           The dmatrix storing the input.

       ntree_limit : int

           Limit number of trees in the prediction; defaults to best_ntree_limit if defined

           (i.e. it has been trained with early stopping), otherwise 0 (use all trees).

       validate_features : bool

           When this is True, validate that the Booster's and data's feature_names are identical.

           Otherwise, it is assumed that the feature_names are the same.

       Returns

       -------

       prediction : numpy array

           a numpy array with the probability of each data example being of a given class.

       """

       test_dmatrix = DMatrix(data, missing=self.missing, nthread=self.n_jobs)

       if ntree_limit is None:

           ntree_limit = getattr(self, "best_ntree_limit", 0)

       class_probs = self.get_booster().predict(test_dmatrix,

           ntree_limit=ntree_limit,

           validate_features=validate_features)

       if self.objective == "multi:softprob":

           return class_probs

       else:

           classone_probs = class_probs

           classzero_probs = 1.0 - classone_probs

           return np.vstack((classzero_probs, classone_probs)).transpose()

 

   def evals_result(self):

       """Return the evaluation results.

       If **eval_set** is passed to the `fit` function, you can call

       ``evals_result()`` to get evaluation results for all passed **eval_sets**.

       When **eval_metric** is also passed to the `fit` function, the

       **evals_result** will contain the **eval_metrics** passed to the `fit` function.

       Returns

       -------

       evals_result : dictionary

       Example

       -------

       .. code-block:: python

           param_dist = {'objective':'binary:logistic', 'n_estimators':2}

           clf = xgb.XGBClassifier(**param_dist)

           clf.fit(X_train, y_train,

                   eval_set=[(X_train, y_train), (X_test, y_test)],

                   eval_metric='logloss',

                   verbose=True)

           evals_result = clf.evals_result()

       The variable **evals_result** will contain

       .. code-block:: python

           {'validation_0': {'logloss': ['0.604835', '0.531479']},

           'validation_1': {'logloss': ['0.41965', '0.17686']}}

       """

       if self.evals_result_:

           evals_result = self.evals_result_

       else:

           raise XGBoostError('No results.')

       return evals_result


相关文章
|
6月前
|
机器学习/深度学习 人工智能 算法
机器学习-线性模型(波士顿房价预测)
机器学习-线性模型(波士顿房价预测)
|
3月前
|
机器学习/深度学习 数据采集 人工智能
《零基础实践深度学习》基于线性回归实现波士顿房价预测任务1.3.3
这篇文章详细介绍了如何使用线性回归算法实现波士顿房价预测任务,包括数据读取、形状变换、集划分、归一化处理、模型设计、前向计算以及损失函数的计算等步骤,并提供了相应的Python代码实现。
 《零基础实践深度学习》基于线性回归实现波士顿房价预测任务1.3.3
|
6月前
|
机器学习/深度学习 自然语言处理 算法
R语言结合新冠疫情COVID-19股票价格预测:ARIMA,KNN和神经网络时间序列分析
R语言结合新冠疫情COVID-19股票价格预测:ARIMA,KNN和神经网络时间序列分析
|
6月前
|
机器学习/深度学习 算法 大数据
结合新冠疫情COVID-19股票价格预测:ARIMA,KNN和神经网络时间序列分析
结合新冠疫情COVID-19股票价格预测:ARIMA,KNN和神经网络时间序列分析
|
6月前
|
存储 机器学习/深度学习 数据采集
基于LightGBM的肺癌分类模型:从预测到个体化治疗
基于LightGBM的肺癌分类模型:从预测到个体化治疗
265 1
|
机器学习/深度学习 算法 API
机器学习梯度下降法应用波士顿房价预测
机器学习梯度下降法应用波士顿房价预测
178 0
|
机器学习/深度学习 算法框架/工具
【深度学习】实验01 波士顿房价预测
【深度学习】实验01 波士顿房价预测
187 0
|
机器学习/深度学习 数据采集 数据可视化
基于Kaggle训练集预测的多层人工神经网络的能源消耗的时间序列预测研究(Matlab代码实现)
基于Kaggle训练集预测的多层人工神经网络的能源消耗的时间序列预测研究(Matlab代码实现)
|
机器学习/深度学习 算法 索引
【阿旭机器学习实战】【9】随机梯度下降(SGD)进行乳腺癌良恶性预测,并与逻辑斯蒂回归预测结果进行对比
【阿旭机器学习实战】【9】随机梯度下降(SGD)进行乳腺癌良恶性预测,并与逻辑斯蒂回归预测结果进行对比
【阿旭机器学习实战】【9】随机梯度下降(SGD)进行乳腺癌良恶性预测,并与逻辑斯蒂回归预测结果进行对比
|
机器学习/深度学习 传感器 算法
【预测模型-SVM预测】基于粒子群算法结合支持向量机SVM实现Covid-19风险预测附matlab代码
【预测模型-SVM预测】基于粒子群算法结合支持向量机SVM实现Covid-19风险预测附matlab代码