ML之Xgboost:利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测

简介: ML之Xgboost:利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测

输出结

image.png

image.png

image.png

 

设计思

image.png


 

核心代

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)

grid_result = grid_search.fit(X, Y)

param_grid = dict(learning_rate=learning_rate)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)


class GridSearchCV(BaseSearchCV):

   """Exhaustive search over specified parameter values for an estimator.

 

   Important members are fit, predict.

 

   GridSearchCV implements a "fit" and a "score" method.

   It also implements "predict", "predict_proba", "decision_function",

   "transform" and "inverse_transform" if they are implemented in the

   estimator used.

 

   The parameters of the estimator used to apply these methods are

    optimized

   by cross-validated grid-search over a parameter grid.

 

   Read more in the :ref:`User Guide <grid_search>`.

 

   Parameters

   ----------

   estimator : estimator object.

   This is assumed to implement the scikit-learn estimator interface.

   Either estimator needs to provide a ``score`` function,

   or ``scoring`` must be passed.

 

   param_grid : dict or list of dictionaries

   Dictionary with parameters names (string) as keys and lists of

   parameter settings to try as values, or a list of such

   dictionaries, in which case the grids spanned by each dictionary

   in the list are explored. This enables searching over any sequence

   of parameter settings.

 

   scoring : string, callable, list/tuple, dict or None, default: None

   A single string (see :ref:`scoring_parameter`) or a callable

   (see :ref:`scoring`) to evaluate the predictions on the test set.

 

   For evaluating multiple metrics, either give a list of (unique) strings

   or a dict with names as keys and callables as values.

 

   NOTE that when using custom scorers, each scorer should return a

    single

   value. Metric functions returning a list/array of values can be wrapped

   into multiple scorers that return one value each.

 

   See :ref:`multimetric_grid_search` for an example.

 

   If None, the estimator's default scorer (if available) is used.

 

   fit_params : dict, optional

   Parameters to pass to the fit method.

 

   .. deprecated:: 0.19

   ``fit_params`` as a constructor argument was deprecated in version

   0.19 and will be removed in version 0.21. Pass fit parameters to

   the ``fit`` method instead.

 

   n_jobs : int, default=1

   Number of jobs to run in parallel.

 

   pre_dispatch : int, or string, optional

   Controls the number of jobs that get dispatched during parallel

   execution. Reducing this number can be useful to avoid an

   explosion of memory consumption when more jobs get dispatched

   than CPUs can process. This parameter can be:

 

   - None, in which case all the jobs are immediately

   created and spawned. Use this for lightweight and

   fast-running jobs, to avoid delays due to on-demand

   spawning of the jobs

 

   - An int, giving the exact number of total jobs that are

   spawned

 

   - A string, giving an expression as a function of n_jobs,

   as in '2*n_jobs'

 

   iid : boolean, default=True

   If True, the data is assumed to be identically distributed across

   the folds, and the loss minimized is the total loss per sample,

   and not the mean loss across the folds.

 

   cv : int, cross-validation generator or an iterable, optional

   Determines the cross-validation splitting strategy.

   Possible inputs for cv are:

   - None, to use the default 3-fold cross validation,

   - integer, to specify the number of folds in a `(Stratified)KFold`,

   - An object to be used as a cross-validation generator.

   - An iterable yielding train, test splits.

 

   For integer/None inputs, if the estimator is a classifier and ``y`` is

   either binary or multiclass, :class:`StratifiedKFold` is used. In all

   other cases, :class:`KFold` is used.

 

   Refer :ref:`User Guide <cross_validation>` for the various

   cross-validation strategies that can be used here.

 

   refit : boolean, or string, default=True

   Refit an estimator using the best found parameters on the whole

   dataset.

 

   For multiple metric evaluation, this needs to be a string denoting the

   scorer is used to find the best parameters for refitting the estimator

   at the end.

 

   The refitted estimator is made available at the ``best_estimator_``

   attribute and permits using ``predict`` directly on this

   ``GridSearchCV`` instance.

 

   Also for multiple metric evaluation, the attributes ``best_index_``,

   ``best_score_`` and ``best_parameters_`` will only be available if

   ``refit`` is set and all of them will be determined w.r.t this specific

   scorer.

 

   See ``scoring`` parameter to know more about multiple metric

   evaluation.

 

   verbose : integer

   Controls the verbosity: the higher, the more messages.

 

   error_score : 'raise' (default) or numeric

   Value to assign to the score if an error occurs in estimator fitting.

   If set to 'raise', the error is raised. If a numeric value is given,

   FitFailedWarning is raised. This parameter does not affect the refit

   step, which will always raise the error.

 

   return_train_score : boolean, optional

   If ``False``, the ``cv_results_`` attribute will not include training

   scores.

 

   Current default is ``'warn'``, which behaves as ``True`` in addition

   to raising a warning when a training score is looked up.

   That default will be changed to ``False`` in 0.21.

   Computing training scores is used to get insights on how different

   parameter settings impact the overfitting/underfitting trade-off.

   However computing the scores on the training set can be

    computationally

   expensive and is not strictly required to select the parameters that

   yield the best generalization performance.

 

 

   Examples

   --------

   >>> from sklearn import svm, datasets

   >>> from sklearn.model_selection import GridSearchCV

   >>> iris = datasets.load_iris()

   >>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

   >>> svc = svm.SVC()

   >>> clf = GridSearchCV(svc, parameters)

   >>> clf.fit(iris.data, iris.target)

   ...                             # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS

   GridSearchCV(cv=None, error_score=...,

   estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,

   decision_function_shape='ovr', degree=..., gamma=...,

   kernel='rbf', max_iter=-1, probability=False,

   random_state=None, shrinking=True, tol=...,

   verbose=False),

   fit_params=None, iid=..., n_jobs=1,

   param_grid=..., pre_dispatch=..., refit=..., return_train_score=...,

   scoring=..., verbose=...)

   >>> sorted(clf.cv_results_.keys())

   ...                             # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS

   ['mean_fit_time', 'mean_score_time', 'mean_test_score',...

   'mean_train_score', 'param_C', 'param_kernel', 'params',...

   'rank_test_score', 'split0_test_score',...

   'split0_train_score', 'split1_test_score', 'split1_train_score',...

   'split2_test_score', 'split2_train_score',...

   'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score'...]

 

   Attributes

   ----------

   cv_results_ : dict of numpy (masked) ndarrays

   A dict with keys as column headers and values as columns, that can be

   imported into a pandas ``DataFrame``.

 

   For instance the below given table

 

   +------------+-----------+------------+-----------------+---+---------+

   |param_kernel|param_gamma|param_degree|split0_test_score|...

    |rank_t...|

 

    +============+===========+============+========

    =========+===+=========+

   |  'poly'    |     --    |      2     |        0.8      |...|    2    |

   +------------+-----------+------------+-----------------+---+---------+

   |  'poly'    |     --    |      3     |        0.7      |...|    4    |

   +------------+-----------+------------+-----------------+---+---------+

   |  'rbf'     |     0.1   |     --     |        0.8      |...|    3    |

   +------------+-----------+------------+-----------------+---+---------+

   |  'rbf'     |     0.2   |     --     |        0.9      |...|    1    |

   +------------+-----------+------------+-----------------+---+---------+

 

   will be represented by a ``cv_results_`` dict of::

 

   {

   'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],

   mask = [False False False False]...)

   'param_gamma': masked_array(data = [-- -- 0.1 0.2],

   mask = [ True  True False False]...),

   'param_degree': masked_array(data = [2.0 3.0 -- --],

   mask = [False False  True  True]...),

   'split0_test_score'  : [0.8, 0.7, 0.8, 0.9],

   'split1_test_score'  : [0.82, 0.5, 0.7, 0.78],

   'mean_test_score'    : [0.81, 0.60, 0.75, 0.82],

   'std_test_score'     : [0.02, 0.01, 0.03, 0.03],

   'rank_test_score'    : [2, 4, 3, 1],

   'split0_train_score' : [0.8, 0.9, 0.7],

   'split1_train_score' : [0.82, 0.5, 0.7],

   'mean_train_score'   : [0.81, 0.7, 0.7],

   'std_train_score'    : [0.03, 0.03, 0.04],

   'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],

   'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],

   'mean_score_time'    : [0.007, 0.06, 0.04, 0.04],

   'std_score_time'     : [0.001, 0.002, 0.003, 0.005],

   'params'             : [{'kernel': 'poly', 'degree': 2}, ...],

   }

 

   NOTE

 

   The key ``'params'`` is used to store a list of parameter

   settings dicts for all the parameter candidates.

 

   The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and

   ``std_score_time`` are all in seconds.

 

   For multi-metric evaluation, the scores for all the scorers are

   available in the ``cv_results_`` dict at the keys ending with that

   scorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shown

   above. ('split0_test_precision', 'mean_train_precision' etc.)

 

   best_estimator_ : estimator or dict

   Estimator that was chosen by the search, i.e. estimator

   which gave highest score (or smallest loss if specified)

   on the left out data. Not available if ``refit=False``.

 

   See ``refit`` parameter for more information on allowed values.

 

   best_score_ : float

   Mean cross-validated score of the best_estimator

 

   For multi-metric evaluation, this is present only if ``refit`` is

   specified.

 

   best_params_ : dict

   Parameter setting that gave the best results on the hold out data.

 

   For multi-metric evaluation, this is present only if ``refit`` is

   specified.

 

   best_index_ : int

   The index (of the ``cv_results_`` arrays) which corresponds to the best

   candidate parameter setting.

 

   The dict at ``search.cv_results_['params'][search.best_index_]`` gives

   the parameter setting for the best model, that gives the highest

   mean score (``search.best_score_``).

 

   For multi-metric evaluation, this is present only if ``refit`` is

   specified.

 

   scorer_ : function or a dict

   Scorer function used on the held out data to choose the best

   parameters for the model.

 

   For multi-metric evaluation, this attribute holds the validated

   ``scoring`` dict which maps the scorer key to the scorer callable.

 

   n_splits_ : int

   The number of cross-validation splits (folds/iterations).

 

   Notes

   ------

   The parameters selected are those that maximize the score of the left

    out

   data, unless an explicit score is passed in which case it is used instead.

 

   If `n_jobs` was set to a value higher than one, the data is copied for

    each

   point in the grid (and not `n_jobs` times). This is done for efficiency

   reasons if individual jobs take very little time, but may raise errors if

   the dataset is large and not enough memory is available.  A

    workaround in

   this case is to set `pre_dispatch`. Then, the memory is copied only

   `pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 *

   n_jobs`.

 

   See Also

   ---------

   :class:`ParameterGrid`:

   generates all the combinations of a hyperparameter grid.

 

   :func:`sklearn.model_selection.train_test_split`:

   utility function to split the data into a development set usable

   for fitting a GridSearchCV instance and an evaluation set for

   its final evaluation.

 

   :func:`sklearn.metrics.make_scorer`:

   Make a scorer from a performance metric or loss function.

 

   """

   def __init__(self, estimator, param_grid, scoring=None,

    fit_params=None,

       n_jobs=1, iid=True, refit=True, cv=None, verbose=0,

       pre_dispatch='2*n_jobs', error_score='raise',

       return_train_score="warn"):

       super(GridSearchCV, self).__init__(estimator=estimator,

        scoring=scoring, fit_params=fit_params, n_jobs=n_jobs, iid=iid,

        refit=refit, cv=cv, verbose=verbose, pre_dispatch=pre_dispatch,

        error_score=error_score, return_train_score=return_train_score)

       self.param_grid = param_grid

       _check_param_grid(param_grid)

 

   def _get_param_iterator(self):

       """Return ParameterGrid instance for the given param_grid"""

       return ParameterGrid(self.param_grid)


相关文章
|
机器学习/深度学习 算法
ML之分类预测:以六类机器学习算法(kNN、逻辑回归、SVM、决策树、随机森林、提升树、神经网络)对糖尿病数据集(8→1)实现二分类模型评估案例来理解和认知机器学习分类预测的模板流程
ML之分类预测:以六类机器学习算法(kNN、逻辑回归、SVM、决策树、随机森林、提升树、神经网络)对糖尿病数据集(8→1)实现二分类模型评估案例来理解和认知机器学习分类预测的模板流程
ML之分类预测:以六类机器学习算法(kNN、逻辑回归、SVM、决策树、随机森林、提升树、神经网络)对糖尿病数据集(8→1)实现二分类模型评估案例来理解和认知机器学习分类预测的模板流程
|
机器学习/深度学习 算法
ML之回归预测:利用十类机器学习算法(线性回归、kNN、SVM、决策树、随机森林、极端随机树、SGD、提升树、LightGBM、XGBoost)对波士顿数据集回归预测(模型评估、推理并导到csv)
ML之回归预测:利用十类机器学习算法(线性回归、kNN、SVM、决策树、随机森林、极端随机树、SGD、提升树、LightGBM、XGBoost)对波士顿数据集回归预测(模型评估、推理并导到csv)
ML之回归预测:利用十类机器学习算法(线性回归、kNN、SVM、决策树、随机森林、极端随机树、SGD、提升树、LightGBM、XGBoost)对波士顿数据集回归预测(模型评估、推理并导到csv)
ML之XGBoost:利用XGBoost算法对波士顿数据集回归预测(模型调参【2种方法,ShuffleSplit+GridSearchCV、TimeSeriesSplitGSCV】、模型评估)
ML之XGBoost:利用XGBoost算法对波士顿数据集回归预测(模型调参【2种方法,ShuffleSplit+GridSearchCV、TimeSeriesSplitGSCV】、模型评估)
|
算法 数据可视化 Python
ML之xgboost:利用xgboost算法(自带,特征重要性可视化+且作为阈值训练模型)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之xgboost:利用xgboost算法(自带,特征重要性可视化+且作为阈值训练模型)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之xgboost:利用xgboost算法(自带,特征重要性可视化+且作为阈值训练模型)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
|
算法 数据可视化 计算机视觉
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测
|
算法 定位技术
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测)
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测)
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测)
|
算法 Python
ML之xgboost:利用xgboost算法(自带方式)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之xgboost:利用xgboost算法(自带方式)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之xgboost:利用xgboost算法(自带方式)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之分类预测之ElasticNet之OPLiR:在二分类数据集上训练OPLiR模型(T1)
ML之分类预测之ElasticNet之OPLiR:在二分类数据集上训练OPLiR模型(T1)
ML之分类预测之ElasticNet之OPLiR:在二分类数据集上训练OPLiR模型(T1)
ML之xgboost:利用xgboost算法(sklearn+3Split)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之xgboost:利用xgboost算法(sklearn+3Split)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之xgboost:利用xgboost算法(sklearn+3Split)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
|
算法 计算机视觉
ML之xgboost:利用xgboost算法(sklearn+7CrVa)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之xgboost:利用xgboost算法(sklearn+7CrVa)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)
ML之xgboost:利用xgboost算法(sklearn+7CrVa)训练mushroom蘑菇数据集(22+1,6513+1611)来预测蘑菇是否毒性(二分类预测)