ML之sklearn:sklearn的make_pipeline函数、RobustScaler函数、KFold函数、cross_val_score函数的代码解释、使用方法之详细攻略(二)

简介: ML之sklearn:sklearn的make_pipeline函数、RobustScaler函数、KFold函数、cross_val_score函数的代码解释、使用方法之详细攻略

RobustScaler函数的使用方法


lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.5, random_state=1))

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.5, l1_ratio=.9, random_state=3))




sklearn的KFold函数的代码解释、使用方法


KFold函数的代码解释


class KFold Found at: sklearn.model_selection._split

class KFold(_BaseKFold):

   """K-Folds cross-validator

   Provides train/test indices to split data in train/test sets. Split  dataset into k consecutive folds (without shuffling by default).

   Each fold is then used once as a validation while the k - 1 remaining  folds form the training set.

   Read more in the :ref:`User Guide <cross_validation>`.

   Parameters

   ----------

   n_splits : int, default=3

   Number of folds. Must be at least 2.

 

   shuffle : boolean, optional

   Whether to shuffle the data before splitting into batches.

 

   random_state : int, RandomState instance or None, optional,

    default=None

   If int, random_state is the seed used by the random number  generator;

   If RandomState instance, random_state is the random number  generator;

   If None, the random number generator is the RandomState  instance used  by `np.random`. Used when ``shuffle`` == True.

在:sklearn.model_select ._split找到的类KFold

类KFold (_BaseKFold):

”““K-Folds cross-validator

提供训练/测试索引来分割训练/测试集中的数据。将数据集分割成k个连续的折叠(默认情况下没有洗牌)。

然后,每条折叠使用一次作为验证,而k - 1条剩余折叠形成训练集。

更多信息参见:ref: ' User Guide <cross_validation> '。</cross_validation>

参数

----------

n_splits :int,默认=3

折叠的数量。必须至少是2。


shuffle :布尔型,可选

在分割成批之前是否打乱数据。


random_state :int, RandomState实例或None,可选,

默认=没有

如果int, random_state是随机数生成器使用的种子;

如果是RandomState实例,则random_state为随机数生成器;

如果没有,随机数生成器是“np.random”使用的RandomState实例。当' ' shuffle ' == True时使用。

   Examples

   --------

   >>> from sklearn.model_selection import KFold

   >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

   >>> y = np.array([1, 2, 3, 4])

   >>> kf = KFold(n_splits=2)

   >>> kf.get_n_splits(X)

   2

   >>> print(kf)  # doctest: +NORMALIZE_WHITESPACE

   KFold(n_splits=2, random_state=None, shuffle=False)

   >>> for train_index, test_index in kf.split(X):

   ...    print("TRAIN:", train_index, "TEST:", test_index)

   ...    X_train, X_test = X[train_index], X[test_index]

   ...    y_train, y_test = y[train_index], y[test_index]

   TRAIN: [2 3] TEST: [0 1]

   TRAIN: [0 1] TEST: [2 3]

 

   Notes

   -----

   The first ``n_samples % n_splits`` folds have size

   ``n_samples // n_splits + 1``, other folds have size

   ``n_samples // n_splits``, where ``n_samples`` is the number of

    samples.

 

   See also

   --------

   StratifiedKFold

   Takes group information into account to avoid building folds with  imbalanced class distributions (for binary or multiclass  classification tasks).

   GroupKFold: K-fold iterator variant with non-overlapping groups.

   RepeatedKFold: Repeats K-Fold n times.

   """ 另请参阅

--------

StratifiedKFold

考虑组信息,以避免构建不平衡的类分布的折叠(对于二进制或多类分类任务)。

GroupKFold:不重叠组的K-fold迭代器变体。

RepeatedKFold:重复K-Fold n次。

”“”

   def __init__(self, n_splits=3, shuffle=False,

       random_state=None):

       super(KFold, self).__init__(n_splits, shuffle, random_state)

 

   def _iter_test_indices(self, X, y=None, groups=None):

       n_samples = _num_samples(X)

       indices = np.arange(n_samples)

       if self.shuffle:

           check_random_state(self.random_state).shuffle(indices)

       n_splits = self.n_splits

       fold_sizes = (n_samples // n_splits) * np.ones(n_splits, dtype=np.

        int)

       fold_sizes[:n_samples % n_splits] += 1

       current = 0

       for fold_size in fold_sizes:

           start, stop = current, current + fold_size

           yield indices[start:stop]

           current = stop  



KFold函数的使用方法


   Examples

   --------

   >>> from sklearn.model_selection import KFold

   >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

   >>> y = np.array([1, 2, 3, 4])

   >>> kf = KFold(n_splits=2)

   >>> kf.get_n_splits(X)

   2

   >>> print(kf)  # doctest: +NORMALIZE_WHITESPACE

   KFold(n_splits=2, random_state=None, shuffle=False)

   >>> for train_index, test_index in kf.split(X):

   ...    print("TRAIN:", train_index, "TEST:", test_index)

   ...    X_train, X_test = X[train_index], X[test_index]

   ...    y_train, y_test = y[train_index], y[test_index]

   TRAIN: [2 3] TEST: [0 1]

   TRAIN: [0 1] TEST: [2 3]






sklearn的cross_val_score函数的代码解释、使用方法


cross_val_score函数的代码解释


def cross_val_score Found at: sklearn.model_selection._validation

def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None,  n_jobs=1, verbose=0, fit_params=None,   pre_dispatch='2*n_jobs'):

   """Evaluate a score by cross-validation

   Read more in the :ref:`User Guide <cross_validation>`.

通过交叉验证来评估一个分数

更多信息参见:ref: ' User Guide '。

 Parameters

   ----------

   estimator : estimator object implementing 'fit' The object to use to fit the data.

 

   X : array-like

   The data to fit. Can be for example a list, or an array.

 

   y : array-like, optional, default: None

   The target variable to try to predict in the case of  supervised learning.

 

   groups : array-like, with shape (n_samples,), optional

   Group labels for the samples used while splitting the dataset into  train/test set.

 

   scoring : string, callable or None, optional, default: None

   A string (see model evaluation documentation) or a scorer callable object / function with signature  ``scorer(estimator, X, y)``.

 

   cv : int, cross-validation generator or an iterable, optional

   Determines the cross-validation splitting strategy.

   Possible inputs for cv are:

   - None, to use the default 3-fold cross validation,

   - integer, to specify the number of folds in a `(Stratified)KFold`,

   - An object to be used as a cross-validation generator.

   - An iterable yielding train, test splits.

   For integer/None inputs, if the estimator is a classifier and ``y`` is  either binary or multiclass, :class:`StratifiedKFold` is used. In all  other cases, :class:`KFold` is used.

 

   Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

 

   n_jobs : integer, optional

   The number of CPUs to use to do the computation. -1 means   'all CPUs'.

 

   verbose : integer, optional

   The verbosity level.

 

   fit_params : dict, optional

   Parameters to pass to the fit method of the estimator.

 

   pre_dispatch : int, or string, optional

   Controls the number of jobs that get dispatched during parallel  execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched  than CPUs can process. This parameter can be:

 

   - None, in which case all the jobs are immediately  created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs

   - An int, giving the exact number of total jobs that are spawned

   - A string, giving an expression as a function of n_jobs, as in '2*n_jobs'

 

   Returns

   -------

   scores : array of float, shape=(len(list(cv)),)

   Array of scores of the estimator for each run of the cross validation.

参数

   ----------

estimator:实现“适合”对象以适合数据。

 

X:数组类

需要匹配的数据。可以是列表,也可以是数组。

 

y : 类似数组,可选,默认:无

在监督学习的情况下,预测的目标变量。

 

groups : 类数组,形状(n_samples,),可选

将数据集分割为训练/测试集时使用的样本的标签分组。

 

scoring : 字符串,可调用或无,可选,默认:无

一个字符串(参见模型评估文档)或签名为' ' scorer(estimator, X, y) ' '的scorer可调用对象/函数。

 

cv : int,交叉验证生成器或可迭代,可选

确定交叉验证分割策略。

cv可能的输入有:

-无,使用默认的三折交叉验证,

-整数,用于指定“(分层的)KFold”中的折叠数,

-用作交叉验证生成器的对象。

-一个可迭代产生的序列,测试分裂。

对于整数/无输入,如果估计器是一个分类器,并且' ' y ' '是二进制的或多类的,则使用:class: ' StratifiedKFold '。在所有其他情况下,使用:class: ' KFold '。

 

请参考:ref: ' User Guide ',了解可以在这里使用的各种交叉验证策略。

 

n_jobs:整数,可选

用于进行计算的cpu数量。-1表示“所有cpu”。

 

verbose:整数,可选

冗长的水平。

 

fit_params :dict,可选

参数传递给估计器的拟合方法。

 

pre_dispatch: int或string,可选

控制并行执行期间分派的作业数量。当分配的作业多于cpu能够处理的任务时,减少这个数量有助于避免内存消耗激增。该参数可以为:

-无,在这种情况下,立即创建并派生所有作业。将此用于轻量级和快速运行的作业,以避免由于按需生成作业而造成的延迟

-一个int,给出生成的作业的确切总数

一个字符串,给出一个作为n_jobs函数的表达式,如'2*n_jobs'

 

返回

   -------

(len(list(cv)),)

交叉验证的每次运行估计器的分数数组。

   Examples

   --------

   >>> from sklearn import datasets, linear_model

   >>> from sklearn.model_selection import cross_val_score

   >>> diabetes = datasets.load_diabetes()

   >>> X = diabetes.data[:150]

   >>> y = diabetes.target[:150]

   >>> lasso = linear_model.Lasso()

   >>> print(cross_val_score(lasso, X, y))  # doctest: +ELLIPSIS

   [ 0.33150734  0.08022311  0.03531764]

 

   See Also

   ---------

   :func:`sklearn.model_selection.cross_validate`:

   To run cross-validation on multiple metrics and also to return  train scores, fit times and score times.

 

   :func:`sklearn.metrics.make_scorer`:

   Make a scorer from a performance metric or loss function.

 

   """

   # To ensure multimetric format is not supported

   scorer = check_scoring(estimator, scoring=scoring)

   cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups,

       scoring={'score':scorer}, cv=cv,

       return_train_score=False,

       n_jobs=n_jobs, verbose=verbose,

       fit_params=fit_params,

       pre_dispatch=pre_dispatch)

   return cv_results['test_score'] 另请参阅

---------

:func:“sklearn.model_selection.cross_validate”:

在多个指标上进行交叉验证,并返回训练分数、适应时间和得分时间。


:func:“sklearn.metrics.make_scorer”:

从性能度量或损失函数中制作一个记分员。


”“”

#以确保不支持多度量格式

scoring参数可选的对象


https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter


Scoring

Function

Comment

Classification

 

‘accuracy’

metrics.accuracy_score

‘balanced_accuracy’

metrics.balanced_accuracy_score

‘average_precision’

metrics.average_precision_score

‘neg_brier_score’

metrics.brier_score_loss

‘f1’

metrics.f1_score

for binary targets

‘f1_micro’

metrics.f1_score

micro-averaged

‘f1_macro’

metrics.f1_score

macro-averaged

‘f1_weighted’

metrics.f1_score

weighted average

‘f1_samples’

metrics.f1_score

by multilabel sample

‘neg_log_loss’

metrics.log_loss

requires predict_proba support

‘precision’ etc.

metrics.precision_score

suffixes apply as with ‘f1’

‘recall’ etc.

metrics.recall_score

suffixes apply as with ‘f1’

‘jaccard’ etc.

metrics.jaccard_score

suffixes apply as with ‘f1’

‘roc_auc’

metrics.roc_auc_score

‘roc_auc_ovr’

metrics.roc_auc_score

‘roc_auc_ovo’

metrics.roc_auc_score

‘roc_auc_ovr_weighted’

metrics.roc_auc_score

‘roc_auc_ovo_weighted’

metrics.roc_auc_score

Clustering

 

‘adjusted_mutual_info_score’

metrics.adjusted_mutual_info_score

‘adjusted_rand_score’

metrics.adjusted_rand_score

‘completeness_score’

metrics.completeness_score

‘fowlkes_mallows_score’

metrics.fowlkes_mallows_score

‘homogeneity_score’

metrics.homogeneity_score

‘mutual_info_score’

metrics.mutual_info_score

‘normalized_mutual_info_score’

metrics.normalized_mutual_info_score

‘v_measure_score’

metrics.v_measure_score

Regression

 

‘explained_variance’

metrics.explained_variance_score

‘max_error’

metrics.max_error

‘neg_mean_absolute_error’

metrics.mean_absolute_error

‘neg_mean_squared_error’

metrics.mean_squared_error

‘neg_root_mean_squared_error’

metrics.mean_squared_error

‘neg_mean_squared_log_error’

metrics.mean_squared_log_error

‘neg_median_absolute_error’

metrics.median_absolute_error

‘r2’

metrics.r2_score

‘neg_mean_poisson_deviance’

metrics.mean_poisson_deviance

‘neg_mean_gamma_deviance’

metrics.mean_gamma_deviance



cross_val_score函数的使用方法


1、分类预测——糖尿病


   >>> from sklearn import datasets, linear_model

   >>> from sklearn.model_selection import cross_val_score

   >>> diabetes = datasets.load_diabetes()

   >>> X = diabetes.data[:150]

   >>> y = diabetes.target[:150]

   >>> lasso = linear_model.Lasso()

   >>> print(cross_val_score(lasso, X, y))  # doctest: +ELLIPSIS

   [ 0.33150734  0.08022311  0.03531764]


2、分类预测——iris鸢尾花


from sklearn import datasets #自带数据集

from sklearn.model_selection import train_test_split,cross_val_score #划分数据 交叉验证

from sklearn.neighbors import KNeighborsClassifier  #一个简单的模型,只有K一个参数,类似K-means

import matplotlib.pyplot as plt

iris = datasets.load_iris()  #加载sklearn自带的数据集

X = iris.data    #这是数据

y = iris.target   #这是每个数据所对应的标签

train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=1/3,random_state=3) #这里划分数据以1/3的来划分 训练集训练结果 测试集测试结果

k_range = range(1,31)

cv_scores = []  #用来放每个模型的结果值

for n in k_range:

   knn = KNeighborsClassifier(n)   #knn模型,这里一个超参数可以做预测,当多个超参数时需要使用另一种方法GridSearchCV

   scores = cross_val_score(knn,train_X,train_y,cv=10,scoring='accuracy')  #cv:选择每次测试折数  accuracy:评价指标是准确度,可以省略使用默认值,具体使用参考下面。

   cv_scores.append(scores.mean())

plt.plot(k_range,cv_scores)

plt.xlabel('K')

plt.ylabel('Accuracy')  #通过图像选择最好的参数

plt.show()

best_knn = KNeighborsClassifier(n_neighbors=3) # 选择最优的K=3传入模型

best_knn.fit(train_X,train_y)   #训练模型

print(best_knn.score(test_X,test_y)) #看看评分






相关文章
|
2月前
|
TensorFlow 算法框架/工具
Tensorflow学习笔记(二):各种tf类型的函数用法集合
这篇文章总结了TensorFlow中各种函数的用法,包括创建张量、设备管理、数据类型转换、随机数生成等基础知识。
42 0
|
3月前
|
Python
Python量化炒股的获取数据函数—get_index_weights()
Python量化炒股的获取数据函数—get_index_weights()
38 0
|
机器学习/深度学习 测试技术 TensorFlow
dataset.py代码解释
这段代码主要定义了三个函数来创建 TensorFlow 数据集对象,这些数据集对象将被用于训练、评估和推断神经网络模型。
138 0
|
机器学习/深度学习 算法 Python
python机器学习 train_test_split()函数用法解析及示例 划分训练集和测试集 以鸢尾数据为例 入门级讲解
python机器学习 train_test_split()函数用法解析及示例 划分训练集和测试集 以鸢尾数据为例 入门级讲解
3370 0
python机器学习 train_test_split()函数用法解析及示例 划分训练集和测试集 以鸢尾数据为例 入门级讲解
|
自然语言处理 数据可视化 Java
Python中enumerate函数的解释和可视化
Python中enumerate函数的解释和可视化
Python中enumerate函数的解释和可视化
|
机器学习/深度学习 数据采集 搜索推荐
training.py的代码解释
labels、test_loss_op 和 mae_ops 计算模型的性能指标。最后,我们输出当前 epoch 的训练损失、测试损失和平均绝对误差(MAE),并保存模型参数(如果 MAE 小于 0.9)。 整个代码的目的是使用协同过滤算法建立电影推荐系统的模型,训练模型并计算模型的性能指标。
120 0
|
机器学习/深度学习 存储 缓存
ML之sklearn:sklearn的make_pipeline函数、RobustScaler函数、KFold函数、cross_val_score函数的代码解释、使用方法之详细攻略
ML之sklearn:sklearn的make_pipeline函数、RobustScaler函数、KFold函数、cross_val_score函数的代码解释、使用方法之详细攻略
|
Python
python下的评估函数eval()
python下的评估函数eval()
146 0
python下的评估函数eval()
|
机器学习/深度学习 计算机视觉 Python
关于python机器学习cross_val_score()交叉检验的参数cv实际默认为5这件事,你怎么看?
关于python机器学习cross_val_score()交叉检验的参数cv实际默认为5这件事,你怎么看?
597 0
关于python机器学习cross_val_score()交叉检验的参数cv实际默认为5这件事,你怎么看?
|
机器学习/深度学习 数据挖掘 索引
ML之sklearn:sklearn.metrics中常用的函数参数(比如confusion_matrix等 )解释及其用法说明之详细攻略
ML之sklearn:sklearn.metrics中常用的函数参数(比如confusion_matrix等 )解释及其用法说明之详细攻略