三、模型训练与评估
1.数据集划分
from sklearn.model_selection import train_test_split,cross_val_score X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=2023)
2.数据标准化
- 返回值为标准化后的数据
- 加载了 StandardScaler 类,并初始化了 StandardScaler 对象 scaler,使用 fit 方法,StandardScaler 从训练数据中估计每个特征维度的参数 μ (样本均值)和 σ (标准差)。 通过调用 transform 方法,使用估计的参数 μ 和 σ 对训练和测试数据进行标准化。
from sklearn.preprocessing import StandardScaler help(StandardScaler)
Help on class StandardScaler in module sklearn.preprocessing._data: class StandardScaler(sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator) | StandardScaler(*, copy=True, with_mean=True, with_std=True) | | Standardize features by removing the mean and scaling to unit variance. | | The standard score of a sample `x` is calculated as: | | z = (x - u) / s | | where `u` is the mean of the training samples or zero if `with_mean=False`, | and `s` is the standard deviation of the training samples or one if | `with_std=False`. | | Centering and scaling happen independently on each feature by computing | the relevant statistics on the samples in the training set. Mean and | standard deviation are then stored to be used on later data using | :meth:`transform`. | | Standardization of a dataset is a common requirement for many | machine learning estimators: they might behave badly if the | individual features do not more or less look like standard normally | distributed data (e.g. Gaussian with 0 mean and unit variance). | | For instance many elements used in the objective function of | a learning algorithm (such as the RBF kernel of Support Vector | Machines or the L1 and L2 regularizers of linear models) assume that | all features are centered around 0 and have variance in the same | order. If a feature has a variance that is orders of magnitude larger | that others, it might dominate the objective function and make the | estimator unable to learn from other features correctly as expected. | | This scaler can also be applied to sparse CSR or CSC matrices by passing | `with_mean=False` to avoid breaking the sparsity structure of the data. | | Read more in the :ref:`User Guide <preprocessing_scaler>`. | | Parameters | ---------- | copy : bool, default=True | If False, try to avoid a copy and do inplace scaling instead. | This is not guaranteed to always work inplace; e.g. if the data is | not a NumPy array or scipy.sparse CSR matrix, a copy may still be | returned. | | with_mean : bool, default=True | If True, center the data before scaling. | This does not work (and will raise an exception) when attempted on | sparse matrices, because centering them entails building a dense | matrix which in common use cases is likely to be too large to fit in | memory. | | with_std : bool, default=True | If True, scale the data to unit variance (or equivalently, | unit standard deviation). | | Attributes | ---------- | scale_ : ndarray of shape (n_features,) or None | Per feature relative scaling of the data to achieve zero mean and unit | variance. Generally this is calculated using `np.sqrt(var_)`. If a | variance is zero, we can't achieve unit variance, and the data is left | as-is, giving a scaling factor of 1. `scale_` is equal to `None` | when `with_std=False`. | | .. versionadded:: 0.17 | *scale_* | | mean_ : ndarray of shape (n_features,) or None | The mean value for each feature in the training set. | Equal to ``None`` when ``with_mean=False``. | | var_ : ndarray of shape (n_features,) or None | The variance for each feature in the training set. Used to compute | `scale_`. Equal to ``None`` when ``with_std=False``. | | n_features_in_ : int | Number of features seen during :term:`fit`. | | .. versionadded:: 0.24 | | feature_names_in_ : ndarray of shape (`n_features_in_`,) | Names of features seen during :term:`fit`. Defined only when `X` | has feature names that are all strings. | | .. versionadded:: 1.0 | | n_samples_seen_ : int or ndarray of shape (n_features,) | The number of samples processed by the estimator for each feature. | If there are no missing samples, the ``n_samples_seen`` will be an | integer, otherwise it will be an array of dtype int. If | `sample_weights` are used it will be a float (if no missing data) | or an array of dtype float that sums the weights seen so far. | Will be reset on new calls to fit, but increments across | ``partial_fit`` calls. | | See Also | -------- | scale : Equivalent function without the estimator API. | | :class:`~sklearn.decomposition.PCA` : Further removes the linear | correlation across features with 'whiten=True'. | | Notes | ----- | NaNs are treated as missing values: disregarded in fit, and maintained in | transform. | | We use a biased estimator for the standard deviation, equivalent to | `numpy.std(x, ddof=0)`. Note that the choice of `ddof` is unlikely to | affect model performance. | | For a comparison of the different scalers, transformers, and normalizers, | see :ref:`examples/preprocessing/plot_all_scaling.py | <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`. | | Examples | -------- | >>> from sklearn.preprocessing import StandardScaler | >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]] | >>> scaler = StandardScaler() | >>> print(scaler.fit(data)) | StandardScaler() | >>> print(scaler.mean_) | [0.5 0.5] | >>> print(scaler.transform(data)) | [[-1. -1.] | [-1. -1.] | [ 1. 1.] | [ 1. 1.]] | >>> print(scaler.transform([[2, 2]])) | [[3. 3.]] | | Method resolution order: | StandardScaler | sklearn.base._OneToOneFeatureMixin | sklearn.base.TransformerMixin | sklearn.base.BaseEstimator | builtins.object | | Methods defined here: | | __init__(self, *, copy=True, with_mean=True, with_std=True) | Initialize self. See help(type(self)) for accurate signature. | | fit(self, X, y=None, sample_weight=None) | Compute the mean and std to be used for later scaling. | | Parameters | ---------- | X : {array-like, sparse matrix} of shape (n_samples, n_features) | The data used to compute the mean and standard deviation | used for later scaling along the features axis. | | y : None | Ignored. | | sample_weight : array-like of shape (n_samples,), default=None | Individual weights for each sample. | | .. versionadded:: 0.24 | parameter *sample_weight* support to StandardScaler. | | Returns | ------- | self : object | Fitted scaler. | | inverse_transform(self, X, copy=None) | Scale back the data to the original representation. | | Parameters | ---------- | X : {array-like, sparse matrix} of shape (n_samples, n_features) | The data used to scale along the features axis. | copy : bool, default=None | Copy the input X or not. | | Returns | ------- | X_tr : {ndarray, sparse matrix} of shape (n_samples, n_features) | Transformed array. | | partial_fit(self, X, y=None, sample_weight=None) | Online computation of mean and std on X for later scaling. | | All of X is processed as a single batch. This is intended for cases | when :meth:`fit` is not feasible due to very large number of | `n_samples` or because X is read from a continuous stream. | | The algorithm for incremental mean and std is given in Equation 1.5a,b | in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. "Algorithms | for computing the sample variance: Analysis and recommendations." | The American Statistician 37.3 (1983): 242-247: | | Parameters | ---------- | X : {array-like, sparse matrix} of shape (n_samples, n_features) | The data used to compute the mean and standard deviation | used for later scaling along the features axis. | | y : None | Ignored. | | sample_weight : array-like of shape (n_samples,), default=None | Individual weights for each sample. | | .. versionadded:: 0.24 | parameter *sample_weight* support to StandardScaler. | | Returns | ------- | self : object | Fitted scaler. | | transform(self, X, copy=None) | Perform standardization by centering and scaling. | | Parameters | ---------- | X : {array-like, sparse matrix of shape (n_samples, n_features) | The data used to scale along the features axis. | copy : bool, default=None | Copy the input X or not. | | Returns | ------- | X_tr : {ndarray, sparse matrix} of shape (n_samples, n_features) | Transformed array. | | ---------------------------------------------------------------------- | Methods inherited from sklearn.base._OneToOneFeatureMixin: | | get_feature_names_out(self, input_features=None) | Get output feature names for transformation. | | Parameters | ---------- | input_features : array-like of str or None, default=None | Input features. | | - If `input_features` is `None`, then `feature_names_in_` is | used as feature names in. If `feature_names_in_` is not defined, | then names are generated: `[x0, x1, ..., x(n_features_in_)]`. | - If `input_features` is an array-like, then `input_features` must | match `feature_names_in_` if `feature_names_in_` is defined. | | Returns | ------- | feature_names_out : ndarray of str objects | Same as input features. | | ---------------------------------------------------------------------- | Data descriptors inherited from sklearn.base._OneToOneFeatureMixin: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) | | ---------------------------------------------------------------------- | Methods inherited from sklearn.base.TransformerMixin: | | fit_transform(self, X, y=None, **fit_params) | Fit to data, then transform it. | | Fits transformer to `X` and `y` with optional parameters `fit_params` | and returns a transformed version of `X`. | | Parameters | ---------- | X : array-like of shape (n_samples, n_features) | Input samples. | | y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None | Target values (None for unsupervised transformations). | | **fit_params : dict | Additional fit parameters. | | Returns | ------- | X_new : ndarray array of shape (n_samples, n_features_new) | Transformed array. | | ---------------------------------------------------------------------- | Methods inherited from sklearn.base.BaseEstimator: | | __getstate__(self) | | __repr__(self, N_CHAR_MAX=700) | Return repr(self). | | __setstate__(self, state) | | get_params(self, deep=True) | Get parameters for this estimator. | | Parameters | ---------- | deep : bool, default=True | If True, will return the parameters for this estimator and | contained subobjects that are estimators. | | Returns | ------- | params : dict | Parameter names mapped to their values. | | set_params(self, **params) | Set the parameters of this estimator. | | The method works on simple estimators as well as on nested objects | (such as :class:`~sklearn.pipeline.Pipeline`). The latter have | parameters of the form ``<component>__<parameter>`` so that it's | possible to update each component of a nested object. | | Parameters | ---------- | **params : dict | Estimator parameters. | | Returns | ------- | self : estimator instance | Estimator instance.
scaler=StandardScaler() X_train=scaler.fit_transform(X_train) X_test=scaler.transform(X_test)
print(X_train[0])
[-0.7710306 1.41036889 1.08508956 1.25031642 1.39864376 1.39096463 -0.72288062 0.93078432 -0.70710678 1.36833491 -0.73479518 1.39096463 0.88551735 1.53202723 -0.72288062]
3.随机森林训练
from sklearn.ensemble import RandomForestClassifier rf=RandomForestClassifier() rf.fit(X_train,y_train) y_prdrf=rf.predict(X_test)
4.模型评估
from sklearn.metrics import classification_report,confusion_matrix print(classification_report(y_test,y_prdrf)) cvs_rf=round(cross_val_score(rf,X,y,scoring="accuracy",cv=10).mean(),2) print("Cross validation score for Random Forest Classifier model is:",cvs_rf)
precision recall f1-score support 0 0.95 0.99 0.97 79 1 0.98 0.93 0.95 56 accuracy 0.96 135 macro avg 0.97 0.96 0.96 135 weighted avg 0.96 0.96 0.96 135 Cross validation score for Random Forest Classifier model is: 0.96
5.绘制混淆矩阵
sns.heatmap(confusion_matrix(y_test,y_prdrf),annot=True,cmap='viridis') plt.xlabel("Predicted") plt.ylabel("Truth") plt.title("Confusion matrix- Random Forest Classifier")
Text(0.5,1,'Confusion matrix- Random Forest Classifier')
可以看出还是相当准确的。