标准化/归一化的数学原理及其代码实现
参考文章:ML之FE:数据处理—特征工程之特征三化(标准化【四大数据类型(数值型/类别型/字符串型/时间型)】、归一化、向量化)简介、代码实现、案例应用之详细攻略
StandardScaler函数的的简介及其用法
注意事项:在机器学习的sklearn.preprocessing中,当需要对训练和测试数据进行标准化时,使用两个不同的函数,
训练数据,采用fit_transform()函数
测试数据,采用tansform()函数
StandardScaler函数的的简介
"""Standardize features by removing the mean and scaling to unit variance
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the`transform` method.
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False` to avoid breaking the sparsity structure of the data.
Read more in the :ref:`User Guide <preprocessing_scaler>`.
通过去除均值并缩放到单位方差来标准化特征
通过计算训练集中样本的相关统计数据,对每个特征分别进行定心和定标。然后使用“transform”方法存储平均值和标准差,以供以后的数据使用。
PS:系统会记录每个输入参数的平均数和标准差,以便数据可以还原。
数据集的标准化是许多机器学习估计器的一个常见需求:如果单个特征与标准的正态分布数据(例如,均值为0的高斯分布和单位方差)不太相似,估计器的性能可能会很差。
例如,学习算法的目标函数中使用的许多元素(如支持向量机的RBF核或线性模型的L1和L2正则化器)都假定所有特征都以0为中心,并且具有相同的方差。如果一个特征的方差比其他特征的方差大几个数量级,那么它就可能控制目标函数,使估计者无法按照预期正确地从其他特征中学习。
这个标量也可以通过传递with_mean=False来应用于稀疏的CSR或CSC矩阵,以避免打破数据的稀疏结构。
请参阅:ref: ' User Guide '。</preprocessing_scaler>
Parameters
----------
copy : boolean, optional, default True
If False, try to avoid a copy and do inplace scaling instead.
This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_mean : boolean, True by default
If True, center the data before scaling.
This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_std : boolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation). 参数
----------
copy: 布尔值,可选,默认为真
如果是假的,尽量避免复制,而要进行适当的缩放。
这并不能保证总是在适当的地方工作;例如,如果数据不是NumPy数组或scipy。稀疏的CSR矩阵,仍然可以返回一个副本。
with_mean:布尔值,默认为真
如果为真,则在扩展之前将数据居中。
这在处理稀疏矩阵时不起作用(并且会引发一个异常),因为将它们居中需要构建一个密集的矩阵,在通常情况下,这个矩阵可能太大而无法装入内存。
with_std:布尔值,默认为真
如果为真,则将数据缩放到单位方差(或者等效为单位标准差)。
Attributes
----------
scale_ : ndarray, shape (n_features,) Per feature relative scaling of the data.
.. versionadded:: 0.17
*scale_*
mean_ : array of floats with shape [n_features]
The mean value for each feature in the training set.
var_ : array of floats with shape [n_features]
The variance for each feature in the training set. Used to compute `scale_`
n_samples_seen_ : int
The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across ``partial_fit`` calls.
属性
----------
scale_: ndarray,形状(n_features,)数据的每个特征相对缩放。缩放比例,同时也是标准差。
. .versionadded:: 0.17
* scale_ *
mean_:带形状的浮动数组[n_features]
训练集中每个特征的平均值。
var_:带形状的浮动数组[n_features]
训练集中每个特征的方差。用于计算' scale_ '
n_samples_seen_: int
由估计量处理的样本数。将重置新的调用,以适应,但增量跨越' ' partial_fit ' '调用。
See also
--------
scale: Equivalent function without the estimator API.
:class:`sklearn.decomposition.PCA`
Further removes the linear correlation across features with 'whiten=True'.
Notes
-----
For a comparison of the different scalers, transformers, and normalizers,
see :ref:`examples/preprocessing/plot_all_scaling.py
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`. 另请参阅
--------
scale:没有estimator API的等价函数。
类:“sklearn.decomposition.PCA”
进一步用'whiten=True'去除特征间的线性相关。
笔记-----
为了比较不同的定标器、变压器和规格化器,
看:裁判:“/预处理/ plot_all_scaling.py例子
< sphx_glr_auto_examples_preprocessing_plot_all_scaling.py >”。
StandardScaler函数的案例应用
from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
print(scaler.mean_) # [ 0.5 0.5]
print(scaler.transform(data))
# [[-1. -1.]
# [-1. -1.]
# [ 1. 1.]
# [ 1. 1.]]
print(scaler.transform([[2, 2]])) #[[ 3. 3.]]
fit_transform函数
fit_transform函数的简介
"""Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. “拟合数据,然后转换它。”
使用可选参数fit_params将transformer匹配到X和y,并返回转换后的X版本。
Parameters
----------
X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns
-------
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
参数
----------
X: 形状是numpy数组[n_samples, n_features]
训练集
y: numpy数组的形状[n_samples]
目标值。
返回
-------
X_new: numpy数组的形状[n_samples, n_features_new]
改变数组。
# non-optimized default implementation; override when a better method is possible for a given clustering algorithm 未经优化默认实现;当对给定的聚类算法有更好的方法时重写
fit_transform函数的用法
def fit_transform Found at: sklearn.base
def fit_transform(self, X, y=None, **fit_params):
"""
# non-optimized default implementation; override when a
better
# method is possible for a given clustering algorithm
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
return self.fit(X, y, **fit_params).transform(X) # fit method of
arity 2 (supervised transformation)
transform函数的简介及其用法
transform函数的简介
"""Perform standardization by centering and scaling
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data used to scale along the features axis.
y : (ignored)
.. deprecated:: 0.19
This parameter will be removed in 0.21.
copy : bool, optional (default: None)
Copy the input X or not.
""" 通过定心和定标来实现标准化
参数
----------
X:类数组,形状[n_samples, n_features]
用于沿着特征轴缩放的数据。
y:(忽略)
. .弃用::0.19
这个参数将在0.21中删除。
复制:bool,可选(默认:无)
是否复制输入X。
”“”
transform函数的用法
def transform Found at: sklearn.preprocessing.data
def transform(self, X, y='deprecated', copy=None):
if not isinstance(y, string_types) or y !=
'deprecated':
warnings.warn("The parameter y on transform()
is "
"deprecated since 0.19 and will be removed in
0.21",
DeprecationWarning)
check_is_fitted(self, 'scale_')
copy = copy if copy is not None else self.copy
X = check_array(X, accept_sparse='csr',
copy=copy, warn_on_dtype=True,
estimator=self, dtype=FLOAT_DTYPES)
if sparse.issparse(X):
if self.with_mean:
raise ValueError(
"Cannot center sparse matrices: pass
`with_mean=False` "
"instead. See docstring for motivation and
alternatives.")
if self.scale_ is not None:
inplace_column_scale(X, 1 / self.scale_)
else:
if self.with_mean:
X -= self.mean_
if self.with_std:
X /= self.scale_
return X
inverse_transform函数的简介及其用法
inverse_transform函数的简介
"""Scale back the data to the original representation
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data used to scale along the features axis.
copy : bool, optional (default: None)
Copy the input X or not.
Returns
-------
X_tr : array-like, shape [n_samples, n_features]
Transformed array.
"""
把数据缩减到原来的样子
参数
----------
X:类数组,形状[n_samples, n_features]
用于沿着特征轴缩放的数据。
复制:bool,可选(默认:无)
是否复制输入X。
返回
-------
X_tr:类数组,形状[n_samples, n_features]
改变数组。
"""
inverse_transform函数的用法
def inverse_transform Found at: sklearn.preprocessing.data
def inverse_transform(self, X, copy=None):
check_is_fitted(self, 'scale_')
copy = copy if copy is not None else self.copy
if sparse.issparse(X):
if self.with_mean:
raise ValueError(
"Cannot uncenter sparse matrices: pass
`with_mean=False` "
"instead See docstring for motivation and
alternatives.")
if not sparse.isspmatrix_csr(X):
X = X.tocsr()
copy = False
if copy:
X = X.copy()
if self.scale_ is not None:
inplace_column_scale(X, self.scale_)
else:
X = np.asarray(X)
if copy:
X = X.copy()
if self.with_std:
X *= self.scale_
if self.with_mean:
X += self.mean_
return X