ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码

简介: ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码

特殊类型数据分割


5.1、时间序列数据分割TimeSeriesSplit


class TimeSeriesSplit Found at: sklearn.model_selection._split

class TimeSeriesSplit(_BaseKFold):

   """Time Series cross-validator .. versionadded:: 0.18

 

   Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate. This cross-validation object is a variation of :class:`KFold`.  In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.

   Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them.

   Read more in the :ref:`User Guide <cross_validation>`.

 

   Parameters

   ----------

   n_splits : int, default=5. Number of splits. Must be at least 2. .. versionchanged:: 0.22 . ``n_splits`` default value changed from 3 to 5.

   max_train_size : int, default=None. Maximum size for a single training set.

提供训练/测试索引,以分割时间序列数据样本,在训练/测试集中,在固定的时间间隔观察。在每次分割中,测试索引必须比以前更高,因此在交叉验证器中变换是不合适的。这个交叉验证对象是KFold 的变体。在第k次分割中,它返回第k次折叠作为序列集,返回第(k+1)次折叠作为测试集。

注意,与标准的交叉验证方法不同,连续训练集是之前那些训练集的超集。

更多信息请参见:ref: ' User Guide <cross_validation> '。</cross_validation>

参数

----------

n_splits :int,默认=5。数量的分裂。必须至少是2. ..versionchanged:: 0.22。' ' n_split ' ' '默认值从3更改为5。

max_train_size : int,默认None。单个训练集的最大容量。

   Examples

   --------

   >>> import numpy as np

   >>> from sklearn.model_selection import TimeSeriesSplit

   >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

   >>> y = np.array([1, 2, 3, 4, 5, 6])

   >>> tscv = TimeSeriesSplit()

   >>> print(tscv)

   TimeSeriesSplit(max_train_size=None, n_splits=5)

   >>> for train_index, test_index in tscv.split(X):

   ...     print("TRAIN:", train_index, "TEST:", test_index)

   ...     X_train, X_test = X[train_index], X[test_index]

   ...     y_train, y_test = y[train_index], y[test_index]

   TRAIN: [0] TEST: [1]

   TRAIN: [0 1] TEST: [2]

   TRAIN: [0 1 2] TEST: [3]

   TRAIN: [0 1 2 3] TEST: [4]

   TRAIN: [0 1 2 3 4] TEST: [5]

 

   Notes

   -----

   The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)`` in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``, where ``n_samples`` is the number of samples.

   """

   @_deprecate_positional_args

   def __init__(self, n_splits=5, *, max_train_size=None):

       super().__init__(n_splits, shuffle=False, random_state=None)

       self.max_train_size = max_train_size

 

   def split(self, X, y=None, groups=None):

       """Generate indices to split data into training and test set.

       Parameters

       ----------

       X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples and n_features is the number of features.

       y : array-like of shape (n_samples,). Always ignored, exists for compatibility.

       groups : array-like of shape (n_samples,). Always ignored, exists for compatibility.

       Yields

       ------

       train : ndarray. The training set indices for that split.

       test : ndarray. The testing set indices for that split.

       """

       X, y, groups = indexable(X, y, groups)

       n_samples = _num_samples(X)

       n_splits = self.n_splits

       n_folds = n_splits + 1

       if n_folds > n_samples:

           raise ValueError(

               ("Cannot have number of folds ={0} greater than the number of samples: {1}."). format(n_folds, n_samples))

       indices = np.arange(n_samples)

       test_size = n_samples // n_folds

       test_starts = range(test_size + n_samples % n_folds, n_samples,

        test_size)

       for test_start in test_starts:

           if self.max_train_size and self.max_train_size < test_start:

               yield indices[test_start - self.max_train_size:test_start], indices

                [test_start:test_start + test_size]

           else:

               yield indices[:test_start], indices[test_start:test_start + test_size]

   Examples

   --------

   >>> import numpy as np

   >>> from sklearn.model_selection import TimeSeriesSplit

   >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

   >>> y = np.array([1, 2, 3, 4, 5, 6])

   >>> tscv = TimeSeriesSplit()

   >>> print(tscv)

   TimeSeriesSplit(max_train_size=None, n_splits=5)

   >>> for train_index, test_index in tscv.split(X):

   ...     print("TRAIN:", train_index, "TEST:", test_index)

   ...     X_train, X_test = X[train_index], X[test_index]

   ...     y_train, y_test = y[train_index], y[test_index]

   TRAIN: [0] TEST: [1]

   TRAIN: [0 1] TEST: [2]

   TRAIN: [0 1 2] TEST: [3]

   TRAIN: [0 1 2 3] TEST: [4]

   TRAIN: [0 1 2 3 4] TEST: [5]

 

   Notes

   -----

   The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)`` in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``, where ``n_samples`` is the number of samples.

   """

   @_deprecate_positional_args

   def __init__(self, n_splits=5, *, max_train_size=None):

       super().__init__(n_splits, shuffle=False, random_state=None)

       self.max_train_size = max_train_size

 

   def split(self, X, y=None, groups=None):

       """Generate indices to split data into training and test set.

       Parameters

       ----------

       X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples and n_features is the number of features.

       y : array-like of shape (n_samples,). Always ignored, exists for compatibility.

       groups : array-like of shape (n_samples,). Always ignored, exists for compatibility.

       Yields

       ------

       train : ndarray. The training set indices for that split.

       test : ndarray. The testing set indices for that split.

       """

       X, y, groups = indexable(X, y, groups)

       n_samples = _num_samples(X)

       n_splits = self.n_splits

       n_folds = n_splits + 1

       if n_folds > n_samples:

           raise ValueError(

               ("Cannot have number of folds ={0} greater than the number of samples: {1}."). format(n_folds, n_samples))

       indices = np.arange(n_samples)

       test_size = n_samples // n_folds

       test_starts = range(test_size + n_samples % n_folds, n_samples,

        test_size)

       for test_start in test_starts:

           if self.max_train_size and self.max_train_size < test_start:

               yield indices[test_start - self.max_train_size:test_start], indices

                [test_start:test_start + test_size]

           else:

               yield indices[:test_start], indices[test_start:test_start + test_size]

 


相关实践学习
基于MaxCompute的热门话题分析
Apsara Clouder大数据专项技能认证配套课程:基于MaxCompute的热门话题分析
相关文章
Qt6学习笔记三(QMainWindow、菜单栏、工具栏、状态栏、铆接部件、核心部件)
Qt6学习笔记三(QMainWindow、菜单栏、工具栏、状态栏、铆接部件、核心部件)
498 0
|
机器学习/深度学习 存储 大数据
在大数据时代,高维数据处理成为难题,主成分分析(PCA)作为一种有效的数据降维技术,通过线性变换将数据投影到新的坐标系
在大数据时代,高维数据处理成为难题,主成分分析(PCA)作为一种有效的数据降维技术,通过线性变换将数据投影到新的坐标系,保留最大方差信息,实现数据压缩、去噪及可视化。本文详解PCA原理、步骤及其Python实现,探讨其在图像压缩、特征提取等领域的应用,并指出使用时的注意事项,旨在帮助读者掌握这一强大工具。
818 4
|
传感器 安全 算法
物联网发布者在数据传输过程中如何防止数据被篡改
在物联网数据传输中,为防止数据被篡改,可采用加密技术、数字签名、数据完整性校验等方法,确保数据的完整性和安全性。
|
数据挖掘 Python
【Python】应用:pyproj地理计算库应用
这篇博客介绍了 `pyproj` 地理计算库的应用,涵盖地理坐标系统转换与地图投影。通过示例代码展示了如何进行经纬度与UTM坐标的互转,并利用 `pyproj.Geod` 计算两点间的距离及方位角,助力地理数据分析。 安装 `pyproj`:`pip install pyproj`。更多内容欢迎关注本博客,一起学习进步! Pancake 🍰 不迷路。😉*★,°*:.☆( ̄▽ ̄)/$:*.°★* 😏
455 1
|
机器学习/深度学习 Go API
如何用Go实现机器学习
如何用Go实现机器学习
385 3
|
JavaScript
【Deepin 20系统】Jupyter notebook解决ValueError: Please install Node.js and npm before continuing installa
文章讨论了在Deepin 20系统上安装Jupyter Notebook的debug插件时出现的"ValueError: Please install Node.js and npm before continuing installation"错误,并提供了使用conda安装Node.js的解决方法。
415 1
|
SQL 安全 API
Django的安全性基石:防止SQL注入攻击
【4月更文挑战第15天】Django,Python的流行Web框架,以其内置的安全机制防范SQL注入攻击。通过ORM系统、安全查询API、用户输入验证和CSRF保护,确保应用安全。开发者应遵循最佳实践,如使用ORM、严格验证输入、及时更新库和限制敏感数据访问,以增强安全性。
|
自然语言处理 PyTorch 算法框架/工具
Transformers 4.37 中文文档(八十四)(2)
Transformers 4.37 中文文档(八十四)
277 3
|
数据挖掘 数据处理 索引
使用Pandas从Excel文件中提取满足条件的数据并生成新的文件
使用Pandas从Excel文件中提取满足条件的数据并生成新的文件
746 1