一、特征处理
本节继续上面一篇,特征工程,在数据竞赛中,特征处理往往是很重要的一部分,这也非常符合二八定律的,我们敲代码,调模型这些时间往往没有我们的特征处理时间更长的。
在这里时间序列的特征处理我们要学习一下,工具 Tsfresh(TimeSeries Fresh)的使用
tsfresh是开源的提取时序数据特征的python包,能够提取出超过4000种特征,堪称提取时序特征的瑞士军刀。
- 数据预处理
时间序列数据格式处理 加入时间步特征time
- 特征工程
时间序列特征构造
特征筛选
使用 tsfresh 进行时间序列特征处理
导入包
# 包导入 import pandas as pd import numpy as np import tsfresh as tsf from tsfresh import extract_features, select_features from tsfresh.utilities.dataframe_functions import impute
data_train = pd.read_csv("train.csv") data_test_A = pd.read_csv("testA.csv") print(data_train.shape) print(data_test_A.shape)
(100000, 3) (20000, 2)
data_train.head()
id | heartbeat_signals | label | |
0 | 0 | 0.9912297987616655,0.9435330436439665,0.764677... | 0.0 |
1 | 1 | 0.9714822034884503,0.9289687459588268,0.572932... | 0.0 |
2 | 2 | 1.0,0.9591487564065292,0.7013782792997189,0.23... | 2.0 |
3 | 3 | 0.9757952826275774,0.9340884687738161,0.659636... | 0.0 |
4 | 4 | 0.0,0.055816398940721094,0.26129357194994196,0... | 2.0 |
data_test_A.head()
id | heartbeat_signals | |
0 | 100000 | 0.9915713654170097,1.0,0.6318163407681274,0.13... |
1 | 100001 | 0.6075533139615096,0.5417083883163654,0.340694... |
2 | 100002 | 0.9752726292239277,0.6710965234906665,0.686758... |
3 | 100003 | 0.9956348033996116,0.9170249621481004,0.521096... |
4 | 100004 | 1.0,0.8879490481178918,0.745564725322326,0.531... |
数据预处理
将训练集的以,分割之后stack转换成长列
# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack() train_heartbeat_df = train_heartbeat_df.reset_index() train_heartbeat_df = train_heartbeat_df.set_index("level_0") train_heartbeat_df.index.name = None train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True) train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float) train_heartbeat_df
time | heartbeat_signals | |
0 | 0 | 0.991230 |
0 | 1 | 0.943533 |
0 | 2 | 0.764677 |
0 | 3 | 0.618571 |
0 | 4 | 0.379632 |
... | ... | ... |
99999 | 200 | 0.000000 |
99999 | 201 | 0.000000 |
99999 | 202 | 0.000000 |
99999 | 203 | 0.000000 |
99999 | 204 | 0.000000 |
20500000 rows × 2 columns
# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储 data_train_label = data_train["label"] data_train = data_train.drop("label", axis=1) data_train = data_train.drop("heartbeat_signals", axis=1) data_train = data_train.join(train_heartbeat_df) data_train
id | time | heartbeat_signals | |
0 | 0 | 0 | 0.991230 |
0 | 0 | 1 | 0.943533 |
0 | 0 | 2 | 0.764677 |
0 | 0 | 3 | 0.618571 |
0 | 0 | 4 | 0.379632 |
... | ... | ... | ... |
99999 | 99999 | 200 | 0.000000 |
99999 | 99999 | 201 | 0.000000 |
99999 | 99999 | 202 | 0.000000 |
99999 | 99999 | 203 | 0.000000 |
99999 | 99999 | 204 | 0.000000 |
20500000 rows × 3 columns
data_train[data_train["id"]==1]
id | time | heartbeat_signals | |
1 | 1 | 0 | 0.971482 |
1 | 1 | 1 | 0.928969 |
1 | 1 | 2 | 0.572933 |
1 | 1 | 3 | 0.178457 |
1 | 1 | 4 | 0.122962 |
... | ... | ... | ... |
1 | 1 | 200 | 0.000000 |
1 | 1 | 201 | 0.000000 |
1 | 1 | 202 | 0.000000 |
1 | 1 | 203 | 0.000000 |
1 | 1 | 204 | 0.000000 |
205 rows × 3 columns
可以看到,每个样本的心电特征都由205个时间步的心电信号组成。
这是数据两边太大了,在使用tsfrsh提取数据将会达到七百多,所以在这里我只使用一部分数据使用
data_train.head()
id | time | heartbeat_signals | |
0 | 0 | 0 | 0.991230 |
0 | 0 | 1 | 0.943533 |
0 | 0 | 2 | 0.764677 |
0 | 0 | 3 | 0.618571 |
0 | 0 | 4 | 0.379632 |
data_train.shape
(20500000, 3)
portion_data_train=data_train.iloc[:30,:]
portion_data_train
id | time | heartbeat_signals | |
0 | 0 | 0 | 0.991230 |
0 | 0 | 1 | 0.943533 |
0 | 0 | 2 | 0.764677 |
0 | 0 | 3 | 0.618571 |
0 | 0 | 4 | 0.379632 |
0 | 0 | 5 | 0.190822 |
0 | 0 | 6 | 0.040237 |
0 | 0 | 7 | 0.025995 |
0 | 0 | 8 | 0.031709 |
0 | 0 | 9 | 0.065524 |
0 | 0 | 10 | 0.125531 |
0 | 0 | 11 | 0.146747 |
0 | 0 | 12 | 0.167656 |
0 | 0 | 13 | 0.193374 |
0 | 0 | 14 | 0.226135 |
0 | 0 | 15 | 0.221143 |
0 | 0 | 16 | 0.236067 |
0 | 0 | 17 | 0.221143 |
0 | 0 | 18 | 0.221143 |
0 | 0 | 19 | 0.211107 |
0 | 0 | 20 | 0.208587 |
0 | 0 | 21 | 0.193374 |
0 | 0 | 22 | 0.195920 |
0 | 0 | 23 | 0.198462 |
0 | 0 | 24 | 0.185706 |
0 | 0 | 25 | 0.195920 |
0 | 0 | 26 | 0.183142 |
0 | 0 | 27 | 0.193374 |
0 | 0 | 28 | 0.190822 |
0 | 0 | 29 | 0.208587 |
使用 tsfresh 进行时间序列特征处理
该包还包含了特征重要性评估、特征选择的方法,因此,不管是基于时序数据的分类问题还是回归问题,tsfresh都会是特征提取一个不错的选择。
from tsfresh import extract_features # 特征提取 train_features = extract_features(portion_data_train, column_id='id', column_sort='time') train_features
Feature Extraction: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00, 9.45s/it]
heartbeat_signals__variance_larger_than_standard_deviation | heartbeat_signals__has_duplicate_max | heartbeat_signals__has_duplicate_min | heartbeat_signals__has_duplicate | heartbeat_signals__sum_values | heartbeat_signals__abs_energy | heartbeat_signals__mean_abs_change | heartbeat_signals__mean_change | heartbeat_signals__mean_second_derivative_central | heartbeat_signals__median | ... | heartbeat_signals__permutation_entropy__dimension_5__tau_1 | heartbeat_signals__permutation_entropy__dimension_6__tau_1 | heartbeat_signals__permutation_entropy__dimension_7__tau_1 | heartbeat_signals__query_similarity_count__query_None__threshold_0.0 | heartbeat_signals__matrix_profile__feature_"min"__threshold_0.98 | heartbeat_signals__matrix_profile__feature_"max"__threshold_0.98 | heartbeat_signals__matrix_profile__feature_"mean"__threshold_0.98 | heartbeat_signals__matrix_profile__feature_"median"__threshold_0.98 | heartbeat_signals__matrix_profile__feature_"25"__threshold_0.98 | heartbeat_signals__matrix_profile__feature_"75"__threshold_0.98 | |
0 | 0.0 | 0.0 | 0.0 | 1.0 | 7.975869 | 3.811619 | 0.044806 | -0.026988 | 0.001169 | 0.19592 | ... | 2.778225 | 2.955209 | 3.062529 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 787 columns
特征选择 train_features中包含了heartbeat_signals的779种常见的时间序列特征(所有这些特征的解释可以去看官方文档),这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:
from tsfresh.utilities.dataframe_functions import impute import warnings warnings.filterwarnings("ignore") # 去除抽取特征中的NaN值 train_features=impute(train_features)
可以的看到后面一排全部被填为了空值
接下来,按照特征和响应变量之间的相关性进行特征选择,这一过程包含两步:首先单独计算每个特征和响应变量之间的相关性,然后利用Benjamini-Yekutieli procedure进行特征选择,决定哪些特征可以被保留。
## from tsfresh import select_features # 按照特征和数据label之间的相关性进行特征选择 train_features_filtered = select_features(train_features, data_train_label[:2]) train_features_filtered
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-48-f2da600384d2> in <module> 2 3 # 按照特征和数据label之间的相关性进行特征选择 ----> 4 train_features_filtered = select_features(train_features, data_train_label[:2]) 5 6 train_features_filtered E:\My_ruan_jian\anaconda3\lib\site-packages\tsfresh\feature_selection\selection.py in select_features(X, y, test_for_binary_target_binary_feature, test_for_binary_target_real_feature, test_for_real_target_binary_feature, test_for_real_target_real_feature, fdr_level, hypotheses_independent, n_jobs, show_warnings, chunksize, ml_task, multiclass, n_significant) 151 ) 152 assert len(y) > 1, "y must contain at least two samples." --> 153 assert len(X) == len(y), "X and y must contain the same number of samples." 154 assert ( 155 len(set(y)) > 1 AssertionError: X and y must contain the same number of samples.
这里报了一错误,两者不匹配,那是因为我把训练集减少了,为了代码能爬出来,不然的话,太浪费时间了,而且内存也比较大,修改的部分在上面加粗了。
另外本文如有错误,请联系作者,及时改正。
路漫漫其修远,我将上下而求索。