python tsfresh特征中文详解

简介: tsfresh.feature_extraction.feature_calculators.count_above_mean(x)大于均值的数的个数

tsfresh是开源的提取时序数据特征的python包,能够提取出超过64种特征,堪称提取时序特征的瑞士军刀。最近有需求,所以一直在看,目前还没有中文文档, 有些特征含义还是很难懂的,我把我已经看懂的一部分放这,没看懂的我只写了标题,待我看懂我添加注解。


tsfresh.feature_extraction.feature_calculators.abs_energy(x)

时间序列的平方和

E = ∑ i = 1 , . . . , n x i 2 E = \sum_{i=1,...,n}x_i^2

E=

i=1,...,n

x

i

2


参数:x(pandas.Series) 需要计算特征的时间序列

返回值:特征值

返回值类型:float

函数类型:简单


tsfresh.feature_extraction.feature_calculators.absolute_sum_of_changes(x)

返回序列x的连续变化的绝对值之和

∑ i = 1 , . . . , n − 1 ∣ x i + 1 − x i ∣ \sum_{i=1,...,n-1} | x_{i+1} - x_i|

i=1,...,n−1

∣x

i+1

−x

i


参数:x(pandas.Series) 需要计算特征的时间序列

返回值:特征值

返回值类型:float

函数类型:简单


tsfresh.feature_extraction.feature_calculators.agg_autocorrelation(x, param)

计算聚合函数f_agg(例如方差或者均值)处理后的自相关性,在一定程度可以衡量数据的周期性质,l ll表示滞后值,如果某个l ll计算出的值比较大,表示改时序数据具有l ll周期性质。

1 n − 1 ∑ i = 1 , . . . , n 1 ( n − l ) σ 2 ∑ t = 1 n − l ( X t − μ ) ( X t − 1 − μ ) \frac{1}{n-1} \sum_{i=1,...,n} \frac{1}{(n-l)\sigma^2} \sum_{t=1}^{n-l}(X_t -\mu)(X_{t-1} -\mu)

n−1

1

 

i=1,...,n

 

(n−l)σ

2

1

 

t=1

n−l

(X

t

−μ)(X

t−1

−μ)


n是时间序列 X i X_iX

i

 的长度,σ 2 \sigma^2σ

2

 是方差,μ \muμ表示均值

参数:x(pandas.Series) 需要计算特征的时间序列

返回值:特征值

返回值类型:float

函数类型:简单


tsfresh.feature_extraction.feature_calculators.agg_linear_trend(x, param)

对时序分块聚合后(max, min, mean, meidan),然后聚合后的值做线性回归,算出 pvalue(),rvalue(相关系数), intercept(截距), slope(斜率), stderr(拟合的标准差)

Parameters: x (pandas.Series) – the time series to calculate the feature of

param (list) – contains dictionaries {“attr”: x, “chunk_len”: l, “f_agg”: f} with x, f an string and l an int

Returns: the different feature values

Return type: pandas.Series


tsfresh.feature_extraction.feature_calculators.approximate_entropy(x, m, r)

近似熵,用来衡量一个时间序列的周期性、不可预测性和波动性


tsfresh.feature_extraction.feature_calculators.ar_coefficient(x, param)

自回归模型系数,


tsfresh.feature_extraction.feature_calculators.augmented_dickey_fuller(x, param)

tsfresh.feature_extraction.feature_calculators.autocorrelation(x, lag)

滞后lag的自相关系数

1 ( n − l ) σ 2 ∑ t = 1 n − l ( X t − μ ) ( X t + l − μ ) \frac{1}{(n-l)\sigma^2} \sum_{t=1}^{n-l}(X_t - \mu)(X_{t+l}-\mu)

(n−l)σ

2

1

 

t=1

n−l

(X

t

−μ)(X

t+l

−μ)


tsfresh.feature_extraction.feature_calculators.binned_entropy(x, max_bins)

把整个序列按值均分成max_bins个桶,然后把每个值放进相应的桶中,然后求熵。

∑ k = 0 m i n ( m a x _ b i n s , l e n ( x ) ) p k l o g ( p k ) ⋅ 1 ( p k > 0 ) \sum_{k=0}^{min(max\_bins, len(x))} p_k log(p_k) \cdot\mathbf{1}_{(p_k > 0)}

k=0

min(max_bins,len(x))

p

k

log(p

k

)⋅1

(p

k

>0)


p k p_kp

k

 表示落在第k个桶中的数占总体的比例。

这个特征是为了衡量样本值分布的均匀度。

参数:x(pandas.Series) 需要计算特征的时间序列

  max_bins (int) 桶的数量

返回值:特征值

返回值类型:float

函数类型:简单


tsfresh.feature_extraction.feature_calculators.c3(x, lag)

1 n − 2 l a g ∑ i = 0 n − 2 l a g x i + 2 ⋅ l a g 2 ⋅ x i + l a g ⋅ x i \frac{1}{n-2lag} \sum_{i=0}^{n-2lag} x_{i + 2 \cdot lag}^2 \cdot x_{i + lag} \cdot x_{i}

n−2lag

1

 

i=0

n−2lag

x

i+2⋅lag

2

⋅x

i+lag

⋅x

i


等同于

E [ L 2 ( X ) 2 ⋅ L ( X ) ⋅ X ] \mathbb{E}[L^2(X)^2 \cdot L(X) \cdot X]

E[L

2

(X)

2

⋅L(X)⋅X]


衡量时序数据的非线性性


tsfresh.feature_extraction.feature_calculators.change_quantiles(x, ql, qh, isabs, f_agg)

先用ql和qh两个分位数在x中确定出一个区间,然后在这个区间里计算时序数据的均值、绝对值、连续变化值。


Parameters:

x (pandas.Series) – 时序数据

ql (float) – 分位数的下限

qh (float) – 分位数的上线

isabs (bool) – 使用使用绝对值

f_agg (str, name of a numpy function (e.g. mean, var, std, median)) – numpy自带的聚合函数(均值,方差,标准差,中位数)


tsfresh.feature_extraction.feature_calculators.cid_ce(x, normalize)

用来评估时间序列的复杂度,越复杂的序列有越多的谷峰。

∑ i = 0 n − 2 l a g ( x i − x i + 1 ) 2 \sqrt{ \sum_{i=0}^{n-2lag} ( x_{i} - x_{i+1})^2 }

i=0

n−2lag

(x

i

−x

i+1

)

2


tsfresh.feature_extraction.feature_calculators.count_above_mean(x)

大于均值的数的个数


tsfresh.feature_extraction.feature_calculators.count_below_mean(x)

小于均值的数的个数


tsfresh.feature_extraction.feature_calculators.cwt_coefficients(x, param)

2 3 a π 1 4 ( 1 − x 2 a 2 ) e x p ( − x 2 2 a 2 ) \frac{2}{\sqrt{3a} \pi^{\frac{1}{4}}} (1 - \frac{x^2}{a^2}) exp(-\frac{x^2}{2a^2})

3a

π

4

1

2

(1−

a

2

x

2

)exp(−

2a

2

x

2

)


tsfresh.feature_extraction.feature_calculators.energy_ratio_by_chunks(x, param)

Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole series.


Takes as input parameters the number num_segments of segments to divide the series into and segment_focus which is the segment number (starting at zero) to return a feature on.


If the length of the time series is not a multiple of the number of segments, the remaining data points are distributed on the bins starting from the first. For example, if your time series consists of 8 entries, the first two bins will contain 3 and the last two values, e.g. [ 0., 1., 2.], [ 3., 4., 5.] and [ 6., 7.].


Note that the answer for num_segments = 1 is a trivial “1” but we handle this scenario in case somebody calls it. Sum of the ratios should be 1.0.


Parameters:

x (numpy.ndarray) – the time series to calculate the feature of

param – contains dictionaries {“num_segments”: N, “segment_focus”: i} with N, i both ints


Returns:

the feature values


Return type:

list of tuples (index, data)


tsfresh.feature_extraction.feature_calculators.fft_aggregated(x, param)

Returns the spectral centroid (mean), variance, skew, and kurtosis of the absolute fourier transform spectrum.


Parameters:

x (numpy.ndarray) – the time series to calculate the feature of

param (list) – contains dictionaries {“aggtype”: s} where s str and in [“centroid”, “variance”, “skew”, “kurtosis”]


Returns:

the different feature values


Return type:

pandas.Series


This function is of type: combiner


tsfresh.feature_extraction.feature_calculators.fft_coefficient(x, param)

Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast fourier transformation algorithm


A k = ∑ m = 0 n − 1 a m exp ⁡ { − 2 π i m k n } , k = 0 , … , n − 1. A_k = \sum_{m=0}^{n-1} a_m \exp \left \{ -2 \pi i \frac{m k}{n} \right \}, \qquad k = 0,\ldots , n-1.

A

k

=

m=0

n−1

a

m

exp{−2πi

n

mk

},k=0,…,n−1.


The resulting coefficients will be complex, this feature calculator can return the real part (attr==”real”), the imaginary part (attr==”imag), the absolute value (attr=”“abs) and the angle in degrees (attr==”angle).


Parameters:

x (numpy.ndarray) – the time series to calculate the feature of

param (list) – contains dictionaries {“coeff”: x, “attr”: s} with x int and x >= 0, s str and in [“real”, “imag”, “abs”, “angle”]


Returns:

the different feature values


Return type:

pandas.Series


This function is of type: combiner


tsfresh.feature_extraction.feature_calculators.first_location_of_maximum(x)

最大值第一次出现的位置


tsfresh.feature_extraction.feature_calculators.first_location_of_minimum(x)

最小值第一次出现的位置


tsfresh.feature_extraction.feature_calculators.friedrich_coefficients(x, param)

Coefficients of polynomial h(x), which has been fitted to the deterministic dynamics of Langevin model

x ˙ ( t ) = h ( x ( t ) ) + N ( 0 , R ) \dot{x}(t) = h(x(t)) + \mathcal{N}(0,R)

x

˙

(t)=h(x(t))+N(0,R)


as described by [1].


For short time-series this method is highly dependent on the parameters.


References


[1] Friedrich et al. (2000): Physics Letters A 271, p. 217-222

Extracting model equations from experimental data


Parameters:

x (numpy.ndarray) – the time series to calculate the feature of

param (list) – contains dictionaries {“m”: x, “r”: y, “coeff”: z} with x being positive integer, the order of polynom to fit for estimating fixed points of dynamics, y positive float, the number of quantils to use for averaging and finally z, a positive integer corresponding to the returned coefficient


Returns:

the different feature values


Return type:

pandas.Series


tsfresh.feature_extraction.feature_calculators.has_duplicate(x)

有没有重复值


tsfresh.feature_extraction.feature_calculators.has_duplicate_max(x

最大值有没有重复


tsfresh.feature_extraction.feature_calculators.has_duplicate_min(x)

最小值有没有重复


tsfresh.feature_extraction.feature_calculators.index_mass_quantile(x, param)

tsfresh.feature_extraction.feature_calculators.kurtosis(x)

tsfresh.feature_extraction.feature_calculators.large_standard_deviation(x, r)

标准差是否大于r乘以最大值减最小值

s t d ( x ) > r ∗ ( m a x ( X ) − m i n ( X ) ) std(x) > r * (max(X)-min(X))

std(x)>r∗(max(X)−min(X))


tsfresh.feature_extraction.feature_calculators.last_location_of_maximum(x)

最大值最后出现的位置


tsfresh.feature_extraction.feature_calculators.last_location_of_minimum(x)

最小值最后出现的位置


tsfresh.feature_extraction.feature_calculators.length(x)

x的长度


tsfresh.feature_extraction.feature_calculators.linear_trend(x, param)

Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one. This feature assumes the signal to be uniformly sampled. It will not use the time stamps to fit the model. The parameters control which of the characteristics are returned.


Possible extracted attributes are “pvalue”, “rvalue”, “intercept”, “slope”, “stderr”, see the documentation of linregress for more information.


Parameters:

x (numpy.ndarray) – the time series to calculate the feature of

param (list) – contains dictionaries {“attr”: x} with x an string, the attribute name of the regression model


Returns:

the different feature values


Return type:

pandas.Series


This function is of type: combiner


tsfresh.feature_extraction.feature_calculators.longest_strike_above_mean(x)

大于均值的最长连续子序列长度


tsfresh.feature_extraction.feature_calculators.longest_strike_below_mean(x)

小于均值的最长连续子序列长度


tsfresh.feature_extraction.feature_calculators.max_langevin_fixed_point(x, r, m)

Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial h(x), which has been fitted to the deterministic dynamics of Langevin model

( ˙ x ) ( t ) = h ( x ( t ) ) + R ( N ) ( 0 , 1 ) \dot(x)(t) = h(x(t)) + R \mathcal(N)(0,1)

(

˙

x)(t)=h(x(t))+R(N)(0,1)


as described by


Friedrich et al. (2000): Physics Letters A 271, p. 217-222 Extracting model equations from experimental data

For short time-series this method is highly dependent on the parameters.


Parameters:

x (numpy.ndarray) – the time series to calculate the feature of

m (int) – order of polynom to fit for estimating fixed points of dynamics

r (float) – number of quantils to use for averaging


Returns:

Largest fixed point of deterministic dynamics


Return type:

float


tsfresh.feature_extraction.feature_calculators.maximum(x)

最大值


tsfresh.feature_extraction.feature_calculators.mean(x)

均值


tsfresh.feature_extraction.feature_calculators.mean_abs_change(x)

连续变化值绝对值的均值

1 n ∑ i = 1 , … , n − 1 ∣ x i + 1 − x i ∣ \frac{1}{n} \sum_{i=1,\ldots, n-1} | x_{i+1} - x_{i}|

n

1

 

i=1,…,n−1

∣x

i+1

−x

i


tsfresh.feature_extraction.feature_calculators.mean_change(x)

连续变化值的均值

1 n ∑ i = 1 , … , n − 1 x i + 1 − x i \frac{1}{n} \sum_{i=1,\ldots, n-1} x_{i+1} - x_{i}

n

1

 

i=1,…,n−1

x

i+1

−x

i


tsfresh.feature_extraction.feature_calculators.mean_second_derivative_central(x)

1 n ∑ i = 1 , … , n − 1 1 2 ( x i + 2 − 2 ⋅ x i + 1 + x i ) \frac{1}{n} \sum_{i=1,\ldots, n-1} \frac{1}{2} (x_{i+2} - 2 \cdot x_{i+1} + x_i)

n

1

 

i=1,…,n−1

 

2

1

(x

i+2

−2⋅x

i+1

+x

i

)


tsfresh.feature_extraction.feature_calculators.median(x)

中位数


tsfresh.feature_extraction.feature_calculators.minimum(x)

最小值


tsfresh.feature_extraction.feature_calculators.number_crossing_m(x, m)

Calculates the number of crossings of x on m. A crossing is defined as two sequential values where the first value is lower than m and the next is greater, or vice-versa. If you set m to zero, you will get the number of zero crossings.


Parameters:

x (numpy.ndarray) – the time series to calculate the feature of

m (float) – the threshold for the crossing


Returns:

the value of this feature


Return type:

int


tsfresh.feature_extraction.feature_calculators.number_cwt_peaks(x, n)

This feature calculator searches for different peaks in x. To do so, x is smoothed by a ricker wavelet and for widths ranging from 1 to n. This feature calculator returns the number of peaks that occur at enough width scales and with sufficiently high Signal-to-Noise-Ratio (SNR)


Parameters:

x (numpy.ndarray) – the time series to calculate the feature of

n (int) – maximum width to consider


Returns:

the value of this feature


Return type:

int


tsfresh.feature_extraction.feature_calculators.number_peaks(x, n)

峰值个数


tsfresh.feature_extraction.feature_calculators.partial_autocorrelation(x, param)

α k = C o v ( x t , x t − k ∣ x t − 1 , … , x t − k + 1 ) V a r ( x t ∣ x t − 1 , … , x t − k + 1 ) V a r ( x t − k ∣ x t − 1 , … , x t − k + 1 ) \alpha_k = \frac{ Cov(x_t, x_{t-k} | x_{t-1}, \ldots, x_{t-k+1})} {\sqrt{ Var(x_t | x_{t-1}, \ldots, x_{t-k+1}) Var(x_{t-k} | x_{t-1}, \ldots, x_{t-k+1} )}}

α

k

=

Var(x

t

∣x

t−1

,…,x

t−k+1

)Var(x

t−k

∣x

t−1

,…,x

t−k+1

)

Cov(x

t

,x

t−k

∣x

t−1

,…,x

t−k+1

)


tsfresh.feature_extraction.feature_calculators.percentage_of_reoccurring_datapoints_to_all_datapoints(x)

len(different values occurring more than once) / len(different values)

出现超过1次的值的个数/总的取值的个数(重复值只算一个)


tsfresh.feature_extraction.feature_calculators.percentage_of_reoccurring_values_to_all_values(x)

出现超过1次的值的个数/总个数


tsfresh.feature_extraction.feature_calculators.quantile(x, q)

返回x中q的分位数,q% 小于分位数。


tsfresh.feature_extraction.feature_calculators.range_count(x, min, max)

x中在min和max之间的数的个数


tsfresh.feature_extraction.feature_calculators.ratio_beyond_r_sigma(x, r)

取值大于r倍标准差的比例


tsfresh.feature_extraction.feature_calculators.ratio_value_number_to_time_series_length(x)

把 x unique后的长度除以x原始长度 len(set(x))/len(x)


tsfresh.feature_extraction.feature_calculators.sample_entropy(x)


tsfresh.feature_extraction.feature_calculators.set_property(key, value)

tsfresh.feature_extraction.feature_calculators.skewness(x)

tsfresh.feature_extraction.feature_calculators.spkt_welch_density(x, param)

tsfresh.feature_extraction.feature_calculators.standard_deviation(x)

标准差


tsfresh.feature_extraction.feature_calculators.sum_of_reoccurring_data_points(x)

出现过多次的点的个数


tsfresh.feature_extraction.feature_calculators.sum_of_reoccurring_values(x)

出现过多次的值的和


tsfresh.feature_extraction.feature_calculators.sum_values(x)

所有值的和


tsfresh.feature_extraction.feature_calculators.symmetry_looking(x, param)

∣ m e a n ( X ) − m e d i a n ( X ) ∣ < r ∗ ( m a x ( X ) − m i n ( X ) ) | mean(X)-median(X)| < r * (max(X)-min(X))

∣mean(X)−median(X)∣<r∗(max(X)−min(X))


tsfresh.feature_extraction.feature_calculators.time_reversal_asymmetry_statistic(x, lag)

1 n − 2 l a g ∑ i = 0 n − 2 l a g x i + 2 ⋅ l a g 2 ⋅ x i + l a g − x i + l a g ⋅ x i 2 \frac{1}{n-2lag} \sum_{i=0}^{n-2lag} x_{i + 2 \cdot lag}^2 \cdot x_{i + lag} - x_{i + lag} \cdot x_{i}^2

n−2lag

1

 

i=0

n−2lag

x

i+2⋅lag

2

⋅x

i+lag

−x

i+lag

⋅x

i

2


相当于

E [ L 2 ( X ) 2 ⋅ L ( X ) − L ( X ) ⋅ X 2 ] \mathbb{E}[L^2(X)^2 \cdot L(X) - L(X) \cdot X^2]

E[L

2

(X)

2

⋅L(X)−L(X)⋅X

2

]


tsfresh.feature_extraction.feature_calculators.value_count(x, value)

x中值等于value的计数


tsfresh.feature_extraction.feature_calculators.variance(x)

方差


tsfresh.feature_extraction.feature_calculators.variance_larger_than_standard_deviation(x)

方差是否大于标准差

目录
相关文章
|
4天前
|
机器学习/深度学习 运维 数据可视化
Python时间序列分析:使用TSFresh进行自动化特征提取
TSFresh 是一个专门用于时间序列数据特征自动提取的框架,支持分类、回归和异常检测等机器学习任务。它通过自动化特征工程流程,处理数百个统计特征(如均值、方差、自相关性等),并通过假设检验筛选显著特征,提升分析效率。TSFresh 支持单变量和多变量时间序列数据,能够与 scikit-learn 等库无缝集成,适用于大规模时间序列数据的特征提取与模型训练。其工作流程包括数据格式转换、特征提取和选择,并提供可视化工具帮助理解特征分布及与目标变量的关系。
40 16
Python时间序列分析:使用TSFresh进行自动化特征提取
|
4月前
|
机器学习/深度学习 算法 数据可视化
8种数值变量的特征工程技术:利用Sklearn、Numpy和Python将数值转化为预测模型的有效特征
特征工程是机器学习流程中的关键步骤,通过将原始数据转换为更具意义的特征,增强模型对数据关系的理解能力。本文重点介绍处理数值变量的高级特征工程技术,包括归一化、多项式特征、FunctionTransformer、KBinsDiscretizer、对数变换、PowerTransformer、QuantileTransformer和PCA,旨在提升模型性能。这些技术能够揭示数据中的潜在模式、优化变量表示,并应对数据分布和内在特性带来的挑战,从而提高模型的稳健性和泛化能力。每种技术都有其独特优势,适用于不同类型的数据和问题。通过实验和验证选择最适合的变换方法至关重要。
74 5
8种数值变量的特征工程技术:利用Sklearn、Numpy和Python将数值转化为预测模型的有效特征
|
3月前
|
机器学习/深度学习 数据格式 Python
将特征向量转化为Python代码
将特征向量转化为Python代码
31 3
|
3月前
|
机器学习/深度学习 数据格式 Python
将特征向量转化为Python代码
将特征向量转化为Python代码
26 3
|
4月前
|
机器学习/深度学习 数据格式 Python
将特征向量转化为Python代码
将特征向量转化为Python代码
33 1
|
3月前
|
机器学习/深度学习 开发者 Python
Python中进行特征重要性分析的9个常用方法
在Python机器学习中,特征重要性分析是理解模型预测关键因素的重要步骤。本文介绍了九种常用方法:排列重要性、内置特征重要性(如`coef_`)、逐项删除法、相关性分析、递归特征消除(RFE)、LASSO回归、SHAP值、部分依赖图和互信息。这些方法适用于不同类型模型和场景,帮助识别关键特征,指导特征选择与模型解释。通过综合应用这些技术,可以提高模型的透明度和预测性能。
276 0
|
5月前
|
机器学习/深度学习 分布式计算 大数据
几行 Python 代码就可以提取数百个时间序列特征
几行 Python 代码就可以提取数百个时间序列特征
95 9
|
5月前
|
机器学习/深度学习 存储 算法
【2024泰迪杯】B 题:基于多模态特征融合的图像文本检索Python代码baseline
本文通过可视化分析,总结了2024年考研国家分数线的变化趋势,指出管理类MBA降低5分,哲学、历史学、理学、医学等10个专业分数线上涨,而经济学等专业出现下降,反映出不同专业分数线受考生数量、竞争情况和政策调整等因素的影响。
91 2
【2024泰迪杯】B 题:基于多模态特征融合的图像文本检索Python代码baseline
|
5月前
|
机器学习/深度学习 存储 算法
【2024泰迪杯】B 题:基于多模态特征融合的图像文本检索Python代码实现
本文提供了2024泰迪杯B题“基于多模态特征融合的图像文本检索”的Python代码实现,包括问题分析、多模态特征提取、特征融合模型和算法的构建,以及如何使用召回率作为评价标准进行模型性能评估的详细说明。
82 2
【2024泰迪杯】B 题:基于多模态特征融合的图像文本检索Python代码实现
|
5月前
|
数据采集 存储 算法
【2024泰迪杯】B 题:基于多模态特征融合的图像文本检索20页论文及Python代码
本文介绍了2024年泰迪杯B题的解决方案,该题目要求构建基于多模态特征融合的图像文本检索模型和算法,通过深入分析和预处理数据集,构建了OFA、BertCLIP和ChineseCLIP三种多模态特征融合模型,并通过投票融合机制优化检索效果,实验结果表明所提模型在图像与文本检索任务中显著提高了检索准确性和效率。
151 3