ML之FE：基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

ML之FE：基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生实现

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

contents={"name": ['Bob', 'LiSa', 'Mary', 'Alan'],

"ID": [1, 2, 3, 4], # 输出 NaN

"age": [np.nan, 28, 38 , '' ], # 输出

"born": [pd.NaT, pd.Timestamp("1990-01-01"), pd.Timestamp("1980-01-01"), ''], # 输出 NaT

"sex": ['男', '女', '女', '男',], # 输出 None

"hobbey":['打篮球', '打羽毛球', '打乒乓球', '',], # 输出

"money":[200.0, 240.0, 290.0, 300.0], # 输出

"weight":[140.5, 120.8, 169.4, 155.6], # 输出

}

2、DFS设计

(1)、指定一个包含数据集中所有实体的字典
(2)、指定实体间如何关联：当两个实体有一对多关系时，我们称之为“one”实体，即“parent entity”。
(3)、运行深度特征合成：DFS的最小输入是一组实体、一组关系和计算特性的“target_entity”。DFS的输出是一个特征矩阵和相应的特征定义列表。
让我们首先为数据中的每个客户创建一个特性矩阵，那么现在有几十个新特性来描述客户的行为。
(4)、改变目标的实体：DFS如此强大的原因之一是它可以为我们的数据中的任何实体创建一个特征矩阵。例如，如果我们想为会话构建特性
(5)、理解特征输出：一般来说，Featuretools通过特性名称引用生成的特性。
为了让特性更容易理解，Featuretools提供了两个额外的工具，Featuretools .graph_feature()和Featuretools .describe_feature()，
来帮助解释什么是特性以及Featuretools生成特性的步骤。
(6)、特征谱系图
特征谱系图可视地遍历功能生成过程。从基本数据开始，它们一步一步地展示应用的原语和生成的中间特征，以创建最终特征。
(7)、特征描述：功能工具还可以自动生成功能的英文句子描述。特性描述有助于解释什么是特性，并且可以通过包含手动定义的自定义来进一步改进。
有关如何自定义自动生成的特性描述的详细信息，请参见生成特性描述。

输出结果

1.    name  ID  age       born sex hobbey  money  weight
2. 0   Bob   1  NaN        NaT   男    打篮球  200.0   140.5
3. 1  LiSa   2   28 1990-01-01   女   打羽毛球  240.0   120.8
4. 2  Mary   3   38 1980-01-01   女   打乒乓球  290.0   169.4
5. 3  Alan   4             NaT   男         300.0   155.6
6. -------------------------------------------
7. nums_df：----------------------------------
8.    name  ID   age  money  weight
9. 0   Bob   1   NaN  200.0   140.5
10. 1  LiSa   2  28.0  240.0   120.8
11. 2  Mary   3  38.0  290.0   169.4
12. 3  Alan   4   NaN  300.0   155.6
13. cats_df：----------------------------------
14.    ID hobbey sex        born
15. 0   4    NaN   男         NaN
16. 1   1    打篮球   男         NaN
17. 2   2   打羽毛球   女  1990-01-01
18. ---------------------------------DFS设计：-----------------------------------
19. feature_matrix_nums 
20.        ID   age  money  weight cats.hobbey cats.sex  cats.COUNT(nums)  \
21. name                                                                   
22. Bob    1   NaN  200.0   140.5         打篮球        男               1.0
23. LiSa   2  28.0  240.0   120.8        打羽毛球        女               1.0
24. Mary   3  38.0  290.0   169.4         NaN      NaN               NaN   
25. 
26.       cats.MAX(nums.age)  cats.MAX(nums.money)  cats.MAX(nums.weight)  \
27. name                                                                    
28. Bob                  NaN                 200.0                  140.5
29. LiSa                28.0                 240.0                  120.8
30. Mary                 NaN                   NaN                    NaN   
31. 
32.       cats.MEAN(nums.age)  cats.MEAN(nums.money)  cats.MEAN(nums.weight)  \
33. name                                                                       
34. Bob                   NaN                  200.0                   140.5
35. LiSa                 28.0                  240.0                   120.8
36. Mary                  NaN                    NaN                     NaN   
37. 
38.       cats.MIN(nums.age)  cats.MIN(nums.money)  cats.MIN(nums.weight)  \
39. name                                                                    
40. Bob                  NaN                 200.0                  140.5
41. LiSa                28.0                 240.0                  120.8
42. Mary                 NaN                   NaN                    NaN   
43. 
44.       cats.SKEW(nums.age)  cats.SKEW(nums.money)  cats.SKEW(nums.weight)  \
45. name                                                                       
46. Bob                   NaN                    NaN                     NaN   
47. LiSa                  NaN                    NaN                     NaN   
48. Mary                  NaN                    NaN                     NaN   
49. 
50.       cats.STD(nums.age)  cats.STD(nums.money)  cats.STD(nums.weight)  \
51. name                                                                    
52. Bob                  NaN                   NaN                    NaN   
53. LiSa                 NaN                   NaN                    NaN   
54. Mary                 NaN                   NaN                    NaN   
55. 
56.       cats.SUM(nums.age)  cats.SUM(nums.money)  cats.SUM(nums.weight)  \
57. name                                                                    
58. Bob                  0.0                 200.0                  140.5
59. LiSa                28.0                 240.0                  120.8
60. Mary                 NaN                   NaN                    NaN   
61. 
62.       cats.DAY(born)  cats.MONTH(born)  cats.WEEKDAY(born)  cats.YEAR(born)  
63. name                                                                         
64. Bob              NaN               NaN                 NaN              NaN  
65. LiSa             1.0               1.0                 0.0           1990.0
66. Mary             NaN               NaN                 NaN              NaN  
67. features_defs_nums： 29 [<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]
68. feature_matrix_cats_df 
69.     hobbey sex  COUNT(nums)  MAX(nums.age)  MAX(nums.money)  MAX(nums.weight)  \
70. ID                                                                             
71. 4     NaN   男            1            NaN            300.0             155.6
72. 1     打篮球   男            1            NaN            200.0             140.5
73. 2    打羽毛球   女            1           28.0            240.0             120.8
74. 
75.     MEAN(nums.age)  MEAN(nums.money)  MEAN(nums.weight)  MIN(nums.age)  \
76. ID                                                                       
77. 4              NaN             300.0              155.6            NaN   
78. 1              NaN             200.0              140.5            NaN   
79. 2             28.0             240.0              120.8           28.0
80. 
81.     MIN(nums.money)  MIN(nums.weight)  SKEW(nums.age)  SKEW(nums.money)  \
82. ID                                                                        
83. 4             300.0             155.6             NaN               NaN   
84. 1             200.0             140.5             NaN               NaN   
85. 2             240.0             120.8             NaN               NaN   
86. 
87.     SKEW(nums.weight)  STD(nums.age)  STD(nums.money)  STD(nums.weight)  \
88. ID                                                                        
89. 4                 NaN            NaN              NaN               NaN   
90. 1                 NaN            NaN              NaN               NaN   
91. 2                 NaN            NaN              NaN               NaN   
92. 
93.     SUM(nums.age)  SUM(nums.money)  SUM(nums.weight)  DAY(born)  MONTH(born)  \
94. ID                                                                             
95. 4             0.0            300.0             155.6        NaN          NaN   
96. 1             0.0            200.0             140.5        NaN          NaN   
97. 2            28.0            240.0             120.8        1.0          1.0
98. 
99.     WEEKDAY(born)  YEAR(born)  
100. ID                             
101. 4             NaN         NaN  
102. 1             NaN         NaN  
103. 2             0.0      1990.0
104. features_defs_cats_df： 25 [<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]
105. <Feature: SUM(nums.age)>
106. The sum of the "age" of all instances of "nums" for each "ID" in "cats".

feature_matrix_cats_df.csv

features_defs_cats_df： 25

[<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]

ID	hobbey	sex	COUNT(nums)	MAX(nums.age)	MAX(nums.money)	MAX(nums.weight)	MEAN(nums.age)	MEAN(nums.money)	MEAN(nums.weight)	MIN(nums.age)	MIN(nums.money)	MIN(nums.weight)	SKEW(nums.age)	SKEW(nums.money)	SKEW(nums.weight)	STD(nums.age)	STD(nums.money)	STD(nums.weight)	SUM(nums.age)	SUM(nums.money)	SUM(nums.weight)	DAY(born)	MONTH(born)	WEEKDAY(born)	YEAR(born)
4		男	1		300	155.6		300	155.6		300	155.6							0	300	155.6
1	打篮球	男	1		200	140.5		200	140.5		200	140.5							0	200	140.5
2	打羽毛球	女	1	28	240	120.8	28	240	120.8	28	240	120.8							28	240	120.8	1	1	0	1990

ID	hobbey	sex	COUNT(nums)
4		男	1
1	打篮球	男	1
2	打羽毛球	女	1
	MAX(nums.age)	MAX(nums.money)	MAX(nums.weight)	MEAN(nums.age)	MEAN(nums.money)	MEAN(nums.weight)	MIN(nums.age)	MIN(nums.money)	MIN(nums.weight)
		300	155.6		300	155.6		300	155.6
		200	140.5		200	140.5		200	140.5
	28	240	120.8	28	240	120.8	28	240	120.8
	SKEW(nums.age)	SKEW(nums.money)	SKEW(nums.weight)	STD(nums.age)	STD(nums.money)	STD(nums.weight)	SUM(nums.age)	SUM(nums.money)	SUM(nums.weight)
							0	300	155.6
							0	200	140.5
							28	240	120.8
	DAY(born)	MONTH(born)	WEEKDAY(born)	YEAR(born)


	1	1	0	1990

字段解释：

<Feature: hobbey> ： The "hobbey".
<Feature: sex> ： The "sex".
<Feature: COUNT(nums)> ： The number of all instances of "nums" for each "ID" in "cats".
<Feature: MAX(nums.age)> ： The maximum of the "age" of all instances of "nums" for each "ID" in "cats".
<Feature: MAX(nums.money)> ： The maximum of the "money" of all instances of "nums" for each "ID" in "cats".
<Feature: MAX(nums.weight)> ： The maximum of the "weight" of all instances of "nums" for each "ID" in "cats".
<Feature: MEAN(nums.age)> ： The average of the "age" of all instances of "nums" for each "ID" in "cats".
<Feature: MEAN(nums.money)> ： The average of the "money" of all instances of "nums" for each "ID" in "cats".
<Feature: MEAN(nums.weight)> ： The average of the "weight" of all instances of "nums" for each "ID" in "cats".
<Feature: MIN(nums.age)> ： The minimum of the "age" of all instances of "nums" for each "ID" in "cats".
<Feature: MIN(nums.money)> ： The minimum of the "money" of all instances of "nums" for each "ID" in "cats".
<Feature: MIN(nums.weight)> ： The minimum of the "weight" of all instances of "nums" for each "ID" in "cats".
<Feature: SKEW(nums.age)> ： The skewness of the "age" of all instances of "nums" for each "ID" in "cats".
<Feature: SKEW(nums.money)> ： The skewness of the "money" of all instances of "nums" for each "ID" in "cats".
<Feature: SKEW(nums.weight)> ： The skewness of the "weight" of all instances of "nums" for each "ID" in "cats".
<Feature: STD(nums.age)> ： The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats".
<Feature: STD(nums.money)> ： The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats".
<Feature: STD(nums.weight)> ： The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats".
<Feature: SUM(nums.age)> ： The sum of the "age" of all instances of "nums" for each "ID" in "cats".
<Feature: SUM(nums.money)> ： The sum of the "money" of all instances of "nums" for each "ID" in "cats".
<Feature: SUM(nums.weight)> ： The sum of the "weight" of all instances of "nums" for each "ID" in "cats".
<Feature: DAY(born)> ： The day of the month of the "born".
<Feature: MONTH(born)> ： The month of the "born".
<Feature: WEEKDAY(born)> ： The day of the week of the "born".
<Feature: YEAR(born)> ： The year of the "born".

feature_matrix_nums.csv

features_defs_nums： 29

[<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]

name	ID	age	money	weight	cats.hobbey	cats.sex	cats.COUNT(nums)	cats.MAX(nums.age)	cats.MAX(nums.money)	cats.MAX(nums.weight)	cats.MEAN(nums.age)	cats.MEAN(nums.money)	cats.MEAN(nums.weight)	cats.MIN(nums.age)	cats.MIN(nums.money)	cats.MIN(nums.weight)	cats.SKEW(nums.age)	cats.SKEW(nums.money)	cats.SKEW(nums.weight)	cats.STD(nums.age)	cats.STD(nums.money)	cats.STD(nums.weight)	cats.SUM(nums.age)	cats.SUM(nums.money)	cats.SUM(nums.weight)	cats.DAY(born)	cats.MONTH(born)	cats.WEEKDAY(born)	cats.YEAR(born)
Bob	1		200	140.5	打篮球	男	1		200	140.5		200	140.5		200	140.5							0	200	140.5
LiSa	2	28	240	120.8	打羽毛球	女	1	28	240	120.8	28	240	120.8	28	240	120.8							28	240	120.8	1	1	0	1990
Mary	3	38	290	169.4
Alan	4		300	155.6		男	1		300	155.6		300	155.6		300	155.6							0	300	155.6

name	ID	age	money	weight
Bob	1		200	140.5
LiSa	2	28	240	120.8
Mary	3	38	290	169.4
Alan	4		300	155.6
	cats.hobbey	cats.sex	cats.COUNT(nums)
	打篮球	男	1
	打羽毛球	女	1

		男	1
	cats.MAX(nums.age)	cats.MAX(nums.money)	cats.MAX(nums.weight)	cats.MEAN(nums.age)	cats.MEAN(nums.money)	cats.MEAN(nums.weight)	cats.MIN(nums.age)	cats.MIN(nums.money)	cats.MIN(nums.weight)
		200	140.5		200	140.5		200	140.5
	28	240	120.8	28	240	120.8	28	240	120.8

		300	155.6		300	155.6		300	155.6
	cats.SKEW(nums.age)	cats.SKEW(nums.money)	cats.SKEW(nums.weight)	cats.STD(nums.age)	cats.STD(nums.money)	cats.STD(nums.weight)	cats.SUM(nums.age)	cats.SUM(nums.money)	cats.SUM(nums.weight)
							0	200	140.5
							28	240	120.8

							0	300	155.6
	cats.DAY(born)	cats.MONTH(born)	cats.WEEKDAY(born)	cats.YEAR(born)

	1	1	0	1990

字段解释：

<Feature: ID> ： The "ID".
<Feature: age> ： The "age".
<Feature: money> ： The "money".
<Feature: weight> ： The "weight".
<Feature: cats.sex> ： The "sex" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.hobbey> ： The "hobbey" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.COUNT(nums)> ： The number of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MAX(nums.age)> ： The maximum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MAX(nums.money)> ： The maximum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MAX(nums.weight)> ： The maximum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MEAN(nums.age)> ： The average of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MEAN(nums.money)> ： The average of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MEAN(nums.weight)> ： The average of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MIN(nums.age)> ： The minimum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MIN(nums.money)> ： The minimum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MIN(nums.weight)> ： The minimum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.SKEW(nums.age)> ： The skewness of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.SKEW(nums.money)> ： The skewness of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.SKEW(nums.weight)> ： The skewness of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.STD(nums.age)> ： The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.STD(nums.money)> ： The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.STD(nums.weight)> ： The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.SUM(nums.age)> ： The sum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.SUM(nums.money)> ： The sum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.SUM(nums.weight)> ： The sum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.DAY(born)> ： The day of the month of the "born" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.MONTH(born)> ： The month of the "born" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.WEEKDAY(born)> ： The day of the week of the "born" for the instance of "cats" associated with this instance of "nums".
<Feature: cats.YEAR(born)> ： The year of the "born" for the instance of "cats" associated with this instance of "nums".

ML之FE：基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

2、DFS设计

输出结果

feature_matrix_cats_df.csv

feature_matrix_nums.csv

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

ML之FE：基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

2、DFS设计

输出结果

feature_matrix_cats_df.csv

feature_matrix_nums.csv

热门文章

最新文章

相关课程

相关电子书