目录
基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生
推荐文章
Py之featuretools:featuretools库的简介、安装、使用方法之详细攻略
ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生
ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生实现
基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生
设计思路
1、定义数据集
contents={"name": ['Bob', 'LiSa', 'Mary', 'Alan'],
"ID": [1, 2, 3, 4], # 输出 NaN
"age": [np.nan, 28, 38 , '' ], # 输出
"born": [pd.NaT, pd.Timestamp("1990-01-01"), pd.Timestamp("1980-01-01"), ''], # 输出 NaT
"sex": ['男', '女', '女', '男',], # 输出 None
"hobbey":['打篮球', '打羽毛球', '打乒乓球', '',], # 输出
"money":[200.0, 240.0, 290.0, 300.0], # 输出
"weight":[140.5, 120.8, 169.4, 155.6], # 输出
}
2、DFS设计
- (1)、指定一个包含数据集中所有实体的字典
- (2)、指定实体间如何关联:当两个实体有一对多关系时,我们称之为“one”实体,即“parent entity”。
- (3)、运行深度特征合成:DFS的最小输入是一组实体、一组关系和计算特性的“target_entity”。DFS的输出是一个特征矩阵和相应的特征定义列表。
让我们首先为数据中的每个客户创建一个特性矩阵,那么现在有几十个新特性来描述客户的行为。 - (4)、改变目标的实体:DFS如此强大的原因之一是它可以为我们的数据中的任何实体创建一个特征矩阵。例如,如果我们想为会话构建特性
- (5)、理解特征输出:一般来说,Featuretools通过特性名称引用生成的特性。
为了让特性更容易理解,Featuretools提供了两个额外的工具,Featuretools .graph_feature()和Featuretools .describe_feature(),
来帮助解释什么是特性以及Featuretools生成特性的步骤。 - (6)、特征谱系图
特征谱系图可视地遍历功能生成过程。从基本数据开始,它们一步一步地展示应用的原语和生成的中间特征,以创建最终特征。 - (7)、特征描述:功能工具还可以自动生成功能的英文句子描述。特性描述有助于解释什么是特性,并且可以通过包含手动定义的自定义来进一步改进。
有关如何自定义自动生成的特性描述的详细信息,请参见生成特性描述。
输出结果
1. name ID age born sex hobbey money weight 2. 0 Bob 1 NaN NaT 男 打篮球 200.0 140.5 3. 1 LiSa 2 28 1990-01-01 女 打羽毛球 240.0 120.8 4. 2 Mary 3 38 1980-01-01 女 打乒乓球 290.0 169.4 5. 3 Alan 4 NaT 男 300.0 155.6 6. ------------------------------------------- 7. nums_df:---------------------------------- 8. name ID age money weight 9. 0 Bob 1 NaN 200.0 140.5 10. 1 LiSa 2 28.0 240.0 120.8 11. 2 Mary 3 38.0 290.0 169.4 12. 3 Alan 4 NaN 300.0 155.6 13. cats_df:---------------------------------- 14. ID hobbey sex born 15. 0 4 NaN 男 NaN 16. 1 1 打篮球 男 NaN 17. 2 2 打羽毛球 女 1990-01-01 18. ---------------------------------DFS设计:----------------------------------- 19. feature_matrix_nums 20. ID age money weight cats.hobbey cats.sex cats.COUNT(nums) \ 21. name 22. Bob 1 NaN 200.0 140.5 打篮球 男 1.0 23. LiSa 2 28.0 240.0 120.8 打羽毛球 女 1.0 24. Mary 3 38.0 290.0 169.4 NaN NaN NaN 25. 26. cats.MAX(nums.age) cats.MAX(nums.money) cats.MAX(nums.weight) \ 27. name 28. Bob NaN 200.0 140.5 29. LiSa 28.0 240.0 120.8 30. Mary NaN NaN NaN 31. 32. cats.MEAN(nums.age) cats.MEAN(nums.money) cats.MEAN(nums.weight) \ 33. name 34. Bob NaN 200.0 140.5 35. LiSa 28.0 240.0 120.8 36. Mary NaN NaN NaN 37. 38. cats.MIN(nums.age) cats.MIN(nums.money) cats.MIN(nums.weight) \ 39. name 40. Bob NaN 200.0 140.5 41. LiSa 28.0 240.0 120.8 42. Mary NaN NaN NaN 43. 44. cats.SKEW(nums.age) cats.SKEW(nums.money) cats.SKEW(nums.weight) \ 45. name 46. Bob NaN NaN NaN 47. LiSa NaN NaN NaN 48. Mary NaN NaN NaN 49. 50. cats.STD(nums.age) cats.STD(nums.money) cats.STD(nums.weight) \ 51. name 52. Bob NaN NaN NaN 53. LiSa NaN NaN NaN 54. Mary NaN NaN NaN 55. 56. cats.SUM(nums.age) cats.SUM(nums.money) cats.SUM(nums.weight) \ 57. name 58. Bob 0.0 200.0 140.5 59. LiSa 28.0 240.0 120.8 60. Mary NaN NaN NaN 61. 62. cats.DAY(born) cats.MONTH(born) cats.WEEKDAY(born) cats.YEAR(born) 63. name 64. Bob NaN NaN NaN NaN 65. LiSa 1.0 1.0 0.0 1990.0 66. Mary NaN NaN NaN NaN 67. features_defs_nums: 29 [<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>] 68. feature_matrix_cats_df 69. hobbey sex COUNT(nums) MAX(nums.age) MAX(nums.money) MAX(nums.weight) \ 70. ID 71. 4 NaN 男 1 NaN 300.0 155.6 72. 1 打篮球 男 1 NaN 200.0 140.5 73. 2 打羽毛球 女 1 28.0 240.0 120.8 74. 75. MEAN(nums.age) MEAN(nums.money) MEAN(nums.weight) MIN(nums.age) \ 76. ID 77. 4 NaN 300.0 155.6 NaN 78. 1 NaN 200.0 140.5 NaN 79. 2 28.0 240.0 120.8 28.0 80. 81. MIN(nums.money) MIN(nums.weight) SKEW(nums.age) SKEW(nums.money) \ 82. ID 83. 4 300.0 155.6 NaN NaN 84. 1 200.0 140.5 NaN NaN 85. 2 240.0 120.8 NaN NaN 86. 87. SKEW(nums.weight) STD(nums.age) STD(nums.money) STD(nums.weight) \ 88. ID 89. 4 NaN NaN NaN NaN 90. 1 NaN NaN NaN NaN 91. 2 NaN NaN NaN NaN 92. 93. SUM(nums.age) SUM(nums.money) SUM(nums.weight) DAY(born) MONTH(born) \ 94. ID 95. 4 0.0 300.0 155.6 NaN NaN 96. 1 0.0 200.0 140.5 NaN NaN 97. 2 28.0 240.0 120.8 1.0 1.0 98. 99. WEEKDAY(born) YEAR(born) 100. ID 101. 4 NaN NaN 102. 1 NaN NaN 103. 2 0.0 1990.0 104. features_defs_cats_df: 25 [<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>] 105. <Feature: SUM(nums.age)> 106. The sum of the "age" of all instances of "nums" for each "ID" in "cats".
feature_matrix_cats_df.csv
features_defs_cats_df: 25
[<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]
ID | hobbey | sex | COUNT(nums) | MAX(nums.age) | MAX(nums.money) | MAX(nums.weight) | MEAN(nums.age) | MEAN(nums.money) | MEAN(nums.weight) | MIN(nums.age) | MIN(nums.money) | MIN(nums.weight) | SKEW(nums.age) | SKEW(nums.money) | SKEW(nums.weight) | STD(nums.age) | STD(nums.money) | STD(nums.weight) | SUM(nums.age) | SUM(nums.money) | SUM(nums.weight) | DAY(born) | MONTH(born) | WEEKDAY(born) | YEAR(born) |
4 | 男 | 1 | 300 | 155.6 | 300 | 155.6 | 300 | 155.6 | 0 | 300 | 155.6 | ||||||||||||||
1 | 打篮球 | 男 | 1 | 200 | 140.5 | 200 | 140.5 | 200 | 140.5 | 0 | 200 | 140.5 | |||||||||||||
2 | 打羽毛球 | 女 | 1 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 1 | 1 | 0 | 1990 |
ID | hobbey | sex | COUNT(nums) | ||||||
4 | 男 | 1 | |||||||
1 | 打篮球 | 男 | 1 | ||||||
2 | 打羽毛球 | 女 | 1 | ||||||
MAX(nums.age) | MAX(nums.money) | MAX(nums.weight) | MEAN(nums.age) | MEAN(nums.money) | MEAN(nums.weight) | MIN(nums.age) | MIN(nums.money) | MIN(nums.weight) | |
300 | 155.6 | 300 | 155.6 | 300 | 155.6 | ||||
200 | 140.5 | 200 | 140.5 | 200 | 140.5 | ||||
28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | |
SKEW(nums.age) | SKEW(nums.money) | SKEW(nums.weight) | STD(nums.age) | STD(nums.money) | STD(nums.weight) | SUM(nums.age) | SUM(nums.money) | SUM(nums.weight) | |
0 | 300 | 155.6 | |||||||
0 | 200 | 140.5 | |||||||
28 | 240 | 120.8 | |||||||
DAY(born) | MONTH(born) | WEEKDAY(born) | YEAR(born) | ||||||
1 | 1 | 0 | 1990 |
字段解释:
- <Feature: hobbey> : The "hobbey".
- <Feature: sex> : The "sex".
- <Feature: COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats".
- <Feature: MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: DAY(born)> : The day of the month of the "born".
- <Feature: MONTH(born)> : The month of the "born".
- <Feature: WEEKDAY(born)> : The day of the week of the "born".
- <Feature: YEAR(born)> : The year of the "born".
feature_matrix_nums.csv
features_defs_nums: 29
[<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]
name | ID | age | money | weight | cats.hobbey | cats.sex | cats.COUNT(nums) | cats.MAX(nums.age) | cats.MAX(nums.money) | cats.MAX(nums.weight) | cats.MEAN(nums.age) | cats.MEAN(nums.money) | cats.MEAN(nums.weight) | cats.MIN(nums.age) | cats.MIN(nums.money) | cats.MIN(nums.weight) | cats.SKEW(nums.age) | cats.SKEW(nums.money) | cats.SKEW(nums.weight) | cats.STD(nums.age) | cats.STD(nums.money) | cats.STD(nums.weight) | cats.SUM(nums.age) | cats.SUM(nums.money) | cats.SUM(nums.weight) | cats.DAY(born) | cats.MONTH(born) | cats.WEEKDAY(born) | cats.YEAR(born) |
Bob | 1 | 200 | 140.5 | 打篮球 | 男 | 1 | 200 | 140.5 | 200 | 140.5 | 200 | 140.5 | 0 | 200 | 140.5 | ||||||||||||||
LiSa | 2 | 28 | 240 | 120.8 | 打羽毛球 | 女 | 1 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 1 | 1 | 0 | 1990 | ||||||
Mary | 3 | 38 | 290 | 169.4 | |||||||||||||||||||||||||
Alan | 4 | 300 | 155.6 | 男 | 1 | 300 | 155.6 | 300 | 155.6 | 300 | 155.6 | 0 | 300 | 155.6 |
name | ID | age | money | weight | |||||
Bob | 1 | 200 | 140.5 | ||||||
LiSa | 2 | 28 | 240 | 120.8 | |||||
Mary | 3 | 38 | 290 | 169.4 | |||||
Alan | 4 | 300 | 155.6 | ||||||
cats.hobbey | cats.sex | cats.COUNT(nums) | |||||||
打篮球 | 男 | 1 | |||||||
打羽毛球 | 女 | 1 | |||||||
男 | 1 | ||||||||
cats.MAX(nums.age) | cats.MAX(nums.money) | cats.MAX(nums.weight) | cats.MEAN(nums.age) | cats.MEAN(nums.money) | cats.MEAN(nums.weight) | cats.MIN(nums.age) | cats.MIN(nums.money) | cats.MIN(nums.weight) | |
200 | 140.5 | 200 | 140.5 | 200 | 140.5 | ||||
28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | |
300 | 155.6 | 300 | 155.6 | 300 | 155.6 | ||||
cats.SKEW(nums.age) | cats.SKEW(nums.money) | cats.SKEW(nums.weight) | cats.STD(nums.age) | cats.STD(nums.money) | cats.STD(nums.weight) | cats.SUM(nums.age) | cats.SUM(nums.money) | cats.SUM(nums.weight) | |
0 | 200 | 140.5 | |||||||
28 | 240 | 120.8 | |||||||
0 | 300 | 155.6 | |||||||
cats.DAY(born) | cats.MONTH(born) | cats.WEEKDAY(born) | cats.YEAR(born) | ||||||
1 | 1 | 0 | 1990 | ||||||
字段解释:
- <Feature: ID> : The "ID".
- <Feature: age> : The "age".
- <Feature: money> : The "money".
- <Feature: weight> : The "weight".
- <Feature: cats.sex> : The "sex" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.hobbey> : The "hobbey" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.DAY(born)> : The day of the month of the "born" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MONTH(born)> : The month of the "born" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.WEEKDAY(born)> : The day of the week of the "born" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.YEAR(born)> : The year of the "born" for the instance of "cats" associated with this instance of "nums".