ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

简介: ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生


目录

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

2、DFS设计

输出结果

feature_matrix_cats_df.csv

feature_matrix_nums.csv


 

 

推荐文章

Py之featuretools:featuretools库的简介、安装、使用方法之详细攻略

ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生实现

 

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

contents={"name": ['Bob',        'LiSa',                     'Mary',                       'Alan'],

         "ID":   [1,              2,                            3,                            4],    # 输出 NaN

         "age":  [np.nan,        28,                           38 ,                          '' ],   # 输出

       "born": [pd.NaT,     pd.Timestamp("1990-01-01"),  pd.Timestamp("1980-01-01"),        ''],     # 输出 NaT

         "sex":  ['男',          '女',                        '女',                        '男',],   # 输出 None

         "hobbey":['打篮球',     '打羽毛球',                   '打乒乓球',                    '',],   # 输出

         "money":[200.0,                240.0,                   290.0,                     300.0],  # 输出

         "weight":[140.5,                120.8,                 169.4,                      155.6],  # 输出

         }

2、DFS设计

  • (1)、指定一个包含数据集中所有实体的字典
  • (2)、指定实体间如何关联:当两个实体有一对多关系时,我们称之为“one”实体,即“parent entity”。
  • (3)、运行深度特征合成:DFS的最小输入是一组实体、一组关系和计算特性的“target_entity”。DFS的输出是一个特征矩阵和相应的特征定义列表。
    让我们首先为数据中的每个客户创建一个特性矩阵,那么现在有几十个新特性来描述客户的行为。
  • (4)、改变目标的实体:DFS如此强大的原因之一是它可以为我们的数据中的任何实体创建一个特征矩阵。例如,如果我们想为会话构建特性
  • (5)、理解特征输出:一般来说,Featuretools通过特性名称引用生成的特性。
    为了让特性更容易理解,Featuretools提供了两个额外的工具,Featuretools .graph_feature()和Featuretools .describe_feature(),
    来帮助解释什么是特性以及Featuretools生成特性的步骤。
  • (6)、特征谱系图
    特征谱系图可视地遍历功能生成过程。从基本数据开始,它们一步一步地展示应用的原语和生成的中间特征,以创建最终特征。
  • (7)、特征描述:功能工具还可以自动生成功能的英文句子描述。特性描述有助于解释什么是特性,并且可以通过包含手动定义的自定义来进一步改进。
    有关如何自定义自动生成的特性描述的详细信息,请参见生成特性描述。

 

 

输出结果

1.    name  ID  age       born sex hobbey  money  weight
2. 0   Bob   1  NaN        NaT   男    打篮球  200.0   140.5
3. 1  LiSa   2   28 1990-01-01   女   打羽毛球  240.0   120.8
4. 2  Mary   3   38 1980-01-01   女   打乒乓球  290.0   169.4
5. 3  Alan   4             NaT   男         300.0   155.6
6. -------------------------------------------
7. nums_df:----------------------------------
8.    name  ID   age  money  weight
9. 0   Bob   1   NaN  200.0   140.5
10. 1  LiSa   2  28.0  240.0   120.8
11. 2  Mary   3  38.0  290.0   169.4
12. 3  Alan   4   NaN  300.0   155.6
13. cats_df:----------------------------------
14.    ID hobbey sex        born
15. 0   4    NaN   男         NaN
16. 1   1    打篮球   男         NaN
17. 2   2   打羽毛球   女  1990-01-01
18. ---------------------------------DFS设计:-----------------------------------
19. feature_matrix_nums 
20.        ID   age  money  weight cats.hobbey cats.sex  cats.COUNT(nums)  \
21. name                                                                   
22. Bob    1   NaN  200.0   140.5         打篮球        男               1.0
23. LiSa   2  28.0  240.0   120.8        打羽毛球        女               1.0
24. Mary   3  38.0  290.0   169.4         NaN      NaN               NaN   
25. 
26.       cats.MAX(nums.age)  cats.MAX(nums.money)  cats.MAX(nums.weight)  \
27. name                                                                    
28. Bob                  NaN                 200.0                  140.5
29. LiSa                28.0                 240.0                  120.8
30. Mary                 NaN                   NaN                    NaN   
31. 
32.       cats.MEAN(nums.age)  cats.MEAN(nums.money)  cats.MEAN(nums.weight)  \
33. name                                                                       
34. Bob                   NaN                  200.0                   140.5
35. LiSa                 28.0                  240.0                   120.8
36. Mary                  NaN                    NaN                     NaN   
37. 
38.       cats.MIN(nums.age)  cats.MIN(nums.money)  cats.MIN(nums.weight)  \
39. name                                                                    
40. Bob                  NaN                 200.0                  140.5
41. LiSa                28.0                 240.0                  120.8
42. Mary                 NaN                   NaN                    NaN   
43. 
44.       cats.SKEW(nums.age)  cats.SKEW(nums.money)  cats.SKEW(nums.weight)  \
45. name                                                                       
46. Bob                   NaN                    NaN                     NaN   
47. LiSa                  NaN                    NaN                     NaN   
48. Mary                  NaN                    NaN                     NaN   
49. 
50.       cats.STD(nums.age)  cats.STD(nums.money)  cats.STD(nums.weight)  \
51. name                                                                    
52. Bob                  NaN                   NaN                    NaN   
53. LiSa                 NaN                   NaN                    NaN   
54. Mary                 NaN                   NaN                    NaN   
55. 
56.       cats.SUM(nums.age)  cats.SUM(nums.money)  cats.SUM(nums.weight)  \
57. name                                                                    
58. Bob                  0.0                 200.0                  140.5
59. LiSa                28.0                 240.0                  120.8
60. Mary                 NaN                   NaN                    NaN   
61. 
62.       cats.DAY(born)  cats.MONTH(born)  cats.WEEKDAY(born)  cats.YEAR(born)  
63. name                                                                         
64. Bob              NaN               NaN                 NaN              NaN  
65. LiSa             1.0               1.0                 0.0           1990.0
66. Mary             NaN               NaN                 NaN              NaN  
67. features_defs_nums: 29 [<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]
68. feature_matrix_cats_df 
69.     hobbey sex  COUNT(nums)  MAX(nums.age)  MAX(nums.money)  MAX(nums.weight)  \
70. ID                                                                             
71. 4     NaN   男            1            NaN            300.0             155.6
72. 1     打篮球   男            1            NaN            200.0             140.5
73. 2    打羽毛球   女            1           28.0            240.0             120.8
74. 
75.     MEAN(nums.age)  MEAN(nums.money)  MEAN(nums.weight)  MIN(nums.age)  \
76. ID                                                                       
77. 4              NaN             300.0              155.6            NaN   
78. 1              NaN             200.0              140.5            NaN   
79. 2             28.0             240.0              120.8           28.0
80. 
81.     MIN(nums.money)  MIN(nums.weight)  SKEW(nums.age)  SKEW(nums.money)  \
82. ID                                                                        
83. 4             300.0             155.6             NaN               NaN   
84. 1             200.0             140.5             NaN               NaN   
85. 2             240.0             120.8             NaN               NaN   
86. 
87.     SKEW(nums.weight)  STD(nums.age)  STD(nums.money)  STD(nums.weight)  \
88. ID                                                                        
89. 4                 NaN            NaN              NaN               NaN   
90. 1                 NaN            NaN              NaN               NaN   
91. 2                 NaN            NaN              NaN               NaN   
92. 
93.     SUM(nums.age)  SUM(nums.money)  SUM(nums.weight)  DAY(born)  MONTH(born)  \
94. ID                                                                             
95. 4             0.0            300.0             155.6        NaN          NaN   
96. 1             0.0            200.0             140.5        NaN          NaN   
97. 2            28.0            240.0             120.8        1.0          1.0
98. 
99.     WEEKDAY(born)  YEAR(born)  
100. ID                             
101. 4             NaN         NaN  
102. 1             NaN         NaN  
103. 2             0.0      1990.0
104. features_defs_cats_df: 25 [<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]
105. <Feature: SUM(nums.age)>
106. The sum of the "age" of all instances of "nums" for each "ID" in "cats".

 

 

feature_matrix_cats_df.csv

features_defs_cats_df: 25

[<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]

ID hobbey sex COUNT(nums) MAX(nums.age) MAX(nums.money) MAX(nums.weight) MEAN(nums.age) MEAN(nums.money) MEAN(nums.weight) MIN(nums.age) MIN(nums.money) MIN(nums.weight) SKEW(nums.age) SKEW(nums.money) SKEW(nums.weight) STD(nums.age) STD(nums.money) STD(nums.weight) SUM(nums.age) SUM(nums.money) SUM(nums.weight) DAY(born) MONTH(born) WEEKDAY(born) YEAR(born)
4   1   300 155.6   300 155.6   300 155.6             0 300 155.6        
1 打篮球 1   200 140.5   200 140.5   200 140.5             0 200 140.5        
2 打羽毛球 1 28 240 120.8 28 240 120.8 28 240 120.8             28 240 120.8 1 1 0 1990

 

ID hobbey sex COUNT(nums)            
4   1            
1 打篮球 1            
2 打羽毛球 1            
  MAX(nums.age) MAX(nums.money) MAX(nums.weight) MEAN(nums.age) MEAN(nums.money) MEAN(nums.weight) MIN(nums.age) MIN(nums.money) MIN(nums.weight)
    300 155.6   300 155.6   300 155.6
    200 140.5   200 140.5   200 140.5
  28 240 120.8 28 240 120.8 28 240 120.8
  SKEW(nums.age) SKEW(nums.money) SKEW(nums.weight) STD(nums.age) STD(nums.money) STD(nums.weight) SUM(nums.age) SUM(nums.money) SUM(nums.weight)
              0 300 155.6
              0 200 140.5
              28 240 120.8
  DAY(born) MONTH(born) WEEKDAY(born) YEAR(born)          
                   
                   
  1 1 0 1990          

字段解释

  1. <Feature: hobbey> : The "hobbey".
  2. <Feature: sex> : The "sex".
  3. <Feature: COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats".
  4. <Feature: MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats".
  5. <Feature: MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats".
  6. <Feature: MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats".
  7. <Feature: MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats".
  8. <Feature: MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats".
  9. <Feature: MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats".
  10. <Feature: MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats".
  11. <Feature: MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats".
  12. <Feature: MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats".
  13. <Feature: SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats".
  14. <Feature: SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats".
  15. <Feature: SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats".
  16. <Feature: STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats".
  17. <Feature: STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats".
  18. <Feature: STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats".
  19. <Feature: SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats".
  20. <Feature: SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats".
  21. <Feature: SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats".
  22. <Feature: DAY(born)> : The day of the month of the "born".
  23. <Feature: MONTH(born)> : The month of the "born".
  24. <Feature: WEEKDAY(born)> : The day of the week of the "born".
  25. <Feature: YEAR(born)> : The year of the "born".

 

 

feature_matrix_nums.csv

features_defs_nums: 29

[<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]

name ID age money weight cats.hobbey cats.sex cats.COUNT(nums) cats.MAX(nums.age) cats.MAX(nums.money) cats.MAX(nums.weight) cats.MEAN(nums.age) cats.MEAN(nums.money) cats.MEAN(nums.weight) cats.MIN(nums.age) cats.MIN(nums.money) cats.MIN(nums.weight) cats.SKEW(nums.age) cats.SKEW(nums.money) cats.SKEW(nums.weight) cats.STD(nums.age) cats.STD(nums.money) cats.STD(nums.weight) cats.SUM(nums.age) cats.SUM(nums.money) cats.SUM(nums.weight) cats.DAY(born) cats.MONTH(born) cats.WEEKDAY(born) cats.YEAR(born)
Bob 1   200 140.5 打篮球 1   200 140.5   200 140.5   200 140.5             0 200 140.5        
LiSa 2 28 240 120.8 打羽毛球 1 28 240 120.8 28 240 120.8 28 240 120.8             28 240 120.8 1 1 0 1990
Mary 3 38 290 169.4                                                  
Alan 4   300 155.6   1   300 155.6   300 155.6   300 155.6             0 300 155.6        

 

name ID age money weight          
Bob 1   200 140.5          
LiSa 2 28 240 120.8          
Mary 3 38 290 169.4          
Alan 4   300 155.6          
  cats.hobbey cats.sex cats.COUNT(nums)            
  打篮球 1            
  打羽毛球 1            
                   
    1            
  cats.MAX(nums.age) cats.MAX(nums.money) cats.MAX(nums.weight) cats.MEAN(nums.age) cats.MEAN(nums.money) cats.MEAN(nums.weight) cats.MIN(nums.age) cats.MIN(nums.money) cats.MIN(nums.weight)
    200 140.5   200 140.5   200 140.5
  28 240 120.8 28 240 120.8 28 240 120.8
                   
    300 155.6   300 155.6   300 155.6
  cats.SKEW(nums.age) cats.SKEW(nums.money) cats.SKEW(nums.weight) cats.STD(nums.age) cats.STD(nums.money) cats.STD(nums.weight) cats.SUM(nums.age) cats.SUM(nums.money) cats.SUM(nums.weight)
              0 200 140.5
              28 240 120.8
                   
              0 300 155.6
  cats.DAY(born) cats.MONTH(born) cats.WEEKDAY(born) cats.YEAR(born)          
                   
  1 1 0 1990          
                   
                   

 

字段解释

  1. <Feature: ID> : The "ID".
  2. <Feature: age> : The "age".
  3. <Feature: money> : The "money".
  4. <Feature: weight> : The "weight".
  5. <Feature: cats.sex> : The "sex" for the instance of "cats" associated with this instance of "nums".
  6. <Feature: cats.hobbey> : The "hobbey" for the instance of "cats" associated with this instance of "nums".
  7. <Feature: cats.COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  8. <Feature: cats.MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  9. <Feature: cats.MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  10. <Feature: cats.MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  11. <Feature: cats.MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  12. <Feature: cats.MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  13. <Feature: cats.MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  14. <Feature: cats.MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  15. <Feature: cats.MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  16. <Feature: cats.MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  17. <Feature: cats.SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  18. <Feature: cats.SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  19. <Feature: cats.SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  20. <Feature: cats.STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  21. <Feature: cats.STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  22. <Feature: cats.STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  23. <Feature: cats.SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  24. <Feature: cats.SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  25. <Feature: cats.SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  26. <Feature: cats.DAY(born)> : The day of the month of the "born" for the instance of "cats" associated with this instance of "nums".
  27. <Feature: cats.MONTH(born)> : The month of the "born" for the instance of "cats" associated with this instance of "nums".
  28. <Feature: cats.WEEKDAY(born)> : The day of the week of the "born" for the instance of "cats" associated with this instance of "nums".
  29. <Feature: cats.YEAR(born)> : The year of the "born" for the instance of "cats" associated with this instance of "nums".


相关文章
|
8月前
单细胞分析|映射和注释查询数据集
单细胞分析|映射和注释查询数据集
111 3
|
8月前
|
数据处理 开发工具 git
coco2017数据集转换为yolo格式(记录过程)
最近做一个yolov5的落地应用项目,用的anylabeling打标,需要将coco2017的数据集转为yolo格式,故写下记录过程!
|
6月前
|
数据采集 机器学习/深度学习 数据可视化
关于Python数据分析项目的简要概述:从CSV加载数据,执行数据预处理,进行数据探索,选择线性回归模型进行训练,评估模型性能并优化,最后结果解释与可视化。
【7月更文挑战第5天】这是一个关于Python数据分析项目的简要概述:从CSV加载数据,执行数据预处理(填充缺失值,处理异常值),进行数据探索(可视化和统计分析),选择线性回归模型进行训练,评估模型性能并优化,最后结果解释与可视化。此案例展示了数据科学的典型流程。
102 2
|
8月前
|
机器学习/深度学习 数据采集 分布式计算
【机器学习】Spark ML 对数据进行规范化预处理 StandardScaler 与向量拆分
标准化Scaler是数据预处理技术,用于将特征值映射到均值0、方差1的标准正态分布,以消除不同尺度特征的影响,提升模型稳定性和精度。Spark ML中的StandardScaler实现此功能,通过`.setInputCol`、`.setOutputCol`等方法配置并应用到DataFrame数据。示例展示了如何在Spark中使用StandardScaler进行数据规范化,包括创建SparkSession,构建DataFrame,使用VectorAssembler和StandardScaler,以及将向量拆分为列。规范化有助于降低特征重要性,提高模型训练速度和计算效率。
181 6
|
机器学习/深度学习 算法 数据处理
ML之FE:数据处理—特征工程之稀疏特征的简介、如何处理、案例应用之详细攻略
ML之FE:数据处理—特征工程之稀疏特征的简介、如何处理、案例应用之详细攻略
ML之FE:数据处理—特征工程之稀疏特征的简介、如何处理、案例应用之详细攻略
|
机器学习/深度学习 数据采集 算法
UCI数据集详解及其数据处理(附148个数据集及处理代码)
UCI数据集详解及其数据处理(附148个数据集及处理代码)
3476 1
|
机器学习/深度学习 传感器 SQL
Python 数据竞赛常用 | 可视化数据集缺失情况
无论是打比赛还是在实际工程项目中,都会遇到数据缺失的情况,如果数据集较小,还能在 Excel 或者其他可视化软件大致看一下导致数据缺失的原因。但当数据集较大时,手工查看每个变量的缺失值是非常麻烦的一件事情。 数据探索和预处理也是任何数据科学或机器学习工作流中的重要步骤。缺失数据可能是处理真实数据集时最常见的问题之一,数据丢失的原因很多,包括传感器故障、数据过时、数据管理不当,甚至人为错误。丢失的数据可能以单个值、一个要素中的多个值或整个要素丢失的形式出现。
549 0
Python 数据竞赛常用 | 可视化数据集缺失情况
ML之FE:基于load_mock_customer数据集(模拟客户,单个DataFrame)利用featuretools工具实现自动特征生成/特征衍生
ML之FE:基于load_mock_customer数据集(模拟客户,单个DataFrame)利用featuretools工具实现自动特征生成/特征衍生
ML之FE:基于load_mock_customer数据集(模拟客户,单个DataFrame)利用featuretools工具实现自动特征生成/特征衍生
|
机器学习/深度学习
ML之FE:基于BigMartSales数据集利用Featuretools工具(1个dataframe表结构切为2个Entity表结构)实现自动特征工程之详细攻略
ML之FE:基于BigMartSales数据集利用Featuretools工具(1个dataframe表结构切为2个Entity表结构)实现自动特征工程之详细攻略
ML之FE:基于BigMartSales数据集利用Featuretools工具(1个dataframe表结构切为2个Entity表结构)实现自动特征工程之详细攻略
ML之FE:基于load_mock_customer数据集(模拟客户)利用featuretools工具实现自动特征生成/特征衍生
ML之FE:基于load_mock_customer数据集(模拟客户)利用featuretools工具实现自动特征生成/特征衍生
ML之FE:基于load_mock_customer数据集(模拟客户)利用featuretools工具实现自动特征生成/特征衍生