ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

简介: ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生


目录

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

2、DFS设计

输出结果

feature_matrix_cats_df.csv

feature_matrix_nums.csv


 

 

推荐文章

Py之featuretools:featuretools库的简介、安装、使用方法之详细攻略

ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生实现

 

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

contents={"name": ['Bob',        'LiSa',                     'Mary',                       'Alan'],

         "ID":   [1,              2,                            3,                            4],    # 输出 NaN

         "age":  [np.nan,        28,                           38 ,                          '' ],   # 输出

       "born": [pd.NaT,     pd.Timestamp("1990-01-01"),  pd.Timestamp("1980-01-01"),        ''],     # 输出 NaT

         "sex":  ['男',          '女',                        '女',                        '男',],   # 输出 None

         "hobbey":['打篮球',     '打羽毛球',                   '打乒乓球',                    '',],   # 输出

         "money":[200.0,                240.0,                   290.0,                     300.0],  # 输出

         "weight":[140.5,                120.8,                 169.4,                      155.6],  # 输出

         }

2、DFS设计

  • (1)、指定一个包含数据集中所有实体的字典
  • (2)、指定实体间如何关联:当两个实体有一对多关系时,我们称之为“one”实体,即“parent entity”。
  • (3)、运行深度特征合成:DFS的最小输入是一组实体、一组关系和计算特性的“target_entity”。DFS的输出是一个特征矩阵和相应的特征定义列表。
    让我们首先为数据中的每个客户创建一个特性矩阵,那么现在有几十个新特性来描述客户的行为。
  • (4)、改变目标的实体:DFS如此强大的原因之一是它可以为我们的数据中的任何实体创建一个特征矩阵。例如,如果我们想为会话构建特性
  • (5)、理解特征输出:一般来说,Featuretools通过特性名称引用生成的特性。
    为了让特性更容易理解,Featuretools提供了两个额外的工具,Featuretools .graph_feature()和Featuretools .describe_feature(),
    来帮助解释什么是特性以及Featuretools生成特性的步骤。
  • (6)、特征谱系图
    特征谱系图可视地遍历功能生成过程。从基本数据开始,它们一步一步地展示应用的原语和生成的中间特征,以创建最终特征。
  • (7)、特征描述:功能工具还可以自动生成功能的英文句子描述。特性描述有助于解释什么是特性,并且可以通过包含手动定义的自定义来进一步改进。
    有关如何自定义自动生成的特性描述的详细信息,请参见生成特性描述。

 

 

输出结果

1.    name  ID  age       born sex hobbey  money  weight
2. 0   Bob   1  NaN        NaT   男    打篮球  200.0   140.5
3. 1  LiSa   2   28 1990-01-01   女   打羽毛球  240.0   120.8
4. 2  Mary   3   38 1980-01-01   女   打乒乓球  290.0   169.4
5. 3  Alan   4             NaT   男         300.0   155.6
6. -------------------------------------------
7. nums_df:----------------------------------
8.    name  ID   age  money  weight
9. 0   Bob   1   NaN  200.0   140.5
10. 1  LiSa   2  28.0  240.0   120.8
11. 2  Mary   3  38.0  290.0   169.4
12. 3  Alan   4   NaN  300.0   155.6
13. cats_df:----------------------------------
14.    ID hobbey sex        born
15. 0   4    NaN   男         NaN
16. 1   1    打篮球   男         NaN
17. 2   2   打羽毛球   女  1990-01-01
18. ---------------------------------DFS设计:-----------------------------------
19. feature_matrix_nums 
20.        ID   age  money  weight cats.hobbey cats.sex  cats.COUNT(nums)  \
21. name                                                                   
22. Bob    1   NaN  200.0   140.5         打篮球        男               1.0
23. LiSa   2  28.0  240.0   120.8        打羽毛球        女               1.0
24. Mary   3  38.0  290.0   169.4         NaN      NaN               NaN   
25. 
26.       cats.MAX(nums.age)  cats.MAX(nums.money)  cats.MAX(nums.weight)  \
27. name                                                                    
28. Bob                  NaN                 200.0                  140.5
29. LiSa                28.0                 240.0                  120.8
30. Mary                 NaN                   NaN                    NaN   
31. 
32.       cats.MEAN(nums.age)  cats.MEAN(nums.money)  cats.MEAN(nums.weight)  \
33. name                                                                       
34. Bob                   NaN                  200.0                   140.5
35. LiSa                 28.0                  240.0                   120.8
36. Mary                  NaN                    NaN                     NaN   
37. 
38.       cats.MIN(nums.age)  cats.MIN(nums.money)  cats.MIN(nums.weight)  \
39. name                                                                    
40. Bob                  NaN                 200.0                  140.5
41. LiSa                28.0                 240.0                  120.8
42. Mary                 NaN                   NaN                    NaN   
43. 
44.       cats.SKEW(nums.age)  cats.SKEW(nums.money)  cats.SKEW(nums.weight)  \
45. name                                                                       
46. Bob                   NaN                    NaN                     NaN   
47. LiSa                  NaN                    NaN                     NaN   
48. Mary                  NaN                    NaN                     NaN   
49. 
50.       cats.STD(nums.age)  cats.STD(nums.money)  cats.STD(nums.weight)  \
51. name                                                                    
52. Bob                  NaN                   NaN                    NaN   
53. LiSa                 NaN                   NaN                    NaN   
54. Mary                 NaN                   NaN                    NaN   
55. 
56.       cats.SUM(nums.age)  cats.SUM(nums.money)  cats.SUM(nums.weight)  \
57. name                                                                    
58. Bob                  0.0                 200.0                  140.5
59. LiSa                28.0                 240.0                  120.8
60. Mary                 NaN                   NaN                    NaN   
61. 
62.       cats.DAY(born)  cats.MONTH(born)  cats.WEEKDAY(born)  cats.YEAR(born)  
63. name                                                                         
64. Bob              NaN               NaN                 NaN              NaN  
65. LiSa             1.0               1.0                 0.0           1990.0
66. Mary             NaN               NaN                 NaN              NaN  
67. features_defs_nums: 29 [<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]
68. feature_matrix_cats_df 
69.     hobbey sex  COUNT(nums)  MAX(nums.age)  MAX(nums.money)  MAX(nums.weight)  \
70. ID                                                                             
71. 4     NaN   男            1            NaN            300.0             155.6
72. 1     打篮球   男            1            NaN            200.0             140.5
73. 2    打羽毛球   女            1           28.0            240.0             120.8
74. 
75.     MEAN(nums.age)  MEAN(nums.money)  MEAN(nums.weight)  MIN(nums.age)  \
76. ID                                                                       
77. 4              NaN             300.0              155.6            NaN   
78. 1              NaN             200.0              140.5            NaN   
79. 2             28.0             240.0              120.8           28.0
80. 
81.     MIN(nums.money)  MIN(nums.weight)  SKEW(nums.age)  SKEW(nums.money)  \
82. ID                                                                        
83. 4             300.0             155.6             NaN               NaN   
84. 1             200.0             140.5             NaN               NaN   
85. 2             240.0             120.8             NaN               NaN   
86. 
87.     SKEW(nums.weight)  STD(nums.age)  STD(nums.money)  STD(nums.weight)  \
88. ID                                                                        
89. 4                 NaN            NaN              NaN               NaN   
90. 1                 NaN            NaN              NaN               NaN   
91. 2                 NaN            NaN              NaN               NaN   
92. 
93.     SUM(nums.age)  SUM(nums.money)  SUM(nums.weight)  DAY(born)  MONTH(born)  \
94. ID                                                                             
95. 4             0.0            300.0             155.6        NaN          NaN   
96. 1             0.0            200.0             140.5        NaN          NaN   
97. 2            28.0            240.0             120.8        1.0          1.0
98. 
99.     WEEKDAY(born)  YEAR(born)  
100. ID                             
101. 4             NaN         NaN  
102. 1             NaN         NaN  
103. 2             0.0      1990.0
104. features_defs_cats_df: 25 [<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]
105. <Feature: SUM(nums.age)>
106. The sum of the "age" of all instances of "nums" for each "ID" in "cats".

 

 

feature_matrix_cats_df.csv

features_defs_cats_df: 25

[<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]

ID hobbey sex COUNT(nums) MAX(nums.age) MAX(nums.money) MAX(nums.weight) MEAN(nums.age) MEAN(nums.money) MEAN(nums.weight) MIN(nums.age) MIN(nums.money) MIN(nums.weight) SKEW(nums.age) SKEW(nums.money) SKEW(nums.weight) STD(nums.age) STD(nums.money) STD(nums.weight) SUM(nums.age) SUM(nums.money) SUM(nums.weight) DAY(born) MONTH(born) WEEKDAY(born) YEAR(born)
4   1   300 155.6   300 155.6   300 155.6             0 300 155.6        
1 打篮球 1   200 140.5   200 140.5   200 140.5             0 200 140.5        
2 打羽毛球 1 28 240 120.8 28 240 120.8 28 240 120.8             28 240 120.8 1 1 0 1990

 

ID hobbey sex COUNT(nums)            
4   1            
1 打篮球 1            
2 打羽毛球 1            
  MAX(nums.age) MAX(nums.money) MAX(nums.weight) MEAN(nums.age) MEAN(nums.money) MEAN(nums.weight) MIN(nums.age) MIN(nums.money) MIN(nums.weight)
    300 155.6   300 155.6   300 155.6
    200 140.5   200 140.5   200 140.5
  28 240 120.8 28 240 120.8 28 240 120.8
  SKEW(nums.age) SKEW(nums.money) SKEW(nums.weight) STD(nums.age) STD(nums.money) STD(nums.weight) SUM(nums.age) SUM(nums.money) SUM(nums.weight)
              0 300 155.6
              0 200 140.5
              28 240 120.8
  DAY(born) MONTH(born) WEEKDAY(born) YEAR(born)          
                   
                   
  1 1 0 1990          

字段解释

  1. <Feature: hobbey> : The "hobbey".
  2. <Feature: sex> : The "sex".
  3. <Feature: COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats".
  4. <Feature: MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats".
  5. <Feature: MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats".
  6. <Feature: MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats".
  7. <Feature: MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats".
  8. <Feature: MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats".
  9. <Feature: MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats".
  10. <Feature: MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats".
  11. <Feature: MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats".
  12. <Feature: MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats".
  13. <Feature: SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats".
  14. <Feature: SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats".
  15. <Feature: SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats".
  16. <Feature: STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats".
  17. <Feature: STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats".
  18. <Feature: STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats".
  19. <Feature: SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats".
  20. <Feature: SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats".
  21. <Feature: SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats".
  22. <Feature: DAY(born)> : The day of the month of the "born".
  23. <Feature: MONTH(born)> : The month of the "born".
  24. <Feature: WEEKDAY(born)> : The day of the week of the "born".
  25. <Feature: YEAR(born)> : The year of the "born".

 

 

feature_matrix_nums.csv

features_defs_nums: 29

[<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]

name ID age money weight cats.hobbey cats.sex cats.COUNT(nums) cats.MAX(nums.age) cats.MAX(nums.money) cats.MAX(nums.weight) cats.MEAN(nums.age) cats.MEAN(nums.money) cats.MEAN(nums.weight) cats.MIN(nums.age) cats.MIN(nums.money) cats.MIN(nums.weight) cats.SKEW(nums.age) cats.SKEW(nums.money) cats.SKEW(nums.weight) cats.STD(nums.age) cats.STD(nums.money) cats.STD(nums.weight) cats.SUM(nums.age) cats.SUM(nums.money) cats.SUM(nums.weight) cats.DAY(born) cats.MONTH(born) cats.WEEKDAY(born) cats.YEAR(born)
Bob 1   200 140.5 打篮球 1   200 140.5   200 140.5   200 140.5             0 200 140.5        
LiSa 2 28 240 120.8 打羽毛球 1 28 240 120.8 28 240 120.8 28 240 120.8             28 240 120.8 1 1 0 1990
Mary 3 38 290 169.4                                                  
Alan 4   300 155.6   1   300 155.6   300 155.6   300 155.6             0 300 155.6        

 

name ID age money weight          
Bob 1   200 140.5          
LiSa 2 28 240 120.8          
Mary 3 38 290 169.4          
Alan 4   300 155.6          
  cats.hobbey cats.sex cats.COUNT(nums)            
  打篮球 1            
  打羽毛球 1            
                   
    1            
  cats.MAX(nums.age) cats.MAX(nums.money) cats.MAX(nums.weight) cats.MEAN(nums.age) cats.MEAN(nums.money) cats.MEAN(nums.weight) cats.MIN(nums.age) cats.MIN(nums.money) cats.MIN(nums.weight)
    200 140.5   200 140.5   200 140.5
  28 240 120.8 28 240 120.8 28 240 120.8
                   
    300 155.6   300 155.6   300 155.6
  cats.SKEW(nums.age) cats.SKEW(nums.money) cats.SKEW(nums.weight) cats.STD(nums.age) cats.STD(nums.money) cats.STD(nums.weight) cats.SUM(nums.age) cats.SUM(nums.money) cats.SUM(nums.weight)
              0 200 140.5
              28 240 120.8
                   
              0 300 155.6
  cats.DAY(born) cats.MONTH(born) cats.WEEKDAY(born) cats.YEAR(born)          
                   
  1 1 0 1990          
                   
                   

 

字段解释

  1. <Feature: ID> : The "ID".
  2. <Feature: age> : The "age".
  3. <Feature: money> : The "money".
  4. <Feature: weight> : The "weight".
  5. <Feature: cats.sex> : The "sex" for the instance of "cats" associated with this instance of "nums".
  6. <Feature: cats.hobbey> : The "hobbey" for the instance of "cats" associated with this instance of "nums".
  7. <Feature: cats.COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  8. <Feature: cats.MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  9. <Feature: cats.MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  10. <Feature: cats.MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  11. <Feature: cats.MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  12. <Feature: cats.MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  13. <Feature: cats.MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  14. <Feature: cats.MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  15. <Feature: cats.MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  16. <Feature: cats.MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  17. <Feature: cats.SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  18. <Feature: cats.SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  19. <Feature: cats.SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  20. <Feature: cats.STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  21. <Feature: cats.STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  22. <Feature: cats.STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  23. <Feature: cats.SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  24. <Feature: cats.SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  25. <Feature: cats.SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
  26. <Feature: cats.DAY(born)> : The day of the month of the "born" for the instance of "cats" associated with this instance of "nums".
  27. <Feature: cats.MONTH(born)> : The month of the "born" for the instance of "cats" associated with this instance of "nums".
  28. <Feature: cats.WEEKDAY(born)> : The day of the week of the "born" for the instance of "cats" associated with this instance of "nums".
  29. <Feature: cats.YEAR(born)> : The year of the "born" for the instance of "cats" associated with this instance of "nums".


相关文章
|
6月前
|
数据采集 机器学习/深度学习 存储
【机器学习】数据清洗——基于Numpy库的方法删除重复点
【机器学习】数据清洗——基于Numpy库的方法删除重复点
213 1
|
6月前
|
XML 存储 数据处理
python绘制热力图-数据处理-VOC数据类别标签分布及数量统计(附代码)
python绘制热力图-数据处理-VOC数据类别标签分布及数量统计(附代码)
|
SQL 测试技术
|
3月前
|
机器学习/深度学习 分布式计算 大数据
几行 Python 代码就可以提取数百个时间序列特征
几行 Python 代码就可以提取数百个时间序列特征
|
4月前
|
数据采集 机器学习/深度学习 数据可视化
关于Python数据分析项目的简要概述:从CSV加载数据,执行数据预处理,进行数据探索,选择线性回归模型进行训练,评估模型性能并优化,最后结果解释与可视化。
【7月更文挑战第5天】这是一个关于Python数据分析项目的简要概述:从CSV加载数据,执行数据预处理(填充缺失值,处理异常值),进行数据探索(可视化和统计分析),选择线性回归模型进行训练,评估模型性能并优化,最后结果解释与可视化。此案例展示了数据科学的典型流程。
76 2
|
6月前
|
机器学习/深度学习 数据采集 分布式计算
【机器学习】Spark ML 对数据进行规范化预处理 StandardScaler 与向量拆分
标准化Scaler是数据预处理技术,用于将特征值映射到均值0、方差1的标准正态分布,以消除不同尺度特征的影响,提升模型稳定性和精度。Spark ML中的StandardScaler实现此功能,通过`.setInputCol`、`.setOutputCol`等方法配置并应用到DataFrame数据。示例展示了如何在Spark中使用StandardScaler进行数据规范化,包括创建SparkSession,构建DataFrame,使用VectorAssembler和StandardScaler,以及将向量拆分为列。规范化有助于降低特征重要性,提高模型训练速度和计算效率。
118 6
|
6月前
|
机器学习/深度学习 存储 算法
【数据分享】R语言SVM和LDA文本挖掘分类开源软件存储库标签数据和词云可视化
【数据分享】R语言SVM和LDA文本挖掘分类开源软件存储库标签数据和词云可视化
|
机器学习/深度学习 算法 数据处理
ML之FE:数据处理—特征工程之稀疏特征的简介、如何处理、案例应用之详细攻略
ML之FE:数据处理—特征工程之稀疏特征的简介、如何处理、案例应用之详细攻略
ML之FE:数据处理—特征工程之稀疏特征的简介、如何处理、案例应用之详细攻略
|
机器学习/深度学习 固态存储 数据处理
【目标检测之数据集预处理】继承Dataset定义自己的数据集【附代码】(上)
在深度学习训练中,除了设计有效的卷积神经网络框架外,更重要的是数据的处理。在训练之前需要对训练数据进行预处理。比如在目标检测网络训练中,首先需要划分训练集和测试集,然后对标签、边界框等进行处理后才能送入网络进行训练,本文章以VOC数据集格式为例,对数据集进行预处理后送入目标检测网络进行训练。【附代码】
348 0
【目标检测之数据集预处理】继承Dataset定义自己的数据集【附代码】(上)
|
机器学习/深度学习 数据处理 计算机视觉
【目标检测之数据集预处理】继承Dataset定义自己的数据集【附代码】(下)
在深度学习训练中,除了设计有效的卷积神经网络框架外,更重要的是数据的处理。在训练之前需要对训练数据进行预处理。比如在目标检测网络训练中,首先需要划分训练集和测试集,然后对标签、边界框等进行处理后才能送入网络进行训练,本文章以VOC数据集格式为例,对数据集进行预处理后送入目标检测网络进行训练。【附代码】
212 0