数据挖掘-二手车价格预测 Task02：数据分析

2022-10-19 216

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 数据挖掘-二手车价格预测 Task02：数据分析

Task02：数据分析

4c6f50c1fb4b1a7022a6d0170de4bacf_watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2E1NjIwMjQ3NDM=,size_16,color_FFFFFF,t_70#pic_center.jpg

1.EDA数据探索性分析

读取文件

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
data_train = pd.read_csv('used_car_train_20200313.csv',sep=' ')
data_testA = pd.read_csv('used_car_testA_20200313.csv',sep=' ')

2.总览数据概况

理论

1.【总览数据概况】

data.describe() – 统计各个特征的非NaN的个数，均值，标准差，最小值，最大值，分位数。返回dataframe格式的统计表。

data.info() – 统计各个特征的dtype，非NaN的个数等。

2.【判断数据缺失和异常】

data.isnull() – 查看全体数据是否为nan，返回各个位置为nan是True还是False。

data.isnull().sum() – 查看每列（每个特征）存在nan的情况，返回每个特征对应缺失值个数。

data[feature].value_counts() – 统计feature的各个值的个数。

实战

观察数据

data_train.head()
SaleID    name    regDate    model    brand    bodyType    fuelType    gearbox    power    kilometer    ...    v_5    v_6    v_7    v_8    v_9    v_10    v_11    v_12    v_13    v_14
0    0    736    20040402    30.0    6    1.0    0.0    0.0    60    12.5    ...    0.235676    0.101988    0.129549    0.022816    0.097462    -2.881803    2.804097    -2.420821    0.795292    0.914762
1    1    2262    20030301    40.0    1    2.0    0.0    0.0    0    15.0    ...    0.264777    0.121004    0.135731    0.026597    0.020582    -4.900482    2.096338    -1.030483    -1.722674    0.245522
2    2    14874    20040403    115.0    15    1.0    0.0    0.0    163    12.5    ...    0.251410    0.114912    0.165147    0.062173    0.027075    -4.846749    1.803559    1.565330    -0.832687    -0.229963
3    3    71865    19960908    109.0    10    0.0    0.0    1.0    193    15.0    ...    0.274293    0.110300    0.121964    0.033395    0.000000    -4.509599    1.285940    -0.501868    -2.438353    -0.478699
4    4    111080    20120103    110.0    5    1.0    0.0    0.0    68    5.0    ...    0.228036    0.073205    0.091880    0.078819    0.121534    -1.896240    0.910783    0.931110    2.834518    1.923482
5 rows × 31 columns

特征含义

name - 汽车编码

regDate - 汽车注册时间

model - 车型编码

brand - 品牌

bodyType - 车身类型

fuelType - 燃油类型

gearbox - 变速箱

power - 汽车功率

kilometer - 汽车行驶公里

notRepairedDamage - 汽车有尚未修复的损坏

regionCode - 看车地区编码

seller - 销售方

offerType - 报价类型

creatDate - 广告发布时间

price - 汽车价格

v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’（根据汽车的评论、标签等大量信息得到的embedding向量）【人工构造匿名特征】

详细信息

data_train.describe()
SaleID    name    regDate    model    brand    bodyType    fuelType    gearbox    power    kilometer    ...    v_5    v_6    v_7    v_8    v_9    v_10    v_11    v_12    v_13    v_14
count    150000.000000    150000.000000    1.500000e+05    149999.000000    150000.000000    145494.000000    141320.000000    144019.000000    150000.000000    150000.000000    ...    150000.000000    150000.000000    150000.000000    150000.000000    150000.000000    150000.000000    150000.000000    150000.000000    150000.000000    150000.000000
mean    74999.500000    68349.172873    2.003417e+07    47.129021    8.052733    1.792369    0.375842    0.224943    119.316547    12.597160    ...    0.248204    0.044923    0.124692    0.058144    0.061996    -0.001000    0.009035    0.004813    0.000313    -0.000688
std    43301.414527    61103.875095    5.364988e+04    49.536040    7.864956    1.760640    0.548677    0.417546    177.168419    3.919576    ...    0.045804    0.051743    0.201410    0.029186    0.035692    3.772386    3.286071    2.517478    1.288988    1.038685
min    0.000000    0.000000    1.991000e+07    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.500000    ...    0.000000    0.000000    0.000000    0.000000    0.000000    -9.168192    -5.558207    -9.639552    -4.153899    -6.546556
25%    37499.750000    11156.000000    1.999091e+07    10.000000    1.000000    0.000000    0.000000    0.000000    75.000000    12.500000    ...    0.243615    0.000038    0.062474    0.035334    0.033930    -3.722303    -1.951543    -1.871846    -1.057789    -0.437034
50%    74999.500000    51638.000000    2.003091e+07    30.000000    6.000000    1.000000    0.000000    0.000000    110.000000    15.000000    ...    0.257798    0.000812    0.095866    0.057014    0.058484    1.624076    -0.358053    -0.130753    -0.036245    0.141246
75%    112499.250000    118841.250000    2.007111e+07    66.000000    13.000000    3.000000    1.000000    0.000000    150.000000    15.000000    ...    0.265297    0.102009    0.125243    0.079382    0.087491    2.844357    1.255022    1.776933    0.942813    0.680378
max    149999.000000    196812.000000    2.015121e+07    247.000000    39.000000    7.000000    6.000000    1.000000    19312.000000    15.000000    ...    0.291838    0.151420    1.404936    0.160791    0.222787    12.357011    18.819042    13.847792    11.147669    8.658418
8 rows × 30 columns
print(data_train.shape,data_testA.shape)
(150000, 31) (50000, 30)

查看缺失值

print(data_train.isnull().sum())
print(data_testA.isnull().sum())
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1413
fuelType             2893
gearbox              1910
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

从统计情况看，训练数据中model、bodyType、fuelType、gearbox存在缺失值

查看是否有异常值

print(Train_data.info())

结果如下：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)

代码示例

载入各种数据科学以及可视化库

载入数据

总览数据概况

判断数据缺失和异常

了解预测值的分布

特征分为类别特征和数字特征，并对类别特征查看unique分布

数字特征分析

类别特征分析

用pandas_profiling生成数据报告

数据挖掘-二手车价格预测 Task02：数据分析

Task02：数据分析

1.EDA数据探索性分析

2.总览数据概况

实战

查看是否有异常值

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

数据挖掘-二手车价格预测 Task02：数据分析

Task02：数据分析

1.EDA数据探索性分析

2.总览数据概况

实战

查看是否有异常值

热门文章

最新文章

相关课程

相关电子书

相关实验场景