开发者学堂课程【人工智能必备基础:概率论与数理统计:案例:缺失值填充】学习笔记,与课程紧密联系,让用户快速学习知识。
课程地址:https://developer.aliyun.com/learning/course/545/detail/7439
案例:缺失值填充
内容介绍:
一、 缺失值处理
一、 缺失值处理
第一种方法:将缺失值去掉
第二种方法:将数据中的中位数或均值、众数等作为缺失值
第三种方法:有均值有数据,先构建回归模型,那其他变量当成特征,当前缺失的表示预测结果。有存在的来构建数据集,缺失的需预测。
先指定风格 sns. set(style = " ticks“) ,碰到陌生的需上官网查找。一下的库是在进行缺失值的展示。黑的代表数值,白的代表缺失值。
Missingnno PYpi v0.4.0 python3.4+ status stable license MIT
Messy datasets? Missing values? missingno provides a small toolsetof flexible and easy-to-use missing data visualizationsand utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. Just pipinstall missingno to get started.
Quickstart lbunch bnder
This quickstart uses a sample of the NYPD Motor Vehicle CollisionsDataset dataset. To get the data yourself, run thefollowing on your command line:
$ pip install quilt
$ quilt install ResidentMario/missingno data
Then to load the data into memory:
>>> from quilt.data. ResidentMario import missingno data» collisions = missingno data.nyc injurious collisions()» collisions = collisions.replace("nan", np,nan)
The rest of this walkthrough will draw from this collisions dataset. I additionally define nullity to mean whether aparticular variable is filled in or not.
Matrix
The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns indata completion.
>>> import missingno as msno
>>> xmatplotlib inline
>>> -sno,matrix(collisions. sample(250))
At a glance, date, time, the distribution of injuries, and the contribution factorof the first vehicle appear to be completelypopulated, while geographic information seems mostly complete, but spottier.
The sparkline at right summarizes the general shape of the data completeness and points out the maximum and minimumrows.
This visualization willcomfortably accommodate upto 50 labelled variables Past tha range labels beginto overlap orbecome unreadable,and by default large displays omit them
If you are working with time series data, you canspecifya periodity using the fre keyword parametern
>>> null_pattern=(np .random.random(1000).reshape((50,20))>0.5).astype(bool)
>>>null pattern - pd.DataFrame(null_pattern).replace({Faise!None})
msno.matrix(null_pattern.set_index(pd.period_range(‘1/1/2011',’2/1/2015’,freq=’M’) ,freq=’ BQ’
缺失值处理
In [39]:# missing values?
sns. set(style = " ticks“)
msno. matrix (data)
Out[39]: <matplotlib. axes._ subplots. AxesSubplot at 0x218b0055f60>
缺失值少的,直接将缺失值去掉。
normalized-losses 缺失比较严重
In (40]: #missing values in normalied-losses
data[pd. isnull (data[‘ normalized-losses’])]. head()
Out 40 :
In (41]: sns. set(style . "ticks “)
plt. figure(figsize = (12, 5))
c =’#366DE8’
# ECDF
plt. subplot(121)
cdf = ECDF (data[ ‘normalized- losses’ )
plt. plot(cdf.x, cdf.y, label =statmodels", color = c);
plt. xlabel( ‘normalized losses’);plt. ylabel(‘ECDF’);
# overall distribution
plt. subplot (122)
plt. hist (data[‘normalized-losses’ ]. dropna(),
bins = int (np. sqrt(len (data[‘normalized-losses’ ]))),
color = c):
可以发现 80% 的 normalized losses 是低于 200 并且绝大多数低于 125.
一个基本的想法就 是用中位数来进行填充, 但是我们得来想一 想,这个特 征跟哪些因素可能有关呢?应该是保险的情况吧,所以我们可以分组来进行填充这样会更精确一些。
首先来看一下对于不同保险情况的统计指标:In [42]: data. groupby(‘symboling’)[‘ normalized-losses ‘]. describe()
Out[42] :
In [43]: #replacing
data: datp. dropna(subset . [‘price’, ‘bore’,’stroke’,’ peak-rpm’,’horsepower’,’num-of doors’ ]
data[‘normalized-losses’] =data. groupby ( ‘symboling’ )[ ‘normalized-losses’]. transform(lambda x: x. fillna(x. mean()))
print(‘ In total:’,data. shape)
data. head ()
In total: (193, 26)
0ut(43] :