# [雪峰磁针石博客]数据分析工具pandas快速入门教程5-处理缺失数据

+关注继续查看

## 介绍

• 什么是缺失值
• 如何创建缺失值
• 如何重新编码并使用缺失值进行计算

### 什么是缺失值

NaN不等于0或空字符串''。


In [1]: from numpy import NaN, NAN, nan

In [2]: print(NaN == True, NaN == False, NaN == 0, NaN == '', sep='|')
False|False|False|False

In [3]: print(NaN == NaN, NaN == nan, NaN == NAN, nan == NAN, sep='|')
False|False|False|False

In [4]: import pandas as pd

In [5]: print(pd.isnull(NaN), pd.isnull(nan), pd.isnull(NAN), sep='|')
True|True|True

In [6]: print(pd.notnull(NaN), pd.notnull(99), pd.notnull("https://china-testing.github.io"), sep='|')
False|True|True


### 缺失值的来源

• 加载数据

5-1.py


import pandas as pd

visited_file = 'data/survey_visited.csv'
print(pd.read_csv(visited_file, na_values=[''], keep_default_na=False))


$python3 5-1.py ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 NaN 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22 ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22 ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 NaN 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22  • 合并数据  import pandas as pd visited = pd.read_csv('data/survey_visited.csv') survey = pd.read_csv('data/survey_survey.csv') print(visited) print(survey) vs = visited.merge(survey, left_on='ident', right_on='taken') print(vs) 执行结果 $ python3 5-2.py
ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
1     619   dyer   sal     0.13
3     622   dyer   sal     0.09
5     734   lake   sal     0.05
6     734     pb  temp   -21.50
8     735    NaN   sal     0.06
9     735    NaN  temp   -26.00
11    751     pb  temp   -18.50
12    751   lake   sal     0.10
14    752   lake   sal     0.09
15    752   lake  temp   -16.00
16    752    roe   sal    41.60
18    837   lake   sal     0.21
19    837    roe   sal    22.50
ident   site       dated  taken person quant  reading
0     619   DR-1  1927-02-08    619   dyer   rad     9.82
1     619   DR-1  1927-02-08    619   dyer   sal     0.13
2     622   DR-1  1927-02-10    622   dyer   rad     7.80
3     622   DR-1  1927-02-10    622   dyer   sal     0.09
4     734   DR-3  1939-01-07    734     pb   rad     8.41
5     734   DR-3  1939-01-07    734   lake   sal     0.05
6     734   DR-3  1939-01-07    734     pb  temp   -21.50
7     735   DR-3  1930-01-12    735     pb   rad     7.22
8     735   DR-3  1930-01-12    735    NaN   sal     0.06
9     735   DR-3  1930-01-12    735    NaN  temp   -26.00
10    751   DR-3  1930-02-26    751     pb   rad     4.35
11    751   DR-3  1930-02-26    751     pb  temp   -18.50
12    751   DR-3  1930-02-26    751   lake   sal     0.10
13    752   DR-3         NaN    752   lake   rad     2.19
14    752   DR-3         NaN    752   lake   sal     0.09
15    752   DR-3         NaN    752   lake  temp   -16.00
16    752   DR-3         NaN    752    roe   sal    41.60
17    837  MSK-4  1932-01-14    837   lake   rad     1.46
18    837  MSK-4  1932-01-14    837   lake   sal     0.21
19    837  MSK-4  1932-01-14    837    roe   sal    22.50
20    844   DR-1  1932-03-22    844    roe   rad    11.25

• 用户输入

import pandas as pd
from numpy import NaN, NAN, nan

num_legs = pd.Series({'goat': 4, 'amoeba': nan})
print(num_legs)
scientists = pd.DataFrame({'Name': ['Rosaline Franklin', 'William Gosset'],
'Occupation': ['Chemist', 'Statistician'],
'Born': ['1920-07-25', '1876-06-13'],
'Died': ['1958-04-16', '1937-10-16'],
'missing': [NaN, nan]})
print(scientists)
scientists['missing'] = nan
print(scientists)


\$ python3 5-3.py
amoeba    NaN
goat      4.0
dtype: float64
Born        Died               Name    Occupation  missing
0  1920-07-25  1958-04-16  Rosaline Franklin       Chemist      NaN
1  1876-06-13  1937-10-16     William Gosset  Statistician      NaN
Born        Died               Name    Occupation  missing
0  1920-07-25  1958-04-16  Rosaline Franklin       Chemist      NaN
1  1876-06-13  1937-10-16     William Gosset  Statistician      NaN

• 重新索引

5-4.py


import pandas as pd
from numpy import NaN, NAN, nan

life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
print(life_exp)
print(life_exp.reindex(range(2000, 2010)))


year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64
year
2000          NaN
2001          NaN
2002    65.694923
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64


### 处理缺失数据

• 统计缺失数据

5-5.py


import pandas as pd
from numpy import NaN, NAN, nan
import numpy as np

print(ebola.count())
num_rows = ebola.shape[0]
print("num_rows")
print(num_rows)
num_missing = num_rows - ebola.count()
print("num_missing:")
print(num_missing)
print(np.count_nonzero(ebola.isnull()))
print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))
print(ebola.Cases_Guinea.value_counts(dropna=False).head())


Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0
1    1/4/2015  288        2775.0            NaN             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286           NaN         8157.0                NaN
4  12/31/2014  284        2730.0         8115.0             9633.0

Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN
1            NaN            NaN                 NaN          NaN         NaN
2            NaN            NaN                 NaN          NaN         NaN
3            NaN            NaN                 NaN          NaN         NaN
4            NaN            NaN                 NaN          NaN         NaN

Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0             NaN              2977.0             NaN
1         1781.0             NaN              2943.0             NaN
2         1767.0          3496.0              2915.0             NaN
3            NaN          3496.0                 NaN             NaN
4         1739.0          3471.0              2827.0             NaN

Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali
0             NaN                  NaN           NaN          NaN
1             NaN                  NaN           NaN          NaN
2             NaN                  NaN           NaN          NaN
3             NaN                  NaN           NaN          NaN
4             NaN                  NaN           NaN          NaN
Date                   122
Day                    122
Cases_Guinea            93
Cases_Liberia           83
Cases_SierraLeone       87
Cases_Nigeria           38
Cases_Senegal           25
Cases_UnitedStates      18
Cases_Spain             16
Cases_Mali              12
Deaths_Guinea           92
Deaths_Liberia          81
Deaths_SierraLeone      87
Deaths_Nigeria          38
Deaths_Senegal          22
Deaths_UnitedStates     18
Deaths_Spain            16
Deaths_Mali             12
dtype: int64
num_rows
122
num_missing:
Date                     0
Day                      0
Cases_Guinea            29
Cases_Liberia           39
Cases_SierraLeone       35
Cases_Nigeria           84
Cases_Senegal           97
Cases_UnitedStates     104
Cases_Spain            106
Cases_Mali             110
Deaths_Guinea           30
Deaths_Liberia          41
Deaths_SierraLeone      35
Deaths_Nigeria          84
Deaths_Senegal         100
Deaths_UnitedStates    104
Deaths_Spain           106
Deaths_Mali            110
dtype: int64
1214
29
NaN       29
86.0      3
495.0     2
112.0     2
390.0     2
Name: Cases_Guinea, dtype: int64

• 处理缺失数据

5-6.py


import pandas as pd
from numpy import NaN, NAN, nan
import numpy as np

print(ebola.iloc[0:10, 0:5])
print(ebola.fillna(0).iloc[0:10, 0:5])
# 前向填充
print(ebola.fillna(method='ffill').iloc[0:10, 0:5])
# 后向填充
print(ebola.fillna(method='bfill').iloc[0:10, 0:5])

print(ebola.interpolate().iloc[0:10, 0:5])

print(ebola.shape)
ebola_dropna = ebola.dropna()
print(ebola_dropna.shape)
print(ebola_dropna)

ebola['Cases_multiple'] = ebola['Cases_Guinea'] + ebola['Cases_Liberia'] + \
ebola['Cases_SierraLeone']

ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia',
'Cases_SierraLeone', 'Cases_multiple']]
print(ebola.Cases_Guinea.sum(skipna = True))
print(ebola.Cases_Guinea.sum(skipna = False))



Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0    1/5/2015  289        2776.0            NaN            10030.0
1    1/4/2015  288        2775.0            NaN             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286           NaN         8157.0                NaN
4  12/31/2014  284        2730.0         8115.0             9633.0
5  12/28/2014  281        2706.0         8018.0             9446.0
6  12/27/2014  280        2695.0            NaN             9409.0
7  12/24/2014  277        2630.0         7977.0             9203.0
8  12/21/2014  273        2597.0            NaN             9004.0
9  12/20/2014  272        2571.0         7862.0             8939.0
Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0    1/5/2015  289        2776.0            0.0            10030.0
1    1/4/2015  288        2775.0            0.0             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286           0.0         8157.0                0.0
4  12/31/2014  284        2730.0         8115.0             9633.0
5  12/28/2014  281        2706.0         8018.0             9446.0
6  12/27/2014  280        2695.0            0.0             9409.0
7  12/24/2014  277        2630.0         7977.0             9203.0
8  12/21/2014  273        2597.0            0.0             9004.0
9  12/20/2014  272        2571.0         7862.0             8939.0
Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0    1/5/2015  289        2776.0            NaN            10030.0
1    1/4/2015  288        2775.0            NaN             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286        2769.0         8157.0             9722.0
4  12/31/2014  284        2730.0         8115.0             9633.0
5  12/28/2014  281        2706.0         8018.0             9446.0
6  12/27/2014  280        2695.0         8018.0             9409.0
7  12/24/2014  277        2630.0         7977.0             9203.0
8  12/21/2014  273        2597.0         7977.0             9004.0
9  12/20/2014  272        2571.0         7862.0             8939.0
Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0    1/5/2015  289        2776.0         8166.0            10030.0
1    1/4/2015  288        2775.0         8166.0             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286        2730.0         8157.0             9633.0
4  12/31/2014  284        2730.0         8115.0             9633.0
5  12/28/2014  281        2706.0         8018.0             9446.0
6  12/27/2014  280        2695.0         7977.0             9409.0
7  12/24/2014  277        2630.0         7977.0             9203.0
8  12/21/2014  273        2597.0         7862.0             9004.0
9  12/20/2014  272        2571.0         7862.0             8939.0
Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0    1/5/2015  289        2776.0            NaN            10030.0
1    1/4/2015  288        2775.0            NaN             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286        2749.5         8157.0             9677.5
4  12/31/2014  284        2730.0         8115.0             9633.0
5  12/28/2014  281        2706.0         8018.0             9446.0
6  12/27/2014  280        2695.0         7997.5             9409.0
7  12/24/2014  277        2630.0         7977.0             9203.0
8  12/21/2014  273        2597.0         7919.5             9004.0
9  12/20/2014  272        2571.0         7862.0             8939.0
(122, 18)
(1, 18)
Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
19  11/18/2014  241        2047.0         7082.0             6190.0

Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
19           20.0            1.0                 4.0          1.0         6.0

Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
19         1214.0          2963.0              1267.0             8.0

Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali
19             0.0                  1.0           0.0          6.0
Cases_Guinea  Cases_Liberia  Cases_SierraLeone  Cases_multiple
0        2776.0            NaN            10030.0             NaN
1        2775.0            NaN             9780.0             NaN
2        2769.0         8166.0             9722.0         20657.0
3           NaN         8157.0                NaN             NaN
4        2730.0         8115.0             9633.0         20478.0
5        2706.0         8018.0             9446.0         20170.0
6        2695.0            NaN             9409.0             NaN
7        2630.0         7977.0             9203.0         19810.0
8        2597.0            NaN             9004.0             NaN
9        2571.0         7862.0             8939.0         19372.0
84729.0
nan


## 参考资料

Python系列直播——深入Python与日志服务，玩转大规模数据分析处理实战
Python系列直播——深入Python与日志服务，玩转大规模数据分析处理实战
4843 0
Python爬虫入门教程 31-100 36氪(36kr)数据抓取 scrapy
1. 36氪(36kr)数据----写在前面 今天抓取一个新闻媒体，36kr的文章内容，也是为后面的数据分析做相应的准备的，预计在12月底，爬虫大概写到50篇案例的时刻，将会迎来一个新的内容，系统的数据分析博文，记得关注哦~ 36kr 让一部分人先看到未来，而你今天要做的事情确实要抓取它的过去。
9321 0

10594 0
Virgin Hyperloop One如何使用Koalas将处理时间从几小时降到几分钟--无缝的将pandas切换成Apache Spark指南
Koalas项目基于Apache Spark实现了pandas DataFrame API，从而使数据科学家能够更有效率的处理大数据。一份代码可以同时在pandas(用于测试，小数据集)和Spark(用于分布式datasets)两个平台上运行。
1145 0

761 0

23 0

2132 0
+关注
python人工智能命理

170

1

《2021云上架构与运维峰会演讲合集》

《零基础CSS入门教程》

《零基础HTML入门教程》