good data decides good analyse
数据清洗,是数据分析中不可缺少的一个环节,其处理的好坏在很大程度上影响着数据分析的结果。而且以前听老师说过数据清洗占整个的数据分析的一半时间以上(汗。。。数据清洗也是一个大学问啊)。
查看空缺值
首先读入文件:
import pandas as pd
test = pd.read_excel('C:/Users/luopan/Desktop/test.xlsx',sheetname='Sheet1')
test
![3629157-cb9f5dd5f45cb6d3.jpg](http://upload-images.jianshu.io/upload_images/3629157-cb9f5dd5f45cb6d3.jpg?x-oss-process=image/resize,w_1400/format,webp)
我们可以看出有一个nan,李四的数学成绩也是不符合常理的。我们通过isnull函数查看数据的空缺值:
test.isnull()
![3629157-f9b7b5d4f715cade.jpg](http://upload-images.jianshu.io/upload_images/3629157-f9b7b5d4f715cade.jpg?x-oss-process=image/resize,w_1400/format,webp)
通过下面命令计算每列数据的空缺值:
test.isnull().sum()
![3629157-2fb642d68a699abf.jpg](http://upload-images.jianshu.io/upload_images/3629157-2fb642d68a699abf.jpg?x-oss-process=image/resize,w_1400/format,webp)
对于不符合常理的数据也可进行设置为空缺值:
test1 = pd.read_excel('C:/Users/luopan/Desktop/test.xlsx',sheetname='Sheet1',na_values=['750'])
test1
![3629157-d74418fae733e911.jpg](http://upload-images.jianshu.io/upload_images/3629157-d74418fae733e911.jpg?x-oss-process=image/resize,w_1400/format,webp)
过滤缺失值
test1.数学[test1.数学.notnull()]
![img_ec683d15e728a422d752b8a66f3d9d00.png](https://yqfile.alicdn.com/img_ec683d15e728a422d752b8a66f3d9d00.png?x-oss-process=image/resize,w_1400/format,webp)
去掉缺失值
test1.dropna()
![3629157-f1e88381fa822923.jpg](http://upload-images.jianshu.io/upload_images/3629157-f1e88381fa822923.jpg?x-oss-process=image/resize,w_1400/format,webp)
test1.dropna(how='all')
![img_ef987fb8aa1cffcd2e20c58de63ea8fd.png](https://yqfile.alicdn.com/img_ef987fb8aa1cffcd2e20c58de63ea8fd.png?x-oss-process=image/resize,w_1400/format,webp)
加入all参数的意思:行全为nan才会drop掉。
填充缺失值
前一个值填充:
test1.fillna(method='ffill')
![3629157-16041e03f71db81e.jpg](http://upload-images.jianshu.io/upload_images/3629157-16041e03f71db81e.jpg?x-oss-process=image/resize,w_1400/format,webp)
后一个值填充:
test1.fillna(method='bfill')
用列的均值填充:
test1.fillna(test1.mean())
![3629157-2ef7a71deab268ef.jpg](http://upload-images.jianshu.io/upload_images/3629157-2ef7a71deab268ef.jpg?x-oss-process=image/resize,w_1400/format,webp)
等距填充:
test1.interpolate()