hands-on-data-analysis 第二单元 - 数据清洗及特征处理
1.缺失值观察与处理
首先当然是导入相应的模块
#加载所需的库
import numpy as np
import pandas as pd
1.1 缺失值观察
接下来就是观察缺失值:
df.info()
df.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
1.2 缺失值处理
数值列读取数据后,空缺值的NaN为浮点型,最好用np.nan
判断是否是NaN。
isnull()
可以筛选出缺失的值
df[df['Age'].isnull()]
df.tail(5)
np.isnan()
也可以筛选出缺失的值
df[np.isnan(df['Age'])]
df.tail(5)
但是,np.isnan
不可以用来与任何数值进行>
,==
,!=
之类的比较
np.nan != np.nan
True
df.dropna(inplace=True)
可以用来丢弃掉有NaN数据的那一行,其中inplace=True
表示修改原数据。
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
df.fillna(0,inplace=True)
可以用来将NaN数据用0
填充,其中inplace=True
表示修改原数据。
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
2.重复值
重复值可以使用df.duplicated()
来查询
df[df.duplicated()]
drop_duplicates()
可以用来删除重复值
df = df.drop_duplicates()
3.分箱(离散化)处理
3.1.平均分箱
# 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
df['AgeBand'] = pd.cut(df['Age'], 5,labels = [1,2,3,4,5])
3.2.划分分箱
df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])
3.3.概率分箱
df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
4.文本变量进行转换
4.1. 查看文本变量名和种类
#方法一: value_counts
df['Sex'].value_counts()
#方法二: unique
df['Sex'].unique()
df['Sex'].nunique()
4.2 文本转换
#将类别文本转换为12345
#方法一: replace
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
#方法二: map
df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 2})
#方法三: 使用sklearn.preprocessing的LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
lbl = LabelEncoder()
print(f"feat is {feat}")
print("end")
label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
print(f"label_dict is {label_dict}")
print("end label_dict")
df[feat + "_labelEncode"] = df[feat].map(label_dict)
df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))
5. 独热编码
#OneHotEncoder
for feat in ["Age", "Embarked"]:
x = pd.get_dummies(df[feat], prefix=feat)
df = pd.concat([df, x], axis=1)
df.head()
参考资料
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
本项目地址: