加载数据集,打印前五行:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats,integrate
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("Desktop/titanic_train.csv")
print(data.head())
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
查看原始数据统计结果:
print(data.describe())
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
可以看到Age这一项有缺失值。
填充缺失值有很多种方法,最常见的比如说均值填充,中位数填充,等等。
首先看一下年龄数据的原始分布状况:
sns.distplot(data["Age"].dropna(),kde=True,bins=50,fit=stats.gamma)#dropna()处理浮点数据
先用均值填充:
data["Age"]=data["Age"].fillna(data["Age"].mean())
sns.distplot(data["Age"].dropna(),kde=True,bins=50,fit=stats.gamma)
可以明显看到图像已经变形,虽然也可以用均值填充的数据进行训练,但如果可以改进填充方式,就可以使数据分布更接近实际。
于是我们提取一些与年龄相关的特征,比如船票(年轻人通常比较穷,富人年纪大),家庭成员数量,兄弟姐妹数量,船舱等级等。
在正式训练数据前,先通过线性回归大致预测一下年龄缺失的部分人的年龄。
print(age_df.head(10))
Age Fare Parch SibSp Pclass
0 22.000000 7.2500 0 1 3
1 38.000000 71.2833 0 1 1
2 26.000000 7.9250 0 0 3
3 35.000000 53.1000 0 1 1
4 35.000000 8.0500 0 0 3
5 27.525206 8.4583 0 0 3
6 54.000000 51.8625 0 0 1
7 2.000000 21.0750 1 3 3
8 27.000000 11.1333 2 0 3
9 14.000000 30.0708 0 1 2
可以看到,船票价格相比其他数据非常高,先对船票价格进行标准化处理
#标准化
from sklearn import preprocessing
age_df=data[["Age","Fare","Parch","SibSp","Pclass"]].copy()#.copy()用于复制原始数据,否则在为DataFrame对象新增一列数据时会报错“A value is trying to be set on a copy of a slice from a DataFrame.”
#scaler=preprocessing.StandardScaler()
#age_df["Fare_scaled"]=scaler.fit_transform(age_df.loc[:,"Fare"])#数据维数问题出错了,暂不知如何解决
age_df["Fare_scaled"] = preprocessing.scale(age_df.loc[:,"Fare"])
print(age_df.head(10))
del age_df["Fare"]
Age Fare Parch SibSp Pclass Fare_scaled
0 22.000000 7.2500 0 1 3 -0.502445
1 38.000000 71.2833 0 1 1 0.786845
2 26.000000 7.9250 0 0 3 -0.488854
3 35.000000 53.1000 0 1 1 0.420730
4 35.000000 8.0500 0 0 3 -0.486337
5 27.525206 8.4583 0 0 3 -0.478116
6 54.000000 51.8625 0 0 1 0.395814
7 2.000000 21.0750 1 3 3 -0.224083
8 27.000000 11.1333 2 0 3 -0.424256
9 14.000000 30.0708 0 1 2 -0.042956
数据处理结束,分割数据集:
#分割有缺失值的数据集
known_age=age_df[age_df.Age.notnull()]
unknown_age=age_df[age_df.Age.isnull()]
#print(known_age.head())
x=known_age.iloc[:,1:]
y=known_age.iloc[:,0]
x_test=unknown_age.iloc[:,1:]
y_pred=unknown_age.iloc[:,0]
训练模型,预测结果,填充缺失值:
from sklearn import linear_model
lr = linear_model.LinearRegression()
model=lr.fit(x,y)
y_pred=lr.predict(x_test)
data.loc[data.Age.isnull(),"Age"]=y_pred
sns.distplot(data["Age"].dropna(),kde=True,bins=50,fit=stats.gamma)
填充结果如上图,相比均值填充,通过模型填充缺失值得到了更符合真实分布的数据。