决策树分类模型实战:泰坦尼克号生存预测
导入数据集并查看基本信息
import pandas as pd
titanic = pd.read_csv("../data/titanic.txt") titanic.head()
# 打印数据集表头 titanic.columns
Index(['row.names', 'pclass', 'survived', 'name', 'age', 'embarked', 'home.dest', 'room', 'ticket', 'boat', 'sex'], dtype='object')
数据字段的含义:
数据集中有12 个字段,每一个字段的名称和含义如下
PassengerId:乘客 ID
Survived:是否生存
Pclass:客舱等级
Name:乘客姓名
Sex:性别
Age:年龄
SibSp:在船兄弟姐妹数/配偶数
Parch:在船父母数/子女数
Ticket:船票编号
Fare:船票价格
Cabin:客舱号
Embarked:登船港口
选择属性:通过分析发现某些属性(如:name)和是否生还没有关系
选择特征并进行特征处理
# 我们选择"pclass","age","sex"这三个主要特征进行模型训练 x = titanic[["pclass","age","sex"]] y = titanic[["survived"]] • 1 • 2 • 3
补全缺失值
x.isnull().any() • 1
pclass False age True sex False dtype: bool
# 查看缺失 x.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1313 entries, 0 to 1312 Data columns (total 3 columns): pclass 1313 non-null object age 633 non-null float64 sex 1313 non-null object dtypes: float64(1), object(2) memory usage: 30.9+ KB
# 分析发现年龄缺失了一半,如果全都丢弃,数据损失过多 # 丢弃不行需要填补,用所有年龄的平均值来填补 x["age"].fillna(x["age"].mean(),inplace=True)
D:\anaconda3\lib\site-packages\pandas\core\generic.py:5430: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._update_inplace(new_data)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25) • 1
x_train[:10]
特征处理:对特征进行向量化
from sklearn.feature_extraction import DictVectorizer • 1
vec = DictVectorizer(sparse=False)#sparse=False意思是不产生稀疏矩阵
# 非数字类型的特征向量化 x_train = vec.fit_transform(x_train.to_dict(orient="record"))
x_train[:5]
array([[31.19418104, 0. , 0. , 1. , 1. , 0. ], [46. , 1. , 0. , 0. , 0. , 1. ], [35. , 1. , 0. , 0. , 1. , 0. ], [46. , 1. , 0. , 0. , 0. , 1. ], [18. , 0. , 1. , 0. , 0. , 1. ]])
x_train.shape • 1
(984, 6) • 1
x_test = vec.fit_transform(x_test.to_dict(orient="record"))
x_test.shape • 1
(329, 6) • 1
创建决策树模型,训练预测
dt = DecisionTreeClassifier() • 1
dt.fit(x_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
y_pre = dt.predict(x_test) • 1
y_pre[:10],y_test[:10]
(array([0, 0, 1, 0, 1, 0, 0, 0, 0, 0], dtype=int64), survived 908 0 822 0 657 1 856 0 212 1 641 1 305 0 778 1 818 1 1179 0)
dt.score(x_test,y_test) # score也成为准确性,只能从宏观上查看到一个模型的准确程度 • 1 • 2
0.7872340425531915
性能评测报告
from sklearn.metrics import classification_report • 1
print(classification_report(y_pre,y_test,target_names=["died","servived"]))
precision recall f1-score support died 0.92 0.78 0.84 244 servived 0.56 0.81 0.66 85 avg / total 0.83 0.79 0.80 329
性能评测报告的相关指标:
比如两个类别A和B,预测的情况会有四种:True A、True B、False A、False B 1、准确率(score):模型预测的正确的概率:score = (True A+True B)/(True A + True B + False A +False B) 2、精确率:表示的是每一个类别预测准确的数量占所有预测为该类别的数量的比例:precision_a = True A / (True A + False A) 3、召回率:表示的每一个类别预测正确的数量占这里类别真正数量的比例:recall_a = True A / (True A + False B) 4、F1指标:F1_a = 2/(1/precision_a + 1/recall_a) = 2*(precision_a*recall_a)/(precision_a+recall_a) 调和平均数,F1指标指的就是精确率和召回率的调和平均数,除了把精确率和召回率平均,还可以给两个指标相近的模型以较高的评分; 【注意】如果精确率和召回率差距太大,模型就不具备参考价值