背景描述
泰坦尼克号轮船的沉没是历史上最为人熟知的海难事件之一。1912年4月15日,在她的处女航中,泰坦尼克号在与冰山相撞后沉没,在船上的 2224 名乘客和机组人员中,共造成 1502 人死亡。这场耸人听闻的悲剧震惊了国际社会,从而促进了船舶安全规定的完善。造成海难失事的原因之一是乘客和机组人员没有足够的救生艇。尽管在沉船事件中幸存者有一些运气因素,但有些人比其他人更容易存活下来,究竟有哪些因素影响着最终乘客的生存与否呢?
数据说明
在该数据集中,共包括三个文件,分别代表训练集、测试集以及测试集的答案;
数据描述:
变量名称 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
变量解释 | 乘客编号 | 是否存活 | 船舱等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹和配偶数量 | 父母与子女数量 | 票的编号 | 票价 | 座位号 | 登船码头 |
数据类型 | numeric | categorical | categorical | String | categorical | categorical | numeric | numeric | string | numeric | string | categorical |
显示详细信息
注:以上数据类型均为经过预处理后的数据类型!
三 建模以及模型评价
1. 数据分离
将经过特征工程处理后的数据分开,分成最初的训练数据和测试数据;
1.1 读取数据
import pandas as pd train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') truth = pd.read_csv('gender_submission.csv') train_and_test = pd.read_csv('经过特征工程处理后的数据.csv') PassengerId = test['PassengerId']
1.2 划分训练集和测试集
index = PassengerId[0] - 1 train_and_test_drop = train_and_test.drop(['PassengerId', 'Name', 'Ticket'], axis=1) train_data = train_and_test_drop[:index] test_data = train_and_test_drop[index:] train_X = train_data.drop(['Survived'], axis=1) train_y = train_data['Survived'] test_X = test_data.drop(['Survived'], axis=1) test_y = truth['Survived'] train_X.shape, train_y.shape, test_X.shape
注:以下模型建模时,均使用默认参数,不涉及过多参数调优、交叉验证、复杂模型等,主要旨在比较再默认参数下不同模型的差异
2. 建模以及模型评价
本章主要实现建模及模型评价部分,为了简便起见,直接调用sklearn现成的函数,所有模型均采用默认参数,不涉及过多参数调优、算法优化等复杂过程,由于能力有限,这里只列举了一些常见的基础模型和集成模型,至于其他模型,读者可自行查阅资料补充
from sklearn.linear_model import LogisticRegression # 逻辑回归 from sklearn.ensemble import RandomForestClassifier # 随机森林 from sklearn.svm import SVC # 支持向量机 from sklearn.neighbors import KNeighborsClassifier # K最近邻 from sklearn.tree import DecisionTreeClassifier # 决策树 from sklearn.ensemble import GradientBoostingClassifier # 梯度提升树GBDT import lightgbm as lgb # LightGBM算法 from xgboost.sklearn import XGBClassifier # XGBoost算法 from sklearn.ensemble import ExtraTreesClassifier # 极端随机树 from sklearn.ensemble import AdaBoostClassifier # from sklearn.ensemble import BaggingClassifier from sklearn.metrics import roc_auc_score # 准确率评价模型好坏 import warnings warnings.filterwarnings("ignore")
2.1 逻辑回归
lr = LogisticRegression() # logit 逻辑回归 lr.fit(train_X, train_y) pred_lr = lr.predict(test_X) accuracy_lr = roc_auc_score(test_y, pred_lr) print("逻辑回归的预测结果:", accuracy_lr)
2.2 随机森林-RF
rfc = RandomForestClassifier() rfc.fit(train_X, train_y) pred_rfc = rfc.predict(test_X) accuracy_rfc = roc_auc_score(test_y, pred_rfc) print("随机森林的预测结果:", accuracy_rfc)
2.3 支持向量机-SVM
svm = SVC() svm.fit(train_X,train_y) pred_svm = svm.predict(test_X) accuracy_svm = roc_auc_score(test_y, pred_svm) print("支持向量机的预测结果:", accuracy_svm)
2.4 K最近邻-KNN
knn = KNeighborsClassifier() knn.fit(train_X,train_y) pred_knn = knn.predict(test_X) accuracy_knn = roc_auc_score(test_y, pred_knn) print("K最近邻分类器的预测结果:", accuracy_knn)
2.5 决策树
dtree = DecisionTreeClassifier() dtree.fit(train_X,train_y) pred_dtree = dtree.predict(test_X) accuracy_dtree = roc_auc_score(test_y, pred_dtree) print("决策树模型的预测结果:", accuracy_dtree)
2.6 梯度提升决策树-GBDT
gbdt = GradientBoostingClassifier() gbdt.fit(train_X, train_y) pred_gbdt = gbdt.predict(test_X) accuracy_gbdt = roc_auc_score(test_y, pred_gbdt) print("GBDT模型的预测结果:", accuracy_gbdt)
2.7 LightGBM算法
lgb_train = lgb.Dataset(train_X, train_y) lgb_eval = lgb.Dataset(test_X, test_y, reference = lgb_train) gbm = lgb.train(params = {}, train_set = lgb_train, valid_sets = lgb_eval) pred_lgb = gbm.predict(test_X, num_iteration = gbm.best_iteration) accuracy_lgb = roc_auc_score(test_y, pred_lgb) print("LightGBM模型的预测结果:", accuracy_lgb)
2.8 XGBoost算法
xgbc = XGBClassifier() xgbc.fit(train_X, train_y) pred_xgbc = xgbc.predict(test_X) accuracy_xgbc = roc_auc_score(test_y, pred_xgbc) print("XGBoost模型的预测结果:", accuracy_xgbc)
2.9 极端随机树
etree = ExtraTreesClassifier() etree.fit(train_X, train_y) pred_etree = etree.predict(test_X) accuracy_etree = roc_auc_score(test_y, pred_etree) print("极端随机树模型的预测结果:", accuracy_etree)
2.10 AdaBoost算法
abc = AdaBoostClassifier() abc.fit(train_X, train_y) pred_abc = abc.predict(test_X) accuracy_abc = roc_auc_score(test_y, pred_abc) print("AdaBoost模型的预测结果:", accuracy_abc)
2.11 基于Bagging的K最近邻
bag_knn = BaggingClassifier(KNeighborsClassifier()) bag_knn.fit(train_X, train_y) pred_bag_knn = bag_knn.predict(test_X) accuracy_bag_knn = roc_auc_score(test_y, pred_bag_knn) print("基于Bagging的K紧邻模型的预测结果:", accuracy_bag_knn)
2.12 基于Bagging的决策树
bag_dt = BaggingClassifier(DecisionTreeClassifier()) bag_dt.fit(train_X, train_y) pred_bag_dt = bag_dt.predict(test_X) accuracy_bag_dt = roc_auc_score(test_y, pred_bag_dt) print("基于Bagging的决策树模型的预测结果:", accuracy_bag_dt)
3. 小结
import seaborn as sns import matplotlib.pyplot as plt sns.set(rc={'figure.figsize':(15,6)}) # 设置画布大小 accuracys = [accuracy_lr, accuracy_rfc, accuracy_svm, accuracy_knn, accuracy_dtree, accuracy_gbdt, accuracy_lgb,accuracy_xgbc, accuracy_etree, accuracy_abc, accuracy_bag_knn, accuracy_bag_dt, ] models = ['Logistic', 'RF', 'SVM', 'KNN', 'Dtree', 'GBDT', 'LightGBM', 'XGBoost', 'Etree', 'Adaboost', 'Bagging-KNN', 'Bagging-Dtree'] bar = sns.barplot(x=models, y=accuracys) # 显示数值标签 for x, y in enumerate(accuracys): plt.text(x, y, '%s'% round(y,3), ha='center') plt.xlabel("Model") plt.ylabel("Accuracy") plt.show()
根据上述条形图可以看出,在全部模型默认参数的情况下,逻辑回归的预测准确率最高,达到了0.911,其次是LightGBM模型,也在0.9以上,达到80%准确率以上的模型有RF、GBDT、XGBoost、ETree、Adaboost以及基于Bagging的决策树,其他模型的预测准确率则较低;
由于本文所涉及到模型均没有进行算法优化,所以只能简单看下在默认参数情况下模型之间预测准确率的比较,但上述结果并不能代表每个模型预测准确率的上限,比如有的模型在默认参数时准确率很低,但通过调参、算法优化可能就会变得很高。
如果本文有存在不足的地方,欢迎大家在评论区留言。