项目目标

**探索影响预期寿命的因素**

世卫组织建立了一段时间内所有国家健康状况的数据集，其中包括预期寿命，成人死亡率等方面的统计数据。使用此数据集，探索各种变量之间的关系，预测对预期寿命的最大影响因素是什么？

请尝试回答以下问题：

- 最初选择的各种预测因素是否真的影响预期寿命？实际影响预期寿命的预测变量有哪些？

- 预期寿命值低于（<65）的国家是否应该增加其医疗保健支出以改善其平均寿命？

- 婴儿和成人死亡率如何影响预期寿命？

- 预期寿命与饮食习惯，生活方式，运动，吸烟，饮酒等是正相关还是负相关？

- 是否接受教育对人类寿命有何影响？

- 预期寿命与饮酒是正相关还是负相关？

- 人口稠密的国家的预期寿命是否有降低的趋势？

- 免疫覆盖率对预期寿命有什么影响？

数据集：数据/探索影响预期寿命的因素/Life Expectancy Data.csv

在本项目中，我们考虑了193个国家2000年至2015年的数据进行进一步分析。单个数据文件已合并到一个数据集中。对数据进行初步目视检查时发现有些值缺失。由于数据集来自世界卫生组织，我们没有发现明显的错误。R软件使用Missmap命令处理缺失数据。结果表明，缺失的数据主要集中在人口、乙肝和国内生产总值。缺失的数据来自不太为人所知的国家，如瓦努阿图、汤加、多哥、佛得角等。很难找到这些国家的所有数据，因此决定将这些国家排除在最终模型数据集之外。最终合并的文件(最终数据集)由22列和2938行组成，这意味着20个预测变量。所有的预测变量被分成几个大类:免疫相关因素、死亡率因素、经济因素和社会因素。

数据集介绍：

Year年份 Status 地位（发达/发展中）Life expectancy 预期寿命Adult Mortality成人死亡数

infant death婴儿死亡数Alcohol酒精percentage expenditure支出百分比Hepatitis B乙型肝炎

Measles 麻疹under-five deaths 五岁以下死亡Polio小儿麻痹Total expenditure总支出Diphtheria 白喉病毒Population人口thinness 1-19 years虚弱1-19年thinness 5-9 years虚弱5-9年Income composition of resources资源收入构成Schooling 学校教育

部分数据展示

导入数据

首先导入项目需要用的包和数据

# 导包
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix,accuracy_score
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif'] = ['SimHei'] #解决中文显示
plt.rcParams['axes.unicode_minus'] = False   #解决符号无法显示
# 加载数据
data = pd.read_csv('Life Expectancy Data.csv')
pd.set_option('display.max_columns',None)
data.head(10)

输出

Country	Year	Status	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	BMI	under-five deaths	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness 1-19 years	thinness 5-9 years	Income composition of resources	Schooling
0	Afghanistan	2015	Developing	65.0	263.0	62	0.01	71.279624	65.0	1154	19.1	83	6.0	8.16	65.0	0.1	584.259210	33736494.0	17.2	17.3	0.479	10.1
1	Afghanistan	2014	Developing	59.9	271.0	64	0.01	73.523582	62.0	492	18.6	86	58.0	8.18	62.0	0.1	612.696514	327582.0	17.5	17.5	0.476	10.0
2	Afghanistan	2013	Developing	59.9	268.0	66	0.01	73.219243	64.0	430	18.1	89	62.0	8.13	64.0	0.1	631.744976	31731688.0	17.7	17.7	0.470	9.9
3	Afghanistan	2012	Developing	59.5	272.0	69	0.01	78.184215	67.0	2787	17.6	93	67.0	8.52	67.0	0.1	669.959000	3696958.0	17.9	18.0	0.463	9.8
4	Afghanistan	2011	Developing	59.2	275.0	71	0.01	7.097109	68.0	3013	17.2	97	68.0	7.87	68.0	0.1	63.537231	2978599.0	18.2	18.2	0.454	9.5
5	Afghanistan	2010	Developing	58.8	279.0	74	0.01	79.679367	66.0	1989	16.7	102	66.0	9.20	66.0	0.1	553.328940	2883167.0	18.4	18.4	0.448	9.2
6	Afghanistan	2009	Developing	58.6	281.0	77	0.01	56.762217	63.0	2861	16.2	106	63.0	9.42	63.0	0.1	445.893298	284331.0	18.6	18.7	0.434	8.9
7	Afghanistan	2008	Developing	58.1	287.0	80	0.03	25.873925	64.0	1599	15.7	110	64.0	8.33	64.0	0.1	373.361116	2729431.0	18.8	18.9	0.433	8.7
8	Afghanistan	2007	Developing	57.5	295.0	82	0.02	10.910156	63.0	1141	15.2	113	63.0	6.73	63.0	0.1	369.835796	26616792.0	19.0	19.1	0.415	8.4
9	Afghanistan	2006	Developing	57.3	295.0	84	0.03	17.171518	64.0	1990	14.7	116	58.0	7.43	58.0	0.1	272.563770	2589345.0	19.2	19.3	0.405	8.1

查看数据基本信息

# 查看数据信息
data.info()

查看数值型数据的描述

# 查看数值型数据的描述
data.describe()

Year	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	BMI	under-five deaths	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness 1-19 years	thinness 5-9 years	Income composition of resources	Schooling
count	2938.000000	2928.000000	2928.000000	2938.000000	2744.000000	2938.000000	2385.000000	2938.000000	2904.000000	2938.000000	2919.000000	2712.00000	2919.000000	2938.000000	2490.000000	2.286000e+03	2904.000000	2904.000000	2771.000000	2775.000000
mean	2007.518720	69.224932	164.796448	30.303948	4.602861	738.251295	80.940461	2419.592240	38.321247	42.035739	82.550188	5.93819	82.324084	1.742103	7483.158469	1.275338e+07	4.839704	4.870317	0.627551	11.992793
std	4.613841	9.523867	124.292079	117.926501	4.052413	1987.914858	25.070016	11467.272489	20.044034	160.445548	23.428046	2.49832	23.716912	5.077785	14270.169342	6.101210e+07	4.420195	4.508882	0.210904	3.358920
min	2000.000000	36.300000	1.000000	0.000000	0.010000	0.000000	1.000000	0.000000	1.000000	0.000000	3.000000	0.37000	2.000000	0.100000	1.681350	3.400000e+01	0.100000	0.100000	0.000000	0.000000
25%	2004.000000	63.100000	74.000000	0.000000	0.877500	4.685343	77.000000	0.000000	19.300000	0.000000	78.000000	4.26000	78.000000	0.100000	463.935626	1.957932e+05	1.600000	1.500000	0.493000	10.100000
50%	2008.000000	72.100000	144.000000	3.000000	3.755000	64.912906	92.000000	17.000000	43.500000	4.000000	93.000000	5.75500	93.000000	0.100000	1766.947595	1.386542e+06	3.300000	3.300000	0.677000	12.300000
75%	2012.000000	75.700000	228.000000	22.000000	7.702500	441.534144	97.000000	360.250000	56.200000	28.000000	97.000000	7.49250	97.000000	0.800000	5910.806335	7.420359e+06	7.200000	7.200000	0.779000	14.300000
max	2015.000000	89.000000	723.000000	1800.000000	17.870000	19479.911610	99.000000	212183.000000	87.300000	2500.000000	99.000000	17.60000	99.000000	50.600000	119172.741800	1.293859e+09	27.700000	28.600000	0.948000	20.700000

查看非数值型数据的描述

# 查看非数值型数据的描述
data.describe(include=np.object)

查看数据大小

# 查看数据大小
data.shape

(2938, 22)

数据预处理

查看缺失值

# 查看缺失值
data.isnull().sum()

可以看出原始数据有很多缺失值

删除缺失值

# 删除缺失值
data.dropna(inplace=True)
data.isnull().sum()

查看是否有重复值,返回结果为True则存在重复值,为False则说明不存在重复值

# 查看是否有重复值,返回结果为True则存在重复值,为False则说明不存在重复值
any(data.duplicated())

False 不存在重复值

# 也可以直接调用删除重复值的函数 data.drop_duplicates(inplace=True)

数据可视化

预期寿命值低于（<65）的国家是否应该增加其医疗保健支出以改善其平均寿命？

# 预期寿命值低于（<65）的国家是否应该增加其医疗保健支出以改善其平均寿命？
data1 = data[data['Life expectancy ']<65]
plt.figure(figsize=(12,8))
plt.scatter(x=data1['percentage expenditure'],y=data1['Life expectancy '])
plt.title('支出百分比与预期寿命值低于（<65）的国家关系')
plt.show()

从图中可以看出绝大部分预期寿命值低于（<65）的国家的医疗支出百分比都是很少的，而且我们还可以看出随着支出百分比的增加，预期寿命有增加的趋势，存在正相关关系。故预期寿命值低于（<65）的国家应该增加其医疗保健支出以改善其平均寿命。

#相关性
"""
使用heatmap绘制相关性热力图
vmax设定热力图色块的最大区分值
square设定图片是否为正方形
annot 设定是否显示每个色块的系数值
cbar 是否显示图例
cmap 颜色主题
"""
fig = plt.figure(figsize=(18,18))
sns.heatmap(data.corr(),vmax=1,annot=True,linewidths=0.5,cbar=False,cmap='YlGnBu',annot_kws={'fontsize':18})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.title('各个因素之间的相关系数',fontsize=20)
plt.show()

预期寿命与饮食习惯，生活方式，运动，吸烟，饮酒等是正相关还是负相关？

'''

从上图我们可以看出预期寿命与酒精相关系数为0.4,较弱的正相关性

预期寿命与虚弱1-19和5-9的相关系数为-0.46，存在负相关关系

预期寿命与收入和教育的相关系数为0.72,0.73，存在着较强的正相关关系

'''

是否接受教育对人类寿命有何影响？

# 是否接受教育对人类寿命有何影响？
plt.figure(figsize=(12,8))
plt.subplot(2,2,1)
plt.scatter(x=data['Schooling'],y=data['Life expectancy '])
plt.title('教育与预期寿命值的关系')
plt.subplot(2,2,2)
plt.scatter(x=data['Schooling'],y=data['Adult Mortality'])
plt.title('教育与成年死亡数的关系')
plt.subplot(2,2,3)
plt.scatter(x=data['Schooling'],y=data[' thinness  1-19 years'])
plt.title('教育与虚弱1-19年的关系')
plt.subplot(2,2,4)
plt.scatter(x=data['Schooling'],y=data[' thinness 5-9 years'])
plt.title('教育与虚弱5-9年的关系')
plt.show()

'''

从图中我们可以看出教育与预期寿命存在着正相关的关系，教育越好的国家预期寿命也就越高；

教育与成年死亡数、虚弱1-19和5-9都存在这负相关的关系，说明教育差的国家成年死亡数和虚弱人数也就相对越多

'''

婴儿和成人死亡率如何影响预期寿命？

# 婴儿和成人死亡率如何影响预期寿命？
plt.figure(figsize=(12,8))
plt.subplot(1,2,1)
plt.scatter(x=data['Adult Mortality'],y=data['Life expectancy '])
plt.title('成人死亡率与预期寿命的关系')
plt.subplot(1,2,2)
plt.scatter(x=data['infant deaths'],y=data['Life expectancy '])
plt.title('婴儿死亡率与预期寿命的关系')
plt.show()

'''

从图中我们可以看出成人死亡率与预期寿命存在较强的负相关关系，说明成人死亡率越高的国家，预期寿命也低

婴儿死亡率与预期寿命存在着较弱的负相关关系，婴儿死亡率对预期寿命影响较小

'''

预期寿命与饮酒是正相关还是负相关？

# 预期寿命与饮酒是正相关还是负相关？
plt.figure(figsize=(12,8))
plt.scatter(x=data['Alcohol'],y=data['Life expectancy '])
plt.title('饮酒与预期寿命的关系')
plt.show()

'''

从图中可以看出预期寿命与饮酒是较弱的正相关关系

'''

人口稠密的国家的预期寿命是否有降低的趋势？

# 人口稠密的国家的预期寿命是否有降低的趋势？
plt.figure(figsize=(12,8))
plt.subplot(1,2,1)
plt.scatter(x=data['Population'],y=data['Life expectancy '])
plt.title('人口数量与预期寿命的关系')
plt.subplot(1,2,2)
plt.plot(data['Population'],data['Life expectancy '])
plt.title('人口数量与预期寿命的折线图')
plt.show()

'''

从图中我们可以看出人口数量与预期寿命直接不存在什么关系，从折线图也看不出什么规律，说明人口稠密的国家的预期寿命没有降低的趋势

'''

分析不同Status的国家的预期寿命、婴儿死亡数、麻疹、5岁以下死亡数、小儿麻痹、B已型肝炎的差异

# 以Status进行分组求出个因素的平均值
group1 = data.groupby('Status').mean()
group1
# 分析不同Status的国家的预期寿命、婴儿死亡数、麻疹、5岁以下死亡数、小儿麻痹、B已型肝炎的差异
plt.figure(figsize=(20,15))
plt.subplot(3,2,1)
group1['Life expectancy '].plot(kind='bar')
plt.subplot(3,2,2)
group1['infant deaths'].plot(kind='bar')
plt.subplot(3,2,3)
group1['Measles '].plot(kind='bar')
plt.subplot(3,2,4)
group1['under-five deaths '].plot(kind='bar')
plt.subplot(3,2,5)
group1['Polio'].plot(kind='bar')
plt.subplot(3,2,6)
group1['Hepatitis B'].plot(kind='bar')

'''

从图中我们可以看出发达国家的预期寿命是高于发展中国家的，婴儿死亡数以及5岁以下死亡数数量发展中国家远超过发达国家，但是小儿麻痹和B已型肝炎的人数是相差不大的

'''

特征工程

将预期寿命按阶段划分为等级1-6

# 将预期寿命按阶段划分为等级1-6
data1 = data.copy()
data1['grade'] = pd.cut(data1['Life expectancy '],bins=[35,45,55,65,75,85,90],labels=['1','2','3','4','5','6'],right=False)
data1.head()

Country	Year	Status	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	BMI	under-five deaths	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness 1-19 years	thinness 5-9 years	Income composition of resources	Schooling	grade
0	Afghanistan	2015	Developing	65.0	263.0	62	0.01	71.279624	65.0	1154	19.1	83	6.0	8.16	65.0	0.1	584.259210	33736494.0	17.2	17.3	0.479	10.1	4
1	Afghanistan	2014	Developing	59.9	271.0	64	0.01	73.523582	62.0	492	18.6	86	58.0	8.18	62.0	0.1	612.696514	327582.0	17.5	17.5	0.476	10.0	3
2	Afghanistan	2013	Developing	59.9	268.0	66	0.01	73.219243	64.0	430	18.1	89	62.0	8.13	64.0	0.1	631.744976	31731688.0	17.7	17.7	0.470	9.9	3
3	Afghanistan	2012	Developing	59.5	272.0	69	0.01	78.184215	67.0	2787	17.6	93	67.0	8.52	67.0	0.1	669.959000	3696958.0	17.9	18.0	0.463	9.8	3
4	Afghanistan	2011	Developing	59.2	275.0	71	0.01	7.097109	68.0	3013	17.2	97	68.0	7.87	68.0	0.1	63.537231	2978599.0	18.2	18.2	0.454	9.5	3

将Status这一列值用0和1表示

# 将Status这一列值用0和1表示
data1['Status'] = data1['Status'].apply(lambda x:0 if x == 'Developing' else 1)
data1.head()

Country	Year	Status	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	BMI	under-five deaths	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness 1-19 years	thinness 5-9 years	Income composition of resources	Schooling	grade
0	Afghanistan	2015	0	65.0	263.0	62	0.01	71.279624	65.0	1154	19.1	83	6.0	8.16	65.0	0.1	584.259210	33736494.0	17.2	17.3	0.479	10.1	4
1	Afghanistan	2014	0	59.9	271.0	64	0.01	73.523582	62.0	492	18.6	86	58.0	8.18	62.0	0.1	612.696514	327582.0	17.5	17.5	0.476	10.0	3
2	Afghanistan	2013	0	59.9	268.0	66	0.01	73.219243	64.0	430	18.1	89	62.0	8.13	64.0	0.1	631.744976	31731688.0	17.7	17.7	0.470	9.9	3
3	Afghanistan	2012	0	59.5	272.0	69	0.01	78.184215	67.0	2787	17.6	93	67.0	8.52	67.0	0.1	669.959000	3696958.0	17.9	18.0	0.463	9.8	3
4	Afghanistan	2011	0	59.2	275.0	71	0.01	7.097109	68.0	3013	17.2	97	68.0	7.87	68.0	0.1	63.537231	2978599.0	18.2	18.2	0.454	9.5	3

建模

划分数据集

# 准备数据
X = data1.drop(['Life expectancy ','grade','Country'],axis=1)
y = data1['grade']
# 划分数据集
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

构建决策树模型

# 构建决策树模型
from sklearn import tree
from six import StringIO
import pydotplus
tree_clf = DecisionTreeClassifier(max_depth=3)
tree_clf.fit(x_train,y_train)
y_pred = tree_clf.predict(x_test)
print('模型准确率:',accuracy_score(y_pred=y_pred,y_true=y_test))
print('混淆矩阵:',confusion_matrix(y_test,y_pred))
dot_data = StringIO()
tree.export_graphviz(
    tree_clf,
    out_file=dot_data,
    feature_names=X.columns[:],
    class_names=['1','2','3','4','5','6'],
    rounded=True,
    filled=True
)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('tree1.png')

模型准确率: 0.7515151515151515
混淆矩阵: [[  0   0   1   0   0   0]
 [  0  10  13   1   0   0]
 [  0   4  48   3   0   0]
 [  0   0  17 141   0   0]
 [  0   0   0  39  49   0]
 [  0   0   0   0   4   0]]

构建GBDT模型

# 构建GBDT模型
gbst = GradientBoostingClassifier()
gbst.fit(x_train,y_train)
y_pred = gbst.predict(x_test)
print('模型准确率',accuracy_score(y_pred,y_test))
print(confusion_matrix(y_test,y_pred))

模型准确率 0.8939393939393939
[[  0   1   0   0   0   0]
 [  0  17   7   0   0   0]
 [  0   1  51   3   0   0]
 [  0   0   3 153   2   0]
 [  0   0   0  17  71   0]
 [  0   0   0   0   1   3]]

构建随机森林模型

# 构建随机森林模型
rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
y_pred = rfc.predict(x_test)
print('模型准确率:',accuracy_score(y_pred=y_pred,y_true=y_test))
print('混淆矩阵:',confusion_matrix(y_test,y_pred))
#打印特征重要性评分
feat_labels = x_train.columns[0:]
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]
for f,j in zip(range(x_train.shape[1]-1),indices):
    print(f + 1, feat_labels[j], importances[j])

模型准确率: 0.8878787878787879
混淆矩阵: [[  0   1   0   0   0   0]
 [  0  18   6   0   0   0]
 [  0   2  51   2   0   0]
 [  0   0   3 155   0   0]
 [  0   0   0  20  68   0]
 [  0   0   0   0   3   1]]
1 Adult Mortality 0.2072291122189003
2 Income composition of resources 0.15119291250758393
3  HIV/AIDS 0.08028919893081636
4  thinness 5-9 years 0.07166453731213923
5 Schooling 0.06714860274427781
6  thinness  1-19 years 0.052012459499437905
7 percentage expenditure 0.049963594335953106
8 Alcohol 0.04191949699555478
9 GDP 0.04132651013754628
10  BMI  0.04056282954843009
11 Total expenditure 0.03431848502937779
12 under-five deaths  0.023643667064835658
13 infant deaths 0.022416087868431765
14 Population 0.020849863426314803
15 Year 0.019126681509410066
16 Hepatitis B 0.0189391761817397
17 Diphtheria  0.018302310953593165
18 Polio 0.01756546236252764
19 Measles  0.015518045355663208

从随机森林模型算法得出的重要特征中，我们看出成年人死亡率、收入结构、HIV、虚弱5-9年、教育等特征的影响分值较大，

说明这几个特征因素影响预期寿命的程度最大，故应该减少成年人死亡数，增加收入，加大对医疗保健的支出，增强学校教育来提高预期寿命

由上述三个模型准确率得知，我们应该使用随机森林或GBDT算法模型来预测人类的预期寿命

数据分析案例-基于sklearn随机森林算法探究影响预期寿命的因素

项目目标

导入数据

查看数据基本信息

数据预处理

数据可视化

特征工程

建模

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

数据分析案例-基于sklearn随机森林算法探究影响预期寿命的因素

项目目标

导入数据

查看数据基本信息

数据预处理

数据可视化

特征工程

建模

热门文章

最新文章

相关课程

相关电子书

相关实验场景