#本人数据新手(real - - ),前几天刚刚接触datacamp,感觉还蛮有趣。基本上所有练习都由浅入深,大多数只要能看懂英文大意即可完成。
#接下来如果有时间的话计划整理一些学习体会。
#如果有一起学习datacamp的小伙伴欢迎留言,一起学习。
#title
Dr. Semmelweis and the discovery of handwashing
##summary
Reanalyse the data behind one of the most important discoveries of modern medicine: Handwashing.
##skill
pandas foudations
整个故事以1847年Ignaz Semmelweis的发现为背景:
In 1847 the Hungarian physician Ignaz Semmelweis makes a breakthough discovery: He discovers handwashing. Contaminated hands was a major cause of childbed fever and by enforcing handwashing at his hospital he saved hundreds of lives.
整个project分9块:
- 1.Meet Dr. Ignaz Semmelweis
- 2.The alarming number of deaths
- 3.Death at the clinics
- 4.The handwashing begins
- 5.The effect of handwashing
- 6.The effect of handwashing highlighted
- 7.More handwashing, fewer deaths?
- 8.A Bootstrap analysis of Semmelweis handwashing data
- 9.The fate of Dr. Semmelweis
下面每部分注释一下,整体比较基础(real = =),不过还是希望能够帮助和我一样对python做数据分析的门外汉们。
1. Meet Dr. Ignaz Semmelweis
老教授看见这样一组数据,刚生过小孩的妈妈们经常会因为一种child fever的病而不幸去世,于是他调查得到了一些数据:
# importing modules
# ... YOUR CODE FOR TASK 1 ...
import pandas as pd #导入pandas 以pd作为简称
import csv #导入csv
# Read datasets/yearly_deaths_by_clinic.csv into yearly
yearly = pd.read_csv('datasets/yearly_deaths_by_clinic.csv')#利用pd.read_csv将csv文件导入yearly变量中
# Print out yearly
# ... YOUR CODE FOR TASK 1 ...
print(yearly)#输出yearly检查变量
output:
year births deaths clinic 0 1841 3036 237 clinic 1 1 1842 3287 518 clinic 1 2 1843 3060 274 clinic 1 3 1844 3157 260 clinic 1 4 1845 3492 241 clinic 1 5 1846 4010 459 clinic 1 6 1841 2442 86 clinic 2 7 1842 2659 202 clinic 2 8 1843 2739 164 clinic 2 9 1844 2956 68 clinic 2 10 1845 3241 66 clinic 2 11 1846 3754 105 clinic 2
2. The alarming number of deaths
经过上面的输出,老教授感觉事情没那么简单:
# Calculate proportion of deaths per no. births
# ... YOUR CODE FOR TASK 2 ...
yearly["proportion_deaths"]=yearly['deaths']/yearly['births']#增加proportion_deaths死亡率列
# Extract clinic 1 data into yearly1 and clinic 2 data into yearly2
yearly1 = yearly.loc[yearly['clinic']=='clinic 1']#提取含clinic1的行,利用loc函数
yearly2 = yearly.loc[yearly['clinic']=='clinic 2']#提取含clinic2的行
print(yearly1)
# Print out yearly1
# ... YOUR CODE FOR TASK 2 ...
output:
year births deaths clinic proportion_deaths 0 1841 3036 237 clinic 1 0.078063 1 1842 3287 518 clinic 1 0.157591 2 1843 3060 274 clinic 1 0.089542 3 1844 3157 260 clinic 1 0.082357 4 1845 3492 241 clinic 1 0.069015 5 1846 4010 459 clinic 1 0.114464
loc函数参考:https://www.cnblogs.com/to-creat/p/7724562.html
要选择列值等于标量some_value的行,请使用==:
df.loc[df['column_name'] == some_value]
3. Death at the clinics
做成图之后就更加直观和明显了:
# This makes plots appear in the notebook
%matplotlib inline#magic method
# Plot yearly proportion of deaths at the two clinics
# ... YOUR CODE FOR TASK 3 ...
ax = yearly1.plot(x="year", y="proportion_deaths",label="clinic1")#利用plot函数画图,x轴为年,y轴为死亡率,label添加图例,为了yearly1和yearly2同轴(图)显示,将此图名为ax
yearly2.plot(x="year", y="proportion_deaths",label="clinic2", ax=ax)#利用ax=ax可以实现同轴显示
ax.set_ylabel("Proportion deaths")#设置y轴名命令,sex_ylabel("name")
output:
plot函数参考:https://blog.csdn.net/sinat_24395003/article/details/60364345
4. The handwashing begins
根据前面的一顿操作分析可以得知,clinic1的死亡率要比clinic2高,这是为什么呢(挠头).原来奥,clinic1的接生同学们还兼职了对尸体的研究.于是教授下令,以后研究完尸体之后必须洗手!!!然后又收集了41年到49年的数据:
# Read datasets/monthly_deaths.csv into monthly
monthly = pd.read_csv("datasets/monthly_deaths.csv",parse_dates=["date"])#导入新的csv到monthly,这里parse_dates是定义date下数据为时间数据(而非字符串),从而具有时序的特性(可以比较先后)
# Calculate proportion of deaths per no. births
# ... YOUR CODE FOR TASK 4 ...
monthly["proportion_deaths"]=monthly["deaths"]/monthly["births"]
# Print out the first rows in monthly
# ... YOUR CODE FOR TASK 4 ...
print(monthly.head(3))#输出部分(前三行)monthly数据
5. The effect of handwashing
洗了手之后有没有效果呢:
# Plot monthly proportion of deaths
# ... YOUR CODE FOR TASK 5 ...
ax=monthly.plot(x="date",y="proportion_deaths",label="deaths after handwashing")
ax.set_ylabel("Proportion deaths")
output:
6. The effect of handwashing highlighted
挖草,好像那个线确实下降了哎,不过不太明显哦:
# Date when handwashing was made mandatory
import pandas as pd
handwashing_start = pd.to_datetime('1847-06-01')#标注'洗手事变'开始时间
# Split monthly into before and after handwashing_start
before_washing = monthly.loc[monthly['date']<handwashing_start]#将时间轴划分为洗手前和洗手后(parse_date的作用出现了)
after_washing = monthly.loc[monthly['date']>=handwashing_start]
# Plot monthly proportion of deaths before and after handwashing
# ... YOUR CODE FOR TASK 6 ...
ax=before_washing.plot(x='date',y='proportion_deaths',label='before washing')
after_washing.plot(x='date',y='proportion_deaths',label='after washing',ax=ax)
ax.set_ylabel="Proportion deaths"
output:
7. More handwashing, fewer deaths?
这下牛逼了奥,看着清晰明了.但是洗手和死亡率的降低真的有关系吗?求个平均值看看
# Difference in mean monthly proportion of deaths due to handwashing
before_proportion = before_washing['proportion_deaths']
after_proportion = after_washing['proportion_deaths']
mean_diff = after_proportion.mean()-before_proportion.mean()
mean_diff
output:
-0.08395660751183336
8. A Bootstrap analysis of Semmelweis handwashing data
可以看到死亡率确实减小了8%左右,看来洗手是真的有用.但是数据科学家感觉事情并没有结束.又用了bootstrap analysis(自助法?)
参考:https://www.zhihu.com/question/38429969
https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29 (too long no see)
# A bootstrap analysis of the reduction of deaths due to handwashing
boot_mean_diff = []#定义一个空list
for i in range(3000):#做一个3000次的实验
boot_before = before_proportion.sample(frac=1,replace=True)#frac=1->全部重新排序,并放回
boot_after = after_proportion.sample(frac=1,replace=True)
boot_mean_diff.append(boot_after.mean() - boot_before.mean())计算一次均值差,加入boot_mean_diff中
# Calculating a 95% confidence interval from boot_mean_diff #计算boot_mean_diff置信区间
confidence_interval = pd.Series(boot_mean_diff).quantile([0.025, 0.975])
confidence_interval
这里有一个地方不是很明白,为什么3000次实验每次只将before_proportion和after_proportion打乱顺序,然后求平均值做差,但boot_after.mean() - boot_before.mean()的结果都不一样呢?(不应该是一样的吗?所有样本都摆在那里求平均值跟什么顺序摆放的也没关系啊?)
print(boot_mean_diff[0:20])
[-0.07787261202620424, -0.07424799825364967, -0.09358312005955502, -0.08894556209810614, -0.08087685009098905, -0.06429190709356139, -0.08068023440789948, -0.07240951438539092, -0.06750565112365006, -0.0676633601804324, -0.08713968457785505, -0.08382681590118775, -0.08280812612089627, -0.08059191110129257, -0.09227479693648963, -0.07786725171910112, -0.08150749269654012, -0.08903607701866195, -0.061659787819670464, -0.0809971940784796]
自问自答(shoegazing - - ):
因为replace=True,所以每次个sample都是从所有(始终是一开始的样本)中抽选的,可能重复,也可能不重复。
test:
import pandas as pd
a=pd.read_csv('C:/Users/chenchutong/Desktop/1.csv')#随便选了一个数据的一段做实验
b=a['births']
print(b)
print('---------')
print(b.sample(frac=1,replace=True))
output:
0 254 1 239 2 277 3 255 Name: births, dtype: int64 --------- 0 254 1 239 1 239 2 277 Name: births, dtype: int64
可以看到有放回的情况下sample出来的样本是可能重复的,这造成了mean值的不同。
9. The fate of Dr. Semmelweis
So handwashing reduced the proportion of deaths by between 6.7 and 10 percentage points, according to a 95% confidence interval. All in all, it would seem that Semmelweis had solid evidence that handwashing was a simple but highly effective procedure that could save many lives.
The tragedy is that, despite the evidence, Semmelweis' theory — that childbed fever was caused by some "substance" (what we today know as bacteria) from autopsy room corpses — was ridiculed by contemporary scientists. The medical community largely rejected his discovery and in 1849 he was forced to leave the Vienna General Hospital for good.
One reason for this was that statistics and statistical arguments were uncommon in medical science in the 1800s. Semmelweis only published his data as long tables of raw data, but he didn't show any graphs nor confidence intervals. If he would have had access to the analysis we've just put together he might have been more successful in getting the Viennese doctors to wash their hands.
# The data Semmelweis collected points to that:
doctors_should_wash_their_hands = True
that's all thank you~~~
dataset:https://github.com/chenchutong/DESOLATION_ROW