UCB Data100:数据科学的原理和技巧:第六章到第十章(1)https://developer.aliyun.com/article/1427168
7.4 变量类型应指导绘图选择
7.5 条形图
import pandas as pd import numpy as np wb = pd.read_csv("data/world_bank.csv", index_col=0) wb.head()
Continent | Country | Primary completion rate: Male: % of relevant age group: 2015 | Primary completion rate: Female: % of relevant age group: 2015 | Lower secondary completion rate: Male: % of relevant age group: 2015 | Lower secondary completion rate: Female: % of relevant age group: 2015 | Youth literacy rate: Male: % of ages 15-24: 2005-14 | Youth literacy rate: Female: % of ages 15-24: 2005-14 | Adult literacy rate: Male: % ages 15 and older: 2005-14 | Adult literacy rate: Female: % ages 15 and older: 2005-14 | … | Access to improved sanitation facilities: % of population: 1990 | Access to improved sanitation facilities: % of population: 2015 | Child immunization rate: Measles: % of children ages 12-23 months: 2015 | Child immunization rate: DTP3: % of children ages 12-23 months: 2015 | Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016 | Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016 | Children sleeping under treated bed nets: % of children under age 5: 2009-2016 | Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016 | Tuberculosis: Treatment success rate: % of new cases: 2014 | Tuberculosis: Cases detection rate: % of new estimated cases: 2015 | |
0 | Africa | Algeria | 106.0 | 105.0 | 68.0 | 85.0 | 96.0 | 92.0 | 83.0 | 68.0 | … | 80.0 | 88.0 | 95.0 | 95.0 | 66.0 | 42.0 | NaN | NaN | 88.0 | 80.0 |
1 | Africa | Angola | NaN | NaN | NaN | NaN | 79.0 | 67.0 | 82.0 | 60.0 | … | 22.0 | 52.0 | 55.0 | 64.0 | NaN | NaN | 25.9 | 28.3 | 34.0 | 64.0 |
2 | Africa | Benin | 83.0 | 73.0 | 50.0 | 37.0 | 55.0 | 31.0 | 41.0 | 18.0 | … | 7.0 | 20.0 | 75.0 | 79.0 | 23.0 | 33.0 | 72.7 | 25.9 | 89.0 | 61.0 |
3 | Africa | Botswana | 98.0 | 101.0 | 86.0 | 87.0 | 96.0 | 99.0 | 87.0 | 89.0 | … | 39.0 | 63.0 | 97.0 | 95.0 | NaN | NaN | NaN | NaN | 77.0 | 62.0 |
5 | Africa | Burundi | 58.0 | 66.0 | 35.0 | 30.0 | 90.0 | 88.0 | 89.0 | 85.0 | … | 42.0 | 48.0 | 93.0 | 94.0 | 55.0 | 43.0 | 53.8 | 25.4 | 91.0 | 51.0 |
5 行 × 47 列
我们可以使用条形图来可视化 Continent
7.5.1 在 Pandas 中绘图
wb['Continent'].value_counts().plot(kind = 'bar');
返回一个 Series
,其中包含每个唯一值的总计数。我们在这个结果上调用 .plot(kind = 'bar')
中的绘图方法是最不受欢迎的,也不受 Data 100 的支持,因为它们的功能有限。相反,未来的示例将专注于其他专门用于可视化数据的库。这里最知名的库是 matplotlib
7.5.2 在 Matplotlib 中绘图
import matplotlib.pyplot as plt # matplotlib is typically given the alias plt continent = wb['Continent'].value_counts() plt.bar(continent.index, continent) plt.xlabel('Continent') plt.ylabel('Count');
虽然需要更多的代码来实现相同的结果,但 matplotlib
通常比 pandas
然而,请注意我们需要使用 plt.xlabel
和 plt.ylabel
标记轴 - matplotlib
不支持自动轴标记。为了避免这些不便,我们可以使用更高效的绘图库 seaborn
7.5.3 在 Seaborn 中绘图
import seaborn as sns # seaborn is typically given the alias sns sns.countplot(data = wb, x = 'Continent');
既计算又可视化给定列中唯一值的数量。这一列由 x
参数指定为 sns.countplot
,而 DataFrame
由 data
参数指定。与 matplotlib
调用的一般结构涉及传入整个 DataFrame
更简洁和美观。然而,这个特定条形图的颜色方案是任意的 - 它并不额外编码有关类别本身的任何信息。这并不总是正确的;颜色可能在其他可视化中表示有意义的细节。我们将在下一讲中更深入地探讨这一点。
- 我们应该使用什么颜色?
- 条的宽度应该是多少?
- 图例是否存在?
- 条形和坐标轴应该有深色边框吗?
- 为每个条引入不同的颜色
- 包括图例
- 包括标题
- 标记 y 轴
- 使用色盲友好的调色板
- 重新定位标签
- 增加字体大小
7.6 定量变量的分布
DataFrame 的示例,让我们绘制人均国民总收入
Continent | Country | Primary completion rate: Male: % of relevant age group: 2015 | Primary completion rate: Female: % of relevant age group: 2015 | Lower secondary completion rate: Male: % of relevant age group: 2015 | Lower secondary completion rate: Female: % of relevant age group: 2015 | Youth literacy rate: Male: % of ages 15-24: 2005-14 | Youth literacy rate: Female: % of ages 15-24: 2005-14 | Adult literacy rate: Male: % ages 15 and older: 2005-14 | Adult literacy rate: Female: % ages 15 and older: 2005-14 | … | Access to improved sanitation facilities: % of population: 1990 | Access to improved sanitation facilities: % of population: 2015 | Child immunization rate: Measles: % of children ages 12-23 months: 2015 | Child immunization rate: DTP3: % of children ages 12-23 months: 2015 | Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016 | Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016 | Children sleeping under treated bed nets: % of children under age 5: 2009-2016 | Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016 | Tuberculosis: Treatment success rate: % of new cases: 2014 | Tuberculosis: Cases detection rate: % of new estimated cases: 2015 | |
0 | Africa | Algeria | 106.0 | 105.0 | 68.0 | 85.0 | 96.0 | 92.0 | 83.0 | 68.0 | … | 80.0 | 88.0 | 95.0 | 95.0 | 66.0 | 42.0 | NaN | NaN | 88.0 | 80.0 |
1 | Africa | Angola | NaN | NaN | NaN | NaN | 79.0 | 67.0 | 82.0 | 60.0 | … | 22.0 | 52.0 | 55.0 | 64.0 | NaN | NaN | 25.9 | 28.3 | 34.0 | 64.0 |
2 | Africa | Benin | 83.0 | 73.0 | 50.0 | 37.0 | 55.0 | 31.0 | 41.0 | 18.0 | … | 7.0 | 20.0 | 75.0 | 79.0 | 23.0 | 33.0 | 72.7 | 25.9 | 89.0 | 61.0 |
3 | Africa | Botswana | 98.0 | 101.0 | 86.0 | 87.0 | 96.0 | 99.0 | 87.0 | 89.0 | … | 39.0 | 63.0 | 97.0 | 95.0 | NaN | NaN | NaN | NaN | 77.0 | 62.0 |
5 | Africa | Burundi | 58.0 | 66.0 | 35.0 | 30.0 | 90.0 | 88.0 | 89.0 | 85.0 | … | 42.0 | 48.0 | 93.0 | 94.0 | 55.0 | 43.0 | 53.8 | 25.4 | 91.0 | 51.0 |
5 行×47 列
sns.countplot(data = wb, x = 'Gross national income per capita, Atlas method: $: 2016');
- 直方图
- 箱线图
- 小提琴图
7.7 箱线图和小提琴图
sns.boxplot(data=wb, y='Gross national income per capita, Atlas method: $: 2016');
sns.violinplot(data=wb, y="Gross national income per capita, Atlas method: $: 2016");
四分位数代表数据的 25%部分。我们说:
- 第一四分位数(Q1)代表第 25 百分位数-25%的数据位于第一四分位数以下。
- 第二四分位数(Q2)代表第 50 百分位数,也称为中位数-50%的数据位于第二四分位数以下。
- 第三四分位数(Q3)代表第 75 百分位数-75%的数据位于第三四分位数以下。
这意味着数据的中间 50%位于第一和第三四分位数之间。这在下面的直方图中得到了证明。三个四分位数用红色垂直线标记。
gdp = wb['Gross domestic product: % growth : 2016'] gdp = gdp[~gdp.isna()] q1, q2, q3 = np.percentile(gdp, [25, 50, 75]) wb_quartiles = wb.copy() wb_quartiles['category'] = None wb_quartiles.loc[(wb_quartiles['Gross domestic product: % growth : 2016'] < q1) | (wb_quartiles['Gross domestic product: % growth : 2016'] > q3), 'category'] = 'Outside of the middle 50%' wb_quartiles.loc[(wb_quartiles['Gross domestic product: % growth : 2016'] > q1) & (wb_quartiles['Gross domestic product: % growth : 2016'] < q3), 'category'] = 'In the middle 50%' sns.histplot(wb_quartiles, x="Gross domestic product: % growth : 2016", hue="category") sns.rugplot([q1, q2, q3], c="firebrick", lw=6, height=0.1);
在箱线图中,箱子的下限位于 Q1,而箱子的上限位于 Q3。箱子中间的水平线对应于 Q2(或者说中位数)。
sns.boxplot(data=wb, y='Gross domestic product: % growth : 2016');
箱线图的须是位于[(1 四分位-(1.5×IQR))]和[(3 四分位+(1.5×IQR))]的两个点。它们是“正常”数据的下限和上限范围(不包括异常值的点)。
sns.violinplot(data=wb, y='Gross domestic product: % growth : 2016');
7.8 并列箱线图和小提琴图
,我们可以通过指定 x 和 y 列轻松创建并列图。
sns.boxplot(data=wb, x="Continent", y='Gross domestic product: % growth : 2016');
7.9 绘制直方图
您可能熟悉 Data 8 中的直方图。直方图将连续数据收集到箱中,然后绘制这些分箱数据。每个箱反映了数值位于箱的左右端之间的数据点的密度。
# The `edgecolor` argument controls the color of the bin edges gni = wb["Gross national income per capita, Atlas method: $: 2016"] plt.hist(gni, density=True, edgecolor="white") # Add labels plt.xlabel("Gross national income per capita") plt.ylabel("Density") plt.title("Distribution of gross national income per capita");
sns.histplot(data=wb, x="Gross national income per capita, Atlas method: $: 2016", stat="density") plt.title("Distribution of gross national income per capita");
7.9.1 重叠直方图
请注意,生成的图表包括一个图例,描述每个半球对应的颜色 - 如果颜色用于编码可视化中的信息,则应始终包括图例!
# Create a new variable to store the hemisphere in which each country is located north = ["Asia", "Europe", "N. America"] south = ["Africa", "Oceania", "S. America"] wb.loc[wb["Continent"].isin(north), "Hemisphere"] = "Northern" wb.loc[wb["Continent"].isin(south), "Hemisphere"] = "Southern"
sns.histplot(data=wb, x="Gross national income per capita, Atlas method: $: 2016", hue="Hemisphere", stat="density") plt.title("Distribution of gross national income per capita");
densities, bins, _ = plt.hist(gni, density=True, edgecolor="white", bins=5) plt.xlabel("Gross national income per capita") plt.ylabel("Density") print(f"First bin has width {bins[1]-bins[0]} and height {densities[0]}") print(f"This corresponds to {bins[1]-bins[0]} * {densities[0]} = {(bins[1]-bins[0])*densities[0]*100}% of the data")
First bin has width 16410.0 and height 4.7741589911386953e-05 This corresponds to 16410.0 * 4.7741589911386953e-05 = 78.343949044586% of the data
UCB Data100:数据科学的原理和技巧:第六章到第十章(3)https://developer.aliyun.com/article/1427170