很多读者想学Python转数据分析,在学习或使用的过程中经常会忘记一些图表的具体用法,所以行哥分享matplotlib和seaborn速查表给读者们在画图的时候使用
所有数据都可以通过行哥的公众号【一行数据】的后台回复【一行04】获得
seaborn速查表
还有读者反应,有时候在数据可视化的时候不知道使用什么图来展现数据,这里提供了TOP50的可视化例子可以提供进行选择,**
所有数据都可以通过行哥的公众号【一行数据】的后台回复【一行04】获得
**
文章目录
所有数据都可以通过行哥的公众号【一行数据】的后台回复【一行04】获得
所有数据都可以通过行哥的公众号【一行数据】的后台回复【一行04】获得
0 初始配置
1.相关性
1.1 散点图
1.2 气泡图
1.3 散点图与最佳拟合线
1.4 带状抖动图
1.5 计数图
1.6 边际直方图
1.7 边际箱型图
1.8 相关图
1.9 成对图
2.偏差
2.1 发散型条形图
2.2 发散型文本
2.3 发散型散点图
2.4 带有标记棒棒糖图
2.5 面积图
3.排行
3.1 有序条形图
3.2 棒棒糖图
3.3 点图
3.4 坡度图
3.5 哑铃图
4.分布
4.1 连续直方分布图
4.2 分类直方图
4.3 密度图
4.4 密度曲线直方图
4.5 Joy Plot
4.6 分布式点图
4.7 箱型图
4.8 点+ 箱型图
4.9 小提琴图
4.10 人口金字塔
4.11 分类图
0 初始配置
# !pip install brewer2mpl import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns import warnings; warnings.filterwarnings(action='once') large = 22; med = 16; small = 12 params = {'axes.titlesize': large, 'legend.fontsize': med, 'figure.figsize': (16, 10), 'axes.labelsize': med, 'axes.titlesize': med, 'xtick.labelsize': med, 'ytick.labelsize': med, 'figure.titlesize': large} plt.rcParams.update(params) plt.style.use('seaborn-whitegrid') sns.set_style("white") %matplotlib inline # mac font plt.rcParams['font.sans-serif'] = ['Arial Unicode MS'] # windows font # plt.rcParams['font.sans-serif'] = ['SimHei'] # Version print(mpl.__version__) #> 3.0.0 print(sns.__version__) #> 0.9.0
1.相关性
相关下的图用于可视化两个或多个变量之间的关系。即,一个变量相对于另一个如何变化。
1.1 散点图
Scatteplot是用于研究两个变量之间关系的经典基础图。如果数据中有多个组,则可能需要以不同的颜色可视化每个组。在中matplotlib,您可以使用方便地执行此操作
# Import dataset midwest = pd.read_csv("data/midwest_filter.csv") # Prepare Data # Create as many colors as there are unique midwest['category'] categories = np.unique(midwest['category']) colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))] print(colors) # Draw Plot for Each Category plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k') for i, category in enumerate(categories): plt.scatter('area', 'poptotal', data=midwest.loc[midwest.category==category, :], s=30, c=colors[i], label=str(category)) # Decorations plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000), xlabel='面积', ylabel='人口') plt.xticks(fontsize=16); plt.yticks(fontsize=16) plt.title("散点图:中西部城市面积与人口的关系", fontsize=22) plt.legend(fontsize=12) plt.show()
1.2 气泡图
有时想在边界内显示一组点以强调其重要性。在此示例中,您从应该环绕的数据框中获取记录,并将其传递给下面的代码中所述
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jm3dvaIc-1588929515286)(https://imgkr.cn-bj.ufileos.com/4d3a28d3-077f-4290-8d2b-963f5bcf02ad.png)]
from matplotlib import patches from scipy.spatial import ConvexHull import warnings; warnings.simplefilter('ignore') sns.set_style("white") plt.rcParams['font.sans-serif'] = ['Arial Unicode MS'] # Step 1: Prepare Data midwest = pd.read_csv("data/midwest_filter.csv") # As many colors as there are unique midwest['category'] categories = np.unique(midwest['category']) colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))] # Step 2: Draw Scatterplot with unique color for each category fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k') for i, category in enumerate(categories): plt.scatter('area', 'poptotal', data=midwest.loc[midwest.category==category, :], s='dot_size', c=colors[i], label=str(category), edgecolors='black', linewidths=.5) # Step 3: Encircling # https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot def encircle(x,y, ax=None, **kw): if not ax: ax=plt.gca() p = np.c_[x,y] hull = ConvexHull(p) poly = plt.Polygon(p[hull.vertices,:], **kw) ax.add_patch(poly) # Select data to be encircled midwest_encircle_data = midwest.loc[midwest.state=='IN', :] # Draw polygon surrounding vertices encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k", fc="gold", alpha=0.1) encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick", fc="none", linewidth=1.5) # Step 4: Decorations plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000), xlabel='面积', ylabel='人口') plt.xticks(fontsize=12); plt.yticks(fontsize=12) plt.title("气泡图", fontsize=22) plt.legend(fontsize=12) plt.show()
1.3 散点图与最佳拟合线
如果您想了解两个变量如何相对变化,则最好的方法就是拟合
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") df_select = df.loc[df.cyl.isin([4,8]), :] # Plot sns.set_style("white") gridobj = sns.lmplot(x="displ", y="hwy", hue="cyl", data=df_select, height=7, aspect=1.6, robust=True, palette='tab10', scatter_kws=dict(s=60, linewidths=.7, edgecolors='black')) # Decorations gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50)) plt.title("Scastterplot with line of best fit grouped by number of cylinders", fontsize=20) plt.show()
1.4 带状抖动图
通常,多个数据点具有完全相同的X和Y值。结果,多个点相互绘制并隐藏。为避免这种情况,请稍微抖动点,以便您可以直观地看到它们
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Draw Stripplot fig, ax = plt.subplots(figsize=(16,10), dpi= 80) sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5) # Decorations plt.title('Use jittered plots to avoid overlapping of points', fontsize=22) plt.show()
1.5 计数图
避免点重叠问题的另一种选择是增加点的大小,具体取决于该点上有多少点。因此,点的大小越大,周围的点的集中度就越大。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") df_counts = df.groupby(['hwy', 'cty']).size().reset_index(name='counts') # Draw Stripplot fig, ax = plt.subplots(figsize=(16,10), dpi= 80) sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*2, ax=ax) # Decorations plt.title('Counts Plot - Size of circle is bigger as more points overlap', fontsize=22) plt.show()
1.6 边际直方图
边际直方图沿X和Y轴变量具有直方图。这用于可视化X和Y之间的关系以及X和Y的单变量分布。如果经常在探索性数据分析(EDA)中使用此图。
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Create Fig and gridspec fig = plt.figure(figsize=(16, 10), dpi= 80) grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2) # Define the axes ax_main = fig.add_subplot(grid[:-1, :-1]) ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[]) ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[]) # Scatterplot on main ax ax_main.scatter('displ', 'hwy', s=df.cty*4, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="tab10", edgecolors='gray', linewidths=.5) # histogram on the right ax_bottom.hist(df.displ, 40, histtype='stepfilled', orientation='vertical', color='deeppink') ax_bottom.invert_yaxis() # histogram in the bottom ax_right.hist(df.hwy, 40, histtype='stepfilled', orientation='horizontal', color='deeppink') # Decorations ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy') ax_main.title.set_fontsize(20) for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()): item.set_fontsize(14) xlabels = ax_main.get_xticks().tolist() ax_main.set_xticklabels(xlabels) plt.show()
1.7 边际箱型图
边际箱线图的作用类似于边际直方图。但是,箱形图有助于查明X和Y的中位数,第25和第75个百分位数
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dasYTrY8-1588929515289)(https://imgkr.cn-bj.ufileos.com/3d3d0a9f-0407-4c31-8ea5-8952bb0959dc.png)]
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Create Fig and gridspec fig = plt.figure(figsize=(16, 10), dpi= 80) grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2) # Define the axes ax_main = fig.add_subplot(grid[:-1, :-1]) ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[]) ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[]) # Scatterplot on main ax ax_main.scatter('displ', 'hwy', s=df.cty*5, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="Set1", edgecolors='black', linewidths=.5) # Add a graph in each part sns.boxplot(df.hwy, ax=ax_right, orient="v") sns.boxplot(df.displ, ax=ax_bottom, orient="h") # Decorations ------------------ # Remove x axis name for the boxplot ax_bottom.set(xlabel='') ax_right.set(ylabel='') # Main Title, Xlabel and YLabel ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy') # Set font size of different components ax_main.title.set_fontsize(20) for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()): item.set_fontsize(14) plt.show()
1.8 相关图
关联图用于直观地查看给定数据帧(或2D数组)中所有可能的数字变量对之间的相关性度量。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MAY2sJfR-1588929515290)(https://imgkr.cn-bj.ufileos.com/07319caa-7724-41d3-bc0e-2c4b7fc2e4dd.png)]
# Import Dataset df = pd.read_csv("data/mtcars.csv") # Plot plt.figure(figsize=(12,10), dpi= 80) sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True) # Decorations plt.title('Correlogram of mtcars', fontsize=22) plt.xticks(fontsize=12) plt.yticks(fontsize=12) plt.show()
12
1.9 成对图
在理解分析中所有可能的数字变量对之间的关系时,成对绘图是最喜欢的。它是用于双变量分析的必备工具
# Load Dataset df = sns.load_dataset('iris') # Plot plt.figure(figsize=(10,8), dpi= 80) sns.pairplot(df, kind="scatter", hue="species", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5)) plt.show()
# Load Dataset df = sns.load_dataset('iris') # Plot plt.figure(figsize=(10,8), dpi= 80) sns.pairplot(df, kind="reg", hue="species") plt.show()
2.偏差
2.1 发散型条形图
如果要查看项目基于单个度量标准的变化方式并可视化此变化的顺序和数量,则分叉条是一个很好的工具。它有助于快速区分数据中组的性能,并且非常直观,可以立即传达要点。
# Prepare Data df = pd.read_csv("data/mtcars.csv") x = df.loc[:, ['mpg']] df['mpg_z'] = (x - x.mean())/x.std() df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']] df.sort_values('mpg_z', inplace=True) df.reset_index(inplace=True) # Draw plot plt.figure(figsize=(14,10), dpi= 80) plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5) # Decorations plt.gca().set(ylabel='$Model$', xlabel='$Mileage$') plt.yticks(df.index, df.cars, fontsize=12) plt.title('Diverging Bars of Car Mileage', fontdict={'size':20}) plt.grid(linestyle='--', alpha=0.5) plt.show()
2.2 发散型文本
分隔文本类似于分隔条,如果您希望以一种美观和可表达的方式显示图表中每个项目的值,则首选文本。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JWiNgYZA-1588929515292)(https://imgkr.cn-bj.ufileos.com/a737590f-102b-4730-9a81-0585d9331c76.png)]
# Prepare Data df = pd.read_csv("data/mtcars.csv") x = df.loc[:, ['mpg']] df['mpg_z'] = (x - x.mean())/x.std() df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']] df.sort_values('mpg_z', inplace=True) df.reset_index(inplace=True) # Draw plot plt.figure(figsize=(14,14), dpi= 80) plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z) for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z): t = plt.text(x, y, round(tex, 2), horizontalalignment='right' if x < 0 else 'left', verticalalignment='center', fontdict={'color':'red' if x < 0 else 'green', 'size':14}) # Decorations plt.yticks(df.index, df.cars, fontsize=12) plt.title('Diverging Text Bars of Car Mileage', fontdict={'size':20}) plt.grid(linestyle='--', alpha=0.5) plt.xlim(-2.5, 2.5) plt.show()
2.3 发散型散点图
发散点图也类似于发散条。但是,与散布条相比,条的不存在会降低组之间的对比度和差异。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MqUvA7ZN-1588929515293)(https://imgkr.cn-bj.ufileos.com/3be5f9f6-964a-46d3-a9b1-3421b85ba1de.png)]
# Prepare Data df = pd.read_csv("data/mtcars.csv") x = df.loc[:, ['mpg']] df['mpg_z'] = (x - x.mean())/x.std() df['colors'] = ['red' if x < 0 else 'darkgreen' for x in df['mpg_z']] df.sort_values('mpg_z', inplace=True) df.reset_index(inplace=True) # Draw plot plt.figure(figsize=(14,16), dpi= 80) plt.scatter(df.mpg_z, df.index, s=450, alpha=.6, color=df.colors) for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z): t = plt.text(x, y, round(tex, 1), horizontalalignment='center', verticalalignment='center', fontdict={'color':'white'}) # Decorations # Lighten borders plt.gca().spines["top"].set_alpha(.3) plt.gca().spines["bottom"].set_alpha(.3) plt.gca().spines["right"].set_alpha(.3) plt.gca().spines["left"].set_alpha(.3) plt.yticks(df.index, df.cars) plt.title('Diverging Dotplot of Car Mileage', fontdict={'size':20}) plt.xlabel('$Mileage$') plt.grid(linestyle='--', alpha=0.5) plt.xlim(-2.5, 2.5) plt.show()
2.4 带有标记棒棒糖图
带有标记的棒棒糖提供了一种灵活的方式来可视化差异,方法是将重点放在您要引起注意的重要数据点上,并在图表中适当地进行推理。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EgQRKDzB-1588929515293)(https://imgkr.cn-bj.ufileos.com/dc986d04-1368-4f75-a27d-2190e8ff85d8.png)]
# Prepare Data df = pd.read_csv("data/mtcars.csv") x = df.loc[:, ['mpg']] df['mpg_z'] = (x - x.mean())/x.std() df['colors'] = 'black' # color fiat differently df.loc[df.cars == 'Fiat X1-9', 'colors'] = 'darkorange' df.sort_values('mpg_z', inplace=True) df.reset_index(inplace=True) # Draw plot import matplotlib.patches as patches plt.figure(figsize=(14,16), dpi= 80) plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=1) plt.scatter(df.mpg_z, df.index, color=df.colors, s=[600 if x == 'Fiat X1-9' else 300 for x in df.cars], alpha=0.6) plt.yticks(df.index, df.cars) plt.xticks(fontsize=12) # Annotate plt.annotate('Mercedes Models', xy=(0.0, 11.0), xytext=(1.0, 11), xycoords='data', fontsize=15, ha='center', va='center', bbox=dict(boxstyle='square', fc='firebrick'), arrowprops=dict(arrowstyle='-[, widthB=2.0, lengthB=1.5', lw=2.0, color='steelblue'), color='white') # Add Patches p1 = patches.Rectangle((-2.0, -1), width=.3, height=3, alpha=.2, facecolor='red') p2 = patches.Rectangle((1.5, 27), width=.8, height=5, alpha=.2, facecolor='green') plt.gca().add_patch(p1) plt.gca().add_patch(p2) # Decorate plt.title('Diverging Bars of Car Mileage', fontdict={'size':20}) plt.grid(linestyle='--', alpha=0.5) plt.show()
2.5 面积图
通过为轴和线之间的区域着色,面积图不仅将重点放在峰和谷上,而且还将重点放在高点和低点的持续时间上。高点持续时间越长,线下面积越大
import numpy as np import pandas as pd # Prepare Data df = pd.read_csv("data/economics.csv", parse_dates=['date']).head(100) x = np.arange(df.shape[0]) y_returns = (df.psavert.diff().fillna(0)/df.psavert.shift(1)).fillna(0) * 100 # Plot plt.figure(figsize=(16,10), dpi= 80) plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] >= 0, facecolor='green', interpolate=True, alpha=0.7) plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] <= 0, facecolor='red', interpolate=True, alpha=0.7) # Annotate plt.annotate('Peak \n1975', xy=(94.0, 21.0), xytext=(88.0, 28), bbox=dict(boxstyle='square', fc='firebrick'), arrowprops=dict(facecolor='steelblue', shrink=0.05), fontsize=15, color='white') # Decorations xtickvals = [str(m)[:3].upper()+"-"+str(y) for y,m in zip(df.date.dt.year, df.date.dt.month_name())] plt.gca().set_xticks(x[::6]) plt.gca().set_xticklabels(xtickvals[::6], rotation=90, fontdict={'horizontalalignment': 'center', 'verticalalignment': 'center_baseline'}) plt.ylim(-35,35) plt.xlim(1,100) plt.title("Month Economics Return %", fontsize=22) plt.ylabel('Monthly returns %') plt.grid(alpha=0.5) plt.show()
3.排行
3.1 有序条形图
有序条形图有效地传达了项目的排名顺序。但是,将指标的值加到图表上方,用户可以从图表本身获取准确的信息。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W09mAeUj-1588929515295)(https://imgkr.cn-bj.ufileos.com/342fe1d9-a39e-48e6-a05b-5e4a17c9966c.png)]
# Prepare Data df_raw = pd.read_csv("data/mpg_ggplot2.csv") df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean()) df.sort_values('cty', inplace=True) df.reset_index(inplace=True) # Draw plot import matplotlib.patches as patches fig, ax = plt.subplots(figsize=(16,10), facecolor='white', dpi= 80) ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=20) # Annotate Text for i, cty in enumerate(df.cty): ax.text(i, cty+0.5, round(cty, 1), horizontalalignment='center') # Title, Label, Ticks and Ylim ax.set_title('Bar Chart for Highway Mileage', fontdict={'size':22}) ax.set(ylabel='Miles Per Gallon', ylim=(0, 30)) plt.xticks(df.index, df.manufacturer.str.upper(), rotation=60, horizontalalignment='right', fontsize=12) # Add patches to color the X axis labels p1 = patches.Rectangle((.57, -0.005), width=.33, height=.13, alpha=.1, facecolor='green', transform=fig.transFigure) p2 = patches.Rectangle((.124, -0.005), width=.446, height=.13, alpha=.1, facecolor='red', transform=fig.transFigure) fig.add_artist(p1) fig.add_artist(p2) plt.show()
3.2 棒棒糖图
棒棒糖图表在视觉上令人愉悦,其功能与订购条形图相似。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4K3dGO1a-1588929515295)(https://imgkr.cn-bj.ufileos.com/06ebb41e-2651-4459-aedb-2603c7f62c2c.png)]
# Prepare Data df_raw = pd.read_csv("data/mpg_ggplot2.csv") df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean()) df.sort_values('cty', inplace=True) df.reset_index(inplace=True) # Draw plot fig, ax = plt.subplots(figsize=(16,10), dpi= 80) ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=2) ax.scatter(x=df.index, y=df.cty, s=75, color='firebrick', alpha=0.7) # Title, Label, Ticks and Ylim ax.set_title('Lollipop Chart for Highway Mileage', fontdict={'size':22}) ax.set_ylabel('Miles Per Gallon') ax.set_xticks(df.index) ax.set_xticklabels(df.manufacturer.str.upper(), rotation=60, fontdict={'horizontalalignment': 'right', 'size':12}) ax.set_ylim(0, 30) # Annotate for row in df.itertuples(): ax.text(row.Index, row.cty+.5, s=round(row.cty, 2), horizontalalignment= 'center', verticalalignment='bottom', fontsize=14) plt.show()
3.3 点图
点状图传达了项目的等级顺序。并且由于它是沿水平轴对齐的,因此您可以更轻松地可视化这些点之间的距离。
# Prepare Data df_raw = pd.read_csv("data/mpg_ggplot2.csv") df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean()) df.sort_values('cty', inplace=True) df.reset_index(inplace=True) # Draw plot fig, ax = plt.subplots(figsize=(16,10), dpi= 80) ax.hlines(y=df.index, xmin=11, xmax=26, color='gray', alpha=0.7, linewidth=1, linestyles='dashdot') ax.scatter(y=df.index, x=df.cty, s=75, color='firebrick', alpha=0.7) # Title, Label, Ticks and Ylim ax.set_title('Dot Plot for Highway Mileage', fontdict={'size':22}) ax.set_xlabel('Miles Per Gallon') ax.set_yticks(df.index) ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'}) ax.set_xlim(10, 27) plt.show()
3.4 坡度图
斜率图最适合比较给定人员/项目的“之前”和“之后”位置。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ollmu5Jx-1588929515296)(https://imgkr.cn-bj.ufileos.com/10992c2c-708e-4317-b295-4c61a58a54d4.png)]
import matplotlib.lines as mlines # Import Data df = pd.read_csv("data/gdppercap.csv") left_label = [str(c) + ', '+ str(round(y)) for c, y in zip(df.continent, df['1952'])] right_label = [str(c) + ', '+ str(round(y)) for c, y in zip(df.continent, df['1957'])] klass = ['red' if (y1-y2) < 0 else 'green' for y1, y2 in zip(df['1952'], df['1957'])] # draw line # https://stackoverflow.com/questions/36470343/how-to-draw-a-line-with-matplotlib/36479941 def newline(p1, p2, color='black'): ax = plt.gca() l = mlines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color='red' if p1[1]-p2[1] > 0 else 'green', marker='o', markersize=6) ax.add_line(l) return l fig, ax = plt.subplots(1,1,figsize=(14,14), dpi= 80) # Vertical Lines ax.vlines(x=1, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted') ax.vlines(x=3, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted') # Points ax.scatter(y=df['1952'], x=np.repeat(1, df.shape[0]), s=10, color='black', alpha=0.7) ax.scatter(y=df['1957'], x=np.repeat(3, df.shape[0]), s=10, color='black', alpha=0.7) # Line Segmentsand Annotation for p1, p2, c in zip(df['1952'], df['1957'], df['continent']): newline([1,p1], [3,p2]) ax.text(1-0.05, p1, c + ', ' + str(round(p1)), horizontalalignment='right', verticalalignment='center', fontdict={'size':14}) ax.text(3+0.05, p2, c + ', ' + str(round(p2)), horizontalalignment='left', verticalalignment='center', fontdict={'size':14}) # 'Before' and 'After' Annotations ax.text(1-0.05, 13000, 'BEFORE', horizontalalignment='right', verticalalignment='center', fontdict={'size':18, 'weight':700}) ax.text(3+0.05, 13000, 'AFTER', horizontalalignment='left', verticalalignment='center', fontdict={'size':18, 'weight':700}) # Decoration ax.set_title("Slopechart: Comparing GDP Per Capita between 1952 vs 1957", fontdict={'size':22}) ax.set(xlim=(0,4), ylim=(0,14000), ylabel='Mean GDP Per Capita') ax.set_xticks([1,3]) ax.set_xticklabels(["1952", "1957"]) plt.yticks(np.arange(500, 13000, 2000), fontsize=12) # Lighten borders plt.gca().spines["top"].set_alpha(.0) plt.gca().spines["bottom"].set_alpha(.0) plt.gca().spines["right"].set_alpha(.0) plt.gca().spines["left"].set_alpha(.0) plt.show()
3.5 哑铃图
哑铃图传达了各个项目的“之前”和“之后”位置以及这些项目的排名顺序。如果您想可视化特定项目/计划对不同对象的效果,则它非常有用。
import matplotlib.lines as mlines # Import Data df = pd.read_csv("data/health.csv") df.sort_values('pct_2014', inplace=True) df.reset_index(inplace=True) # Func to draw line segment def newline(p1, p2, color='black'): ax = plt.gca() l = mlines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color='skyblue') ax.add_line(l) return l # Figure and Axes fig, ax = plt.subplots(1,1,figsize=(14,14), facecolor='#f7f7f7', dpi= 80) # Vertical Lines ax.vlines(x=.05, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted') ax.vlines(x=.10, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted') ax.vlines(x=.15, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted') ax.vlines(x=.20, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted') # Points ax.scatter(y=df['index'], x=df['pct_2013'], s=50, color='#0e668b', alpha=0.7) ax.scatter(y=df['index'], x=df['pct_2014'], s=50, color='#a3c4dc', alpha=0.7) # Line Segments for i, p1, p2 in zip(df['index'], df['pct_2013'], df['pct_2014']): newline([p1, i], [p2, i]) # Decoration ax.set_facecolor('#f7f7f7') ax.set_title("Dumbell Chart: Pct Change - 2013 vs 2014", fontdict={'size':22}) ax.set(xlim=(0,.25), ylim=(-1, 27), ylabel='Mean GDP Per Capita') ax.set_xticks([.05, .1, .15, .20]) ax.set_xticklabels(['5%', '15%', '20%', '25%']) ax.set_xticklabels(['5%', '15%', '20%', '25%']) plt.show()
4.分布
4.1 连续直方分布图
直方图显示给定变量的频率分布。下图根据分类变量对频率条进行了分组,从而对连续变量和分类变量串联在一起有更深入的了解。
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Prepare data x_var = 'displ' groupby_var = 'class' df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var) vals = [df[x_var].values.tolist() for i, df in df_agg] # Draw plt.figure(figsize=(16,9), dpi= 80) colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))] n, bins, patches = plt.hist(vals, 30, stacked=True, density=False, color=colors[:len(vals)]) # Decoration plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])}) plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22) plt.xlabel(x_var) plt.ylabel("Frequency") plt.ylim(0, 25) plt.xticks(ticks=bins[::3], labels=[round(b,1) for b in bins[::3]]) plt.show()
4.2 分类直方图
分类变量的直方图显示该变量的频率分布。通过为条形着色,您可以将分布与代表颜色的另一个分类变量关联起来。
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Prepare data x_var = 'manufacturer' groupby_var = 'class' df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var) vals = [df[x_var].values.tolist() for i, df in df_agg] # Draw plt.figure(figsize=(16,9), dpi= 80) colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))] n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)]) # Decoration plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])}) plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22) plt.xlabel(x_var) plt.ylabel("Frequency") plt.ylim(0, 40) plt.xticks(ticks=bins, labels=np.unique(df[x_var]).tolist(), rotation=90, horizontalalignment='left') plt.show()
4.3 密度图
密度图是可视化连续变量分布的常用工具。通过使用’response’变量对它们进行分组,您可以检查X和Y之间的关系。以下情况仅出于代表性目的,描述城市里程的分布相对于汽缸数的变化。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1n7Ik9UU-1588929515299)(https://imgkr.cn-bj.ufileos.com/c1c8c508-8ad8-4ee3-b94a-eb008e4345cf.png)]
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Draw Plot plt.figure(figsize=(16,10), dpi= 80) sns.kdeplot(df.loc[df['cyl'] == 4, "cty"], shade=True, color="g", label="Cyl=4", alpha=.7) sns.kdeplot(df.loc[df['cyl'] == 5, "cty"], shade=True, color="deeppink", label="Cyl=5", alpha=.7) sns.kdeplot(df.loc[df['cyl'] == 6, "cty"], shade=True, color="dodgerblue", label="Cyl=6", alpha=.7) sns.kdeplot(df.loc[df['cyl'] == 8, "cty"], shade=True, color="orange", label="Cyl=8", alpha=.7) # Decoration plt.title('Density Plot of City Mileage by n_Cylinders', fontsize=22) plt.legend() plt.show()
4.4 密度曲线直方图
带有直方图的密度曲线将两个图所传达的集体信息汇总在一起,因此您可以将它们都放在一个图中而不是两个图中
# Import Data df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv") # Draw Plot plt.figure(figsize=(13,10), dpi= 80) sns.distplot(df.loc[df['class'] == 'compact', "cty"], color="dodgerblue", label="Compact", hist_kws={'alpha':.7}, kde_kws={'linewidth':3}) sns.distplot(df.loc[df['class'] == 'suv', "cty"], color="orange", label="SUV", hist_kws={'alpha':.7}, kde_kws={'linewidth':3}) sns.distplot(df.loc[df['class'] == 'minivan', "cty"], color="g", label="minivan", hist_kws={'alpha':.7}, kde_kws={'linewidth':3}) plt.ylim(0, 0.35) # Decoration plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=22) plt.legend() plt.show()
4.5 Joy Plot
Joy Plot允许不同组的密度曲线重叠,这是一种可视化大量组相对于彼此分布的好方法。它看起来令人赏心悦目,并且清楚地传达了正确的信息。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3cBIWv2w-1588929515300)(https://imgkr.cn-bj.ufileos.com/a6cca272-1d67-4590-9415-eb017d052edf.png)]
# !pip install joypy # Import Data mpg = pd.read_csv("data/mpg_ggplot2.csv") # Draw Plot plt.figure(figsize=(16,10), dpi= 80) fig, axes = joypy.joyplot(mpg, column=['hwy', 'cty'], by="class", ylim='own', figsize=(14,10)) # Decoration plt.title('Joy Plot of City and Highway Mileage by Class', fontsize=22) plt.show()
4.6 分布式点图
分布点图显示了按组划分的点的单变量分布。点越暗,该区域中数据点的集中度越高。通过对中间值进行不同的着色,各组的实际位置会立即变得明显。
import matplotlib.patches as mpatches # Prepare Data df_raw = pd.read_csv("data/mpg_ggplot2.csv") cyl_colors = {4:'tab:red', 5:'tab:green', 6:'tab:blue', 8:'tab:orange'} df_raw['cyl_color'] = df_raw.cyl.map(cyl_colors) # Mean and Median city mileage by make df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean()) df.sort_values('cty', ascending=False, inplace=True) df.reset_index(inplace=True) df_median = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.median()) # Draw horizontal lines fig, ax = plt.subplots(figsize=(16,10), dpi= 80) ax.hlines(y=df.index, xmin=0, xmax=40, color='gray', alpha=0.5, linewidth=.5, linestyles='dashdot') # Draw the Dots for i, make in enumerate(df.manufacturer): df_make = df_raw.loc[df_raw.manufacturer==make, :] ax.scatter(y=np.repeat(i, df_make.shape[0]), x='cty', data=df_make, s=75, edgecolors='gray', c='w', alpha=0.5) ax.scatter(y=i, x='cty', data=df_median.loc[df_median.index==make, :], s=75, c='firebrick') # Annotate ax.text(33, 13, "$red \; dots \; are \; the \: median$", fontdict={'size':12}, color='firebrick') # Decorations red_patch = plt.plot([],[], marker="o", ms=10, ls="", mec=None, color='firebrick', label="Median") plt.legend(handles=red_patch) ax.set_title('Distribution of City Mileage by Make', fontdict={'size':22}) ax.set_xlabel('Miles Per Gallon (City)', alpha=0.7) ax.set_yticks(df.index) ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'}, alpha=0.7) ax.set_xlim(1, 40) plt.xticks(alpha=0.7) plt.gca().spines["top"].set_visible(False) plt.gca().spines["bottom"].set_visible(False) plt.gca().spines["right"].set_visible(False) plt.gca().spines["left"].set_visible(False) plt.grid(axis='both', alpha=.4, linewidth=.1) plt.show()
4.7 箱型图
箱形图是可视化分布的一种好方法,同时牢记中位数,第25个第75个四分位数和离群值。但是,在解释方框的大小时需要小心,这可能会扭曲该组中包含的点数。因此,手动在每个框中提供观察次数可以帮助克服此缺点。
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Draw Plot plt.figure(figsize=(13,10), dpi= 80) sns.boxplot(x='class', y='hwy', data=df, notch=False) # Add N Obs inside boxplot (optional) def add_n_obs(df,group_col,y): medians_dict = {grp[0]:grp[1][y].median() for grp in df.groupby(group_col)} xticklabels = [x.get_text() for x in plt.gca().get_xticklabels()] n_obs = df.groupby(group_col)[y].size().values for (x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs): plt.text(x, medians_dict[xticklabel]*1.01, "#obs : "+str(n_ob), horizontalalignment='center', fontdict={'size':14}, color='white') add_n_obs(df,group_col='class',y='hwy') # Decoration plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22) plt.ylim(10, 40) plt.show()
4.8 点+ 箱型图
点+箱形图以箱形图的形式传送类似的信息,分为组。此外,这些点还使您感觉到每个组中有多少个数据点。
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Draw Plot plt.figure(figsize=(13,10), dpi= 80) sns.boxplot(x='class', y='hwy', data=df, hue='cyl') sns.stripplot(x='class', y='hwy', data=df, color='black', size=3, jitter=1) for i in range(len(df['class'].unique())-1): plt.vlines(i+.5, 10, 45, linestyles='solid', colors='gray', alpha=0.2) # Decoration plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22) plt.legend(title='Cylinders') plt.show()
4.9 小提琴图
小提琴图是箱形图的视觉替代。小提琴的形状或面积取决于其可观察的次数。但是,小提琴图很难阅读,在专业环境中不常用。
# Import Data df = pd.read_csv("data/mpg_ggplot2.csv") # Draw Plot plt.figure(figsize=(13,10), dpi= 80) sns.violinplot(x='class', y='hwy', data=df, scale='width', inner='quartile') # Decoration plt.title('Violin Plot of Highway Mileage by Vehicle Class', fontsize=22) plt.show()
4.10 人口金字塔
人口金字塔可用于显示按体积排序的组的分布。或者,它也可以用来显示人口的逐步过滤,因为它在下面用于显示有多少人通过营销渠道的每个阶段。
# Read data df = pd.read_csv("data/email_campaign_funnel.csv") # Draw Plot plt.figure(figsize=(13,10), dpi= 80) group_col = 'Gender' order_of_bars = df.Stage.unique()[::-1] colors = [plt.cm.Spectral(i/float(len(df[group_col].unique())-1)) for i in range(len(df[group_col].unique()))] for c, group in zip(colors, df[group_col].unique()): sns.barplot(x='Users', y='Stage', data=df.loc[df[group_col]==group, :], order=order_of_bars, color=c, label=group) # Decorations plt.xlabel("$Users$") plt.ylabel("Stage of Purchase") plt.yticks(fontsize=12) plt.title("Population Pyramid of the Marketing Funnel", fontsize=22) plt.legend() plt.show()
4.11 分类图
seaborn库提供的分类图可用于可视化2个或更多分类变量彼此之间的计数分布。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sHmytr7R-1588929515303)(https://imgkr.cn-bj.ufileos.com/8a378260-32ba-45d0-a4e8-a277534ea96e.png)]
# Load Dataset titanic = sns.load_dataset("titanic") # Plot g = sns.catplot("alive", col="deck", col_wrap=4, data=titanic[titanic.deck.notnull()], kind="count", height=3.5, aspect=.8, palette='tab20') fig.suptitle('sf') plt.show()
# Load Dataset titanic = sns.load_dataset("titanic") # Plot sns.catplot(x="age", y="embark_town", hue="sex", col="class", data=titanic[titanic.embark_town.notnull()], orient="h", height=5, aspect=1, palette="tab10", kind="violin", dodge=True, cut=0, bw=.2)