seaborn从入门到精通03-绘图功能实现03-分布绘图distributional plots

简介: seaborn从入门到精通03-绘图功能实现03-分布绘图distributional plots

1a9bed745ac14360a656b89ca93191cf.jpg


关系-分布-分类


relational “关系型”

distributional “分布型”

categorical “分类型”

456b22d5bb0fef07f5fd71c8faa4cfc8_cbb903f411324180a5e36f891bc3eb8f.png


分布绘图-Visualizing distributions data

An early step in any effort to analyze or model data should be to understand how the variables are distributed. Techniques for distribution visualization can provide quick answers to many important questions. What range do the observations cover? What is their central tendency? Are they heavily skewed in one direction? Is there evidence for bimodality? Are there significant outliers? Do the answers to these questions vary across subsets defined by other variables?

任何分析或建模数据的工作的早期步骤都应该是理解变量是如何分布的。分布可视化技术可以为许多重要问题提供快速答案。观察的范围是什么?它们的集中趋势是什么?它们是否严重偏向一个方向?是否有双态的证据?是否存在显著的异常值?这些问题的答案是否在其他变量定义的子集中有所不同?

The distributions module contains several functions designed to answer questions such as these. The axes-level functions are histplot(), kdeplot(), ecdfplot(), and rugplot(). They are grouped together within the figure-level displot(), jointplot(), and pairplot() functions…

分发模块包含几个旨在回答此类问题的函数。轴级函数是histplot()、kdeploy()、ecdfplot()和rugplot()。它们在图形级的displot()、jointplot()和pairplot()函数中组合在一起。

There are several different approaches to visualizing a distribution, and each has its relative advantages and drawbacks. It is important to understand these factors so that you can choose the best approach for your particular aim.

有几种不同的方法来可视化发行版,每种方法都有其相对的优点和缺点。了解这些因素是很重要的,这样你就可以为你的特定目标选择最好的方法。


图形级接口displot/jointplot/pairplot–figure-level interface

参考

displot

jointplot

pairplot


轴级接口histplot/kdeplot/ecdfplot/rugplot–axes-level interface

histplot

kdeplot

ecdfplot

rugplot


导入库与查看tips和diamonds 数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as  mpl
import seaborn as sns
sns.set_theme(style="darkgrid")
mpl.rcParams['font.sans-serif']=['SimHei']
mpl.rcParams['axes.unicode_minus']=False
tips = sns.load_dataset("tips",cache=True,data_home=r"./seaborn-data")
tips.head()

3f470b79f5c718980721b668e3327666_984ec1c1b7fb4aedbca13f8f772a5cf8.png


diamonds = sns.load_dataset("diamonds",cache=True,data_home=r"./seaborn-data")
print(diamonds.head())

73bdaa66f2130b88751ac065129d13be_422cad48f6264e68869cb8338785a3fa.png

titanic = sns.load_dataset("titanic",cache=True,data_home=r"./seaborn-data")
print(titanic.info())
print(titanic.head())

输出:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   survived     891 non-null    int64
 1   pclass       891 non-null    int64
 2   sex          891 non-null    object
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64
 5   parch        891 non-null    int64
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object
 8   class        891 non-null    category
 9   who          891 non-null    object
 10  adult_male   891 non-null    bool
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object
 13  alive        891 non-null    object
 14  alone        891 non-null    bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None
   survived  pclass     sex   age  sibsp  parch  ...    who adult_male deck  embark_town  alive  alone   
0         0       3    male  22.0      1      0  ...    man       True  NaN  Southampton     no  False   
1         1       1  female  38.0      1      0  ...  woman      False    C    Cherbourg    yes  False   
2         1       3  female  26.0      0      0  ...  woman      False  NaN  Southampton    yes   True   
3         1       1  female  35.0      1      0  ...  woman      False    C  Southampton    yes  False   
4         0       3    male  35.0      0      0  ...    man       True  NaN  Southampton     no   True  
penguins = sns.load_dataset("penguins",cache=True,data_home=r"./seaborn-data")
print(penguins.head())

606271007ca6f923d23e6690892ba7c0_97f7b43330ff4957b2def059a562b455.png


直方图histplot

案例1-单变量直方图histplot

Perhaps the most common approach to visualizing a distribution is the histogram. This is the default approach in displot(), which uses the same underlying code as histplot(). A histogram is a bar plot where the axis representing the data variable is divided into a set of discrete bins and the count of observations falling within each bin is shown using the height of the corresponding bar:

也许可视化分布的最常用方法是直方图。这是displot()中的默认方法,它使用与histplot()相同的底层代码。直方图是一种条形图,其中表示数据变量的轴被划分为一组离散的bins,并且每个bin内的观测值的计数使用相应的bar的高度表示:


sns.displot(penguins, x="flipper_length_mm")

8c677e1ef52cd8177589bd0ea0ec63b2_4eea70d15da14485bb949f897b27f102.png


This plot immediately affords a few insights about the flipper_length_mm variable. For instance, we can see that the most common flipper length is about 195 mm, but the distribution appears bimodal, so this one number does not represent the data well.

这个图立即提供了关于flipper_length_mm变量的一些见解。例如,我们可以看到最常见的鳍长约为195 mm,但分布呈双峰,所以这一个数字并不能很好地代表数据。


案例2-直方图histplot-参数设置bin数量,大小和宽度

The size of the bins is an important parameter, and using the wrong bin size can mislead by obscuring important features of the data or by creating apparent features out of random variability. By default, displot()/histplot() choose a default bin size based on the variance of the data and the number of observations. But you should not be over-reliant on such automatic approaches, because they depend on particular assumptions about the structure of your data. It is always advisable to check that your impressions of the distribution are consistent across different bin sizes. To choose the size directly, set the binwidth parameter:

容器的大小是一个重要的参数,使用错误的容器大小可能会通过模糊数据的重要特征或通过随机可变性创建明显的特征而产生误导。默认情况下,displot()/histplot()根据数据的方差和观测值的数量选择默认的bin大小。但是您不应该过度依赖这种自动方法,因为它们依赖于对数据结构的特定假设。检查你对不同容器大小的分布的印象是否一致总是明智的。


sns.displot(penguins, x="flipper_length_mm", binwidth=3)

9ed2b909b5e60494bf405e6f258e6650_c80833fe6c63413eaac12ca74a211529.png

sns.displot(penguins, x="flipper_length_mm", bins=20)

In other circumstances, it may make more sense to specify the number of bins, rather than their size:

在其他情况下,指定箱子的数量而不是它们的大小可能更有意义:


dadc481c1584b725fe59f8b9c36fcf2a_d4159a2cf0e24092b35214a13908308e.png


One example of a situation where defaults fail is when the variable takes a relatively small number of integer values. In that case, the default bin width may be too small, creating awkward gaps in the distribution:

默认值失败的一个例子是当变量接受相对较少的整数值时。在这种情况下,默认的bin宽度可能太小,在分布中产生尴尬的间隙:

sns.displot(tips, x="size")

8fddf37c911c5a4caedd9a1469252c4d_d189f901602941899e76e5888a0c0371.png


# sns.displot(tips, x="size")
sns.displot(tips, x="size", bins=[1, 2, 3, 4, 5, 6, 7])

One approach would be to specify the precise bin breaks by passing an array to bins:

一种方法是通过传递一个数组给bins来指定精确的bin换行符:


394eb4cf857cc2d9451b341a318604dd_071b2b5d8c554297b74de4a16023b80d.png


This can also be accomplished by setting discrete=True, which chooses bin breaks that represent the unique values in a dataset with bars that are centered on their corresponding value.

这也可以通过设置discrete=True来实现,它选择代表数据集中唯一值的分站符,其中的条以相应的值为中心。


sns.displot(tips, x="size", discrete=True)

49937f758a81437264be5ecbd8de4ad1_5a388b6dc3ef4e89b9d37b4fa02f132d.png


It’s also possible to visualize the distribution of a categorical variable using the logic of a histogram. Discrete bins are automatically set for categorical variables, but it may also be helpful to “shrink” the bars slightly to emphasize the categorical nature of the axis:

也可以使用直方图的逻辑来可视化分类变量的分布。离散箱是自动为分类变量设置的,但它可能也有助于“缩小”条,以强调轴的分类性质:

sns.displot(tips, x="day", shrink=.8)

497f97bc6d89b786e1eb0b4bef0ff553_1cd968b035af43afb957d3e381c7440f.png


案例3-直方图histplot-Conditioning on other variables

Once you understand the distribution of a variable, the next step is often to ask whether features of that distribution differ across other variables in the dataset. For example, what accounts for the bimodal distribution of flipper lengths that we saw above? displot() and histplot() provide support for conditional subsetting via the hue semantic. Assigning a variable to hue will draw a separate histogram for each of its unique values and distinguish them by color:

一旦你理解了一个变量的分布,下一步通常是问这个分布的特征在数据集中的其他变量之间是否不同。例如,是什么解释了我们上面看到的鳍状肢长度的双峰分布?Displot()和histplot()通过色调语义提供条件子集的支持。将变量赋值为hue将为每个变量的唯一值绘制单独的直方图,并通过颜色区分它们:

sns.displot(penguins, x="flipper_length_mm", hue="species")

e4952cf630efc93b2228633fa21f8059_7a5dd26042c440f38d701533728b7eb8.png


案例4-直方图histplot转换为阶梯图

By default, the different histograms are “layered” on top of each other and, in some cases, they may be difficult to distinguish. One option is to change the visual representation of the histogram from a bar plot to a “step” plot:

默认情况下,不同的直方图是相互“分层”的,在某些情况下,它们可能很难区分。一种选择是将直方图的可视化表示从条形图更改为“阶梯”图:


# sns.displot(penguins, x="flipper_length_mm", hue="species")
sns.displot(penguins, x="flipper_length_mm", hue="species", element="step")

407511a66de967afeeb7cfe77e044bac_b300f424758048ec941f152efeac621d.png


案例5-直方图histplot堆叠图stack

sns.displot(penguins, x="flipper_length_mm", hue="species", multiple="stack")


b07f9245967b7ffacd724e82b845d900_66b7ced7cff94c4383c2f8045cb91f96.png

The stacked histogram emphasizes the part-whole relationship between the variables, but it can obscure other features (for example, it is difficult to determine the mode of the Adelie distribution. Another option is “dodge” the bars, which moves them horizontally and reduces their width. This ensures that there are no overlaps and that the bars remain comparable in terms of height. But it only works well when the categorical variable has a small number of levels:

堆叠直方图强调变量之间的部分-整体关系,但它可能会掩盖其他特征(例如,很难确定阿德利分布的模式。另一种选择是“dodge”,这将水平移动它们并减少它们的宽度。这确保了没有重叠,并且条在高度方面保持可比性。但它只在类别变量具有少量级别时才能很好地工作:


sns.displot(penguins, x="flipper_length_mm", hue="sex", multiple="dodge")

2fc12766aabc90331e9bec11ed9c2f8d_d240e033b24a454e89622999f0c29cdd.png


Because displot() is a figure-level function and is drawn onto a FacetGrid, it is also possible to draw each individual distribution in a separate subplot by assigning the second variable to col or row rather than (or in addition to) hue. This represents the distribution of each subset well, but it makes it more difficult to draw direct comparisons:

因为displot()是一个图形级函数,并且被绘制到FacetGrid上,所以还可以通过将第二个变量分配给col或row而不是(或加上)hue来在单独的子图中绘制每个单独的分布。这很好地代表了每个子集的分布,但它使进行直接比较变得更加困难:


sns.displot(penguins, x="flipper_length_mm", col="sex")

5a28c0119a19c2545b289578ffa11def_afdb1f3c8c6141b5ab20f27ce87cf550.png


None of these approaches are perfect, and we will soon see some alternatives to a histogram that are better-suited to the task of comparison.

这些方法都不是完美的,我们很快就会看到一些替代直方图的方法,它们更适合进行比较。


案例5-直方图hist-标准化直方图Normalized histogram statistics

Before we do, another point to note is that, when the subsets have unequal numbers of observations, comparing their distributions in terms of counts may not be ideal. One solution is to normalize the counts using the stat parameter:

在此之前,需要注意的另一点是,当子集具有不等数量的观测值时,比较它们在计数方面的分布可能并不理想。一种解决方案是使用stat参数规范化计数:

By default, however, the normalization is applied to the entire distribution, so this simply rescales the height of the bars. By setting common_norm=False, each subset will be normalized independently:

但是,默认情况下,归一化应用于整个分布,因此这只是重新调整了柱状图的高度。通过设置common_norm=False,每个子集将被独立地规范化:


sns.displot(penguins, x="flipper_length_mm", hue="species",)
# sns.displot(penguins, x="flipper_length_mm", hue="species", stat="density")
sns.displot(penguins, x="flipper_length_mm", hue="species", stat="density", common_norm=False)


dd0f9cd077359047c9650b5d586496c1_2a98dd83c9044ed094738e48d25d186f.png

Density normalization scales the bars so that their areas sum to 1. As a result, the density axis is not directly interpretable. Another option is to normalize the bars to that their heights sum to 1. This makes most sense when the variable is discrete, but it is an option for all histograms:

密度归一化使条形图的面积之和为1。因此,密度轴是不能直接解释的。另一种选择是将柱形归一化,使其高度之和为1。当变量是离散的时,这是最有意义的,但它是所有直方图的一个选项:


sns.displot(penguins, x="flipper_length_mm", hue="species", stat="density")
sns.displot(penguins, x="flipper_length_mm", hue="species", stat="probability")

aea88bc232fb3a30fb78b1709d0c1709_51c02403bda446bcb3767dce5b71ec41.png


核密度估计图-Kernel density estimation

A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate:

直方图旨在通过对观察结果进行分类和计数来近似生成数据的底层概率密度函数。核密度估计(KDE)对同样的问题提出了不同的解决方案。KDE图不是使用离散箱,而是用高斯核平滑观察,产生连续的密度估计:


案例1-核密度估计图

sns.displot(penguins, x="flipper_length_mm", kind="kde")



f205d077206c02fc06ca5ab387aea61e_1f67225d11bc432b984edba033300f15.png

案例2-核密度估计图-Choosing the smoothing bandwidth

Much like with the bin size in the histogram, the ability of the KDE to accurately represent the data depends on the choice of smoothing bandwidth. An over-smoothed estimate might erase meaningful features, but an under-smoothed estimate can obscure the true shape within random noise. The easiest way to check the robustness of the estimate is to adjust the default bandwidth:

就像直方图中的箱子大小一样,KDE准确表示数据的能力取决于平滑带宽的选择。过度平滑的估计可能会抹去有意义的特征,但未平滑的估计可能会在随机噪声中掩盖真实的形状。检查估计的稳健性最简单的方法是调整默认带宽:


如果发现曲线还是不够平滑时,可以增大bw_adjust,即对bw乘以一个系数


sns.displot(penguins, x="flipper_length_mm", kind="kde", bw_adjust=.25)
sns.displot(penguins, x="flipper_length_mm", kind="kde", bw_adjust=1.0)


f3ddc81b04d6e3f67bb1a95f89ed6238_bbe12abbb7824f229122ad43c75746b9.png

案例3-核密度估计图-参数hue与fill填充

与直方图一样,如果你分配了一个色调变量,将为该变量的每个级别计算一个单独的密度估计:


sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde")
sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde", fill=True)

b05cae7068ff83d90606824692d1ab59_f4b1c0e4c2904d75b84449eb257d4d8d.png


案例4-核密度估计图缺陷-Kernel density estimation pitfalls

KDE plots have many advantages. Important features of the data are easy to discern (central tendency, bimodality, skew), and they afford easy comparisons between subsets. But there are also situations where KDE poorly represents the underlying data. This is because the logic of KDE assumes that the underlying distribution is smooth and unbounded. One way this assumption can fail is when a variable reflects a quantity that is naturally bounded. If there are observations lying close to the bound (for example, small values of a variable that cannot be negative), the KDE curve may extend to unrealistic values:

KDE图有很多优点。数据的重要特征很容易辨别(集中倾向、双峰性、歪斜),并且可以很容易地在子集之间进行比较。但是也有KDE不能很好地表示底层数据的情况。这是因为KDE的逻辑假设底层分布是平滑且无界的。当一个变量反映一个自然有界的量时,这个假设就会失败。如果观测值接近边界(例如,变量的小值不能为负),则KDE曲线可能扩展为不真实


sns.displot(tips, x="total_bill", kind="hist")
sns.displot(tips, x="total_bill", kind="kde")


14c5cc890aec718b7a136c39a8483f3c_4060bb7a827540e0bec16a4040f12579.png

This can be partially avoided with the cut parameter, which specifies how far the curve should extend beyond the extreme datapoints. But this influences only where the curve is drawn; the density estimate will still smooth over the range where no data can exist, causing it to be artificially low at the extremes of the distribution:

使用cut参数可以部分避免这种情况,该参数指定曲线应该超出极端数据点的范围。但这只会影响曲线的绘制位置;密度估计仍然会在没有数据存在的范围内平滑,导致在分布的极端处人为地降低:


sns.displot(tips, x="total_bill", kind="kde")
sns.displot(tips, x="total_bill", kind="kde", cut=0)

117d78948e67350e3871996a21f6efcc_790249d1205a4a9e9decf61e6652f524.png


The KDE approach also fails for discrete data or when data are naturally continuous but specific values are over-represented. The important thing to keep in mind is that the KDE will always show you a smooth curve, even when the data themselves are not smooth. For example, consider this distribution of diamond weights:

KDE方法对于离散数据或当数据自然连续但特定值被过度表示时也会失败。需要记住的重要一点是,KDE将始终向您显示平滑的曲线,即使数据本身并不平滑。例如,考虑钻石重量的分布:

While the KDE suggests that there are peaks around specific values, the histogram reveals a much more jagged distribution:

虽然KDE表明在特定值周围有峰值,但直方图揭示了一个更加锯齿状的分布:


sns.displot(diamonds, x="carat", kind="kde")
sns.displot(diamonds, x="carat")

f615b03d114918a2597e55a3129dc5df_227660b6ced04905acfac8948a6fdf3c.png


As a compromise, it is possible to combine these two approaches. While in histogram mode, displot() (as with histplot()) has the option of including the smoothed KDE curve (note kde=True, not kind=“kde”):

作为一种折衷,可以将这两种方法结合起来。在直方图模式下,displot()(与histplot()一样)可以选择包括平滑的KDE曲线(注意KDE =True, not kind=" KDE "):

sns.displot(diamonds, x="carat", kde=True)

bd8fc688098e25e0b9c6b168375fe4b4_c2565d7802e14835b2739f4062cb7101.png


经验累计分布-Empirical cumulative distributions

A third option for visualizing distributions computes the “empirical cumulative distribution function” (ECDF). This plot draws a monotonically-increasing curve through each datapoint such that the height of the curve reflects the proportion of observations with a smaller value:

可视化分布的第三个选项是计算“经验累积分布函数”(ECDF)。该图通过每个数据点绘制了一条单调递增的曲线,这样曲线的高度反映了具有较小值的观测值的比例:


案例1-经验累计分布图ecdf

sns.displot(penguins,x="flipper_length_mm",kde="kde")
sns.displot(penguins, x="flipper_length_mm", kind="ecdf")


188b83bb2d90787cfedb94d9b468b051_5ce0d667669e490ca828b19dd03ef152.png

案例2-经验累计分布图ecdf优势多个分布

The ECDF plot has two key advantages. Unlike the histogram or KDE, it directly represents each datapoint. That means there is no bin size or smoothing parameter to consider. Additionally, because the curve is monotonically increasing, it is well-suited for comparing multiple distributions:

ECDF地块有两个关键优势。与直方图或KDE不同,它直接表示每个数据点。这意味着不需要考虑bin大小或平滑参数。此外,由于曲线是单调递增的,它非常适合比较多个分布:


sns.displot(penguins, x="flipper_length_mm", hue="species", kind="ecdf")

0c84d9d81cc668a8ed20f6dea148a60d_7a5bc605be7145d0a574652d6bafba2d.png


The major downside to the ECDF plot is that it represents the shape of the distribution less intuitively than a histogram or density curve. Consider how the bimodality of flipper lengths is immediately apparent in the histogram, but to see it in the ECDF plot, you must look for varying slopes. Nevertheless, with practice, you can learn to answer all of the important questions about a distribution by examining the ECDF, and doing so can be a powerful approach.

ECDF图的主要缺点是它表示分布的形状不如直方图或密度曲线直观。考虑鳍状肢长度的双峰性如何在直方图中立即显现,但要在ECDF图中看到它,必须寻找不同的斜率。尽管如此,通过实践,您可以通过检查ECDF来学习回答关于发行版的所有重要问题,这样做可能是一种强大的方法。


双变量分布可视化-Visualizing bivariate distributions

All of the examples so far have considered univariate distributions: distributions of a single variable, perhaps conditional on a second variable assigned to hue. Assigning a second variable to y, however, will plot a bivariate distribution:

到目前为止,所有的例子都考虑了单变量分布:单个变量的分布,可能取决于赋给色调的第二个变量。然而,将第二个变量赋值给y,将绘制一个二元分布:


案例1-双变量分布直方图与核密度图

A bivariate histogram bins the data within rectangles that tile the plot and then shows the count of observations within each rectangle with the fill color (analogous to a heatmap()). Similarly, a bivariate KDE plot smoothes the (x, y) observations with a 2D Gaussian. The default representation then shows the contours of the 2D density:

二元直方图将数据装入平铺图的矩形中,然后用填充色显示每个矩形中的观察计数(类似于热图())。类似地,二元KDE图用二维高斯平滑(x, y)观测值。默认的表示形式然后显示2D密度的轮廓:


sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm")
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde")

8f2a5289f6c36547f4b0088b6b00e147_bf526cb7768345f68cf399e9017ee19e.png


案例2-双变量分布直方图与核密度图-参数hue

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm",hue="species")
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm",hue="species", kind="kde")

54c046a9c4c534df87f36f7d46300175_a9230ff8711247328775a44d305a9dea.png


The contour approach of the bivariate KDE plot lends itself better to evaluating overlap, although a plot with too many contours can get busy:

二元KDE图的等高线方法更适合评估重叠


案例3-双变量分布直方图与核密度图-bin大小和颜色

To aid interpretation of the heatmap, add a colorbar to show the mapping between counts and color intensity:

为了帮助解释热图,添加一个颜色条来显示计数和颜色强度之间的映射:


sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", binwidth=(2, .5))
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", binwidth=(2, .5), cbar=True)

a9d39abf5e73f1c16e539c29214385b7_ea07b6874cd5412598962792f75b9eb8.png


The meaning of the bivariate density contours is less straightforward. Because the density is not directly interpretable, the contours are drawn at iso-proportions of the density, meaning that each curve shows a level set such that some proportion p of the density lies below it. The p values are evenly spaced, with the lowest level contolled by the thresh parameter and the number controlled by levels:

二元密度等高线的含义不那么直接。由于密度不能直接解释,等高线是按照密度的等比例绘制的,这意味着每条曲线都显示了一个水平集,使得密度的某个比例p位于它以下。p值均匀间隔,最低级别由thresh参数控制,数量由级别控制:

The levels parameter also accepts a list of values, for more control:

evel参数还接受一个值列表,以便进行更多的控制:


sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde", thresh=.2, levels=4)
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde", levels=[.01, .05, .1, .8])

79d9a91d5e441a0c5c68313531f91617_1442f12136bb4b16a530a1328bb775a9.png


分布可视化-pairplot和joinplot

案例1-绘制节理和边际分布-Plotting joint and marginal distributions

The first is jointplot(), which augments a bivariate relatonal or distribution plot with the marginal distributions of the two variables. By default, jointplot() represents the bivariate distribution using scatterplot() and the marginal distributions using histplot():

第一个是jointplot(),它用两个变量的边际分布来增加一个双变量关系图或分布图。默认情况下,jointplot()使用scatterplot()表示二元分布,使用histplot()表示边际分布:


sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm",)
sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm",hue="species",)

065f6dbbab0b28dea5497a856479fcd6_b28467cac6e44e47b0a7c351df991748.png

sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm",hue="species",)
sns.jointplot(data=penguins,x="bill_length_mm", y="bill_depth_mm", hue="species",kind="kde")

621431fc8d2b189d2ca661705ca5457b_f29591d239e24700ac796d7d84b24fd1.png


jointplot() is a convenient interface to the JointGrid class, which offeres more flexibility when used directly:

jointplot()是JointGrid类的一个方便接口,直接使用时提供了更多的灵活性:

g = sns.JointGrid(data=penguins, x="bill_length_mm", y="bill_depth_mm")
g.plot_joint(sns.histplot)
g.plot_marginals(sns.boxplot)

ca9a8af6bcbd997f9a2913c029716803_07bf193598d14a25bae388f3e811df07.png


案例2-绘制节理和边际分布-地毯图rugplot

A less-obtrusive way to show marginal distributions uses a “rug” plot, which adds a small tick on the edge of the plot to represent each individual observation. This is built into displot():

显示边际分布的一种不那么突兀的方法是使用“地毯”图,它在图的边缘添加一个小标记来表示每个单独的观察结果。这是内置在displot()中:


sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm",kind="kde")
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm",kind="kde", rug=True)

efdbbc38ef0eb21adf379243f795081e_dfa76e0b6cef45b7982333e1f14b97cc.png


And the axes-level rugplot() function can be used to add rugs on the side of any other kind of plot:

轴级rugplot()函数可用于在任何其他类型的plot的一侧添加地毯:


g=sns.relplot(data=penguins, x="bill_length_mm", y="bill_depth_mm")
sns.rugplot(data=penguins, x="bill_length_mm", y="bill_depth_mm",ax=g.ax)
# sns.relplot(data=penguins, x="bill_length_mm", y="bill_depth_mm")
# sns.rugplot(data=penguins, x="bill_length_mm", y="bill_depth_mm")

fd92366cfa06f7e94938c9dfe2780f84_55c0b65a269247b79f0967d6ef197ecf.png


案例3-pairplot绘制多个分布

The pairplot() function offers a similar blend of joint and marginal distributions. Rather than focusing on a single relationship, however, pairplot() uses a “small-multiple” approach to visualize the univariate distribution of all variables in a dataset along with all of their pairwise relationships:

pairplot()函数提供了类似的联合分布和边际分布的混合。然而,pairplot()不是专注于单个关系,而是使用“小倍数”方法来可视化数据集中所有变量的单变量分布及其所有的成对关系:


sns.pairplot(penguins)

1b17f17de9e3e1c0d3bba7714c1d0c29_4b8ccad75a5b453d8f5a9a12e6f0fcd5.png


As with jointplot()/JointGrid, using the underlying PairGrid directly will afford more flexibility with only a bit more typing:

与jointplot()/JointGrid一样,直接使用底层的PairGrid将提供更多的灵活性,只需要多一点输入:


g = sns.PairGrid(penguins)
g.map_upper(sns.histplot)
g.map_lower(sns.kdeplot, fill=True)
g.map_diag(sns.histplot, kde=True)

03b41a1985c3bbb0381c3298b66d4e4d_4bd6982f39e14644bf00b1dd65d32956.png


总结

本文主要是seaborn从入门到精通系列第3篇,本文介绍了seaborn的绘图功能实现,本文是分布绘图,同时介绍了较好的参考文档置于博客前面,读者可以重点查看参考链接。本系列的目的是可以完整的完成seaborn从入门到精通。重点参考连接


参考

seaborn官方

seaborn官方介绍

seaborn可视化入门

【宝藏级】全网最全的Seaborn详细教程-数据分析必备手册(2万字总结)

Seaborn常见绘图总结

相关文章
|
4天前
|
开发者 Python
探索Python中的装饰器:简化代码,增强功能
【10月更文挑战第22天】在Python的世界里,装饰器是一个强大的工具,它能够让我们以简洁的方式修改函数的行为,增加额外的功能而不需要重写原有代码。本文将带你了解装饰器的基本概念,并通过实例展示如何一步步构建自己的装饰器,从而让你的代码更加高效、易于维护。
|
2天前
|
数据采集 机器学习/深度学习 人工智能
Python编程入门:从基础到实战
【10月更文挑战第24天】本文将带你进入Python的世界,从最基础的语法开始,逐步深入到实际的项目应用。我们将一起探索Python的强大功能和灵活性,无论你是编程新手还是有经验的开发者,都能在这篇文章中找到有价值的内容。让我们一起开启Python的奇妙之旅吧!
|
4天前
|
数据采集 存储 数据库
Python中实现简单爬虫的入门指南
【10月更文挑战第22天】本文将带你进入Python爬虫的世界,从基础概念到实战操作,一步步指导你如何使用Python编写一个简单的网络爬虫。我们将不展示代码示例,而是通过详细的步骤描述和逻辑讲解,帮助你理解爬虫的工作原理和开发过程。无论你是编程新手还是有一定经验的开发者,这篇文章都将为你打开一扇通往数据收集新世界的大门。
|
2天前
|
测试技术 开发者 Python
探索Python中的装饰器:从入门到实践
【10月更文挑战第24天】 在Python的世界里,装饰器是一个既神秘又强大的工具。它们就像是程序的“隐形斗篷”,能在不改变原有代码结构的情况下,增加新的功能。本篇文章将带你走进装饰器的世界,从基础概念出发,通过实际例子,逐步深入到装饰器的高级应用,让你的代码更加优雅和高效。无论你是初学者还是有一定经验的开发者,这篇文章都将为你打开一扇通往高效编程的大门。
|
4天前
|
存储 人工智能 数据挖掘
Python编程入门:构建你的第一个程序
【10月更文挑战第22天】编程,这个听起来高深莫测的词汇,实际上就像搭积木一样简单有趣。本文将带你走进Python的世界,用最浅显的语言和实例,让你轻松掌握编写第一个Python程序的方法。无论你是编程新手还是希望了解Python的爱好者,这篇文章都将是你的理想起点。让我们一起开始这段奇妙的编程之旅吧!
13 3
|
3天前
|
机器学习/深度学习 人工智能 算法
机器学习基础:使用Python和Scikit-learn入门
机器学习基础:使用Python和Scikit-learn入门
11 1
|
5天前
|
数据采集 机器学习/深度学习 数据可视化
深入浅出:用Python进行数据分析的入门指南
【10月更文挑战第21天】 在信息爆炸的时代,掌握数据分析技能就像拥有一把钥匙,能够解锁隐藏在庞大数据集背后的秘密。本文将引导你通过Python语言,学习如何从零开始进行数据分析。我们将一起探索数据的收集、处理、分析和可视化等步骤,并最终学会如何利用数据讲故事。无论你是编程新手还是希望提升数据分析能力的专业人士,这篇文章都将为你提供一条清晰的学习路径。
|
Linux Python
【Python】300行代码实现crontab定时器功能 【上】
熟悉Linux的都知道在Linux下有一个crontab的定时任务,可以很方便的进行各种定时、计划任务的执行。有时候写代码也需要用到定时器业务,因此我使用Python实现了一个类似的定时器模块,可以很方便的做定时业务,使用例子如下:
433 0
【Python】300行代码实现crontab定时器功能 【上】
|
Python
使用python实现一个文件搜索功能,类似于Everything功能
一般人日常总是会将一些片段信息记录到文件中,放到电脑硬盘上。等过段时间,可能就不知道放到哪里了,电脑上文件夹太多。 找文件一般都会借助于搜索软件,比如Everything软件就很强大,输入名称,就能全局查找文件;
466 0
一行Python可以实现的功能
手头有 109 张头部 CT 的断层扫描图片,我打算用这些图片尝试头部的三维重建。基础工作之一,就是要把这些图片数据读出来,组织成一个三维的数据结构(实际上是四维的,因为每个像素有 RGBA 四个通道)。