seaborn从入门到精通03-绘图功能实现04-回归拟合绘图Estimating regression fits

简介: seaborn从入门到精通03-绘图功能实现04-回归拟合绘图Estimating regression fits

1a9bed745ac14360a656b89ca93191cf.jpg


分布绘图-Visualizing distributions data

Many datasets contain multiple quantitative variables, and the goal of an analysis is often to relate those variables to each other. We previously discussed functions that can accomplish this by showing the joint distribution of two variables. It can be very helpful, though, to use statistical models to estimate a simple relationship between two noisy sets of observations. The functions discussed in this chapter will do so through the common framework of linear regression.

许多数据集包含多个定量变量,分析的目标通常是将这些变量相互关联起来。我们之前讨论过可以通过显示两个变量的联合分布来实现这一点的函数。不过,使用统计模型来估计两组有噪声的观测数据之间的简单关系是非常有用的。本章讨论的函数将通过线性回归的通用框架来实现。

The goal of seaborn, however, is to make exploring a dataset through visualization quick and easy, as doing so is just as (if not more) important than exploring a dataset through tables of statistics.

seaborn的目标是通过可视化快速轻松地探索数据集,因为这样做与通过统计表探索数据集一样重要(如果不是更重要的话)。


绘制线性回归模型的函数-Functions for drawing linear regression models

The two functions that can be used to visualize a linear fit are regplot() and lmplot().

In the simplest invocation, both functions draw a scatterplot of two variables, x and y, and then fit the regression model y ~ x and plot the resulting regression line and a 95% confidence interval for that regression:

可以用来可视化线性拟合的两个函数是regplot()和lmplot()。

在最简单的调用中,两个函数都绘制了两个变量x和y的散点图,然后拟合回归模型y ~ x,并绘制出最终的回归线和该回归的95%置信区间:

These functions draw similar plots, but regplot() is an axes-level function, and lmplot() is a figure-level function. Additionally, regplot() accepts the x and y variables in a variety of formats including simple numpy arrays, pandas.Series objects, or as references to variables in a pandas.DataFrame object passed to data. In contrast, lmplot() has data as a required parameter and the x and y variables must be specified as strings. Finally, only lmplot() has hue as a parameter.

这些函数绘制类似的图形,但regplot()是一个轴级函数,而lmplot()是一个图形级函数。此外,regplot()接受各种格式的x和y变量,包括简单的numpy数组和pandas。系列对象,或者作为pandas中变量的引用。传递给data的DataFrame对象。相反,lmplot()将数据作为必需的参数,x和y变量必须指定为字符串。最后,只有lmplot()有hue参数。


参考

regplot

lmplot

导入库与查看tips和diamonds 数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as  mpl
import seaborn as sns
sns.set_theme(style="darkgrid")
mpl.rcParams['font.sans-serif']=['SimHei']
mpl.rcParams['axes.unicode_minus']=False
tips = sns.load_dataset("tips",cache=True,data_home=r"./seaborn-data")
tips.head()

3f470b79f5c718980721b668e3327666_984ec1c1b7fb4aedbca13f8f772a5cf8.png


diamonds = sns.load_dataset("diamonds",cache=True,data_home=r"./seaborn-data")
print(diamonds.head())

73bdaa66f2130b88751ac065129d13be_422cad48f6264e68869cb8338785a3fa.png

anscombe = sns.load_dataset("anscombe",cache=True,data_home=r"./seaborn-data")
print(anscombe.head())

4bbf3bd838e3262bb4e765f78d239f83_862939db24dd4fd2a0080e97d5e63d61.png


penguins = sns.load_dataset("penguins",cache=True,data_home=r"./seaborn-data")
print(penguins.head())

606271007ca6f923d23e6690892ba7c0_97f7b43330ff4957b2def059a562b455.png


案例1-回归拟合

sns.regplot(x="total_bill", y="tip", data=tips)
sns.lmplot(x="total_bill", y="tip", data=tips)


19a164081ed2f14cc0dd1117199049de_f028da1dc8344a2a8499f4805b862501.png

It’s possible to fit a linear regression when one of the variables takes discrete values, however, the simple scatterplot produced by this kind of dataset is often not optimal:

当其中一个变量取离散值时,有可能拟合线性回归,然而,这种数据集产生的简单散点图通常不是最优的:

sns.lmplot(x="size", y="tip", data=tips);
sns.lmplot(x="size", y="tip", data=tips, x_jitter=.05);

6688ddde5169154bc961fb36c2ca9fa9_f72b276ff5194f3ba48252873f7c488b.png


案例2-适合不同模型的拟合-Anscombe的四重奏数据集

scatter_kws参数控制颜色,透明度,点的大小

ci 回归估计的置信区间大小。这将使用回归线周围的半透明带绘制。使用自举法估计置信区间;对于大型数据集,建议通过将该参数设置为None来避免计算。


sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"),ci=None, scatter_kws={"s": 80})


26dbb6bccea9511c55395532fe126fa1_8894f82f721645a19bd05c55890809f7.png

The linear relationship in the second dataset is the same, but the plot clearly shows that this is not a good model:

第二个数据集中的线性关系是相同的,但图表清楚地表明这不是一个好的模型:

sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),ci=None, scatter_kws={"s": 80})

1d668f6b4ca16b5c76d8090d8c41759c_046cdb2b365d4ad585e6aa6ea2c32a65.png


In the presence of these kind of higher-order relationships, lmplot() and regplot() can fit a polynomial regression model to explore simple kinds of nonlinear trends in the dataset:

在这些高阶关系的存在下,lmplot()和regplot()可以拟合一个多项式回归模型来探索数据集中简单的非线性趋势:


order参数:

If order is greater than 1, use numpy.polyfit to estimate a polynomial regression.

如果order大于1,则使用numpy.Polyfit来估计一个多项式回归。

参考:https://blog.csdn.net/lishiyang0902/article/details/127652317


# sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),ci=None, scatter_kws={"s": 80})
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),order=2, ci=None, scatter_kws={"s": 80})

25c9d02c1ce6a3e364accc58c239d865_d63bb97430e74e888b7b5e8ac92ec576.png


A different problem is posed by “outlier” observations that deviate for some reason other than the main relationship under study:

一个不同的问题是由“异常值”观测造成的,这些观测由于某种原因偏离了所研究的主要关系:

In the presence of outliers, it can be useful to fit a robust regression, which uses a different loss function to downweight relatively large residuals:

在存在异常值的情况下,拟合稳健(robust )回归是有用的,它使用不同的损失函数来降低相对较大的残差:


robust参数:

If True, use statsmodels to estimate a robust regression. This will de-weight outliers. Note that this is substantially more computationally intensive than standard linear regression, so you may wish to decrease the number of bootstrap resamples (n_boot) or set ci to None.

如果为真,则使用统计模型来估计稳健回归。这将降低异常值的权重。注意,这比标准线性回归的计算量要大得多,因此您可能希望减少引导重采样(n_boot)的数量或将ci设置为None。


sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),ci=None, scatter_kws={"s": 80})
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),robust=True, ci=None, scatter_kws={"s": 80})


5a3fd70ce4e0d63000b9572cd5fe3be0_2e191a02c9804279a06515fc4d3f245b.png

案例3-y值为离散值的线性拟合-参数logistic

tips["big_tip"] = (tips.tip / tips.total_bill) > .15
print(tips.head())

When the y variable is binary, simple linear regression also “works” but provides implausible predictions:

当y变量是二进制时,简单线性回归也“有效”,但提供了令人难以置信的预测

The solution in this case is to fit a logistic regression, such that the regression line shows the estimated probability of y = 1 for a given value of x:

这种情况下的解决方案是拟合一个逻辑回归,这样回归线显示了给定x值y = 1的估计概率:

sns.lmplot(x="total_bill", y="big_tip", data=tips, y_jitter=.03)
sns.lmplot(x="total_bill", y="big_tip", data=tips,logistic=True, y_jitter=.03)

2660783cf439bd4d5797da80594fbe60_59dcfa6f4fec48c38f76d091eda5a640.png


案例4-The residplot() function

参考:http://seaborn.pydata.org/generated/seaborn.residplot.html#seaborn.residplot


The residplot() function can be a useful tool for checking whether the simple regression model is appropriate for a dataset. It fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around y = 0:

residplot()函数是检查简单回归模型是否适合数据集的有用工具。它拟合并移除一个简单的线性回归,然后绘制每个观测值的残差值。理想情况下,这些值应该随机分布在y = 0附近:

If there is structure in the residuals, it suggests that simple linear regression is not appropriate:

如果残差中存在结构,则表明简单线性回归不合适:


fig,axes = plt.subplots(1,2)
fig.set_figheight(8)
fig.set_figwidth(16)
sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'I'"),scatter_kws={"s": 80},ax=axes[0])
sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),scatter_kws={"s": 80},ax=axes[1])

b923713b9bae2d669c18c07b1cceefbe_644a1d7e8a994ff3ab6f41a27a4d8a37.png


案例5-多变量的拟合回归

The plots above show many ways to explore the relationship between a pair of variables. Often, however, a more interesting question is “how does the relationship between these two variables change as a function of a third variable?” This is where the main differences between regplot() and lmplot() appear. While regplot() always shows a single relationship, lmplot() combines regplot() with FacetGrid to show multiple fits using hue mapping or faceting.

上面的图表显示了探索一对变量之间关系的许多方法。然而,一个更有趣的问题通常是“这两个变量之间的关系如何作为第三个变量的函数而变化?”这就是regplot()和lmplot()之间的主要区别所在。regplot()总是显示单个关系,而lmplot()将regplot()与FacetGrid结合起来,使用色调映射或面形显示多个拟合。

The best way to separate out a relationship is to plot both levels on the same axes and to use color to distinguish them:

区分关系的最佳方法是在同一轴上绘制两个层次,并使用颜色来区分它们:

Unlike relplot(), it’s not possible to map a distinct variable to the style properties of the scatter plot, but you can redundantly code the hue variable with marker shape:

lmplot不像relplot(),lmplot不可能将一个不同的变量映射到散点图的样式属性,但是你可以用标记形状冗余地编码色调变量: 参数markers=["o", "x"], palette="Set1"

To add another variable, you can draw multiple “facets” with each level of the variable appearing in the rows or columns of the grid:

要添加另一个变量,您可以绘制多个“facet”,每个级别的变量出现在网格的行或列中: col参数

sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips)
sns.lmplot(x="total_bill", y="tip", hue="smoker",markers=["o", "x"], palette="Set1",col="time", row="sex", data=tips, height=3)

8d5d757f637d097a60038f846ab18d73_fdeb5ae6cd4d40faa0ea5ffdfd5cca61.png


案例6-在其他上下文中绘制回归图Plotting a regression in other contexts

sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg")

47929285402eecbed06b190197ffe415_1425128e58734067ae790d0360defe5d.png

sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"],hue="smoker", height=5, aspect=.8, kind="reg")

20139d751972c197ffc1e4effd6b315e_d3da4a87a5e0451abc1b38116e23ebb0.png


总结

本文主要是seaborn从入门到精通系列第3篇,本文介绍了seaborn的绘图功能实现,本文是回归拟合绘图,同时介绍了较好的参考文档置于博客前面,读者可以重点查看参考链接。本系列的目的是可以完整的完成seaborn从入门到精通。重点参考连接


参考

seaborn官方

seaborn官方介绍

seaborn可视化入门

【宝藏级】全网最全的Seaborn详细教程-数据分析必备手册(2万字总结)

Seaborn常见绘图总结

相关文章
|
23天前
|
开发框架 数据建模 中间件
Python中的装饰器:简化代码,增强功能
在Python的世界里,装饰器是那些静悄悄的幕后英雄。它们不张扬,却能默默地为函数或类增添强大的功能。本文将带你了解装饰器的魅力所在,从基础概念到实际应用,我们一步步揭开装饰器的神秘面纱。准备好了吗?让我们开始这段简洁而富有启发性的旅程吧!
31 6
|
21天前
|
存储 数据采集 人工智能
Python编程入门:从零基础到实战应用
本文是一篇面向初学者的Python编程教程,旨在帮助读者从零开始学习Python编程语言。文章首先介绍了Python的基本概念和特点,然后通过一个简单的例子展示了如何编写Python代码。接下来,文章详细介绍了Python的数据类型、变量、运算符、控制结构、函数等基本语法知识。最后,文章通过一个实战项目——制作一个简单的计算器程序,帮助读者巩固所学知识并提高编程技能。
|
13天前
|
数据可视化 DataX Python
Seaborn 教程-绘图函数
Seaborn 教程-绘图函数
43 8
|
13天前
Seaborn 教程-主题(Theme)
Seaborn 教程-主题(Theme)
36 7
|
13天前
|
Python
Seaborn 教程-模板(Context)
Seaborn 教程-模板(Context)
38 4
|
13天前
|
数据可视化 Python
Seaborn 教程
Seaborn 教程
34 5
|
16天前
|
数据可视化 Python
以下是一些常用的图表类型及其Python代码示例,使用Matplotlib和Seaborn库。
通过这些思维导图和分析说明表,您可以更直观地理解和选择适合的数据可视化图表类型,帮助更有效地展示和分析数据。
59 8
|
22天前
|
IDE 程序员 开发工具
Python编程入门:打造你的第一个程序
迈出编程的第一步,就像在未知的海洋中航行。本文是你启航的指南针,带你了解Python这门语言的魅力所在,并手把手教你构建第一个属于自己的程序。从安装环境到编写代码,我们将一步步走过这段旅程。准备好了吗?让我们开始吧!
|
22天前
|
测试技术 开发者 Python
探索Python中的装饰器:从入门到实践
装饰器,在Python中是一块强大的语法糖,它允许我们在不修改原函数代码的情况下增加额外的功能。本文将通过简单易懂的语言和实例,带你一步步了解装饰器的基本概念、使用方法以及如何自定义装饰器。我们还将探讨装饰器在实战中的应用,让你能够在实际编程中灵活运用这一技术。
37 7
|
21天前
|
Python
探索Python中的装饰器:简化代码,增强功能
在Python的世界里,装饰器就像是给函数穿上了一件神奇的外套,让它们拥有了超能力。本文将通过浅显易懂的语言和生动的比喻,带你了解装饰器的基本概念、使用方法以及它们如何让你的代码变得更加简洁高效。让我们一起揭开装饰器的神秘面纱,看看它是如何在不改变函数核心逻辑的情况下,为函数增添新功能的吧!