1 数据处理中的概率

由于python在计算领域是高度精确的，同时也有大量的数据处理库用于人工智能，日常处理等等。

撕裂的宇宙3.png

仅仅是开源在python中就有大量的库用于处理，比如opencv，matplotlib，numpy，pandas，也有直接提供界面UI交互的seaborn框架。

常用的工具包括 python 和 R 语言都提供了完整的支持。

数据处理离不开概率，在很多场合都可能用到，比如python内置的随机变量发生器。

    random.shuffle(x, [, random])

该函数将x 随机放入队列sequence，打乱成伪随机的状态，属于经典的数学方法。

random 随机变量发生器函数

    uniform within range    分布均匀

支持的分布函数(不完整)

    uniform            # 均匀分布
    triangular        # 三角函数相关
    lognormal        # 对数相关分布
    gamma            # 伽马相关分布
    beta            # beta 相关分布
    pareto            # 帕累托相关分布
    weibull            # 威布尔相关分布

distributions on the circle # 圆上的分布

    circular uniform     # 圆形分布
    von mises        # 冯·米斯

曾经在某个巧合的场景使用了某些类型的库，这里做些简单对比，并不全面，希望各个阵营的大佬多指正。

这里不重复相关书籍的概率公理，它们可以容易在图书馆或网上书城找到。

2 概率密度函数简单对比

2.1 R 语言

R中的pt()（用于求已知t值和df的t分布累计概率值，等同于EXCEL中的TDIST()）、qt()（用于求已知p值和df的t分布的t区间值，等同于EXCEL中的TINV()。

r语言几个函数：dt，pt，qt，rt分别与dnorm，rnorm，pnorm，qnorm和rnorm对应 > * dt() 的返回值是正态分布概率密度函数(density)。

指令的函数和参数介绍：

> * pt()返回值是正态分布的分布函数(probability)
> * 函数qt()的返回值是给定概率p后的下百分位数(quantitle)
> * rt()的返回值是n个正态分布随机

x, q: 矢量的量。
p 矢量的概率。
n : 观察的次数,如果长度大于1，则被认为是符合需求的。
df ：自由度，如果大于0，也可能非整型，df 为 Inf 对象是允许的（概率统计对象。）
ncp ：非中心分布必须的参数，当前期望为 rt(), 仅仅在绝对值 abs(ncp) <= 37.62 使用。如果省略该参数，使用默认的中心极限T分布。
log, log.p logical; 如果为 TRUE，概率p 被以 log(p) 提供。 if TRUE, probabilities p are given as log(p).
logical; 如果为默认的 TRUE (default), 概率为 P[X ≤ x], 否则概率为：P[X > x].

2.2 Python基于matplotlib的高级框架

Seaborn是基于matplotlib的Python数据可视化库。它提供了用于绘制引人入胜且内容丰富的统计图形的高级界面。

如下为常用函数功能，用于将关系图绘制到FacetGrid上的图形级界面。

2.2.0 Relational plots 关系图

relplot(,x, y, hue, size, style, data...)        
        Figure-level interface for drawing relational plots onto a FacetGrid.

绘制一个散点图，可能会出现多个语义分组

scatterplot(* [,x,y,hue,style, size...]) 
        Draw a scater plot with possibility of several semantic groupings.

lineplot(*, [,x,y,hue,size,style,data ...])
        Draw a line plot with possibility of serveral semantic groupings

2.2.1 绘图库分布图，分类图，回归图，矩阵图，其他

Distribution plots 分布图

用于将分布图绘制到FacetGrid上的图形级界面。

displot([data, x,y,hue, row, col ...])    
    Figure-level interface for drawing distribution plots onto a FaceGrid

绘制单变量或双变量直方图以显示数据集的分布。

histplot([data, x,y,hue, weights, stat,...])
    Plot univeriate or bivariate histograms to show distributions of datasets.

使用核密度估计图绘制单变量或双变量分布

kdeplot([x,y,shade,vertical,kernel, bw, ...])
    Plot univariate or bivariate distributions using kernel density estimation.

绘制经验累积分布函数。

ecdfplot([data, x,y,hue,weights, stat,...])
    Plot empirical cumulative distribution functions.

通过沿x和y轴绘制刻度线来绘制边际分布图

rugplot([x,height,axis, ax,data,y,hue,...])
    Plot marginal distributions by drawing ticks along the x and y axes.

已弃用：灵活地绘制观测值的统一分布。

distplot([a,bins,hist,kde,rug,fit,...])
    DEPRECATED:Flexibly plot a univeriate distribution of observations.

* Categorical plots    分类图

图形级界面，用于将分类图绘制到。

    catplot(* [,x,y,hue,data, row,col,...])
        Figure-level interface for drawing categorical plots onto a facetfrid

绘制一个散点图，其中一个变量是分类的。

    stripplot(* [, x,y,hue,data,order,...])
        Draw a scatterplot where one variable is categorical

绘制一个具有非重叠点的分类散点图。

    swarmplot(* [,x,y,hue,data,order,...])
        Draw a categorical scatterplot with non-overlapping points.

为更大的数据集绘制增强的箱形图

    boxplot(* [, x,y,hue,data,order,...])
        Draw an enhanced box plot for larger datasets.

使用散点图字形显示点估计和置信区间。

    pointplot(* [,x,y,hue,data,order,...])
        Show point estimates and confidence intervals using scatter plot glyphs.

将点估计和置信区间显示为矩形条。

    barplot(* [, x,y, hue, data,order,...])
         Show point estimates and confidence intervals as rectangular bars.

用条形图显示每个分类箱中的观测值。

     countplot(* [, x,y,hue,data,order,...])
         Show the counts of observations in each categorical bin using bars.

Regression plots 回归图

绘制数据和回归模型以适合FacetGrid

implot(* [,x,y,data,hue,col,row, ...])
        Plot data and regression model fits across a FacetGrid

绘制数据并拟合线性回归模型

regplot(* [,x,y,data,x_estimator,...])
        Plot data and a linear regression model fit

求线性回归的残差

residplot(* [,x,y,lowess,...])
        PLot the residuals of a linear regression

Matrix plots 矩阵图

绘制矩形数据作为颜色编码矩阵

heatmap(data, * [, vmin, vmax, cmap, center,...])
```
       Plot rectangular data as a color-encoded matrix
```
将矩阵数据集绘制为分层聚类的热图。

clustermap(data, * [, pivot_kws, method,...])
```
       Plot a matrix dataset as a hierarchically-clustered heatmap.
```

Multi-plot grids 多图网格

Facet grids 多面网格

用于绘制条件关系的多图网格。

FaceGrid(data, * [,row, col, hue, ...])
    Multi-plot grid for plotting conditional relationships.

应用绘图条件关系

FaceGrid.map(self,func, *args, **kwargs)
    Apply a plotting conditional relationships

类似 .map ，但是此函数将args作为字符串传递并将数据插入kwargs。

FaceGrid.map_dataframe(self, func, *args,...)
   Like .map but passes args as strings and inserts data in kwargs.

成对网格 pair grids

在数据集中绘制成对关系

   pairplot(data, * [, hue, hue_order, palette, ...])
       Plot pairwise relationshops in a dataset.

子图网格，用于绘制数据集中的成对关系

   PairGrid(data, * [, hue, hue_order, palette, ...])
       Subplot grid for plotting pairwise relationships in a dataaset

在每个子图中绘制具有相同功能的图。

   PairGrid.map(self, func, ** kwargs)
       Plot with the same function in every subplot.

在每个对角线子图上使用单变量函数绘制。

   PairGrid.map_diag(self, func, ** kwargs)
       Plot with a univariate function on each diagonal subplot.

在非对角子图上具有二元函数的图

   PairGrid.map_offdiag(self, func, ** kwargs)
       Plot with a bivariate function on the off-diagonal subplots

在下对角线子图上使用双变量函数绘制

PariGrid.map_lower(self, func, ** kwargs)
    Plot with a bivariate function on the lower diagonal subplots

在上对角线子图上使用双变量函数绘制

PairGrid.map_upper(self, func, ** kwargs)
    Plot with a bivariate function on the upper diagonal subplots

联合网格 Joint grids

用双变量和单变量图绘制两个变量的图。

jointplot(* [, x, y, data, kind, color, …])
        Draw a plot of two variables with bivariate and univariate graphs.

用于绘制带有边际单变量图的二元图的网格。

JointGrid(* [, x, y, data, height, ratio, …])
        Grid for drawing a bivariate plot with marginal univariate plots.

通过传递关节轴和边缘轴的函数来绘制图。

JointGrid.plot(self, joint_func, …)
        Draw the plot by passing functions for joint and marginal axes.

在网格的关节轴上绘制一个双变量图。

JointGrid.plot_joint(self, func, ** kwargs)
        Draw a bivariate plot on the joint axes of the grid.

在每个边缘轴上绘制单变量图。

JointGrid.plot_marginals(self, func, ** kwargs)
        Draw univariate plots on each marginal axes.

2.3 其他主题

Themes 主题

一次性设置多个主题参数

set_theme([context, style, palette, font, …])
        Set multiple theme parameters in one step.

返回图的易于阅读的美学风格的参数字典。

axes_style([style, rc])
        Return a parameter dict for the aesthetic style of the plots.

设置绘图的美学风格

set_style([style, rc])
        Set the aesthetic style of the plots.

返回参数dict以缩放图形元素。

plotting_context([context, font_scale, rc])
        Return a parameter dict to scale elements of the figure.

设置绘图上下文参数。

set_context([context, font_scale, rc])
        Set the plotting context parameters.

更改matplotlib颜色速记的解释方式。

set_color_codes([palette])
        Change how matplotlib color shorthands are interpreted.

将所有RC参数恢复为默认设置。

reset_defaults()
        Restore all RC params to default settings.

将所有RC参数恢复为原始设置（尊重自定义rc）

reset_orig()
        Restore all RC params to original settings (respects custom rc).

set_theme（）的别名，这是首选接口。

set(* args, ** kwargs)
        Alias for set_theme(), which is the preferred interface.

调色板 Color palettes

使用深浅的调色板设置matplotlib颜色周期

set_palette(palette[, n_colors, desat, …])
```
      Set the matplotlib color cycle using a seaborn palette.
```
返回定义调色板的颜色列表或连续颜色图。

color_palette([palette, n_colors, desat, …])
```
      Return a list of colors or continuous colormap defining a palette.
```
在HUSL色相空间中获得一组均匀分布的颜色。

husl_palette([n_colors, h, s, l, as_cmap])
```
      Get a set of evenly spaced colors in HUSL hue space.
```
在HLS色相空间中获得一组均匀分布的颜色。

hls_palette([n_colors, h, l, s, as_cmap])
```
      Get a set of evenly spaced colors in HLS hue space.
```
从cubehelix系统制作顺序调色板。

cubehelix_palette([n_colors, start, rot, …])
```
      Make a sequential palette from the cubehelix system.
```
制作从深色到彩色混合的顺序调色板。

dark_palette(color[, n_colors, reverse, …])
```
      Make a sequential palette that blends from dark to color.
```
制作从浅色到彩色混合的顺序调色板。

light_palette(color[, n_colors, reverse, …])
```
      Make a sequential palette that blends from light to color.
```
在两种HUSL颜色之间创建一个发散的调色板。

diverging_palette(h_neg, h_pos[, s, l, sep, …])
```
      Make a diverging palette between two HUSL colors.
```
制作一个在一系列颜色之间混合的调色板。

blend_palette(colors[, n_colors, as_cmap, input])
```
      Make a palette that blends between a list of colors.
```
使用xkcd颜色调查中的颜色名称制作调色板。

xkcd_palette(colors)
```
      Make a palette with color names from the xkcd color survey.
```
使用Crayola蜡笔的颜色名称制作调色板。

crayon_palette(colors)
```
      Make a palette with color names from Crayola crayons.
```
从matplotlib调色板返回不连续的颜色。

mpl_palette(name[, n_colors, as_cmap])
```
      Return discrete colors from a matplotlib palette.
```

面板小部件 Palette widgets

从ColorBrewer集中选择一个调色板。

 choose_colorbrewer_palette(data_type[, as_cmap])

         Select a palette from the ColorBrewer set.

启动一个交互式窗口小部件，以创建一个顺序的cubehelix调色板。

 choose_cubehelix_palette([as_cmap])

         Launch an interactive widget to create a sequential cubehelix palette.

启动一个交互式小部件以创建一个浅色顺序调色板。

 choose_light_palette([input, as_cmap])

         Launch an interactive widget to create a light sequential palette.

启动一个交互式小部件以创建一个黑暗的顺序调色板。

 choose_dark_palette([input, as_cmap])

         Launch an interactive widget to create a dark sequential palette.

启动交互式小部件以选择不同的调色板

 choose_diverging_palette([as_cmap])

         Launch an interactive widget to choose a diverging color palette

其他实用功能 Utility functions

从在线存储库中加载示例数据集（需要Internet）

     load_dataset(name[, cache, data_home])

             Load an example dataset from the online repository (requires internet).

报告可用的示例数据集，对于报告问题很有用。

     get_dataset_names()

             Report available example datasets, useful for reporting issues.

返回示例数据集的高速缓存目录的路径。

     get_data_home([data_home])

             Return a path to the cache directory for example datasets.

从情节中删除顶部和右侧的。

     despine([fig, ax, top, right, left, bottom, …])

             Remove the top and right spines from plot(s).

将颜色的饱和度通道减少一些百分比。

     desaturate(color, prop)

             Decrease the saturation channel of a color by some percent.

返回具有相同色调的完全饱和的颜色。

     saturate(color)

             Return a fully saturated color with the same hue.

独立操纵颜色的h，l或s通道。

     set_hls_values(color[, h, l, s])

             Independently manipulate the h, l, or s channels of a color.

3 小结

R语言在教学中使用较多，对各种经典的概率函数和分布都可以直接生成图形。

该py框架只要有一些概率基础都可以使用，上手较快，文档和实例也较多。
其框架中的其他内容的使用，多偏向前端。

配合交互式绘图库 plotly, Bokeh，之类的交互式web 图形 python库，可以满足中小项目的使用需求。

Bokeh是一个针对现代人的Python交互式可视化库，支持Web浏览器进行演示。

其目标是为各种图形提供优雅，简洁的结构，并通过大型交互提供高性能的交互性或流数据集。

可以帮助任何想要快速和轻松创建交互式绘图，仪表板和数据应用程序。

这里做个简单的记录。

参考资源。

https://www.pydata.org

大数据处理时的python和R语言

1 数据处理中的概率

2 概率密度函数简单对比

2.1 R 语言

2.2 Python基于matplotlib的高级框架

2.2.0 Relational plots 关系图

2.2.1 绘图库分布图，分类图，回归图，矩阵图，其他

2.3 其他主题

3 小结

大数据与机器学习

热门文章

最新文章

相关产品

相关课程

相关电子书

推荐镜像