Statistical Measures with R

本文涉及的产品
可视分析地图(DataV-Atlas),3 个项目,100M 存储空间
数据可视化DataV,5个大屏 1个月
简介:

Refer to R Tutorial andExercise Solution

Mean, 平均值

The mean of an observation variable is a numerical measure of the central location of the data values. It is the sum of its data values divided by data count.

Hence, for a data sample of size n, its sample mean is defined as follows:

> duration = faithful$eruptions     # the eruption durations  
> mean(duration)                    # apply the mean function  
[1] 3.4878

 

Median, 中位数

The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.

> duration = faithful$eruptions     # the eruption durations  
> median(duration)                  # apply the median function  
[1] 4

 

 

Quartile, 四分位数, 中位数即第二四分位数

There are several quartiles of an observation variable.

The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order.

The second quartile, or median, is the value that cuts off the first 50%.

The third quartile, or upper quartile, is the value that cuts off the first 75%.

> duration = faithful$eruptions     # the eruption durations  
> quantile(duration)                # apply the quantile function  
    0%    25%    50%    75%   100%  
1.6000 2.1627 4.0000 4.4543 5.1000

 

Percentile, 百分位数

The nth percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

Find the 32nd, 57th and 98th percentiles

> duration = faithful$eruptions     # the eruption durations  
> quantile(duration, c(.32, .57, .98))  
   32%    57%    98%  
2.3952 4.1330 4.9330

 

Range

The range of an observation variable is the difference of its largest and smallest data values. It is a measure of how far apart the entire data spreads in value.

> duration = faithful$eruptions     # the eruption durations  
> max(duration) − min(duration)     # apply the max and min functions  
[1] 3.5

 

Interquartile Range, 四分位距

The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.

 

> duration = faithful$eruptions     # the eruption durations  
> IQR(duration)                     # apply the IQR function  
[1] 2.2915

 

Box Plot, 箱线图

The box plot of an observation variable is a graphical representation based on its quartiles, as well as its smallest and largest values. It attempts to provide a visual shape of the data distribution.

> duration = faithful$eruptions       # the eruption durations  
> boxplot(duration, horizontal=TRUE)  # horizontal box plot

The box plot of the eruption duration is:

这个图就是用图形化来表示四分位数, box的三条边表示第一, 二, 三四分位数, 那条最粗的就是第二四分位数, 即中位数

    0%    25%    50%    75%   100%  
1.6000 2.1627 4.0000 4.4543 5.1000

从这个图可以看出数据的分布...

 

Variance, 方差

The variance is a numerical measure of how the data values is dispersed around the mean. In particular, the sample variance is defined as:

 

> duration = faithful$eruptions    # the eruption durations  
> var(duration)                    # apply the var function  
[1] 1.3027

 

Standard Deviation, 标准偏差

The standard deviation of an observation variable is the square root of its variance.

> duration = faithful$eruptions    # the eruption durations  
> sd(duration)                     # apply the sd function  
[1] 1.1414

 

Covariance, 协方差

The covariance of two variables x and y in a data sample measures how the two are linearly related. A positive covariancewould indicates a positive linear relationship between the variables, and a negative covariance would indicate the opposite.

The sample covariance is defined in terms of the sample means as:

> duration = faithfuleruptions   # the eruption durations   > waiting = faithfuleruptions   # the eruption durations   > waiting = faithfulwaiting      # the waiting period  
> cov(duration, waiting)          # apply the cov function  
[1] 13.978

 

Correlation Coefficient, 相关系数

The correlation coefficient of two variables in a data sample is their covariance divided by the product of their individualstandard deviations. It is a normalized measurement of how the two are linearly related.

Formally, the sample correlation coefficient is defined by the following formula, where sx and sy are the sample standard deviations, and sxy is the sample covariance.

If the correlation coefficient is close to 1, it would indicates that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope.

For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope.

And for zero, it would indicates a weak linear relationship between the variables.

> duration = faithfuleruptions   # the eruption durations   > waiting = faithfuleruptions   # the eruption durations   > waiting = faithfulwaiting      # the waiting period  
> cor(duration, waiting)          # apply the cor function  
[1] 0.90081

说明喷发时间和等待时间成正比, 等的越久就喷的越久...

 

协方差和相关系数

1、协方差是一个用于测量投资组合中某一具体投资项目相对于另一投资项目风险的统计指标,通俗点就是投资组合中两个项目间收益率的相关程度,正数说明两个项目一个收益率上升,另一个也上升,收益率呈同方向变化。如果是负数,则一个上升另一个下降,表明收益率是反方向变化。协方差的绝对值越大,表示这两种资产收益率关系越密切;绝对值越小表明这两种资产收益率的关系越疏远。 
2、由于协方差比较难理解,所以将协方差除以两个投资方案投资收益率的标准差之积,得出一个与协方差具有相同性质却没有量化的数。这个数就是相关系数。计算公式为相关系数=协方差/两个项目标准差之积。

 

Central Moment, 中心矩

The kth central moment (or moment about the mean) of a data sample is:

For example, the second central moment of a population is its variance.

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> moment(duration, order=3, central=TRUE)  
[1] −0.6149

 

Skewness, 偏斜度

The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.

Intuitively, the skewness is a measure of symmetry.

Negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed;

Positive skewness would indicates that the mean of the data values is larger than the median, and the data distribution is right-skewed. Of course, this rule applies only to unimodal distributions whose histograms have a single peak.

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> skewness(duration)                # apply the skewness function  
[1] -0.41584

 

Kurtosis, 峰态

The kurtosis of a univariate population is defined by the following formula, where μ2 and μ4 are the second and fourthcentral moments.

Intuitively, the kurtosis is a measure of the peakedness of the data distribution.

Negative kurtosis would indicates a flat distribution, which is said to be platykurtic(平顶).

Positive kurtosis would indicates a peaked distribution, which is said to be leptokurtic(尖顶).

Finally, the normal distribution has zero kurtosis, and is said to be mesokurtic(常态峰的).

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> kurtosis(duration) - 3            # apply the kurtosis function  
[1] -1.5006


本文章摘自博客园,原文发布日期:2012-02-15

相关实践学习
DataV Board用户界面概览
本实验带领用户熟悉DataV Board这款可视化产品的用户界面
阿里云实时数仓实战 - 项目介绍及架构设计
课程简介 1)学习搭建一个数据仓库的过程,理解数据在整个数仓架构的从采集、存储、计算、输出、展示的整个业务流程。 2)整个数仓体系完全搭建在阿里云架构上,理解并学会运用各个服务组件,了解各个组件之间如何配合联动。 3 )前置知识要求   课程大纲 第一章 了解数据仓库概念 初步了解数据仓库是干什么的 第二章 按照企业开发的标准去搭建一个数据仓库 数据仓库的需求是什么 架构 怎么选型怎么购买服务器 第三章 数据生成模块 用户形成数据的一个准备 按照企业的标准,准备了十一张用户行为表 方便使用 第四章 采集模块的搭建 购买阿里云服务器 安装 JDK 安装 Flume 第五章 用户行为数据仓库 严格按照企业的标准开发 第六章 搭建业务数仓理论基础和对表的分类同步 第七章 业务数仓的搭建  业务行为数仓效果图  
目录
相关文章
|
机器学习/深度学习 开发框架 数据建模
HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction
远程监督假设任何包含相同实体对的句子都反映了相同的关系。先前的远程监督关系抽取(DSRE)任务通常独立地关注sentence-level或bag-level去噪技术
176 0
|
机器学习/深度学习 算法 流计算
【读书笔记】Algorithms for Decision Making(6)
对于较大状态空间的问题,计算精确解需要极大的内存量,因而考虑近似解的方法。常使用approximate dynamic programming的方法去寻求近似解,进而使用在线方法实现实时计算。
163 0
【读书笔记】Algorithms for Decision Making(6)
|
算法 决策智能
【读书笔记】Algorithms for Decision Making(14)
本部分将简单游戏扩展到具有多个状态的连续上下文。马尔可夫博弈可以看作是多个具有自己奖励函数的智能体的马尔可夫决策过程。
363 0
【读书笔记】Algorithms for Decision Making(14)
|
机器学习/深度学习 API
【读书笔记】Algorithms for Decision Making(8)
解决存在模型不确定性的此类问题是强化学习领域的主题,这是这部分的重点。解决模型不确定性的几个挑战:首先,智能体必须仔细平衡环境探索和利用通过经验获得的知识。第二,在做出重要决策后很长时间内,可能会收到奖励,因此必须将以后奖励的学分分配给以前的决策。第三,智能体必须从有限的经验中进行概括。
208 0
【读书笔记】Algorithms for Decision Making(8)
|
算法 关系型数据库 数据建模
【读书笔记】Algorithms for Decision Making(4)
本部分讨论从数据学习或拟合模型参数的问题,进一步讨论了从数据中学习模型结构的方法,最后对决策理论进行了简单的概述。
【读书笔记】Algorithms for Decision Making(4)
|
运维 算法 数据挖掘
Statistical Approaches|学习笔记
快速学习 Statistical Approaches
Statistical Approaches|学习笔记
|
vr&ar
【读书笔记】Algorithms for Decision Making(5)
此前讲述了在某个时间点做一个单一的决定的问题,但许多重要的问题需要做出一系列的决定。序列环境中的最佳决策需要对未来行动和观察序列进行推理。
115 0
|
决策智能
【读书笔记】Algorithms for Decision Making(13)
本部分将简单游戏扩展到具有多个状态的连续上下文。马尔可夫博弈可以看作是多个具有自己奖励函数的智能体的马尔可夫决策过程。
145 0
|
算法
【读书笔记】Algorithms for Decision Making(11)
在有限维场景中,POMDP问题的精确解也经常很难计算。因而,考虑求得近似解的方法是合理的。本部分从离线近似解讨论到在线近似解,是近似方法的常规逻辑思路。
151 0
|
机器学习/深度学习
【读书笔记】Algorithms for Decision Making(7)
策略搜索即搜索策略空间,而无需直接计算值函数。策略空间的维数通常低于状态空间,并且通常可以更有效地搜索。本部分首先讨论在初始状态分布下估计策略价值的方法。然后讨论不使用策略梯度估计的搜索方法和策略梯度方法。接着介绍Actor-Critic方法用值函数的估计来指导优化。