Statistical Measures with R

简介:

Refer to R Tutorial andExercise Solution

Mean, 平均值

The mean of an observation variable is a numerical measure of the central location of the data values. It is the sum of its data values divided by data count.

Hence, for a data sample of size n, its sample mean is defined as follows:

> duration = faithful$eruptions     # the eruption durations  
> mean(duration)                    # apply the mean function  
[1] 3.4878

 

Median, 中位数

The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.

> duration = faithful$eruptions     # the eruption durations  
> median(duration)                  # apply the median function  
[1] 4

 

 

Quartile, 四分位数, 中位数即第二四分位数

There are several quartiles of an observation variable.

The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order.

The second quartile, or median, is the value that cuts off the first 50%.

The third quartile, or upper quartile, is the value that cuts off the first 75%.

> duration = faithful$eruptions     # the eruption durations  
> quantile(duration)                # apply the quantile function  
    0%    25%    50%    75%   100%  
1.6000 2.1627 4.0000 4.4543 5.1000

 

Percentile, 百分位数

The nth percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

Find the 32nd, 57th and 98th percentiles

> duration = faithful$eruptions     # the eruption durations  
> quantile(duration, c(.32, .57, .98))  
   32%    57%    98%  
2.3952 4.1330 4.9330

 

Range

The range of an observation variable is the difference of its largest and smallest data values. It is a measure of how far apart the entire data spreads in value.

> duration = faithful$eruptions     # the eruption durations  
> max(duration) − min(duration)     # apply the max and min functions  
[1] 3.5

 

Interquartile Range, 四分位距

The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.

 

> duration = faithful$eruptions     # the eruption durations  
> IQR(duration)                     # apply the IQR function  
[1] 2.2915

 

Box Plot, 箱线图

The box plot of an observation variable is a graphical representation based on its quartiles, as well as its smallest and largest values. It attempts to provide a visual shape of the data distribution.

> duration = faithful$eruptions       # the eruption durations  
> boxplot(duration, horizontal=TRUE)  # horizontal box plot

The box plot of the eruption duration is:

这个图就是用图形化来表示四分位数, box的三条边表示第一, 二, 三四分位数, 那条最粗的就是第二四分位数, 即中位数

    0%    25%    50%    75%   100%  
1.6000 2.1627 4.0000 4.4543 5.1000

从这个图可以看出数据的分布...

 

Variance, 方差

The variance is a numerical measure of how the data values is dispersed around the mean. In particular, the sample variance is defined as:

 

> duration = faithful$eruptions    # the eruption durations  
> var(duration)                    # apply the var function  
[1] 1.3027

 

Standard Deviation, 标准偏差

The standard deviation of an observation variable is the square root of its variance.

> duration = faithful$eruptions    # the eruption durations  
> sd(duration)                     # apply the sd function  
[1] 1.1414

 

Covariance, 协方差

The covariance of two variables x and y in a data sample measures how the two are linearly related. A positive covariancewould indicates a positive linear relationship between the variables, and a negative covariance would indicate the opposite.

The sample covariance is defined in terms of the sample means as:

> duration = faithfuleruptions   # the eruption durations   > waiting = faithfuleruptions   # the eruption durations   > waiting = faithfulwaiting      # the waiting period  
> cov(duration, waiting)          # apply the cov function  
[1] 13.978

 

Correlation Coefficient, 相关系数

The correlation coefficient of two variables in a data sample is their covariance divided by the product of their individualstandard deviations. It is a normalized measurement of how the two are linearly related.

Formally, the sample correlation coefficient is defined by the following formula, where sx and sy are the sample standard deviations, and sxy is the sample covariance.

If the correlation coefficient is close to 1, it would indicates that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope.

For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope.

And for zero, it would indicates a weak linear relationship between the variables.

> duration = faithfuleruptions   # the eruption durations   > waiting = faithfuleruptions   # the eruption durations   > waiting = faithfulwaiting      # the waiting period  
> cor(duration, waiting)          # apply the cor function  
[1] 0.90081

说明喷发时间和等待时间成正比, 等的越久就喷的越久...

 

协方差和相关系数

1、协方差是一个用于测量投资组合中某一具体投资项目相对于另一投资项目风险的统计指标,通俗点就是投资组合中两个项目间收益率的相关程度,正数说明两个项目一个收益率上升,另一个也上升,收益率呈同方向变化。如果是负数,则一个上升另一个下降,表明收益率是反方向变化。协方差的绝对值越大,表示这两种资产收益率关系越密切;绝对值越小表明这两种资产收益率的关系越疏远。 
2、由于协方差比较难理解,所以将协方差除以两个投资方案投资收益率的标准差之积,得出一个与协方差具有相同性质却没有量化的数。这个数就是相关系数。计算公式为相关系数=协方差/两个项目标准差之积。

 

Central Moment, 中心矩

The kth central moment (or moment about the mean) of a data sample is:

For example, the second central moment of a population is its variance.

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> moment(duration, order=3, central=TRUE)  
[1] −0.6149

 

Skewness, 偏斜度

The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.

Intuitively, the skewness is a measure of symmetry.

Negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed;

Positive skewness would indicates that the mean of the data values is larger than the median, and the data distribution is right-skewed. Of course, this rule applies only to unimodal distributions whose histograms have a single peak.

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> skewness(duration)                # apply the skewness function  
[1] -0.41584

 

Kurtosis, 峰态

The kurtosis of a univariate population is defined by the following formula, where μ2 and μ4 are the second and fourthcentral moments.

Intuitively, the kurtosis is a measure of the peakedness of the data distribution.

Negative kurtosis would indicates a flat distribution, which is said to be platykurtic(平顶).

Positive kurtosis would indicates a peaked distribution, which is said to be leptokurtic(尖顶).

Finally, the normal distribution has zero kurtosis, and is said to be mesokurtic(常态峰的).

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> kurtosis(duration) - 3            # apply the kurtosis function  
[1] -1.5006


本文章摘自博客园,原文发布日期:2012-02-15

相关实践学习
基于Hologres轻松玩转一站式实时仓库
本场景介绍如何利用阿里云MaxCompute、实时计算Flink和交互式分析服务Hologres开发离线、实时数据融合分析的数据大屏应用。
阿里云实时数仓实战 - 项目介绍及架构设计
课程简介 1)学习搭建一个数据仓库的过程,理解数据在整个数仓架构的从采集、存储、计算、输出、展示的整个业务流程。 2)整个数仓体系完全搭建在阿里云架构上,理解并学会运用各个服务组件,了解各个组件之间如何配合联动。 3 )前置知识要求   课程大纲 第一章 了解数据仓库概念 初步了解数据仓库是干什么的 第二章 按照企业开发的标准去搭建一个数据仓库 数据仓库的需求是什么 架构 怎么选型怎么购买服务器 第三章 数据生成模块 用户形成数据的一个准备 按照企业的标准,准备了十一张用户行为表 方便使用 第四章 采集模块的搭建 购买阿里云服务器 安装 JDK 安装 Flume 第五章 用户行为数据仓库 严格按照企业的标准开发 第六章 搭建业务数仓理论基础和对表的分类同步 第七章 业务数仓的搭建  业务行为数仓效果图  
目录
相关文章
|
9月前
|
机器学习/深度学习 人工智能 算法
【5分钟 Paper】Reinforcement Learning with Deep Energy-Based Policies
【5分钟 Paper】Reinforcement Learning with Deep Energy-Based Policies
|
算法
【读书笔记】Algorithms for Decision Making(11)
在有限维场景中,POMDP问题的精确解也经常很难计算。因而,考虑求得近似解的方法是合理的。本部分从离线近似解讨论到在线近似解,是近似方法的常规逻辑思路。
111 0
|
人工智能 vr&ar 决策智能
【读书笔记】Algorithms for Decision Making(12)
现将单智能体的核心概念扩展到多智能体系统的问题。在该系统中,可将其他智能体建模为潜在的盟友或对手,并随着时间的推移进行相应的调整。
|
机器学习/深度学习 人工智能 算法
【读书笔记】Algorithms for Decision Making(1)
我自己的粗浅看法:机器学习要不是拟合逼近(经常提及的machine learning),要不就是决策过程(reinforcement learning),这本书主要讲述后者的前世今生。
277 0
【读书笔记】Algorithms for Decision Making(1)
|
Python
【读书笔记】Algorithms for Decision Making(2)
理性决策需要对不确定性和目标进行推理。不确定性源于预测未来事件能力的实际及理论限制。为了实现其目标,一个强有力的决策系统必须考虑到当前世界状况和未来事件中的各种不确定性来源。
【读书笔记】Algorithms for Decision Making(2)
|
算法
【读书笔记】Algorithms for Decision Making(3)
上一部分给出了概率分布的表示论。本部分将展示如何使用概率表示进行推理,即确定一组给定观察变量相关值的一个或多个未观察变量的分布。在该部分中首先介绍直接推断的办法,然后给出几种有效的近似方法。
124 0
|
算法 决策智能
【读书笔记】Algorithms for Decision Making(14)
本部分将简单游戏扩展到具有多个状态的连续上下文。马尔可夫博弈可以看作是多个具有自己奖励函数的智能体的马尔可夫决策过程。
313 0
【读书笔记】Algorithms for Decision Making(14)
|
vr&ar
【读书笔记】Algorithms for Decision Making(5)
此前讲述了在某个时间点做一个单一的决定的问题,但许多重要的问题需要做出一系列的决定。序列环境中的最佳决策需要对未来行动和观察序列进行推理。
|
运维 算法 数据挖掘
Statistical Approaches|学习笔记
快速学习 Statistical Approaches
57 0
Statistical Approaches|学习笔记