如何在R中产生一些规则或不规则的测试数据?
产生连续分布的向量
例子 :
1. 使用冒号. 产生连续整数向量
2. 使用seq函数, 产生连续值, 可以指定步长.
3. 使用scan让用户输入
4. 使用rep重复一个向量值数次, 注意each和times参数的差别.
5. 使用sequence函数产生一系列连续整数序列.
6. 使用gl 产生因子
7. expand.grid()创建数据框(data.frame)
以下一共产生2*2*3行 的数据框
最后两个函数序列可以用 来求统计假设检验中P值或临界值。
分布名称 函数
[参考]
2. help('gl')
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 1:10-2 # 注意:号优先级高于减号运算符
[1] -1 0 1 2 3 4 5 6 7 8
> 1:(10-2)
[1] 1 2 3 4 5 6 7 8
> a <- 1:10
> a
[1] 1 2 3 4 5 6 7 8 9 10
2. 使用seq函数, 产生连续值, 可以指定步长.
> seq(1,10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(1,10,0.5) # from=1 to=10 by=0.5
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
[16] 8.5 9.0 9.5 10.0
> seq(1,10,1)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from=1,to=10,by=1) # by 指定步长
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from=1,to=10,length.out=1)
[1] 1
> seq(from=1,to=10,length.out=100)
[1] 1.000000 1.090909 1.181818 1.272727 1.363636 1.454545 1.545455
[8] 1.636364 1.727273 1.818182 1.909091 2.000000 2.090909 2.181818
[15] 2.272727 2.363636 2.454545 2.545455 2.636364 2.727273 2.818182
[22] 2.909091 3.000000 3.090909 3.181818 3.272727 3.363636 3.454545
[29] 3.545455 3.636364 3.727273 3.818182 3.909091 4.000000 4.090909
[36] 4.181818 4.272727 4.363636 4.454545 4.545455 4.636364 4.727273
[43] 4.818182 4.909091 5.000000 5.090909 5.181818 5.272727 5.363636
[50] 5.454545 5.545455 5.636364 5.727273 5.818182 5.909091 6.000000
[57] 6.090909 6.181818 6.272727 6.363636 6.454545 6.545455 6.636364
[64] 6.727273 6.818182 6.909091 7.000000 7.090909 7.181818 7.272727
[71] 7.363636 7.454545 7.545455 7.636364 7.727273 7.818182 7.909091
[78] 8.000000 8.090909 8.181818 8.272727 8.363636 8.454545 8.545455
[85] 8.636364 8.727273 8.818182 8.909091 9.000000 9.090909 9.181818
[92] 9.272727 9.363636 9.454545 9.545455 9.636364 9.727273 9.818182
[99] 9.909091 10.000000
3. 使用scan让用户输入
> scan()
1: 1
2: 2
3: 3
4: 4
5: 100
6:
Read 5 items
[1] 1 2 3 4 100
> a <- scan()
1: 1
2: 10
3: 100
4: 1000
5:
Read 4 items
> a
[1] 1 10 100 1000
4. 使用rep重复一个向量值数次, 注意each和times参数的差别.
> a
[1] 1 10 100 1000
> rep(a,each=2) 每个元素重复2次
[1] 1 1 10 10 100 100 1000 1000
> rep(a,times=2) 每个向量重复2次
[1] 1 10 100 1000 1 10 100 1000
> rep(a,each=4,length=10) # length限制返回向量的长度
[1] 1 1 1 1 10 10 10 10 100 100
> rep(a,times=4,length=10)
[1] 1 10 100 1000 1 10 100 1000 1 10
5. 使用sequence函数产生一系列连续整数序列.
> sequence(c(2,3,4,5)) # 产生从1到2, 从1到3, 从1到4, 从1到5的序列.
[1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5
> sequence(2:5) # 产生从1到2, 从1到3, 从1到4, 从1到5的序列.
[1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5
> sequence(5) 从1到5的序列.
[1] 1 2 3 4 5
> sequence(c(2,5)) # 产生从1到2, 从1到5的序列.
[1] 1 2 1 2 3 4 5
6. 使用gl 产生因子
> gl(n=2, k=3, length=10) # n是level数量, k是每个level的重复次数, length是总长度
[1] 1 1 1 2 2 2 1 1 1 2
Levels: 1 2
> gl(n=2, k=3)
[1] 1 1 1 2 2 2
Levels: 1 2
> gl(n=2, k=3, labels=c("a", "b")) # labels代替数字level
[1] a a a b b b
Levels: a b
> gl(n=2, k=3, labels=c("a", "b", "c")) # labels代替数字level, 如果n<length(lables), 不需要的level不会出现在上面.
[1] a a a b b b
Levels: a b c
> gl(n=2, k=9, labels=c("a", "b", "c"))
[1] a a a a a a a a a b b b b b b b b b
Levels: a b c
> gl(n=2, k=9, labels=c("a", "b", "c"), ordered=TRUE) # 是否排序
[1] a a a a a a a a a b b b b b b b b b
Levels: a < b < c
7. expand.grid()创建数据框(data.frame)
数据框是列长度相同的多列结构 , 每列的类型可以不一致.
3列如下, 完全匹配, (笛卡尔)
以下
一共产生2*2*2行的数据框
> expand.grid(h=c(60,80), w=c(100, 300), sex=c("Male", "Female"))
h w sex
1 60 100 Male
2 80 100 Male
3 60 300 Male
4 80 300 Male
5 60 100 Female
6 80 100 Female
7 60 300 Female
8 80 300 Female
以下一共产生2*2*3行 的数据框
> expand.grid(h=c(60,80), w=c(100, 300), sex=c("Male", "Female", "non"))
h w sex
1 60 100 Male
2 80 100 Male
3 60 300 Male
4 80 300 Male
5 60 100 Female
6 80 100 Female
7 60 300 Female
8 80 300 Female
9 60 100 non
10 80 100 non
11 60 300 non
12 80 300 non
产生规则分布的测试数据 :
在统计学中,产生随机数据是很有用的,R可以产生多种不同分布下的随
机数序列。
这些分布函数的形式为rfunc(n,p1,p2,...),其中func指概率分
布函数,n为生成数据的个数,p1, p2, . . . 是分布的参数数值。
上面的表给出
了每个分布的详情和可能的缺省值(如果没有给出缺省值,则意味着用户必
须指定参数)。
大多数这种统计函数都有相似的形式,只需用d、p或者q去替代r (见下表),比如 :
1. 分布函数的形式为 rfunc(n,p1,p2,...)
2. 密度函数 ( dfunc (x, ...) ,
3. 累计概率密度函数(也即分布函数)( pfunc (x,...) ) ,
4. 分位数函数( qfunc (p, ...) , 0 < p < 1) .
最后两个函数序列可以用 来求统计假设检验中P值或临界值。
例如,显著性水平为5%的正态分布的双
侧临界值是 :
> qnorm(0.025)
[1] -1.959964
> qnorm(0.975)
[1] 1.959964
对于同一个检验的单侧临界值,根据备择假设的形式使用qnorm(0.05)或1 -qnorm(0.95)
一个检验的P 值,比如自由度df = 1的?
2
= 3:84 :
> 1 - pchisq(3.84, 1)
[1] 0.05004352
分布名称 函数
Gaussian (normal) rnorm(n, mean=0, sd=1)
exponential rexp(n, rate=1)
gamma rgamma(n, shape, scale=1)
Poisson rpois(n, lambda)
Weibull rweibull(n, shape, scale=1)
Cauchy rcauchy(n, location=0, scale=1)
beta rbeta(n, shape1, shape2)
`Student' (t) rt(n, df)
Fisher{Snedecor (F ) rf(n, df1, df2)
Pearson (?2) rchisq(n, df)
binomial rbinom(n, size, prob)
multinomial rmultinom(n, size, prob)
geometric rgeom(n, prob)
hypergeometric rhyper(nn, m, n, k)
logistic rlogis(n, location=0, scale=1)
lognormal rlnorm(n, meanlog=0, sdlog=1)
negative binomial rnbinom(n, size, prob)
uniform runif(n, min=0, max=1)
Wilcoxon's statistics rwilcox(nn, m, n), rsignrank(nn, n)
1. help("seq")
Description:
Generate regular sequences. ‘seq’ is a standard generic with a
default method. ‘seq.int’ is a primitive which can be much faster
but has a few restrictions. ‘seq_along’ and ‘seq_len’ are very
fast primitives for two common cases.
Usage:
seq(...)
## Default S3 method:
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
length.out = NULL, along.with = NULL, ...)
seq.int(from, to, by, length.out, along.with, ...)
seq_along(along.with)
seq_len(length.out)
Arguments:
...: arguments passed to or from methods.
from, to: the starting and (maximal) end values of the sequence. Of
length ‘1’ unless just ‘from’ is supplied as an unnamed
argument.
by: number: increment of the sequence.
length.out: desired length of the sequence. A non-negative number,
which for ‘seq’ and ‘seq.int’ will be rounded up if
fractional.
along.with: take the length from the length of this argument.
....
2. help('gl')
Description:
Generate factors by specifying the pattern of their levels.
Usage:
gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)
Arguments:
n: an integer giving the number of levels.
k: an integer giving the number of replications.
length: an integer giving the length of the result.
labels: an optional vector of labels for the resulting factor levels.
ordered: a logical indicating whether the result should be ordered or
not.
Value:
The result has levels from ‘1’ to ‘n’ with each value replicated
in groups of length ‘k’ out to a total length of ‘length’.
‘gl’ is modelled on the _GLIM_ function of the same name.