aggregate 的简单用法指南
分组求均值
#导入内置数据 df <- chickwts #查看数据集 head(df)
> head(df) weight feed 1 179 horsebean 2 160 horsebean 3 136 horsebean 4 227 horsebean 5 217 horsebean 6 168 horsebean
aggregate分组计算均值有两种方法。
法1:
#第一个参数:数值变量 #第二个参数:列表形似的分组变量 #第三个参数:用于汇总统计的函数(本例为均值mean) group_mean <- aggregate(df$weight, list(df$feed), mean)
> group_mean Group.1 x 1 casein 323.5833 2 horsebean 160.2000 3 linseed 218.7500 4 meatmeal 276.9091 5 soybean 246.4286 6 sunflower 328.9167
值得注意的是,数据框的列名发生了改变,可以使用colnames函数修改。
colnames(group_mean) <- c("Group", "Mean") group_mean
> group_mean Group Mean 1 casein 323.5833 2 horsebean 160.2000 3 linseed 218.7500 4 meatmeal 276.9091 5 soybean 246.4286 6 sunflower 328.9167
法2:
group_mean <- aggregate(weight ~ feed, data = df, mean)
> group_mean feed weight 1 casein 323.5833 2 horsebean 160.2000 3 linseed 218.7500 4 meatmeal 276.9091 5 soybean 246.4286 6 sunflower 328.9167
分组统计个数
group_count <- aggregate(df$feed, by = list(df$feed), FUN = length) group_count
> group_count Group.1 x 1 casein 12 2 horsebean 10 3 linseed 12 4 meatmeal 11 5 soybean 14 6 sunflower 12
分组统计总体分位数
#建立一个数据集:一个基金的一年的每日收益 set.seed(1) library(lubridate) Dates <- seq(dmy("01/01/2014"), dmy("01/01/2015"), by = "day") Return <- rnorm(length(Dates)) install.packages("xts") library(xts) tserie <- xts(Return, Dates) head(tserie)
> head(tserie) [,1] 2014-01-01 -0.6264538 2014-01-02 0.1836433 2014-01-03 -0.8356286 2014-01-04 1.5952808 2014-01-05 0.3295078 2014-01-06 -0.8204684
可以计算每个月收益的5%和95%的分位数:
dat <- aggregate(tserie ~ month(index(tserie)), FUN = quantile, probs = c(0.05, 0.95)) colnames(dat)[1] <- "Month" dat
> dat Month V1.5% V1.95% 1 1 -1.704122 1.427575 2 2 -1.099533 1.316474 3 3 -1.388600 1.819083 4 4 -1.083452 1.639272 5 5 -1.652789 1.259811 6 6 -1.406464 2.147217 7 7 -1.337666 1.637731 8 8 -1.669366 1.308261 9 9 -1.635192 1.155433 10 10 -1.371251 1.874883 11 11 -1.445358 1.505385 12 12 -2.091900 1.525886
按多个列聚合
#创建数据集 set.seed(1) cat_var <- sample(c("A", "B", "C"), nrow(df), replace = TRUE) df_2 <- cbind(df, cat_var) head(df_2)
> head(df_2) weight feed cat_var 1 179 horsebean A 2 160 horsebean C 3 136 horsebean A 4 227 horsebean B 5 217 horsebean A 6 168 horsebean C
可以根据多个分类变量进行统计
aggregate(df_2$weight, by = list(df_2$feed, df_2$cat_var), FUN = sum) aggregate(weight ~ feed + cat_var, data = df_2, FUN = sum) #等效
feed cat_var weight casein A 1005 horsebean A 532 linseed A 1079 meatmeal A 242 soybean A 1738 sunflower A 882 casein B 1131 horsebean B 494 linseed B 780 meatmeal B 2244 soybean B 1355 sunflower B 2109 casein C 1747 horsebean C 576 linseed C 766 meatmeal C 560 soybean C 357 sunflower C 956
#创建一个新数据集 set.seed(1) num_var <- rnorm(nrow(df)) df_3 <- cbind(num_var, df) head(df_3)
> head(df_3) num_var weight feed 1 -0.6264538 179 horsebean 2 0.1836433 160 horsebean 3 -0.8356286 136 horsebean 4 1.5952808 227 horsebean 5 0.3295078 217 horsebean 6 -0.8204684 168 horsebean
处理两个或多个数值变量时,可以使用cbind函数来连接:
aggregate(cbind(df_3$num_var, df_3$weight), list(df_3$feed), mean)
Group.1 V1 V2 casein 0.4043795 323.5833 horsebean 0.1322028 160.2000 linseed 0.3491303 218.7500 meatmeal 0.2125804 276.9091 soybean -0.2314387 246.4286 sunflower 0.1651836 328.9167
当然,还可以将该函数同时应用于多个数值变量和分类变量。
参考
https://r-coder.com/aggregate-r/