本文首发于“生信补给站”公众号 https://mp.weixin.qq.com/s/yhMgkST-rVD6SaQS7R-eoA
桑基图(Sankey diagram),是一种特定类型的流程图,图中延伸的分支的宽度对应数据流量的大小,通常应用于能源、材料成分、金融等数据的可视化分析。
因1898年Matthew Henry Phineas Riall Sankey绘制的“蒸汽机的能源效率图”而闻名,此后便以其名字命名为“桑基图”。
载入R包,数据
本文使用TCGA数据集中的LIHC的临床数据进行展示,大家可以根据数据格式处理自己的临床数据。也可后台回复“R-桑基图”获得示例数据以及R代码。
#install.packages("ggalluvial") library(ggalluvial) library(ggplot2) library(dplyr) #读入LIHC临床数据 LIHC <- read.csv("TCGA_lihc.csv",header=TRUE) #展示数据情况 head(LIHC) PATIENT_ID AGE SEX AJCC_PATHOLOGIC_TUMOR_STAGE OS_STATUS 1 TCGA-XR-A8TE less50 Male STAGE III LIVING 2 TCGA-5R-AA1D less50 Female STAGE III LIVING 3 TCGA-DD-A1EC less50 Female STAGE I LIVING 4 TCGA-ED-A7PY less50 Female STAGE II LIVING 5 TCGA-RC-A6M5 less50 Female STAGE IV LIVING 6 TCGA-DD-A1EH less50 Male STAGE III LIVING
summary(LIHC)
桑基图的数据结构需要节点,权重等信息,ggalluvial 的输入数据可以是长数据亦可以是宽数据。
绘制桑基图
1 宽数据示例
对临床数据进行简单的处理,得到后四个变量的频数,整理成宽数据:以下处理过程可参考数据处理|R-dplyr,数据处理|数据框重铸
#分组计算频数 LIHCData <- group_by(data,AGE,SEX,AJCC_PATHOLOGIC_TUMOR_STAGE,OS_STATUS) %>% summarise(., count = n()) #查看宽数据格式 head(LIHCData) AGE SEX AJCC_PATHOLOGIC_TUMOR_STAGE OS_STATUS count <fct> <fct> <fct> <fct> <int> 1 50to70 Female STAGE I DECEASED 11 2 50to70 Female STAGE I LIVING 16 3 50to70 Female STAGE II DECEASED 3 4 50to70 Female STAGE II LIVING 11 5 50to70 Female STAGE III DECEASED 8 6 50to70 Female STAGE III LIVING 9 绘制桑基图 ggplot(as.data.frame(LIHCData), aes(axis1 = AJCC_PATHOLOGIC_TUMOR_STAGE, axis2 = SEX, axis3 = AGE, y= count)) + scale_x_discrete(limits = c("AJCC_STAGE", "SEX", "AGE"), expand = c(.1, .05)) + geom_alluvium(aes(fill = OS_STATUS)) + geom_stratum() + geom_text(stat = "stratum", label.strata = TRUE) + theme_minimal() + ggtitle("Patients in the TCGA-LIHC cohort", "stratified by demographics and survival")
- axis参数设置待展示的节点信息(柱子);
- geom_alluvium参数设置组间面积连接,此处按生存状态分组;
2 长数据示例
ggplot2通常处理的都是长表格模式,使用to_lodes_form函数即可转换
#to_lodes_form生成alluvium和stratum列,主分组位于key列中 LIHC_long <- to_lodes_form(data.frame(LIHCData), key = "Demographic", axes = 1:3) head(LIHC_long) OS_STATUS count alluvium Demographic stratum 1 DECEASED 11 1 AGE 50to70 2 LIVING 16 2 AGE 50to70 3 DECEASED 3 3 AGE 50to70 4 LIVING 11 4 AGE 50to70 5 DECEASED 8 5 AGE 50to70 6 LIVING 9 6 AGE 50to70 # 绘制桑基图
ggplot(data = LIHC_long, aes(x = Demographic, stratum = stratum, alluvium = alluvium, y = count, label = stratum)) + geom_alluvium(aes(fill = OS_STATUS)) + geom_stratum() + geom_text(stat = "stratum") + theme_minimal() + ggtitle("Patients in the TCGA-LIHC cohort", "stratified by demographics and survival")
3 状态变化的趋势
vaccinations为R包内置数据集,可展示同一subject在不同survey状态下的response情况。
data(vaccinations) levels(vaccinations$response) <- rev(levels(vaccinations$response)) ggplot(vaccinations, aes(x = survey, stratum = response, alluvium = subject, y = freq, fill = response, label = response)) + scale_x_discrete(expand = c(.1, .1)) + geom_flow() + geom_stratum(alpha = .5) + geom_text(stat = "stratum", size = 3) + theme(legend.position = "none") + ggtitle("vaccination survey responses at three points in time")
4 更多细节
vignette(topic="ggalluvial", package="ggalluvial")
以上就是如何使用R-ggalluvial包绘制桑基图的简单介绍,可以自己动手展示了 🤭。