[toc]
前言
cfDNA(无细胞DNA,游离DNA,Circulating free DNA
or Cell free DNA
)是指在血液循环中存在的DNA片段。这些DNA片段不属于任何细胞,因此被称为“无细胞”或“游离”的。cfDNA来源广泛,可以来自正常细胞和病变细胞(如肿瘤细胞)的死亡和分解过程。cfDNA的长度通常在160-180碱基对左右,这与核小体保护的DNA片段长度相符。
cfDNA的研究对于非侵入性诊断、疾病监测、早期检测以及了解生理和病理状态具有重要意义。特别是在肿瘤学领域,通过分析循环肿瘤DNA(ctDNA
),即来源于肿瘤细胞的cfDNA,可以获取肿瘤的遗传信息,从而指导癌症的诊断、治疗选择和治疗效果监测。
cfDNAPro
主要功能:
- 数据表征: 计算片段大小分布的整体、中位数和众数,以及片段大小轮廓中的峰和谷,还有振荡周期性。
- 数据可视化: 提供了多种函数来可视化这些数据,包括整体到单个片段的可视化、度量可视化、模式和摘要可视化等。
demo
1.片段长度可视化
- 上图:横轴表示片段长度,范围为30bp至500bp。纵轴表示具有特定读取长度的读取比例。这里的线并不是平滑曲线,而是连接不同数据点的直线。
- 下图:首先统计长度小于或等于30bp的读取数量(例如N),然后将其归一化为比例。重复这一过程,直至处理完所有片段长度(即30bp, 31bp, …, 500bp),然后以线图的形式呈现。与非累积图一样,这里的线也是连接各个数据点,而不是平滑曲线。
library(scales) library(ggpubr) library(ggplot2) library(dplyr) # Define a list for the groups/cohorts. grp_list<-list("cohort_1"="cohort_1", "cohort_2"="cohort_2", "cohort_3"="cohort_3", "cohort_4"="cohort_4") # Generating the plots and store them in a list. result<-sapply(grp_list, function(x){ result <-callSize(path = data_path) %>% dplyr::filter(group==as.character(x)) %>% plotSingleGroup() }, simplify = FALSE) #> setting default outfmt to df. #> setting default input_type to picard. #> setting default outfmt to df. #> setting default input_type to picard. #> setting default outfmt to df. #> setting default input_type to picard. #> setting default outfmt to df. #> setting default input_type to picard. # Multiplexing the plots in one figure suppressWarnings( multiplex <- ggarrange(result$cohort_1$prop_plot + theme(axis.title.x = element_blank()), result$cohort_4$prop_plot + theme(axis.title = element_blank()), result$cohort_1$cdf_plot, result$cohort_4$cdf_plot + theme(axis.title.y = element_blank()), labels = c("Cohort 1 (n=5)", "Cohort 4 (n=4)"), label.x = 0.2, ncol = 2, nrow = 2)) multiplex
2.片段长度分布比较
- callMetrics:计算了每个组的中位片段大小分布
- 上图:每个队列中位数片段大小分布的比例。y轴显示读取比例,x轴显示片段大小。图中显示的线不是平滑的曲线,而是连接不同数据点的线
- 下图:中位数累积分布函数(CDF)的图形。y轴显示累积比例,x轴仍然显示片段大小。这是一个逐步上升的图形,反映了不同片段大小下读取的累积分布情况。
# Set an order for those groups (i.e. the levels of factors). order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4") # Generate plots. compare_grps<-callMetrics(data_path) %>% plotMetrics(order=order) #> setting default input_type to picard. # Modify plots. p1<-compare_grps$median_prop_plot + ylim(c(0, 0.028)) + theme(axis.title.x = element_blank(), axis.title.y = element_text(size=12,face="bold")) + theme(legend.position = c(0.7, 0.5), legend.text = element_text( size = 11), legend.title = element_blank()) p2<-compare_grps$median_cdf_plot + scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) + theme(axis.title=element_text(size=12,face="bold")) + theme(legend.position = c(0.7, 0.5), legend.text = element_text( size = 11), legend.title = element_blank()) # Finalize plots. suppressWarnings( median_grps<-ggpubr::ggarrange(p1, p2, label.x = 0.3, ncol = 1, nrow = 2 )) median_grps
3.可视化DNA片段模态长度
- 柱状图:这里的模态片段大小是指在样本中出现次数最多的DNA片段长度
# Set an order for your groups, it will affect the group order along x axis! order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4") # Generate mode bin chart. mode_bin <- callMode(data_path) %>% plotMode(order=order,hline = c(167,111,81)) #> setting default mincount as 0. #> setting default input_type to picard. # Show the plot. suppressWarnings(print(mode_bin))
- 堆叠柱状图:可以看到每个组中不同长度片段的分布
# Set an order for your groups, it will affect the group order along x axis. order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4") # Generate mode stacked bar chart. You could specify how to stratify the modes # using 'mode_partition' arguments. If other modes exist other than you # specified, an 'other' group will be added to the plot. mode_stacked <- callMode(data_path) %>% plotModeSummary(order=order, mode_partition = list(c(166,167))) #> setting default input_type to picard. # Modify the plot using ggplot syntax. mode_stacked <- mode_stacked + theme(legend.position = "top") # Show the plot. suppressWarnings(print(mode_stacked))
4.片段化振荡模式比较
- 间峰距离:通过测量和比较间距距离(峰值之间的距离),比较不同队列中的10bp周期性振荡模式
# Set an order for your groups, it will affect the group order. order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3") # Plot and modify inter-peak distances. inter_peak_dist<-callPeakDistance(path = data_path, limit = c(50, 135)) %>% plotPeakDistance(order = order) + labs(y="Fraction") + theme(axis.title = element_text(size=12,face="bold"), legend.title = element_blank(), legend.position = c(0.91, 0.5), legend.text = element_text(size = 11)) #> setting the mincount to 0. #> setting the xlim to c(7,13). #> setting default outfmt to df. #> Setting default mincount to 0. #> setting default input_type to picard. # Show the plot. suppressWarnings(print(inter_peak_dist))
- 间谷距离:与之前介绍的间峰距离可视化相比,间谷距离的可视化重点在于表示读取次数下降的区域,而不是上升的区域。这两个图表的区别在于它们关注的是碎片大小谱的不同特点,一个是峰点(即频率的局部最高点),另一个是谷点(即频率的局部最低点)。
# Set an order for your groups, it will affect the group order. order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3") # Plot and modify inter-peak distances. inter_valley_dist<-callValleyDistance(path = data_path, limit = c(50, 135)) %>% plotValleyDistance(order = order) + labs(y="Fraction") + theme(axis.title = element_text(size=12,face="bold"), legend.title = element_blank(), legend.position = c(0.91, 0.5), legend.text = element_text(size = 11)) #> setting the mincount to 0. #> setting the xlim to c(7,13). #> setting default outfmt to df. #> setting the mincount to 0. #> setting default input_type to picard. # Show the plot. suppressWarnings(print(inter_valley_dist))
5. ggplot2美化
library(ggplot2) library(cfDNAPro) # Set the path to the example sample. exam_path <- examplePath("step6") # Calculate peaks and valleys. peaks <- callPeakDistance(path = exam_path) #> setting default limit to c(35,135). #> setting default outfmt to df. #> Setting default mincount to 0. #> setting default input_type to picard. valleys <- callValleyDistance(path = exam_path) #> setting default limit to c(35,135). #> setting default outfmt to df. #> setting the mincount to 0. #> setting default input_type to picard. # A line plot showing the fragmentation pattern of the example sample. exam_plot_all <- callSize(path=exam_path) %>% plotSingleGroup(vline = NULL) #> setting default outfmt to df. #> setting default input_type to picard. # Label peaks and valleys with dashed and solid lines. exam_plot_prop <- exam_plot_all$prop + coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) + geom_vline(xintercept=peaks$insert_size, colour="red",linetype="dashed") + geom_vline(xintercept = valleys$insert_size,colour="blue") # Show the plot. suppressWarnings(print(exam_plot_prop))
# Label peaks and valleys with dots. exam_plot_prop_dot<- exam_plot_all$prop + coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) + geom_point(data= peaks, mapping = aes(x= insert_size, y= prop), color="blue",alpha=0.5,size=3) + geom_point(data= valleys, mapping = aes(x= insert_size, y= prop), color="red",alpha=0.5,size=3) # Show the plot. suppressWarnings(print(exam_plot_prop_dot))
想做cfDNA,迈出分析的第一步,数据表征。