CUT&Tag 数据处理和分析教程(8)

简介: CUT&Tag 数据处理和分析教程(8)

SEACR call Peak

SEACR(用于 CUT&RUN 的稀疏富集分析工具包)专为从染色质分析数据中识别峰值和富集区域而设计。这类数据通常背景信号极低(即某些区域完全没有读数覆盖),这在 CUT&Tag 染色质实验中尤为常见。 SEACR 以双末端测序生成的 bedGraph 文件为输入,将峰值定义为连续的碱基对覆盖区域,这些区域不会与 IgG 控制数据中标记的背景信号区域重叠。 SEACR 能有效识别转录因子结合位点形成的窄峰,以及某些组蛋白修饰特有的较宽广区域。由于已根据 大肠杆菌 读数对片段计数进行了归一化,因此在 SEACR 中将归一化选项设为“non”。如果未进行此类归一化,建议选择“norm”选项。

##== linux command ==##
seacr="/fh/fast/gottardo_r/yezheng_working/Software/SEACR/SEACR_1.3.sh"
histControl=$2
mkdir -p $projPath/peakCalling/SEACR

bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph \
     $projPath/alignment/bedgraph/${histControl}_bowtie2.fragments.normalized.bedgraph \
     non stringent $projPath/peakCalling/SEACR/${histName}_seacr_control.peaks

bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks

Peak 数目

##=== R command ===## 
peakN = c()
peakWidth = c()
peakType = c("control", "top0.01")
for(hist in sampleList){
  histInfo = strsplit(hist, "_")[[1]]
  if(histInfo[1] != "IgG"){
    for(type in peakType){
      peakInfo = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE)  %>% mutate(width = abs(V3-V2))
      peakN = data.frame(peakN = nrow(peakInfo), peakType = type, Histone = histInfo[1], Replicate = histInfo[2]) %>% rbind(peakN, .)
      peakWidth = data.frame(width = peakInfo$width, peakType = type, Histone = histInfo[1], Replicate = histInfo[2])  %>% rbind(peakWidth, .)
    }
  }
}
peakN %>% select(Histone, Replicate, peakType, peakN)

生物学重复性

通过对生物学重复数据集进行峰值识别并比较,确定哪些峰值具有可重复性。按照每个峰值区域的信号强度排序,选取前 1% 的峰值作为高可信度的位点。

##=== R command ===## 
histL = c("K27me3", "K4me3")
repL = paste0("rep", 1:2)
peakType = c("control", "top0.01")
peakOverlap = c()
for(type in peakType){
  for(hist in histL){
    overlap.gr = GRanges()
    for(rep in repL){
      peakInfo = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_", rep, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE)
      peakInfo.gr = GRanges(peakInfo$V1, IRanges(start = peakInfo$V2, end = peakInfo$V3), strand = "*")
      if(length(overlap.gr) >0){
        overlap.gr = overlap.gr[findOverlaps(overlap.gr, peakInfo.gr)@from]
      }else{
        overlap.gr = peakInfo.gr

      }
    }
    peakOverlap = data.frame(peakReprod = length(overlap.gr), Histone = hist, peakType = type) %>% rbind(peakOverlap, .)
  }
}

peakReprod = left_join(peakN, peakOverlap, by = c("Histone", "peakType")) %>% mutate(peakReprodRate = peakReprod/peakN * 100)
peakReprod %>% select(Histone, Replicate, peakType, peakN, peakReprodNum = peakReprod, peakReprodRate)

FRiPs(峰区片段比例)

通过计算峰区中的片段比例(FRiPs)来衡量信噪比,并将其与 IgG 对照数据集中的 FRiPs 进行对比,以便更直观地展示结果。尽管 CUT&Tag 实验的测序深度通常只有 100-500 万片段,但由于该方法的背景噪声较低,因此 FRiP 分数往往较高。

##=== R command ===## 
library(chromVAR)

bamDir = paste0(projPath, "/alignment/bam")
inPeakData = c()
## overlap with bam file to get count
for(hist in histL){
  for(rep in repL){
    peakRes = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_", rep, "_seacr_control.peaks.stringent.bed"), header = FALSE, fill = TRUE)
    peak.gr = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*")
    bamFile = paste0(bamDir, "/", hist, "_", rep, "_bowtie2.mapped.bam")
    fragment_counts <- getCounts(bamFile, peak.gr, paired = TRUE, by_rg = FALSE, format = "bam")
    inPeakN = counts(fragment_counts)[,1] %>% sum
    inPeakData = rbind(inPeakData, data.frame(inPeakN = inPeakN, Histone = hist, Replicate = rep))
  }
}

frip = left_join(inPeakData, alignResult, by = c("Histone", "Replicate")) %>% mutate(frip = inPeakN/MappedFragNum_hg38 * 100)
frip %>% select(Histone, Replicate, SequencingDepth, MappedFragNum_hg38, AlignmentRate_hg38, FragInPeakNum = inPeakN, FRiPs = frip)

可视化分析

通过可视化展示峰的数量、峰的宽度、峰的可重复性以及 FRiPs(峰区片段比例)。

fig7A = peakN %>% ggplot(aes(x = Histone, y = peakN, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    facet_grid(~peakType) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Number of Peaks") +
    xlab("")

fig7B = peakWidth %>% ggplot(aes(x = Histone, y = width, fill = Histone)) +
    geom_violin() +
    facet_grid(Replicate~peakType) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    scale_y_continuous(trans = "log", breaks = c(400, 3000, 22000)) +
    theme_bw(base_size = 18) +
    ylab("Width of Peaks") +
    xlab("")

fig7C = peakReprod %>% ggplot(aes(x = Histone, y = peakReprodRate, fill = Histone, label = round(peakReprodRate, 2))) +
    geom_bar(stat = "identity") +
    geom_text(vjust = 0.1) +
    facet_grid(Replicate~peakType) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("% of Peaks Reproduced") +
    xlab("")

fig7D = frip %>% ggplot(aes(x = Histone, y = frip, fill = Histone, label = round(frip, 2))) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("% of Fragments in Peaks") +
    xlab("")

ggarrange(fig7A, fig7B, fig7C, fig7D, ncol = 2, nrow=2, common.legend = TRUE, legend="bottom")

相关文章
|
10天前
|
数据可视化 数据挖掘 数据处理
CUT&Tag 数据处理和分析教程(7)
CUT&Tag 数据处理和分析教程(7)
CUT&Tag 数据处理和分析教程(7)
|
21天前
|
数据处理
CUT&Tag 数据处理和分析教程(6)
CUT&Tag 数据处理和分析教程(6)
CUT&Tag 数据处理和分析教程(6)
|
2月前
|
数据处理
CUT&Tag 数据处理和分析教程(2)
CUT&Tag 数据处理和分析教程(2)
CUT&Tag 数据处理和分析教程(2)
|
9月前
LangChain 构建问题之定义extract_local_group_size工具如何解决
LangChain 构建问题之定义extract_local_group_size工具如何解决
58 0
|
12月前
|
存储 Serverless 定位技术
深度探索 Elasticsearch 8.X:function_score 参数解读与实战案例分析
深度探索 Elasticsearch 8.X:function_score 参数解读与实战案例分析
236 0
|
12月前
R语言VAR模型的不同类型的脉冲响应分析
R语言VAR模型的不同类型的脉冲响应分析
|
数据挖掘 数据处理
tidyverse|数据分析常规操作-分组汇总(sumamrise+group_by)
tidyverse|数据分析常规操作-分组汇总(sumamrise+group_by)
136 1
|
数据采集 监控 Java
网站流量日志分析--数据预处理-- 点击流模型visit编程实现思路| 学习笔记
快速学习网站流量日志分析--数据预处理--点击流模型 visit 编程实现思路
网站流量日志分析--数据预处理-- 点击流模型visit编程实现思路| 学习笔记
|
缓存 Java 数据挖掘
白话Elasticsearch50-深入聚合数据分析之doc values机制
白话Elasticsearch50-深入聚合数据分析之doc values机制
129 0
|
缓存 自然语言处理 数据挖掘
白话Elasticsearch50-深入聚合数据分析之基于doc values正排索引的聚合内部原理
白话Elasticsearch50-深入聚合数据分析之基于doc values正排索引的聚合内部原理
171 0
AI助理

你好,我是AI助理

可以解答问题、推荐解决方案等