Peak Calling
关于MACS2的使用方法, 我写了如何使用MACS进行peak calling详细地介绍了它的参数,在用MACS2之前尽量去阅读下。
尽管 文章说他按照默认参数分别找narrow peak 和 broad peak, 即"We used MACS2 with default settings to call narrow (or broad when necessary) R-loop peaks",但是 MACS2的默认参数一开始会根据reads在正反链的分布建立双峰模型,确定偏移模型(shifting model),也就是它会认为示例CHTF8和CIRH1A位于正反链的信号视作一个IP结果的信号,因此用默认参数绝对有问题,所以 必须要增加--nomodel
取消建模,且增加--shift 0 --extsize 150
按照实验的条带长度对片段进行延长。
cd analysis
# HKE293 D210N
macs2 callpeak --nomodel --shift 0 --extsize 150 -g hs --keep-dup all -f BAM -t 2-read-align/HKE293-D210N-V5ChIP-Rep*.flt.bam -c 2-read-align/HKE293-D210N-Input-Rep*.flt.bam --outdir 5-peak-calling/narrow -n HKE293-D210 2> log/HKE293-D210.macs2.narrow.log &
macs2 callpeak --nomodel --shift 0 --extsize 150 -g hs --keep-dup all --broad -f BAM -t 2-read-align/HKE293-D210N-V5ChIP-Rep*.flt.bam -c 2-read-align/HKE293-D210N-Input-Rep*.flt.bam --outdir 5-peak-calling/broad -n HKE293-D210 2> log/HKE293-D210.macs2.broad.log &
# HEK293 WKKD
macs2 callpeak --nomodel --shift 0 --extsize 150 -g hs --keep-dup all -f BAM -t 2-read-align/HKE293-WKKD-V5ChIP.flt.bam -c 2-read-align/HKE293-WKKD-Input.flt.bam --outdir 5-peak-calling/narrow/ -n HKE293-WKKD 2> log/HKE293-WKKD.macs2.narrow.log &
macs2 callpeak --nomodel --shift 0 --extsize 150 -g hs --keep-dup all --broad -f BAM -t 2-read-align/HKE293-WKKD-V5ChIP.flt.bam -c 2-read-align/HKE293-WKKD-Input.flt.bam --outdir 5-peak-calling/broad/ -n HKE293-WKKD 2> log/HKE293-WKKD.macs2.broad.log &
可以将建模和不建模的narrowPeak结果在IGV进行比较,你会发现之前的CHTF8和CIRH1A上的两个信号峰在默认参数下就被认为是成一个。
文章中对找到peak进行了一次筛选,标准是大于5倍富集和q-value 小于或等于0.001(broad peak则是0.0001),最后文章写着在D210N有12,906peak,然后剔除WKKD里的peak,还有12,507个。
我找到的HKE293-D210的原始narrowPeak数为17,558, 按照作者的标准筛选后只剩下9,552,。对于HKE2930-D210的原始broadPeak数为42,912,过滤之后只剩2,998了。隐隐觉得peak有点太少了,于是我的标准是4倍变化。
负对照的WKKD无论是narrowPeak还是broadPeak都是0
cd analysis/5-peak-calling/
fc=4
# narrow peak
awk -v fc=$fc '$7 >= fc && $9 >=3' narrow/HKE293-D210_peaks.narrowPeak > HKE293-D210_flt.narrowPeak
# broad peak
awk -v fc=$fc '$7 >= fc && $9 >=4' broad/HKE293-D210_peaks.broadPeak > HKE293-D210_flt.broadPeak
之后可以用过滤后的D210N的narrowPeak和broadPeak的peak长度进行描述性统计分析,然后用箱线图展示其大小分布。
library(data.table)
library(ggplot2)
narrowPeak <- fread(file="HKE293-D210_flt.narrowPeak",
sep="\t", header = F)
broadPeak <- fread(file="HKE293-D210_flt.broadPeak",
sep="\t", header = F)
peak_size <- log10(c(narrowPeak$V3 - narrowPeak$V2, broadPeak$V3 - broadPeak$V2 ))
peak_from <- factor(rep(c('Narrow','Broad'), times=c(nrow(narrowPeak),nrow(broadPeak)) ),
levels=c("Narrow","Broad"))
peak_df <- data.frame(size=peak_size, from=peak_from)
my_clear_theme = theme_bw() +
theme(panel.grid.major = element_line(colour="NA"),
panel.grid.minor = element_line(colour="NA"),
panel.background = element_rect(fill="NA"),
panel.border = element_rect(colour="black", fill=NA),
legend.background = element_blank())
ggplot(peak_df,aes(x=from,y=size,col=from)) +
geom_boxplot(notch = T,outlier.size = .5) +
scale_y_continuous(breaks=c(log10(100),log10(300),log10(400),log10(450),log10(500),log10(1000),log10(2000)),
labels = c(100,300,400,450,500,1000,2000)) +
coord_cartesian(ylim=c(2,3.5)) + labs(col="Peak Size") +
my_clear_theme + xlab("") + ylab("Peak Size (bp)")
ggsave("Fig2A_boxplot.pdf", width=4,height=6,units = "in")
原文说自己的narrow peak 长度的中位数是199bp, broad peak的长度中位数是318 bp,和电镜观察的150–500 bp一致。我过滤后的peak的中位数是208,broad peak的中位数是581。
如果你无脑用MACS2的参数话,就会得到右边的图,peak的长度明显都偏大了。
还可以对Broad peak 和Narrow peak进行比较,看看有多少共同peak和特异性peak。
#!/bin/bash
peak1=${1?first peak}
peak2=${2?second peak}
outdir=${3?output dir}
mkdir -p ${outdir}
bedtools intersect -a $peak1 -b $peak2 > ${outdir}/$(basename ${peak1%%.*}).common.peak
common=$(wc -l ${outdir}/$(basename ${peak1%%.*}).common.peak)
echo "$common"
bedtools subtract -A -a $peak1 -b $peak2 > ${outdir}/$(basename ${peak1}).only.bed
left=$(wc -l ${outdir}/$(basename ${peak1}).only.bed)
echo "$left"
bedtools subtract -A -b $peak1 -a $peak2 > ${outdir}/$(basename ${peak2}).only.bed
right=$(wc -l ${outdir}/$(basename ${peak2}).only.bed)
echo "$right"
运行:bash scripts/09.peak_comparison.sh analysis/5-peak-calling/HKE293-D210_flt.narrowPeak analysis/5-peak-calling/HKE293-D210_flt.broadPeak analysis/5-peak-calling/peak_compare > results/narrow_vs_broad.txt
然后得到的数值就可以丢到R语言画图了,这个结果和Figure S2A一致。
library(VennDiagram)
grid.newpage()
venn.plot <- VennDiagram::draw.pairwise.venn(area1=6401 + 7165,
area2=286 + 7165,
cross.area = 7165,
category = c("Narrow specific","Broad specific"),
scaled = F,
fill=c("#c7d0e7","#f4babf"),
lty='blank',
cat.pos = c(180,0),#360度划分
rotation.degree = 90 #整体旋转90
)
grid.draw(venn.plot)
我们可以对这三类peak对应区间内的信号进行一下比较。统计信号强度的工具是bigwigSummary
,来自于ucscGenomeBrowser工具集。
脚本为scripts/10.peak_signal_summary.sh
#!/bin/bash
bed=${1?bed file}
fwd=${2?forwad bigwig}
rvs=${3?reverse bigwig}
exec 0< $bed
while read region
do
chr=$( echo $region | cut -d ' ' -f 1)
sta=$( echo $region | cut -d ' ' -f 2)
end=$( echo $region | cut -d ' ' -f 3)
(bigWigSummary $fwd $chr $sta $end 1 ; bigWigSummary $rvs $chr $sta $end 1 )| awk -v out=0 '{out=out+$1} END{print out/2}'
done
分别统计三类peak的信号的信号
bash scripts/10.peak_signal_summary.sh analysis/5-peak-calling/peak_compare/HKE293-D210_flt.broadPeak.only.bed analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_fwd.bw analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_rvs.bw > results/HKE293-D210N-V5ChIP.broadSignal
bash scripts/10.peak_signal_summary.sh analysis/5-peak-calling/peak_compare/HKE293-D210_flt.narrowPeak.only.bed analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_fwd.bw analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_rvs.bw > results/HKE293-D210N-V5ChIP.narrowSignal
bash scripts/10.peak_signal_summary.sh analysis/5-peak-calling/peak_compare/HKE293-D210_flt.common.peak analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_fwd.bw analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_rvs.bw > results/HKE293-D210N-V5ChIP.commonSignal
在R语言中绘图展示
# peak signal
commonSignal <- fread("./HKE293-D210N-V5ChIP.commonSignal")
narrowSignal <- fread("./HKE293-D210N-V5ChIP.narrowSignal")
broadSignal <- fread("./HKE293-D210N-V5ChIP.broadSignal")
data <- data.frame(signal=c(commonSignal$V1, narrowSignal$V1, broadSignal$V1),
type=factor(rep(c("common","narrow","broad"),
times=c(nrow(commonSignal),nrow(narrowSignal), nrow(broadSignal)))) )
wilcox.test(data[data$type=="common",1],data[data$type=="broad",1])
wilcox.test(data[data$type=="common",1],data[data$type=="narrow",1])
p <- ggplot(data,aes(x=type,y=signal,col=type)) +
geom_boxplot(outlier.colour = "NA") +
coord_cartesian(ylim=c(0,3.5)) +
theme(panel.grid.major = element_line(colour="NA"),
panel.grid.minor = element_line(colour="NA"),
panel.background = element_rect(fill="NA"),
panel.border = element_rect(colour="black", fill=NA)) +
theme(axis.text.x = element_blank()) +
xlab("") + ylab("R-loop Signal") + labs(col="")
p1 <- p + annotate("segment", x=1,xend=1,y=0.75,yend=2,size=0.5) +
annotate("segment", x=2,xend=2,y=1.25,yend=2,size=0.5) +
annotate("segment", x=1,xend=2,y=2,yend=2,size=0.5) +
annotate("text", x= 1.5, y=2.1, label="***")
p2 <- p1 + annotate("segment", x=3,xend=3,y=0.65,yend=3,size=0.5) +
annotate("segment", x=2,xend=2,y=2.2,yend=3,size=0.5) +
annotate("segment", x=2,xend=3,y=3,yend=3,size=0.5) +
annotate("text", x= 2.5, y=3.2, label="***")
p2
看图说话: 无论是narrow peak 特异区间的信号,还是broad peak 特异区间的信号都显著性低于其共有区间的信号。