来自于论文
Removing unwanted variation from large-scale
RNA sequencing data with PRPS
论文里提供了很多的数据和代码
链接是 GitHub - RMolania/TCGA_PanCancer_UnwantedVariation
这个模板需要用到 rmdformats 这个R包
rmarkdown 表头内容
---
title: "Removing tumour purity, library size and batch effects from the TCGA breast cancer RNA-seq data using RUV-III-PRPS"
author:
- name: Ramyar Molania
affiliation: Papenfuss Lab, Bioinformatics, WEHI.
url: https://www.wehi.edu.au/people/tony-papenfuss
date: "15-02-2020"
output:
rmdformats::readthedown:
code_folding: hide
gallery: yes
highlight: tango
lightbox: yes
self_contained: yes
thumbnails: no
number_sections: yes
toc_depth: 3
use_bookdown: yes
html_document2:
df_print: paged
html_document:
toc_depth: '3'
df_print: paged
params:
update_date: !r paste("Last updated on:", Sys.Date())
editor_options:
chunk_output_type: console
---
`r params$update_date`
<style type="text/css">
h1.title {
font-size: 28px;
color: DarkRed;
}
h1 { /* Header 1 */
font-size: 24px;
color: DarkBlue;
}
h2 { /* Header 2 */
font-size: 20px;
color: DarkBlue;
}
h3 { /* Header 3 */
font-size: 18px;
color: DarkBlue;
}
h4 { /* Header 3 */
font-size: 16px;
color: DarkBlue;
}
</style>
<style>
p.caption {
font-size: 46em;
font-style: italic;
color: black;
}
</style>
#```{r setup, include=F}
knitr::opts_chunk$set(
tidy = FALSE,
fig.width = 10,
message = FALSE,
warning = FALSE)
#```
# Introduction
Effective removal of unwanted variation is essential to derive meaningful biological results from RNA-seq data, particularly when the data comes from large and complex studies. We have previously proposed a new method, removing unwanted variation III (RUV-III) to normalize gene expression data [(R.Molania, NAR, 2019)](https://academic.oup.com/nar/article/47/12/6073/5494770?login=true). The RUV-III method requires well-designed technical replicates (well-distributed across sources of unwanted variation) and negative control genes to estimate known and unknown sources of unwanted variation and remove it from the data.\
We propose a novel strategy, pseudo-replicates of pseudo-samples (PRPS) [R.Molania, bioRxiv, 2021](https://www.biorxiv.org/content/10.1101/2021.11.01.466731v1), for deploying RUV-III to normalize RNA-seq data in situations when technical replicates are not available or are not well-designed. Our approach requires at least one **roughly** known biologically homogenous subclass of samples presented across sources of unwanted variation. For example, in a cancer RNA-seq study where there are normal tissues present across all sources of unwanted variation. Then, we can use these samples to create PRPS.\
To create PRPS, we first need to identify the sources of unwanted variation, which we call batches in the data. Then the gene expression measurements of suitable biologically homogeneous sets of samples are averaged within batches, and the results called pseudo-samples. Since the variation between pseudo-samples in different batches is mainly unwanted variation, by defining them as pseudo-replicates and used them in RUV-III as replicates, we can easily and effectively remove the unwanted variation. we refer to our paper for more technical details [R.Molania, bioRxiv, 2021](https://www.biorxiv.org/content/10.1101/2021.11.01.466731v1).\
Here, we use the TCGA invasive breast cancer (BRCA) RNA-seq data as an example to show how to remove tumour purity, flow cell chemistry, library size and batch effects (plate effects) from the data. We illustrate the value of our approach by comparing it to the standard TCGA normalizations on the TCGA BRCA RNA-seq data. Further, we demonstrate how unwanted variation can compromise several downstream analyses and can lead to wrong biological conclusions. We will also assess the performance of RUV-III with poorly chosen PRPS and in situations where biological labels are only partially known.\
Note that RUV-III with PRPS is not limited to TCGA data: it can be used for any large genomics project involving multiple labs, technicians, platforms, ...\
## Data preparation
The TCGA consortium aligned RNA sequencing reads to the hg38 reference genome using the STAR aligner and quantified the results at gene level using the HTseq and Gencode v22 gene-annotation [Ref](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/). The TCGA RNA-seq data are publicly available in three formats: raw counts, FPKM and FPKM with upper-quartile normalization (FPKM.UQ). All these formats for individual cancer types (33 cancer types, ~ 11000 samples) were downloaded using the R/Bioconductor package (version 2.16.1). The TCGA normalized microarray gene expression data were downloaded from the Broad GDAC [Firehose](https://gdac.broadinstitute.org) repository , data version 2016/01/28. Tissue source sites (TSS), and batches of sequencing-plates were extracted from individual TCGA [patient barcodes](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/), and sample processing times were downloaded from the [MD Anderson Cancer Centre TCGA Batch Effects website](https://bioinformatics.mdanderson.org/public-software/tcga-batch-effects). Pathological features of cancer patients were downloaded from the Broad GDAC Firehose repository (https://gdac.broadinstitute.org). The details of processing the TCGA BRCA RNA-seq samples using two flow cell chemistries were received by personal communication from Dr. K Hoadley. The TCGA survival data reported by [Liu et al.](https://www.cell.com/cell/fulltext/S0092-8674(18)30229-0?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867418302290%3Fshowall%3Dtrue) were used in this paper. The consensus measurement of purity estimation (CPE) were downloaded from the [Aran et al](https://www.nature.com/articles/ncomms9971) study.\
We have generated SummarizedExperiment objects for all the TCGA RNA-seq datasets. These datasets can be found here [TCGA_PanCancerRNAseq](https://zenodo.org/record/6326542#.YimR0C8Rquo). Unwanted variation of all the datasets can be explored using an Rshiny application published in [(R.Molania, bioRxiv, 2021)](https://www.biorxiv.org/content/10.1101/2021.11.01.466731v1.article-metrics).\
All datasets that are required for this vignette can be found here [link](https://doi.org/10.5281/zenodo.6392171)
# TCGA BRCA gene expression data
## RNA-seq data
We load the TCGA_SummarizedExperiment_HTseq_BRCA.rds file. This is a SummarizedExperiment object that contains:\
**assays:**\
-Raw counts\
-FPKM\
-FPKM.UQ\
**colData:**\
-Batch information\
-Clinical information (collected from different resources)\
**rowData:**\
-Genes' details (GC, chromosome, ...)\
-Several lists of housekeeping genes\
The lists of housekeeping genes might be suitable to use as negative control genes (NCG) for the RUV-III normalization.
效果