单细胞分析|将 Seurat 与多模态数据结合使用

简介: 单细胞分析|将 Seurat 与多模态数据结合使用

数据加载

同时测量同一细胞的多种数据类型的能力(称为多模态分析)代表了单细胞基因组学的一个令人兴奋的新前沿。例如,CITE-seq 能够同时测量同一细胞的转录组和细胞表面蛋白。其他令人兴奋的多模式技术,例如 10x 多组试剂盒,可以对细胞转录组和染色质可及性进行测量(即 scRNA-seq+scATAC-seq)。 Seurat 可以实现各种多模态单细胞数据集的无缝存储、分析和探索。

本教程中,我们介绍了创建多模态 Seurat 对象并执行初始分析的工作流程。例如,我们演示了如何根据测量的细胞转录组对 CITE-seq 数据集进行聚类,并随后发现每个聚类中富集的细胞表面蛋白。我们注意到,Seurat 还支持更先进的技术来分析多模态数据,特别是我们的加权最近邻 (WNN) 方法的应用,该方法能够基于两种模态的加权组合同时对细胞进行聚类。

在这里,我们分析了 8,617 个脐带血单核细胞 (CBMC) 的数据集,其中转录组测量与 11 种表面蛋白的丰度估计值配对,其水平通过 DNA 条形码抗体进行量化。首先,我们加载两个计数矩阵:一个用于 RNA 测量,另一个用于抗体衍生标签 (ADT)。您可以在此处下载数据。

library(Seurat)
library(ggplot2)
library(patchwork)

# Load in the RNA UMI matrix

# Note that this dataset also contains ~5% of mouse cells, which we can use as negative
# controls for the protein measurements. For this reason, the gene expression matrix has
# HUMAN_ or MOUSE_ appended to the beginning of each gene.
cbmc.rna <- as.sparse(read.csv(file = "/brahms/shared/vignette-data/GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz",
    sep = ",", header = TRUE, row.names = 1))

# To make life a bit easier going forward, we're going to discard all but the top 100 most
# highly expressed mouse genes, and remove the 'HUMAN_' from the CITE-seq prefix
cbmc.rna <- CollapseSpeciesExpressionMatrix(cbmc.rna)

# Load in the ADT UMI matrix
cbmc.adt <- as.sparse(read.csv(file = "/brahms/shared/vignette-data/GSE100866_CBMC_8K_13AB_10X-ADT_umi.csv.gz",
    sep = ",", header = TRUE, row.names = 1))

# Note that since measurements were made in the same cells, the two matrices have identical
# column names
all.equal(colnames(cbmc.rna), colnames(cbmc.adt))
## [1] TRUE

构建Seurat对象

现在我们创建一个 Seurat 对象,并添加 ADT 数据作为第二个assay

# creates a Seurat object based on the scRNA-seq data
cbmc <- CreateSeuratObject(counts = cbmc.rna)

# We can see that by default, the cbmc object contains an assay storing RNA measurement
Assays(cbmc)
## [1] "RNA"

# create a new assay to store ADT information
adt_assay <- CreateAssay5Object(counts = cbmc.adt)

# add this assay to the previously created Seurat object
cbmc[["ADT"]] <- adt_assay

# Validate that the object now contains multiple assays
Assays(cbmc)
## [1] "RNA" "ADT"

# Extract a list of features measured in the ADT assay
rownames(cbmc[["ADT"]])
##  [1] "CD3"    "CD4"    "CD8"    "CD45RA" "CD56"   "CD16"   "CD10"   "CD11c" 
##  [9] "CD14"   "CD19"   "CD34"   "CCR5"   "CCR7"


# Note that we can easily switch back and forth between the two assays to specify the default
# for visualization and analysis

# List the current default assay
DefaultAssay(cbmc)
## [1] "RNA"

# Switch the default to ADT
DefaultAssay(cbmc) <- "ADT"
DefaultAssay(cbmc)
## [1] "ADT"

聚类

以下步骤基于 scRNA-seq 数据的 PBMC 快速聚类

# Note that all operations below are performed on the RNA assay Set and verify that the
# default assay is RNA
DefaultAssay(cbmc) <- "RNA"
DefaultAssay(cbmc)
## [1] "RNA"

# perform visualization and clustering steps
cbmc <- NormalizeData(cbmc)
cbmc <- FindVariableFeatures(cbmc)
cbmc <- ScaleData(cbmc)
cbmc <- RunPCA(cbmc, verbose = FALSE)
cbmc <- FindNeighbors(cbmc, dims = 1:30)
cbmc <- FindClusters(cbmc, resolution = 0.8, verbose = FALSE)
cbmc <- RunUMAP(cbmc, dims = 1:30)
DimPlot(cbmc, label = TRUE)

数据可视化

现在我们已经从 scRNA-seq 配置文件中获得了簇,我们可以可视化数据集中蛋白质或 RNA 分子的表达。重要的是,Seurat 提供了几种在模态之间切换的方法,并指定您有兴趣分析或可视化的模态。这一点特别重要,因为在某些情况下,相同的特征可以以多种方式出现 - 例如,该数据集包含 B 细胞标记 CD19(蛋白质和 RNA 水平)的独立测量结果。

# Normalize ADT data,
DefaultAssay(cbmc) <- "ADT"
cbmc <- NormalizeData(cbmc, normalization.method = "CLR", margin = 2)
DefaultAssay(cbmc) <- "RNA"

# Note that the following command is an alternative but returns the same result
cbmc <- NormalizeData(cbmc, normalization.method = "CLR", margin = 2, assay = "ADT")

# Now, we will visualize CD14 levels for RNA and protein By setting the default assay, we can
# visualize one or the other
DefaultAssay(cbmc) <- "ADT"
p1 <- FeaturePlot(cbmc, "CD19", cols = c("lightgrey", "darkgreen")) + ggtitle("CD19 protein")
DefaultAssay(cbmc) <- "RNA"
p2 <- FeaturePlot(cbmc, "CD19") + ggtitle("CD19 RNA")

# place plots side-by-side
p1 | p2

# Alternately, we can use specific assay keys to specify a specific modality Identify the key
# for the RNA and protein assays
Key(cbmc[["RNA"]])
## [1] "rna_"

Key(cbmc[["ADT"]])
## [1] "adt_"

# Now, we can include the key in the feature name, which overrides the default assay
p1 <- FeaturePlot(cbmc, "adt_CD19", cols = c("lightgrey", "darkgreen")) + ggtitle("CD19 protein")
p2 <- FeaturePlot(cbmc, "rna_CD19") + ggtitle("CD19 RNA")
p1 | p2

marker 鉴定

我们可以利用配对的 CITE-seq 测量来帮助注释来自 scRNA-seq 的簇,并识别蛋白质和 RNA 标记。

# as we know that CD19 is a B cell marker, we can identify cluster 6 as expressing CD19 on the
# surface
VlnPlot(cbmc, "adt_CD19")

# we can also identify alternative protein and RNA markers for this cluster through
# differential expression
adt_markers <- FindMarkers(cbmc, ident.1 = 6, assay = "ADT")
rna_markers <- FindMarkers(cbmc, ident.1 = 6, assay = "RNA")

head(adt_markers)
##                p_val avg_log2FC pct.1 pct.2     p_val_adj
## CD19   2.067533e-215  2.5741873     1     1 2.687793e-214
## CD45RA 8.108073e-109  0.5300346     1     1 1.054049e-107
## CD4    1.123162e-107 -1.6707420     1     1 1.460110e-106
## CD14   7.212876e-106 -1.0332070     1     1 9.376739e-105
## CD3     1.639633e-87 -1.5823056     1     1  2.131523e-86
## CCR5    2.552859e-63  0.3753989     1     1  3.318716e-62

head(rna_markers)
##       p_val avg_log2FC pct.1 pct.2 p_val_adj
## IGHM      0   6.660187 0.977 0.044         0
## CD79A     0   6.748356 0.965 0.045         0
## TCL1A     0   7.428099 0.904 0.028         0
## CD79B     0   5.525568 0.944 0.089         0
## IGHD      0   7.811884 0.857 0.015         0
## MS4A1     0   7.523215 0.851 0.016         0

附加可视化

# Draw ADT scatter plots (like biaxial plots for FACS). Note that you can even 'gate' cells if
# desired by using HoverLocator and FeatureLocator
FeatureScatter(cbmc, feature1 = "adt_CD19", feature2 = "adt_CD3")

# view relationship between protein and RNA
FeatureScatter(cbmc, feature1 = "adt_CD3", feature2 = "rna_CD3E")

FeatureScatter(cbmc, feature1 = "adt_CD4", feature2 = "adt_CD8")

# Let's look at the raw (non-normalized) ADT counts. You can see the values are quite high,
# particularly in comparison to RNA values. This is due to the significantly higher protein
# copy number in cells, which significantly reduces 'drop-out' in ADT data
FeatureScatter(cbmc, feature1 = "adt_CD4", feature2 = "adt_CD8", slot = "counts")

10X多模态数据加载

Seurat 还能够分析使用 CellRanger v3 处理的多模态 10X 实验的数据;例如,我们使用 7,900 个外周血单核细胞 (PBMC) 的数据集重新创建了上面的图。

pbmc10k.data <- Read10X(data.dir = "/brahms/shared/vignette-data/pbmc10k/filtered_feature_bc_matrix/")
rownames(x = pbmc10k.data[["Antibody Capture"]]) <- gsub(pattern = "_[control_]*TotalSeqB", replacement = "",
    x = rownames(x = pbmc10k.data[["Antibody Capture"]]))

pbmc10k <- CreateSeuratObject(counts = pbmc10k.data[["Gene Expression"]], min.cells = 3, min.features = 200)
pbmc10k <- NormalizeData(pbmc10k)
pbmc10k[["ADT"]] <- CreateAssayObject(pbmc10k.data[["Antibody Capture"]][, colnames(x = pbmc10k)])
pbmc10k <- NormalizeData(pbmc10k, assay = "ADT", normalization.method = "CLR")

plot1 <- FeatureScatter(pbmc10k, feature1 = "adt_CD19", feature2 = "adt_CD3", pt.size = 1)
plot2 <- FeatureScatter(pbmc10k, feature1 = "adt_CD4", feature2 = "adt_CD8a", pt.size = 1)
plot3 <- FeatureScatter(pbmc10k, feature1 = "adt_CD3", feature2 = "CD3E", pt.size = 1)
(plot1 + plot2 + plot3) & NoLegend()

plot <- FeatureScatter(cbmc, feature1 = "adt_CD19", feature2 = "adt_CD3") + NoLegend() + theme(axis.title = element_text(size = 18),
    legend.text = element_text(size = 18))
ggsave(filename = "../output/images/citeseq_plot.jpg", height = 7, width = 12, plot = plot, quality = 50)

未完待续,持续更新,欢迎关注!

相关文章
|
1月前
|
存储 编解码 数据可视化
单细胞分析|Seurat中的跨模态整合
在单细胞基因组学中,新方法“桥接整合”允许将scATAC-seq、scDNAme等技术的数据映射到基于scRNA-seq的参考数据集,借助多组学数据作为桥梁。研究展示了如何将scATAC-seq数据集映射到人类PBMC的scRNA-seq参考,使用10x Genomics的多组学数据集。Azimuth ATAC工具提供了自动化的工作流程,支持在R和网页平台上执行桥接整合。通过加载和预处理不同数据集,映射scATAC-seq数据并进行评估,证明了映射的准确性和细胞类型预测的可靠性。此方法扩展了参考映射框架,促进了不同技术间的互操作性。
46 5
|
1月前
|
数据建模 计算机视觉
SiMBA:基于Mamba的跨图像和多元时间序列的预测模型
微软研究者提出了SiMBA,一种融合Mamba与EinFFT的新架构,用于高效处理图像和时间序列。SiMBA解决了Mamba在大型网络中的不稳定性,结合了卷积、Transformer、频谱方法和状态空间模型的优点。在ImageNet 1K上表现优越,达到84.0%的Top-1准确率,并在多变量长期预测中超越SOTA,降低了MSE和MAE。代码开源,适用于复杂任务的高性能建模。[[论文链接]](https//avoid.overfit.cn/post/c21aa5ca480b47198ee3daefdc7254bb)
271 3
|
3天前
|
存储 编解码 数据可视化
单细胞空间|在Seurat中对基于图像的空间数据进行分析(1)
单细胞空间|在Seurat中对基于图像的空间数据进行分析(1)
18 5
|
25天前
|
存储 数据可视化 数据挖掘
单细胞分析(Signac): PBMC scATAC-seq 基因组区域可视化
单细胞分析(Signac): PBMC scATAC-seq 基因组区域可视化
17 0
|
5天前
|
数据采集 机器学习/深度学习 算法框架/工具
探究肺癌患者的CT图像的图像特征并构建一个诊断模型
探究肺癌患者的CT图像的图像特征并构建一个诊断模型
13 0
|
1月前
|
数据可视化 数据挖掘 索引
R语言层次聚类、多维缩放MDS分类RNA测序(RNA-seq)乳腺发育基因数据可视化|附数据代码2
R语言层次聚类、多维缩放MDS分类RNA测序(RNA-seq)乳腺发育基因数据可视化|附数据代码
|
1月前
|
存储 数据可视化 数据挖掘
R语言层次聚类、多维缩放MDS分类RNA测序(RNA-seq)乳腺发育基因数据可视化|附数据代码1
R语言层次聚类、多维缩放MDS分类RNA测序(RNA-seq)乳腺发育基因数据可视化|附数据代码
|
1月前
|
机器学习/深度学习 自然语言处理 数据可视化
【数据分享】R语言对airbnb数据nlp文本挖掘、地理、词云可视化、回归GAM模型、交叉验证分析
【数据分享】R语言对airbnb数据nlp文本挖掘、地理、词云可视化、回归GAM模型、交叉验证分析
|
1月前
|
机器学习/深度学习 XML 自然语言处理
R语言LDA、CTM主题模型、RJAGS 吉布斯GIBBS采样文本挖掘分析论文摘要、通讯社数据
R语言LDA、CTM主题模型、RJAGS 吉布斯GIBBS采样文本挖掘分析论文摘要、通讯社数据
|
1月前
|
SQL 数据可视化 算法
单细胞Seurat - 降维与细胞标记(4)
单细胞Seurat - 降维与细胞标记(4)
49 2