Liger单细胞多组学分析:RNA-seq与ATAC-seq

除了整合差异相对较小的scRNA-seq数据集外,liger也能用于整合数据差异较大、甚至研究的对象完全不同的多组学。本文将以scRNA-seq和scATAC-seq为例,展示liger多组学整合的相关流程。scRNA-seq描述了细胞的转录组特征,而scATAC-seq则描述了细胞基因组开放性的特征,显然这两种组学特征之间存在着某种相关性,大胆一点我们可以认为整体上呈现正相关。通过scRNA-seq和scATAT-seq的整合分析,我们不光能够更加全面地描述细胞的特征,也能从开放性和RNA的角度上去推测可能的基因表达调控模式。尽管现在已经有了一些多组学测序的技术,例如10x的Single Cell Multiome ATAC Gene Expression,但更多的组学特征以及技术上仍有待进一步的突破,因此liger依然给我们提供了一个非常好的窗口帮助我们理解单细胞的生命特征。

对不同的数据集进行整合,数据集之间需要有一些共有的对象来作为feature,并且这些feature在不同的数据集之间应当存在着某种整体上的相关性(例如某一个基因越开放,则通常它的表达量就越高;某一个基因body上非CpG甲基化程度越高,则该基因通常就越处于沉默状态)。对于ATAC-seq等表观组的数据,我们通常使用peak来表征其在某一个位点上的强弱。尽管我们也可以统计各个基因上的peak数量来作为ATAC-seq的表达矩阵,但liger的作者认为这么做的效果可能并不理想,原因有以下三点。

  1. Call peak通常是用全部细胞来做的,而这种方式很容易埋没一些稀有细胞群的信号。
  2. 基因body上的开放性往往是比较弥散的,不像一些调控元件可以call出非常明显的peak,因此peaking calling的算法可能会丢失大量的信息。
  3. Peak本身就类似于显著差异的变化,因此大量的信息被丢失,本身稀疏的数据变得更加稀疏。

Liger作者发现,使用scATAC-seq在基因body和promoter上的reads数就能很容易地表征某个基因的整体开放性水平,promoter区域可以选择TSS上下游2kb的区域。本文的分析使用10x提供的公共测试数据,并使用10x提供的cellranger和cellranger-atac进行比对。scRNA-seq数据为健康供体的10k PBMC数据,scATAC-seq数据也是健康供体的10k PBMC数据。scATAC-seq比对所用的reference来自直接从10x官网上下载的现成的GRCh38参考基因组与注释,该reference所用的基因组注释文件为GENCODE的第28版GRCh38基因组注释,下载该注释文件后我们手工构建一个scRNA-seq的reference,从而最大程度地保证了scRNA-seq和scATAC-seq所用的注释文件的一致性。

  1. ############################################
  2. ## Project: Liger-learning
  3. ## Script Purpose: joint analysis of scRNA-seq and chromatin accessibility
  4. ## Data: 2020.10.24
  5. ## Author: Yiming Sun
  6. ############################################
  7.  
  8. #sleep
  9. ii <- 1
  10. while(1){
  11. cat(paste("round",ii),sep = "n")
  12. ii <- ii 1
  13. Sys.sleep(30)
  14. }
  15.  
  16. # general setting
  17. setwd('/data/User/sunym/project/liger_learning/')
  18. # library
  19. library(Seurat)
  20. library(ggplot2)
  21. library(dplyr)
  22. library(scibet)
  23. library(Matrix)
  24. library(tidyverse)
  25. library(cowplot)
  26. library(viridis)
  27. library(pheatmap)
  28. library(ComplexHeatmap)
  29. library(networkD3)
  30. library(liger)
  31. library(parallel)
  32. #my function
  33. MySuggestLambda <- function(object,lambda,k,knn_k = 20,thresh = 1e-04,max.iters = 100,k2 = 500,ref_dataset = NULL,resolution = 1,parl_num){
  34. res <- matrix(ncol = 2,nrow = length(lambda))
  35. colnames(res) <- c('lambda','alignment')
  36. res[,'lambda'] <- lambda
  37. cl <- makeCluster(parl_num)
  38. clusterExport(cl,c('object','k','knn_k','thresh','max.iters','k2','ref_dataset','resolution'),envir = environment())
  39. clusterExport(cl,c('optimizeALS','quantileAlignSNF','calcAlignment'),envir = .GlobalEnv)
  40. alignment_score <- parLapply(cl = cl,lambda,function(x){
  41. object <- optimizeALS(object,k = k,lambda = x,thresh = thresh,max.iters = max.iters)
  42. object <- quantileAlignSNF(object,knn_k = knn_k,k2 = k2,ref_dataset = ref_dataset,resolution = resolution)
  43. return(calcAlignment(object))
  44. })
  45. stopCluster(cl)
  46. res[,'alignment'] <- (alignment_score %>% unlist())
  47. res <- as.data.frame(res)
  48. res[,1] <- as.numeric(as.character(res[,1]))
  49. res[,2] <- as.numeric(as.character(res[,2]))
  50. p <- ggplot(res,aes(x = lambda,y = alignment)) geom_point() geom_line()
  51. return(list(res,p))
  52. }

scATAC-seq 数据准备

首先我们需要把GENCODE提供的gtf格式的基因组注释文件转换成bed格式的注释文件,再用bedmap等工具去寻找我们测序所得到的reads与基因组上的注释有哪些overlap。gtf文件的前五行为该注释文件的相关信息,需要先去掉。

  1. [[email protected] annotation]$ head -n 5 gencode.v28.annotation.gtf
  2. ##description: evidence-based annotation of the human genome (GRCh38), version 28 (Ensembl 92)
  3. ##provider: GENCODE
  4. ##contact: [email protected]
  5. ##format: gtf
  6. ##date: 2018-03-23

为了方便写成脚本,我就直接在R中调用system函数了,实际过程中我们也可以直接在linux终端进行操作。

  1. # 1.joint analysis of scRNA-seq and chromatin accessibility
  2. # pre-prapare scTATA data
  3. #sort by chromosome start position and end position for fragment, gene and promoter
  4. system('gunzip ./data/10k_pbmc/scATAC/fragments.tsv.gz')
  5. system('tail -n 6 ./data/10k_pbmc/annotation/gencode.v28.annotation.gtf > ./data/10k_pbmc/annotation/genes_split.gtf')
  6. genes <- read.table('./data/10k_pbmc/annotation/genes_split.gtf',sep = 't')

读入gtf文件后我们需要对条目进行筛选和整理。V1列表示所在的染色体,V4和V5分别表示了该片段在染色体上的起始位置,V7表示该片段位于正义链还是反义链,V9列则是对该片段名称的注释。我们筛选V3列为gene的条目,并对V9列进行整理。

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片1

  1. genes <- genes[genes[,3] == 'gene',]
  2. gene_name <- unlist(strsplit(as.character(genes[,9]),split = '; ',fixed = TRUE))
  3. gene_name <- gene_name[grepl(pattern = 'gene_name',gene_name)]
  4. gene_name <- sub(pattern = 'gene_name ',replacement = '',gene_name)
  5. genes[,9] <- gene_name
  6. genes <- genes[,c(1,4,5,9,6,7)]

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片2

这样我们就得到了bed类似格式的注释文件。同理我们也可以利用TSS得到promoter的位置信息,不过需要注意正义链和反义链。构建好基因和promoter的两个表格后,将其保存为bed文件即可。

  1. promoter <- as.data.frame(matrix(nrow = dim(genes)[1],ncol = dim(genes)[2]))
  2. promoter[,c(1,4,5,6)] <- genes[,c(1,4,5,6)]
  3. for (i in 1:dim(promoter)[1]) {
  4. if(promoter[i,6] == ' '){
  5. if ((as.numeric(genes[i,2])-2000) >= 0){
  6. promoter[i,2] <- as.numeric(genes[i,2])-2000
  7. }else{
  8. promoter[i,2] <- 0
  9. }
  10. promoter[i,3] <- as.numeric(genes[i,2]) 2000
  11. } else if(promoter[i,6] == '-'){
  12. if ((as.numeric(genes[i,3])-2000) >= 0){
  13. promoter[i,2] <- as.numeric(genes[i,3])-2000
  14. }else{
  15. promoter[i,2] <- 0
  16. }
  17. promoter[i,3] <- as.numeric(genes[i,3]) 2000
  18. }
  19. }
  20. write.table(genes,file = './data/10k_pbmc/annotation/genes.bed',sep = 't',col.names = FALSE,row.names = FALSE,quote = FALSE)
  21. write.table(promoter,file = './data/10k_pbmc/annotation/promoter.bed',sep = 't',col.names = FALSE,row.names = FALSE,quote = FALSE)

在用bedmap做overlap之前需要先对reads和bed文件进行排序,按照染色体、起始位置、结束位置三者的先后顺序进行排序。排序完成后即可用bedmap计算overlap。

  1. system('/data/User/sunym/software/BEDOPS/bin/sort-bed --max-mem 50G ./data/10k_pbmc/scATAC/fragments.tsv > ./data/10k_pbmc/scATAC/atac_fragments.sort.bed')
  2. system('/data/User/sunym/software/BEDOPS/bin/sort-bed --max-mem 50G ./data/10k_pbmc/annotation/genes.bed > ./data/10k_pbmc/annotation/genes.sort.bed')
  3. system('/data/User/sunym/software/BEDOPS/bin/sort-bed --max-mem 50G ./data/10k_pbmc/annotation/promoter.bed > ./data/10k_pbmc/annotation/promoters.sort.bed')
  4. #Use bedmap command to calculate overlapping elements between indexes and fragment output files
  5. system('/data/User/sunym/software/BEDOPS/bin/bedmap --ec --delim "t" --echo --echo-map-id ./data/10k_pbmc/annotation/promoters.sort.bed ./data/10k_pbmc/scATAC/atac_fragments.sort.bed > ./data/10k_pbmc/scATAC/atac_promoters_bc.bed')
  6. system('/data/User/sunym/software/BEDOPS/bin/bedmap --ec --delim "t" --echo --echo-map-id ./data/10k_pbmc/annotation/genes.sort.bed ./data/10k_pbmc/scATAC/atac_fragments.sort.bed > ./data/10k_pbmc/scATAC/atac_genes_bc.bed')

构建liger对象

读入bedmap计算overlap后的输出文件。

  1. #import the bedmap outputs into the R
  2. genes.bc <- read.table(file = "./data/10k_pbmc/scATAC/atac_genes_bc.bed", sep = "t", as.is = c(4,7), header = FALSE)
  3. promoters.bc <- read.table(file = "./data/10k_pbmc/scATAC/atac_promoters_bc.bed", sep = "t", as.is = c(4,7), header = FALSE)

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片3

可以看到V7列包括了所有与该基因有overlap的细胞的barcode。筛选至少覆盖1500个基因的细胞为有效的细胞,并与peak数据中过滤得到的barcode做overlap,将这部分细胞用于后续的分析。

  1. #load scATAC peak filted barcode data
  2. barcodes <- read.table(file = './data/10k_pbmc/scATAC/filtered_peak_bc_matrix/barcodes.tsv',sep = 't',as.is = TRUE,header = FALSE)$V1
  3. #filter the barcode with bad quality
  4. bc <- genes.bc[,7]
  5. bc_split <- strsplit(bc,";")
  6. bc_split_vec <- unlist(bc_split)
  7. bc_counts <- table(bc_split_vec)
  8. bc_filt <- names(bc_counts)[bc_counts > 1500]
  9. barcodes <- intersect(bc_filt,barcodes)

使用liger自带的makeFeatureMatrix函数对bedmap的输出结果进行处理,得到单个细胞在各个基因及其promoter上的reads数的表达矩阵,用该矩阵和scRNA-seq表达矩阵构建liger对象。

  1. #use LIGER's makeFeatureMatrix function to calculate accessibility counts for gene body and promoter
  2. gene.counts <- makeFeatureMatrix(genes.bc, barcodes)
  3. promoter.counts <- makeFeatureMatrix(promoters.bc, barcodes)
  4. gene.counts <- gene.counts[order(rownames(gene.counts)),]
  5. promoter.counts <- promoter.counts[order(rownames(promoter.counts)),]
  6. PBMCs <- gene.counts promoter.counts
  7. colnames(PBMCs)=paste0("PBMC_",colnames(PBMCs))
  8. #load scRNA-seq data
  9. PBMC.rna <- read10X(sample.dirs = './data/10k_pbmc/scRNA/filtered_feature_bc_matrix/',sample.names = 'rna')
  10. #create a LIGER object
  11. PBMC.data <- list(atac = PBMCs, rna = PBMC.rna)
  12. int.PBMC <- createLiger(PBMC.data)
  13. remove(list = ls()[ls() != 'int.PBMC'])
  14. gc()

对liger对象进行标准化等一系列前期操作,注意只使用RNA-seq数据来挑选variable gene。

  1. #normalize select gene and scale
  2. int.PBMC <- normalize(int.PBMC)
  3. int.PBMC <- selectGenes(int.PBMC, datasets.use = 2)
  4. int.PBMC <- scaleNotCenter(int.PBMC)

非负矩阵分解

选择合适的参数K和λ进行矩阵分解。

  1. #integrate NMF
  2. #suggest lambda
  3. suggestLambda(int.PBMC,lambda.test = seq(1,20,1),k = 25,num.cores = 10)

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片4

看这趋势似乎还在一路走高,一般建议λ的取值范围为0.25到60。经过后续的降维,我发现λ=10是个不错的选择。

  1. #suggest k
  2. suggestK(object = int.PBMC,k.test = seq(10,30,1),lambda = 10,num.cores = 10,plot.log2 = FALSE)

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片5

K在25左右就基本达到了饱和,因此我们选择K=25。根据我们后续的降维和聚类,这些值也都可以人为的进行调整,使得它尽可能地符合你的数据集,并让你的结果看起来“干净”。

  1. #lambda = 10,k = 25
  2. int.PBMC <- optimizeALS(int.PBMC,k = 25,lambda = 10,max.iters = 30,thresh = 1e-06)
  3. #Quantile Normalization and Joint Clustering
  4. int.PBMC <- quantile_norm(int.PBMC,knn_k = 20,quantiles = 50,min_cells = 20,do.center = FALSE,
  5. max_sample = 1000,refine.knn = TRUE,eps = 0.9)
  6. int.PBMC <- louvainCluster(int.PBMC, resolution = 0.4)

可以看下总共聚出了多少个cluster。

  1. > table([email protected])
  2.  
  3. 0 1 2 3 4 5 6 7 8 9 10 11
  4. 3360 2552 2484 2074 1976 1683 1629 1004 897 778 776 416
  5. 12 13 14 15 16 17
  6. 398 391 316 304 154 108

可视化

  • UMAP
  1. #Visualization and Downstream Analysis
  2. int.PBMC <- runUMAP(int.PBMC)
  3. all.plots <- plotByDatasetAndCluster(int.PBMC, axis.labels = c('UMAP 1', 'UMAP 2'),return.plots = TRUE)
  4. head(all.plots[[1]]$data)
  5. pdf(file = './res/fig_201112/plot_by_dataset_and_cluster.pdf',width = 8,height = 4)
  6. all.plots[[1]] all.plots[[2]]
  7. dev.off()

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片6

  • 差异表达分析
  1. #identify marker genes for each cluster
  2. int.PBMC.wilcoxon <- runWilcoxon(int.PBMC, data.use = 'all', compare.method = 'clusters')
  3. #filter the enriched gene yourself
  4. int.PBMC.wilcoxon <- int.PBMC.wilcoxon[int.PBMC.wilcoxon$padj < 0.05,]
  5. int.PBMC.wilcoxon <- int.PBMC.wilcoxon[int.PBMC.wilcoxon$logFC > 3,]
  6. #top 20 markers of cluster 3 and 4
  7. wilcoxon.cluster_3 <- int.PBMC.wilcoxon[int.PBMC.wilcoxon$group == 3, ]
  8. wilcoxon.cluster_3 <- wilcoxon.cluster_3[order(wilcoxon.cluster_3$padj), ]
  9. markers.cluster_3 <- wilcoxon.cluster_3[1:20, ]
  10. wilcoxon.cluster_4 <- int.PBMC.wilcoxon[int.PBMC.wilcoxon$group == 4, ]
  11. wilcoxon.cluster_4 <- wilcoxon.cluster_4[order(wilcoxon.cluster_4$padj), ]
  12. markers.cluster_4 <- wilcoxon.cluster_4[1:20, ]

具体看下cluster_3和cluster_4中有哪些基因(RNA和开放性)被富集到。

  1. > markers.cluster_3
  2. feature group avgExpr logFC statistic auc pval padj pct_in pct_out
  3. 93440 AC092580.4 3 -16.77217 5.369608 27507364 0.6898446 0 0 100 100
  4. 96414 CCL5 3 -11.02601 9.269639 32396409 0.8124547 0 0 100 100
  5. 96562 CD8A 3 -13.57582 6.735737 29898397 0.7498083 0 0 100 100
  6. 97532 CST7 3 -12.67376 7.845358 31058233 0.7788952 0 0 100 100
  7. 98887 CTSW 3 -14.43134 6.331951 28632810 0.7180692 0 0 100 100
  8. 99965 EOMES 3 -16.80980 4.705106 26695598 0.6694867 0 0 100 100
  9. 101646 GZMA 3 -14.08326 7.198335 29369425 0.7365424 0 0 100 100
  10. 101649 GZMK 3 -14.48547 7.763756 30122792 0.7554357 0 0 100 100
  11. 101650 GZMM 3 -13.68788 6.104824 28699408 0.7197393 0 0 100 100
  12. 102068 HOPX 3 -15.09229 5.371568 27605779 0.6923127 0 0 100 100
  13. 102524 IL32 3 -11.36195 6.950100 29875540 0.7492350 0 0 100 100
  14. 103205 KLRB1 3 -14.85939 6.264977 28100056 0.7047085 0 0 100 100
  15. 103213 KLRG1 3 -11.01856 7.183600 31721918 0.7955395 0 0 100 100
  16. 103361 LAG3 3 -18.79355 3.451362 24980309 0.6264698 0 0 100 100
  17. 103916 LYAR 3 -12.56974 5.956998 29655544 0.7437179 0 0 100 100
  18. 104130 MATK 3 -15.48198 5.632058 28084886 0.7043280 0 0 100 100
  19. 105193 NCR3 3 -17.23250 4.670144 26521096 0.6651105 0 0 100 100
  20. 105421 NKG7 3 -11.84455 8.689092 31538326 0.7909353 0 0 100 100
  21. 107166 PRF1 3 -13.84797 7.254678 30034053 0.7532103 0 0 100 100
  22. 107282 PRR5 3 -12.95011 4.999333 28880551 0.7242821 0 0 100 100
  23. > markers.cluster_4
  24. feature group avgExpr logFC statistic auc pval padj pct_in pct_out
  25. 124628 ADAM28 4 -14.686500 6.687493 28617138 0.7494493 0 0 100 100
  26. 125560 ARHGAP24 4 -13.124318 5.737005 28321370 0.7417034 0 0 100 100
  27. 126016 BANK1 4 -9.488431 9.653939 34707167 0.9089400 0 0 100 100
  28. 126063 BCL11A 4 -11.083597 7.253749 31052386 0.8132255 0 0 100 100
  29. 126149 BLK 4 -13.199351 7.485614 30191285 0.7906743 0 0 100 100
  30. 126152 BLNK 4 -16.619583 4.727523 25910558 0.6785671 0 0 100 100
  31. 127204 CD22 4 -11.490107 9.400583 32656222 0.8552281 0 0 100 100
  32. 127206 CD24 4 -17.198765 5.403821 26269448 0.6879660 0 0 100 100
  33. 127229 CD37 4 -7.768716 6.736971 33587469 0.8796164 0 0 100 100
  34. 127236 CD40 4 -15.849347 5.536785 27024648 0.7077438 0 0 100 100
  35. 127244 CD52 4 -9.103628 5.641797 30316353 0.7939497 0 0 100 100
  36. 127257 CD74 4 -6.196057 6.340909 34263853 0.8973301 0 0 100 100
  37. 127258 CD79A 4 -8.407126 12.678446 35987290 0.9424649 0 0 100 100
  38. 127259 CD79B 4 -9.626859 11.530288 34664105 0.9078122 0 0 100 100
  39. 127313 CDCA7L 4 -16.539789 4.886703 26014286 0.6812836 0 0 100 100
  40. 127941 COBLL1 4 -16.594804 4.437667 25557096 0.6693103 0 0 100 100
  41. 130946 FAM129C 4 -13.668283 6.205986 28893095 0.7566762 0 0 100 100
  42. 131192 FAU 4 -8.118877 4.119389 29235631 0.7656468 0 0 100 100
  43. 131273 FCER2 4 -14.723589 6.953663 28740433 0.7526782 0 0 100 100
  44. 131280 FCGR2B 4 -19.050128 3.596260 23898797 0.6258814 0 0 100 100
  45. >

cluster_3中富集到了CD8A,cluster_4中富集到了CD79A,感觉都是挺重要的基因,可以看下这两个基因的表达情况。

  1. #feature plot on umap
  2. # CD8A for cluster 2 and CD79A for cluster 4
  3. CD8A <- plotGene(int.PBMC, "CD8A", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
  4. CD79A <- plotGene(int.PBMC, "CD79A", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
  5. pdf(file = './res/fig_201112/marker_for_cluster_3_and_4.pdf',width = 9,height = 6)
  6. plot_grid(CD8A[[2]],CD79A[[2]],CD8A[[1]],CD79A[[1]], ncol=2)
  7. dev.off()

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片7

  • metagene

也可以从metagene的角度去探究基因的差异表达。

  1. #gene loading plot
  2. # gene loading plot of factor12
  3. all.plots <- plotGeneLoadings(int.PBMC, return.plots = TRUE,do.spec.plot = FALSE)
  4. pdf(file = './res/fig_201112/gene_loading_factor_18.pdf')
  5. all.plots[[18]]
  6. dev.off()

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片8

可以看到metagene_18在cluster_4中特异性地高loading,两个数据集中CD79A和CD79B都对metagene_18有很高的贡献,这与我们之前找到的差异表达基因相吻合。可以继续来看下不同数据集各自对metagene_18的贡献,如何导出gene loading plot背后的数据可以参考文章使用liger整合单细胞RNA-seq数据集

  1. # feature plot
  2. TCL1A <- plotGene(int.PBMC, "TCL1A", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
  3. RPS9 <- plotGene(int.PBMC, "RPS9", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
  4. pdf(file = './res/fig_201112/RPS9_TCL1A_plot_for_RNA_and_ATAC.pdf',width = 9,height = 6)
  5. plot_grid(RPS9[[2]],TCL1A[[2]],RPS9[[1]],TCL1A[[1]], ncol=2)
  6. dev.off()

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片9

可以看到RPS9作为一种核糖体蛋白在各种细胞中均有表达,而在分泌抗体的B细胞中这种表达显然会更高,这与RNA-seq的表达谱相一致。不过在ATAC-seq的表达谱中,B细胞中的RPS9的开放性似乎并不是最高的,说明控制RPS9转录的机制可能不止基因组的开放性这一种。TCL1A的转录组和ATAC-seq谱在都在B细胞中高表达,显示出了开放性与基因表达的整体正相关。

  • Pseudo peak

这也是liger非常有意思的一点,即它可以利用一个数据集的特征去预测另一个数据集的特征。想法也非常简单,数据集1中的细胞A周围有k个数据集2中的细胞,那么A的特征应该和这k个细胞很像,利用简单的平均我们就可以预测出A中数据集2特有的特征。在本例中,ATAC-seq的peak是数据集2(ATAC-seq数据集)特有的特征,那么我可以利用liger预测出数据集1(RNA-seq数据集)中细胞的peak特征,就像我们同时对一个细胞测定了RNA和ATAC一样,尽管这种ATAC是预测出来的。

首先我们需要先读入ATAC-seq的peak的数据,并对barcode进行过滤,使得它完全包含于我们之前使用reads构建的ATAC-seq表达谱中。

  1. # Pseudo single multi-omic data (scRNA-seq and scATAC-seq)
  2. #load scATAC peak data
  3. barcodes <- read.table(file = './data/10k_pbmc/scATAC/filtered_peak_bc_matrix/barcodes.tsv',sep = 't',as.is = TRUE,header = FALSE)$V1
  4. barcodes <- paste('PBMC',barcodes,sep = '_')
  5. peak.names <- read.table(file = './data/10k_pbmc/scATAC/filtered_peak_bc_matrix/peaks.bed', sep = 't', header = FALSE)
  6. peak.names <- paste0(peak.names$V1, ':', peak.names$V2, '-', peak.names$V3)
  7. pmat <- readMM(file = './data/10k_pbmc/scATAC/filtered_peak_bc_matrix/matrix.mtx')
  8. colnames(pmat) <- barcodes
  9. rownames(pmat) <- peak.names
  10. pmat <- pmat[,colnames([email protected]$atac)]

peak的个数还是相当多的,为了简化我们的运算量,我们需要从中挑选出不同的cluster之间差异表达的peak。

  1. > dim(pmat)
  2. [1] 94534 9642

虽然liger的官方教程说可以直接把pmat赋值给ATAC的表达矩阵,然后只用ATAC做差异表达分析即可,但我跑的时候又是各种奇怪的报错,所以变通一下,我们用seurat来构建ATAC的表达矩阵然后来做差异表达分析。

  1. #cell-type-specific accessibility
  2. #create seurat for pmat
  3. pmat_seurat <- CreateSeuratObject(counts = pmat,project = 'pmat')
  4. pmat_seurat <- NormalizeData(pmat_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
  5. pmat_seurat <- ScaleData(pmat_seurat, features = rownames(pmat_seurat))
  6. cell_type <- [email protected]
  7. cell_type <- cell_type[colnames(pmat_seurat)]
  8. pmat_seurat$cell_type <- cell_type
  9. table(pmat_seurat$cell_type)
  10. Idents(pmat_seurat) <- 'cell_type'
  11. #find the DEPeak across the cluster
  12. pmat_marker <- FindAllMarkers(pmat_seurat, only.pos = TRUE, logfc.threshold = 0.25)
  13. pmat_marker <- pmat_marker[pmat_marker$p_val_adj < 0.05,]
  14. peaks.sel <- unique(pmat_marker$gene)

peaks.sel就是我们挑选出的在cluster之间差异表达的peak。我们用这组peak来过滤pmat(用于预测RNA-seq中的peak)以及pmat_seurat(用于画图)。

  1. #filter the pmat seurat object and scale the express
  2. pmat_seurat <- pmat_seurat[peaks.sel,]
  3. pmat_seurat <- NormalizeData(pmat_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
  4. pmat_seurat <- ScaleData(pmat_seurat, features = rownames(pmat_seurat),do.center = FALSE)
  5. #filter the pmat object to impute the RNA-seq cell's ATAC peaks
  6. pmat <- pmat[peaks.sel,]
  7. int.PBMC.ds <- int.PBMC
  8. [email protected][['atac']] <- pmat
  9. int.PBMC.ds <- normalize(int.PBMC.ds)

使用ATAC-seq数据预测RNA-seq数据中的peak。

  1. #predict the accessibility profile for scRNA-seq data
  2. int.PBMC.ds <- imputeKNN(int.PBMC.ds, reference = 'atac',queries = 'rna',knn_k = 20,norm = TRUE,scale = FALSE)

预测后的数据就覆盖了原来的RNA-seq的表达谱。

  1. > [email protected]$rna[1:3,1:3]
  2. 3 x 3 sparse Matrix of class "dgCMatrix"
  3. rna_AAACCCAAGCGCCCAT rna_AAACCCAAGGTTCCGC
  4. chr6:111086981-111088830 . 0.0001584052
  5. chr17:82126355-82127872 0.0007113352 .
  6. chr8:19496551-19497709 . .
  7. rna_AAACCCACAGACAAGC
  8. chr6:111086981-111088830 4.491621e-05
  9. chr17:82126355-82127872 2.747334e-04
  10. chr8:19496551-19497709 1.823305e-04

现在我们就得到了单细胞RNA-seq数据集中细胞的转录组和(预测的)peak的特征,我们就可以去寻找基因表达与开放性元件的调控关系。想法也非常简单,我们可以计算一个基因的表达量与其上下游500kb范围内的peak强度的相关性。如果你希望扩大或者缩小这个上下游的范围,你可以用我修改过后的函数,peak_range以一个碱基为单位。

  1. MyLinkGenesAndPeaks <- function(gene_counts, peak_counts, genes.list = NULL, dist = "spearman",
  2. alpha = 0.05, path_to_coords, peak_range) {
  3. ## check dependency
  4. if (!requireNamespace("GenomicRanges", quietly = TRUE)) {
  5. stop("Package "GenomicRanges" needed for this function to work. Please install it by command:n",
  6. "BiocManager::install('GenomicRanges')",
  7. call. = FALSE
  8. )
  9. }
  10. if (!requireNamespace("IRanges", quietly = TRUE)) {
  11. stop("Package "IRanges" needed for this function to work. Please install it by command:n",
  12. "BiocManager::install('IRanges')",
  13. call. = FALSE
  14. )
  15. }
  16. ### make Granges object for peaks
  17. peak.names <- strsplit(rownames(peak_counts), "[:-]")
  18. chrs <- Reduce(append, lapply(peak.names, function(peak) {
  19. peak[1]
  20. }))
  21. chrs.start <- Reduce(append, lapply(peak.names, function(peak) {
  22. peak[2]
  23. }))
  24. chrs.end <- Reduce(append, lapply(peak.names, function(peak) {
  25. peak[3]
  26. }))
  27. peaks.pos <- GenomicRanges::GRanges(
  28. seqnames = chrs,
  29. ranges = IRanges::IRanges(as.numeric(chrs.start), end = as.numeric(chrs.end))
  30. )
  31. ### make Granges object for genes
  32. gene.names <- read.csv2(path_to_coords, sep = "t", header = FALSE, stringsAsFactors = F)
  33. gene.names <- gene.names[complete.cases(gene.names), ]
  34. genes.coords <- GenomicRanges::GRanges(
  35. seqnames = gene.names$V1,
  36. ranges = IRanges::IRanges(as.numeric(gene.names$V2), end = as.numeric(gene.names$V3))
  37. )
  38. names(genes.coords) <- gene.names$V4
  39. ### construct regnet
  40. gene_counts <- t(gene_counts) # cell x genes
  41. peak_counts <- t(peak_counts) # cell x genes
  42. # find overlap peaks for each gene
  43. if (missing(genes.list)) {
  44. genes.list <- colnames(gene_counts)
  45. }
  46. missing_genes <- !genes.list %in% names(genes.coords)
  47. if (sum(missing_genes)!=0){
  48. print(
  49. paste0(
  50. "Removing ", sum(missing_genes),
  51. " genes not found in given gene coordinates..."
  52. )
  53. )
  54. }
  55. genes.list <- genes.list[!missing_genes]
  56. genes.coords <- genes.coords[genes.list]
  57. print("Calculating correlation for gene-peak pairs...")
  58. each.len <- 0
  59. # assign('each.len', 0, envir = globalenv())
  60. elements <- lapply(seq(length(genes.list)), function(pos) {
  61. gene.use <- genes.list[pos]
  62. # re-scale the window for each gene
  63. gene.loci <- GenomicRanges::trim(suppressWarnings(GenomicRanges::promoters(GenomicRanges::resize(
  64. genes.coords[gene.use],
  65. width = 1, fix = "start"
  66. ),
  67. upstream = peak_range, downstream = peak_range
  68. )))
  69. peaks.use <- S4Vectors::queryHits(GenomicRanges::findOverlaps(peaks.pos, gene.loci))
  70. if ((x <- length(peaks.use)) == 0L) { # if no peaks in window, skip this iteration
  71. return(list(NULL, as.numeric(each.len), NULL))
  72. }
  73. ### compute correlation and p-adj for genes and peaks ###
  74. res <- suppressWarnings(psych::corr.test(
  75. x = gene_counts[, gene.use], y = as.matrix(peak_counts[, peaks.use]),
  76. method = dist, adjust = "holm", ci = FALSE, use = "complete"
  77. ))
  78. pick <- res[["p"]] < alpha # filter by p-value
  79. pick[is.na(pick)] <- FALSE
  80. if (sum(pick) == 0) { # if no peaks are important, skip this iteration
  81. return(list(NULL, as.numeric(each.len), NULL))
  82. }
  83. else {
  84. res.corr <- as.numeric(res[["r"]][pick])
  85. peaks.use <- peaks.use[pick]
  86. }
  87. # each.len <<- each.len length(peaks.use)
  88. assign('each.len', each.len length(peaks.use), envir = parent.frame(2))
  89. return(list(as.numeric(peaks.use), as.numeric(each.len), res.corr))
  90. })
  91. i_index <- Reduce(append, lapply(elements, function(ele) {
  92. ele[[1]]
  93. }))
  94. p_index <- c(0, Reduce(append, lapply(elements, function(ele) {
  95. ele[[2]]
  96. })))
  97. value_list <- Reduce(append, lapply(elements, function(ele) {
  98. ele[[3]]
  99. }))
  100. # make final sparse matrix
  101. regnet <- sparseMatrix(
  102. i = i_index, p = p_index, x = value_list,
  103. dims = c(ncol(peak_counts), length(genes.list)),
  104. dimnames = list(colnames(peak_counts), genes.list)
  105. )
  106. return(regnet)
  107. }

这个相关性的计算还挺漫长的,可以先去吃个饭2333。

  1. #find peaks and genes strongly corrected
  2. gmat = [email protected][['rna']]
  3. pseudo_pmat = [email protected][['rna']]
  4. regnet = MyLinkGenesAndPeaks(gene_counts = gmat, peak_counts = pseudo_pmat, dist = 'spearman', alpha = 0.05, path_to_coords = './data/10k_pbmc/annotation/genes.bed',peak_range = 500000)

对于我们关注的基因TCL1A,我们可以来看下有没有与其表达量高相关的peak。

  1. > which(regnet[,'TCL1A'] > 0.6)
  2. chr14:95456861-95457458
  3. 963
  4. > saveRDS(regnet,file = './res/fig_201112/regnet.rds')

我们成功找到了一个高相关的peak,相关性系数为0.73。为了进一步验证我们方法的可靠性,我们可以画出原始的peak的强度和RNA-seq的表达量。

  1. #convert to seurat and plot
  2. tsne.obj <- CreateDimReducObject(embeddings = [email protected][colnames(pmat_seurat),], key = "UMAP_",global = TRUE)
  3. pmat_seurat[['tsne']] <- tsne.obj
  4. TCL1A <- plotGene(int.PBMC, "TCL1A", axis.labels = c('UMAP_1', 'UMAP_2'), return.plots = TRUE)
  5. peak1 <- FeaturePlot(pmat_seurat,features = 'chr14:95456861-95457458',reduction = 'tsne')
  6. pdf(file = './res/fig_201112/high_related_peak_gene_dimplot.pdf',width = 9,height = 9)
  7. peak1[[1]] / TCL1A[[2]]
  8. dev.off()

Liger单细胞多组学分析:RNA-seq与ATAC-seq-图片10

确实在B细胞群中,RNA的表达和peak的强度呈现明显的正相关,同时我们也注意到另一群细胞中peak的强度很高,但RNA的表达量却几乎没有,这其中的基因表达调控机制或许也值得我们后续的研究和探讨。

完整代码

  1. ############################################
  2. ## Project: Liger-learning
  3. ## Script Purpose: joint analysis of scRNA-seq and chromatin accessibility
  4. ## Data: 2020.10.24
  5. ## Author: Yiming Sun
  6. ############################################
  7.  
  8. #sleep
  9. ii <- 1
  10. while(1){
  11. cat(paste("round",ii),sep = "n")
  12. ii <- ii 1
  13. Sys.sleep(30)
  14. }
  15.  
  16. # general setting
  17. setwd('/data/User/sunym/project/liger_learning/')
  18. # library
  19. library(Seurat)
  20. library(ggplot2)
  21. library(dplyr)
  22. library(scibet)
  23. library(Matrix)
  24. library(tidyverse)
  25. library(cowplot)
  26. library(viridis)
  27. library(pheatmap)
  28. library(ComplexHeatmap)
  29. library(networkD3)
  30. library(liger)
  31. library(parallel)
  32. #my function
  33. MySuggestLambda <- function(object,lambda,k,knn_k = 20,thresh = 1e-04,max.iters = 100,k2 = 500,ref_dataset = NULL,resolution = 1,parl_num){
  34. res <- matrix(ncol = 2,nrow = length(lambda))
  35. colnames(res) <- c('lambda','alignment')
  36. res[,'lambda'] <- lambda
  37. cl <- makeCluster(parl_num)
  38. clusterExport(cl,c('object','k','knn_k','thresh','max.iters','k2','ref_dataset','resolution'),envir = environment())
  39. clusterExport(cl,c('optimizeALS','quantileAlignSNF','calcAlignment'),envir = .GlobalEnv)
  40. alignment_score <- parLapply(cl = cl,lambda,function(x){
  41. object <- optimizeALS(object,k = k,lambda = x,thresh = thresh,max.iters = max.iters)
  42. object <- quantileAlignSNF(object,knn_k = knn_k,k2 = k2,ref_dataset = ref_dataset,resolution = resolution)
  43. return(calcAlignment(object))
  44. })
  45. stopCluster(cl)
  46. res[,'alignment'] <- (alignment_score %>% unlist())
  47. res <- as.data.frame(res)
  48. res[,1] <- as.numeric(as.character(res[,1]))
  49. res[,2] <- as.numeric(as.character(res[,2]))
  50. p <- ggplot(res,aes(x = lambda,y = alignment)) geom_point() geom_line()
  51. return(list(res,p))
  52. }
  53.  
  54. MyLinkGenesAndPeaks <- function(gene_counts, peak_counts, genes.list = NULL, dist = "spearman",
  55. alpha = 0.05, path_to_coords, peak_range) {
  56. ## check dependency
  57. if (!requireNamespace("GenomicRanges", quietly = TRUE)) {
  58. stop("Package "GenomicRanges" needed for this function to work. Please install it by command:n",
  59. "BiocManager::install('GenomicRanges')",
  60. call. = FALSE
  61. )
  62. }
  63. if (!requireNamespace("IRanges", quietly = TRUE)) {
  64. stop("Package "IRanges" needed for this function to work. Please install it by command:n",
  65. "BiocManager::install('IRanges')",
  66. call. = FALSE
  67. )
  68. }
  69. ### make Granges object for peaks
  70. peak.names <- strsplit(rownames(peak_counts), "[:-]")
  71. chrs <- Reduce(append, lapply(peak.names, function(peak) {
  72. peak[1]
  73. }))
  74. chrs.start <- Reduce(append, lapply(peak.names, function(peak) {
  75. peak[2]
  76. }))
  77. chrs.end <- Reduce(append, lapply(peak.names, function(peak) {
  78. peak[3]
  79. }))
  80. peaks.pos <- GenomicRanges::GRanges(
  81. seqnames = chrs,
  82. ranges = IRanges::IRanges(as.numeric(chrs.start), end = as.numeric(chrs.end))
  83. )
  84. ### make Granges object for genes
  85. gene.names <- read.csv2(path_to_coords, sep = "t", header = FALSE, stringsAsFactors = F)
  86. gene.names <- gene.names[complete.cases(gene.names), ]
  87. genes.coords <- GenomicRanges::GRanges(
  88. seqnames = gene.names$V1,
  89. ranges = IRanges::IRanges(as.numeric(gene.names$V2), end = as.numeric(gene.names$V3))
  90. )
  91. names(genes.coords) <- gene.names$V4
  92. ### construct regnet
  93. gene_counts <- t(gene_counts) # cell x genes
  94. peak_counts <- t(peak_counts) # cell x genes
  95. # find overlap peaks for each gene
  96. if (missing(genes.list)) {
  97. genes.list <- colnames(gene_counts)
  98. }
  99. missing_genes <- !genes.list %in% names(genes.coords)
  100. if (sum(missing_genes)!=0){
  101. print(
  102. paste0(
  103. "Removing ", sum(missing_genes),
  104. " genes not found in given gene coordinates..."
  105. )
  106. )
  107. }
  108. genes.list <- genes.list[!missing_genes]
  109. genes.coords <- genes.coords[genes.list]
  110. print("Calculating correlation for gene-peak pairs...")
  111. each.len <- 0
  112. # assign('each.len', 0, envir = globalenv())
  113. elements <- lapply(seq(length(genes.list)), function(pos) {
  114. gene.use <- genes.list[pos]
  115. # re-scale the window for each gene
  116. gene.loci <- GenomicRanges::trim(suppressWarnings(GenomicRanges::promoters(GenomicRanges::resize(
  117. genes.coords[gene.use],
  118. width = 1, fix = "start"
  119. ),
  120. upstream = peak_range, downstream = peak_range
  121. )))
  122. peaks.use <- S4Vectors::queryHits(GenomicRanges::findOverlaps(peaks.pos, gene.loci))
  123. if ((x <- length(peaks.use)) == 0L) { # if no peaks in window, skip this iteration
  124. return(list(NULL, as.numeric(each.len), NULL))
  125. }
  126. ### compute correlation and p-adj for genes and peaks ###
  127. res <- suppressWarnings(psych::corr.test(
  128. x = gene_counts[, gene.use], y = as.matrix(peak_counts[, peaks.use]),
  129. method = dist, adjust = "holm", ci = FALSE, use = "complete"
  130. ))
  131. pick <- res[["p"]] < alpha # filter by p-value
  132. pick[is.na(pick)] <- FALSE
  133. if (sum(pick) == 0) { # if no peaks are important, skip this iteration
  134. return(list(NULL, as.numeric(each.len), NULL))
  135. }
  136. else {
  137. res.corr <- as.numeric(res[["r"]][pick])
  138. peaks.use <- peaks.use[pick]
  139. }
  140. # each.len <<- each.len length(peaks.use)
  141. assign('each.len', each.len length(peaks.use), envir = parent.frame(2))
  142. return(list(as.numeric(peaks.use), as.numeric(each.len), res.corr))
  143. })
  144. i_index <- Reduce(append, lapply(elements, function(ele) {
  145. ele[[1]]
  146. }))
  147. p_index <- c(0, Reduce(append, lapply(elements, function(ele) {
  148. ele[[2]]
  149. })))
  150. value_list <- Reduce(append, lapply(elements, function(ele) {
  151. ele[[3]]
  152. }))
  153. # make final sparse matrix
  154. regnet <- sparseMatrix(
  155. i = i_index, p = p_index, x = value_list,
  156. dims = c(ncol(peak_counts), length(genes.list)),
  157. dimnames = list(colnames(peak_counts), genes.list)
  158. )
  159. return(regnet)
  160. }
  161. # 1.joint analysis of scRNA-seq and chromatin accessibility
  162. # pre-prapare scTATA data
  163. #sort by chromosome start position and end position for fragment, gene and promoter
  164. system('gunzip ./data/10k_pbmc/scATAC/fragments.tsv.gz')
  165. system('tail -n 6 ./data/10k_pbmc/annotation/gencode.v28.annotation.gtf > ./data/10k_pbmc/annotation/genes_split.gtf')
  166. genes <- read.table('./data/10k_pbmc/annotation/genes_split.gtf',sep = 't')
  167. genes <- genes[genes[,3] == 'gene',]
  168. gene_name <- unlist(strsplit(as.character(genes[,9]),split = '; ',fixed = TRUE))
  169. gene_name <- gene_name[grepl(pattern = 'gene_name',gene_name)]
  170. gene_name <- sub(pattern = 'gene_name ',replacement = '',gene_name)
  171. genes[,9] <- gene_name
  172. genes <- genes[,c(1,4,5,9,6,7)]
  173. promoter <- as.data.frame(matrix(nrow = dim(genes)[1],ncol = dim(genes)[2]))
  174. promoter[,c(1,4,5,6)] <- genes[,c(1,4,5,6)]
  175. for (i in 1:dim(promoter)[1]) {
  176. if(promoter[i,6] == ' '){
  177. if ((as.numeric(genes[i,2])-2000) >= 0){
  178. promoter[i,2] <- as.numeric(genes[i,2])-2000
  179. }else{
  180. promoter[i,2] <- 0
  181. }
  182. promoter[i,3] <- as.numeric(genes[i,2]) 2000
  183. } else if(promoter[i,6] == '-'){
  184. if ((as.numeric(genes[i,3])-2000) >= 0){
  185. promoter[i,2] <- as.numeric(genes[i,3])-2000
  186. }else{
  187. promoter[i,2] <- 0
  188. }
  189. promoter[i,3] <- as.numeric(genes[i,3]) 2000
  190. }
  191. }
  192. write.table(genes,file = './data/10k_pbmc/annotation/genes.bed',sep = 't',col.names = FALSE,row.names = FALSE,quote = FALSE)
  193. write.table(promoter,file = './data/10k_pbmc/annotation/promoter.bed',sep = 't',col.names = FALSE,row.names = FALSE,quote = FALSE)
  194. system('/data/User/sunym/software/BEDOPS/bin/sort-bed --max-mem 50G ./data/10k_pbmc/scATAC/fragments.tsv > ./data/10k_pbmc/scATAC/atac_fragments.sort.bed')
  195. system('/data/User/sunym/software/BEDOPS/bin/sort-bed --max-mem 50G ./data/10k_pbmc/annotation/genes.bed > ./data/10k_pbmc/annotation/genes.sort.bed')
  196. system('/data/User/sunym/software/BEDOPS/bin/sort-bed --max-mem 50G ./data/10k_pbmc/annotation/promoter.bed > ./data/10k_pbmc/annotation/promoters.sort.bed')
  197. #Use bedmap command to calculate overlapping elements between indexes and fragment output files
  198. system('/data/User/sunym/software/BEDOPS/bin/bedmap --ec --delim "t" --echo --echo-map-id ./data/10k_pbmc/annotation/promoters.sort.bed ./data/10k_pbmc/scATAC/atac_fragments.sort.bed > ./data/10k_pbmc/scATAC/atac_promoters_bc.bed')
  199. system('/data/User/sunym/software/BEDOPS/bin/bedmap --ec --delim "t" --echo --echo-map-id ./data/10k_pbmc/annotation/genes.sort.bed ./data/10k_pbmc/scATAC/atac_fragments.sort.bed > ./data/10k_pbmc/scATAC/atac_genes_bc.bed')
  200.  
  201.  
  202.  
  203.  
  204. # load data
  205. #load scATAC peak filted barcode data
  206. barcodes <- read.table(file = './data/10k_pbmc/scATAC/filtered_peak_bc_matrix/barcodes.tsv',sep = 't',as.is = TRUE,header = FALSE)$V1
  207. #import the bedmap outputs into the R
  208. genes.bc <- read.table(file = "./data/10k_pbmc/scATAC/atac_genes_bc.bed", sep = "t", as.is = c(4,7), header = FALSE)
  209. promoters.bc <- read.table(file = "./data/10k_pbmc/scATAC/atac_promoters_bc.bed", sep = "t", as.is = c(4,7), header = FALSE)
  210. #filter the barcode with bad quality
  211. bc <- genes.bc[,7]
  212. bc_split <- strsplit(bc,";")
  213. bc_split_vec <- unlist(bc_split)
  214. bc_counts <- table(bc_split_vec)
  215. bc_filt <- names(bc_counts)[bc_counts > 1500]
  216. barcodes <- intersect(bc_filt,barcodes)
  217. #use LIGER's makeFeatureMatrix function to calculate accessibility counts for gene body and promoter
  218. gene.counts <- makeFeatureMatrix(genes.bc, barcodes)
  219. promoter.counts <- makeFeatureMatrix(promoters.bc, barcodes)
  220. gene.counts <- gene.counts[order(rownames(gene.counts)),]
  221. promoter.counts <- promoter.counts[order(rownames(promoter.counts)),]
  222. PBMCs <- gene.counts promoter.counts
  223. colnames(PBMCs)=paste0("PBMC_",colnames(PBMCs))
  224. #load scRNA-seq data
  225. PBMC.rna <- read10X(sample.dirs = './data/10k_pbmc/scRNA/filtered_feature_bc_matrix/',sample.names = 'rna')
  226. #create a LIGER object
  227. PBMC.data <- list(atac = PBMCs, rna = PBMC.rna)
  228. int.PBMC <- createLiger(PBMC.data)
  229. remove(list = ls()[ls() != 'int.PBMC'])
  230. gc()
  231. #normalize select gene and scale
  232. int.PBMC <- normalize(int.PBMC)
  233. int.PBMC <- selectGenes(int.PBMC, datasets.use = 2)
  234. int.PBMC <- scaleNotCenter(int.PBMC)
  235. #integrate NMF
  236. #suggest lambda
  237. suggestLambda(int.PBMC,lambda.test = seq(1,20,1),k = 25,num.cores = 10)
  238. #suggest k
  239. suggestK(object = int.PBMC,k.test = seq(10,30,1),lambda = 10,num.cores = 10,plot.log2 = FALSE)
  240. #lambda = 10,k = 25
  241. int.PBMC <- optimizeALS(int.PBMC,k = 25,lambda = 10,max.iters = 30,thresh = 1e-06)
  242. #Quantile Normalization and Joint Clustering
  243. int.PBMC <- quantile_norm(int.PBMC,knn_k = 20,quantiles = 50,min_cells = 20,do.center = FALSE,
  244. max_sample = 1000,refine.knn = TRUE,eps = 0.9)
  245. int.PBMC <- louvainCluster(int.PBMC, resolution = 0.4)
  246. table([email protected])
  247. #Visualization and Downstream Analysis
  248. int.PBMC <- runUMAP(int.PBMC)
  249. all.plots <- plotByDatasetAndCluster(int.PBMC, axis.labels = c('UMAP 1', 'UMAP 2'),return.plots = TRUE)
  250. head(all.plots[[1]]$data)
  251. pdf(file = './res/fig_201112/plot_by_dataset_and_cluster.pdf',width = 8,height = 4)
  252. all.plots[[1]] all.plots[[2]]
  253. dev.off()
  254. #identify marker genes for each cluster
  255. int.PBMC.wilcoxon <- runWilcoxon(int.PBMC, data.use = 'all', compare.method = 'clusters')
  256. #filter the enriched gene yourself
  257. int.PBMC.wilcoxon <- int.PBMC.wilcoxon[int.PBMC.wilcoxon$padj < 0.05,]
  258. int.PBMC.wilcoxon <- int.PBMC.wilcoxon[int.PBMC.wilcoxon$logFC > 3,]
  259. #top 20 markers of cluster 3 and 4
  260. wilcoxon.cluster_3 <- int.PBMC.wilcoxon[int.PBMC.wilcoxon$group == 3, ]
  261. wilcoxon.cluster_3 <- wilcoxon.cluster_3[order(wilcoxon.cluster_3$padj), ]
  262. markers.cluster_3 <- wilcoxon.cluster_3[1:20, ]
  263. wilcoxon.cluster_4 <- int.PBMC.wilcoxon[int.PBMC.wilcoxon$group == 4, ]
  264. wilcoxon.cluster_4 <- wilcoxon.cluster_4[order(wilcoxon.cluster_4$padj), ]
  265. markers.cluster_4 <- wilcoxon.cluster_4[1:20, ]
  266. #feature plot on umap
  267. # CD8A for cluster 2 and CD79A for cluster 4
  268. CD8A <- plotGene(int.PBMC, "CD8A", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
  269. CD79A <- plotGene(int.PBMC, "CD79A", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
  270. pdf(file = './res/fig_201112/marker_for_cluster_3_and_4.pdf',width = 9,height = 6)
  271. plot_grid(CD8A[[2]],CD79A[[2]],CD8A[[1]],CD79A[[1]], ncol=2)
  272. dev.off()
  273. #gene loading plot
  274. # gene loading plot of factor12
  275. all.plots <- plotGeneLoadings(int.PBMC, return.plots = TRUE,do.spec.plot = FALSE)
  276. pdf(file = './res/fig_201112/gene_loading_factor_18.pdf')
  277. all.plots[[18]]
  278. dev.off()
  279.  
  280. # feature plot
  281. TCL1A <- plotGene(int.PBMC, "TCL1A", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
  282. RPS9 <- plotGene(int.PBMC, "RPS9", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
  283. pdf(file = './res/fig_201112/RPS9_TCL1A_plot_for_RNA_and_ATAC.pdf',width = 9,height = 6)
  284. plot_grid(RPS9[[2]],TCL1A[[2]],RPS9[[1]],TCL1A[[1]], ncol=2)
  285. dev.off()
  286.  
  287.  
  288. # Pseudo single multi-omic data (scRNA-seq and scATAC-seq)
  289. #load scATAC peak data
  290. barcodes <- read.table(file = './data/10k_pbmc/scATAC/filtered_peak_bc_matrix/barcodes.tsv',sep = 't',as.is = TRUE,header = FALSE)$V1
  291. barcodes <- paste('PBMC',barcodes,sep = '_')
  292. peak.names <- read.table(file = './data/10k_pbmc/scATAC/filtered_peak_bc_matrix/peaks.bed', sep = 't', header = FALSE)
  293. peak.names <- paste0(peak.names$V1, ':', peak.names$V2, '-', peak.names$V3)
  294. pmat <- readMM(file = './data/10k_pbmc/scATAC/filtered_peak_bc_matrix/matrix.mtx')
  295. colnames(pmat) <- barcodes
  296. rownames(pmat) <- peak.names
  297. pmat <- pmat[,colnames([email protected]$atac)]
  298. #cell-type-specific accessibility
  299. #create seurat for pmat
  300. pmat_seurat <- CreateSeuratObject(counts = pmat,project = 'pmat')
  301. pmat_seurat <- NormalizeData(pmat_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
  302. pmat_seurat <- ScaleData(pmat_seurat, features = rownames(pmat_seurat))
  303. cell_type <- [email protected]
  304. cell_type <- cell_type[colnames(pmat_seurat)]
  305. pmat_seurat$cell_type <- cell_type
  306. table(pmat_seurat$cell_type)
  307. Idents(pmat_seurat) <- 'cell_type'
  308. #find the DEPeak across the cluster
  309. pmat_marker <- FindAllMarkers(pmat_seurat, only.pos = TRUE, logfc.threshold = 0.25)
  310. pmat_marker <- pmat_marker[pmat_marker$p_val_adj < 0.05,]
  311. peaks.sel <- unique(pmat_marker$gene)
  312. #filter the pmat seurat object and scale the express
  313. pmat_seurat <- pmat_seurat[peaks.sel,]
  314. pmat_seurat <- NormalizeData(pmat_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
  315. pmat_seurat <- ScaleData(pmat_seurat, features = rownames(pmat_seurat),do.center = FALSE)
  316. #filter the pmat object to impute the RNA-seq cell's ATAC peaks
  317. pmat <- pmat[peaks.sel,]
  318. int.PBMC.ds <- int.PBMC
  319. [email protected][['atac']] <- pmat
  320. int.PBMC.ds <- normalize(int.PBMC.ds)
  321. #predict the accessibility profile for scRNA-seq data
  322. int.PBMC.ds <- imputeKNN(int.PBMC.ds, reference = 'atac',queries = 'rna',knn_k = 20,norm = TRUE,scale = FALSE)
  323. #find peaks and genes strongly corrected
  324. gmat = [email protected][['rna']]
  325. pseudo_pmat = [email protected][['rna']]
  326. regnet = MyLinkGenesAndPeaks(gene_counts = gmat, peak_counts = pseudo_pmat, dist = 'spearman', alpha = 0.05, path_to_coords = './data/10k_pbmc/annotation/genes.bed',peak_range = 500000)
  327. which(regnet[,'TCL1A'] > 0.6)
  328. saveRDS(regnet,file = './res/fig_201112/regnet.rds')
  329. #convert to seurat and plot
  330. tsne.obj <- CreateDimReducObject(embeddings = [email protected][colnames(pmat_seurat),], key = "UMAP_",global = TRUE)
  331. pmat_seurat[['tsne']] <- tsne.obj
  332. TCL1A <- plotGene(int.PBMC, "TCL1A", axis.labels = c('UMAP_1', 'UMAP_2'), return.plots = TRUE)
  333. peak1 <- FeaturePlot(pmat_seurat,features = 'chr14:95456861-95457458',reduction = 'tsne')
  334. pdf(file = './res/fig_201112/high_related_peak_gene_dimplot.pdf',width = 9,height = 9)
  335. peak1[[1]] / TCL1A[[2]]
  336. dev.off()

写在最后

在写这篇博客的时候遇到了不少的困难,liger作者给出的教程中基因组注释文件似乎不是特别匹配,导致我最终重复不出他们的结果。所以我决定从头开始,自己收集数据和reference,从而构建出一套完整的流程。可能我对外周血单核细胞的认知也比较浅薄,举例时所选择的基因和peak也不一定恰当,如果读者发现有什么错误之处,希望能在评论区指出,大家一起进步。

参考链接

Joint definition of cell types from single-cell gene expression and chromatin accessibility data (human bone marrow mononuclear cells)

发表评论

匿名网友

拖动滑块以完成验证
加载失败