介绍完liger的数学原理,我们便可以来跑一跑liger的流程,并对liger的数学原理和参数选择有更深刻的理解。我们可以在GitHub上找到liger的码源以及相关的教程,本文也主要翻译自GitHub上的教程,并加入了一些自己的函数和理解。
载入包和数据
本文的教程希望整合控制组以及干扰素刺激组的PBMC单细胞RNA-seq数据,数据来源于Kang et al, 2017,liger的作者则提供了以RData格式存储的经过down sample的表达矩阵,我们可以从这里下载并直接使用。当然,如果你使用10x的数据的话,则也可以使用read10X函数将数据读入。
############################################ ## Project: Liger-learning ## Script Purpose: run pipeline of liger integration and further analysis ## Data: 2020.10.24 ## Author: Yiming Sun ############################################ # general setting setwd('/data/User/sunym/project/liger_learning/') # library library(liger) library(Seurat) library(dplyr) library(tidyverse) library(viridis) library(parallel) #my function MySuggestLambda <- function(object,lambda,k,knn_k = 20,thresh = 1e-04,max.iters = 100,k2 = 500,ref_dataset = NULL,resolution = 1,parl_num){ res <- matrix(ncol = 2,nrow = length(lambda)) colnames(res) <- c('lambda','alignment') res[,'lambda'] <- lambda cl <- makeCluster(parl_num) clusterExport(cl,c('object','k','knn_k','thresh','max.iters','k2','ref_dataset','resolution'),envir = environment()) clusterExport(cl,c('optimizeALS','quantileAlignSNF','calcAlignment'),envir = .GlobalEnv) alignment_score <- parLapply(cl = cl,lambda,function(x){ object <- optimizeALS(object,k = k,lambda = x,thresh = thresh,max.iters = max.iters) object <- quantileAlignSNF(object,knn_k = knn_k,k2 = k2,ref_dataset = ref_dataset,resolution = resolution) return(calcAlignment(object)) }) stopCluster(cl) res[,'alignment'] <- (alignment_score %>% unlist()) res <- as.data.frame(res) res[,1] <- as.numeric(as.character(res[,1])) res[,2] <- as.numeric(as.character(res[,2])) p <- ggplot(res,aes(x = lambda,y = alignment)) geom_point() geom_line() return(list(res,p)) } # 1.Integrating Multiple Single-Cell RNA-seq Datasets #load data ctrl_dge <- readRDS("./data/PBMC_control.RDS") stim_dge <- readRDS("./data/PBMC_interferon-stimulated.RDS")
这里我写了一个函数MySuggestLambda用来优化λ的选择,其实liger包自带有suggestK和suggestLambda函数帮助我们优化K和λ的选择,但是suggestLambda函数似乎有什么bug,在一台服务器上跑完优化后会报一个莫名其妙的错误,而在另一台服务器上则完全正常,我还没有发现bug在哪里所以就自己写了个类似的函数帮助我优化λ的选择,如果你使用liger默认的函数也会报错的话不妨试试我的函数,虽然他内存运用效率特别低,建议配置好50G左右的swap区,因为当R跑多线程时,如果内存不够rsession会被直接挤掉,配置好swap能拯救一下它(血泪教训)。
标准化
在降维和聚类(整合)之前,我们通常都要对表达矩阵进行标准化、挑选variable gene最后再归一化。需要注意的是,liger使用的是非负矩阵分解的方法,因此我们在做归一化的时候注意不要做中心化,因为这会导致负数的出现。
#initialize a liger object ifnb_liger <- createLiger(list(ctrl = ctrl_dge, stim = stim_dge)) #explore liger object dim([email protected]$ctrl) head(colnames([email protected]$ctrl)) head(rownames([email protected]$ctrl)) dim([email protected]$stim) #normalize data ifnb_liger <- normalize(ifnb_liger) #select variable gene ifnb_liger <- selectGenes(ifnb_liger) #scale data but not center ifnb_liger <- scaleNotCenter(ifnb_liger)
可以简单关注一下liger的数据结构,原始的表达矩阵存储在raw.data中,聚类的信息存储在clusters中,降维的信息存储在tsne.coords中。这里就不得不提一下seurat,seurat有一个meta.data的表格来存储各个细胞的各种信息,因此在我们做个性化的比较分析的时候特别有用。liger虽然也有cell.data,但我们似乎不太能够利用liger自带的函数指定cell.data中的变量来进行比较,我想到的解决方法是把liger导出为seurat,在seurat中做一些个性化的比较,而liger则单纯用来做整合、降维以及聚类。
非负矩阵分解
Liger利用非负矩阵分解的方法进行数据集之间的整合,我们需要提供的参数为K和λ。通常,K反映了数据集中真实cluster的个数,而λ则反映了数据集之间的异质性,如果你对你的数据完全没有把握,那么默认参数K=20,λ=5往往也能有不错的表现。当然我们也可以根据KL散度和Alignment score来优化K和λ的选择。
#find the optimized k suggestK(ifnb_liger,k.test = seq(10,30,1),lambda = 5,plot.log2 = FALSE,num.cores = 10)
设置λ=5,调用10个进程,测试K从10到30时所计算得到的KL散度,最终选择20作为最优的K。
使用liger自带的函数优化λ的选择,取K=20,测试lambda的范围从1到10,调用10个进程,最终选择5作为最优的λ。
#find the optimized lambda #use liger function suggestLambda(object = ifnb_liger,k = 20,lambda.test = seq(1,10,1),num.cores = 10)
使用我自己写的函数来优化λ的选择,
#use my function suggest_lambda <- MySuggestLambda(ifnb_liger,lambda = seq(1,10,1),k = 20,parl_num = 10) suggest_lambda[[2]]
官方的函数应该是在Alignment score上做了平均而使得曲线更加平滑了,总之建议使用官方的函数,如果出现报错的话,可以试试我的函数。最后我们选择K=20,λ=5进行非负矩阵分解。
#use lambda 5 k 20 #integrate NMF ifnb_liger <- optimizeALS(ifnb_liger,k = 20,lambda = 5,max.iters = 30,thresh = 1e-06)
聚类
做完非负矩阵分解后就可以进行聚类了,liger聚类的方法也比较有趣,可以参考liger的数学原理。quantile_norm函数会构建SFN图(shared factor neighborhood graph)对细胞进行聚类,在聚类完成后将同一cluster中不同数据集来源的细胞在坐标轴(W V矩阵)上进行分位数归一化。
#Quantile Normalization and Joint Clustering ifnb_liger <- quantile_norm(ifnb_liger,knn_k = 20,quantiles = 50,min_cells = 20,do.center = FALSE, max_sample = 1000,refine.knn = TRUE,eps = 0.9)
这里需要讲两个参数,knn_k是在构建SFN图中选择邻居细胞的个数,do.center是指是否做中心化,一般默认为FALSE,但当我们处理一些比较dense的数据时,如DNA甲基化数据,metagene的loading值通常会很高,此时可以做一下中心化来进行矫正。当然,我们也可以用一些图聚类算法来进一步优化我们的聚类,这一步可做可不做。
# you can use louvain cluster to detect and assign cluster ifnb_liger <- louvainCluster(ifnb_liger, resolution = 0.25)
可视化与下游分析
- UMAP降维图
#Visualization and Downstream Analysis ifnb_liger <- runUMAP(ifnb_liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.3) all.plots <- plotByDatasetAndCluster(ifnb_liger, axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T) all.plots[[1]] all.plots[[2]]
- Gene loading plot
我们可以找到在各个cluster中高表达的metagene,以及这些metagene在不同的数据集中(W和V矩阵)对其贡献最高的基因。
#find the maximun loading factor on each cluster and the maximun loading gene on the factor gene_loadings <- plotGeneLoadings(ifnb_liger, do.spec.plot = FALSE, return.plots = TRUE) gene_loadings[[4]]
metagene_4在cluster_3中特异性高表达,那么metagene_4在W矩阵中对其贡献最大的基因(Shared),则很有可能是cluster_3的marker基因。类似的,metagene_4在V矩阵中对其贡献最大的基因,则反映了同一细胞类型数据集之间(或处理之间)的差异。我们也可以很容易得到图片背后的数据。
#the metadata of the factor/gene loading plot #UMAP head(gene_loadings[[4]][[1]]$data) #gene loading value on the specific factor head(gene_loadings[[4]][[2]][[1]]$data[order(gene_loadings[[4]][[2]][[1]]$data$loadings,decreasing = TRUE),]) head(gene_loadings[[4]][[2]][[2]]$data[order(gene_loadings[[4]][[2]][[2]]$data$loadings,decreasing = TRUE),]) head(gene_loadings[[4]][[2]][[3]]$data[order(gene_loadings[[4]][[2]][[3]]$data$loadings,decreasing = TRUE),]) # conclusion: gene_loadings object is a list of all the factor, specific the factor you want # conclusion: the factor object listed by gene_loadings has two lists, the first is the UMAP with the factor loading value, the second is the dataset-sepcific matrix W # the dataset specific matrix W is a list contains the all the dataset and shared gene loading value # what a mess!
总之就是list套list,很麻烦,可以对照着图和输出理解一下。
> head(gene_loadings[[4]][[1]]$data) Factor4 tSNE1 tSNE2 ctrlTCAGCGCTGGTCAT-1 0.0018384202 -9.5633178 -1.5025842 ctrlTTATGGCTTCATTC-1 NA -7.4026990 -0.5618219 ctrlACCCACTGCTTAGG-1 0.0007611497 3.7179575 4.9839707 ctrlATGGGTACCCCGTT-1 0.0001820607 -11.0730367 -4.6024990 ctrlTGACTGGACAGTCA-1 NA 0.5273923 -8.4171533 ctrlGTGTAGTGGTTGTG-1 0.0013735353 8.8018236 5.2826267 > #gene loading value on the specific factor > head(gene_loadings[[4]][[2]][[1]]$data[order(gene_loadings[[4]][[2]][[1]]$data$loadings,decreasing = TRUE),]) loadings xpos top_k FAM43A 2.522310 1.0000000 TRUE FGFBP2 2.411532 0.9995883 TRUE SPON2 2.264957 0.9991766 FALSE ZNF638 1.925689 0.9987649 FALSE KARS 1.916834 0.9983532 FALSE PRDX5 1.889175 0.9979415 TRUE > head(gene_loadings[[4]][[2]][[2]]$data[order(gene_loadings[[4]][[2]][[2]]$data$loadings,decreasing = TRUE),]) loadings xpos top_k GNLY 136.98236 1.0000000 TRUE NKG7 119.07990 0.9995883 TRUE CLIC3 117.71097 0.9991766 TRUE GZMB 108.58467 0.9987649 TRUE PRF1 108.14754 0.9983532 TRUE GZMA 93.22887 0.9979415 TRUE > head(gene_loadings[[4]][[2]][[3]]$data[order(gene_loadings[[4]][[2]][[3]]$data$loadings,decreasing = TRUE),]) loadings xpos top_k PRF1 3.273654 1.0000000 TRUE SDF2L1 2.913147 0.9995883 FALSE CD38 2.204181 0.9991766 TRUE IFNG 2.111837 0.9987649 FALSE HSPA5 2.081298 0.9983532 FALSE CIAPIN1 2.002635 0.9979415 FALSE
- 差异表达分析
转录组分析怎么离得开差异表达分析嘛,我们可以用runWilcoxon比较不同的cluster之间的差异表达基因,也可以比较同一个cluster的不同数据集来源之间的差异表达基因。
#differential gene express # for different cluster cluster.results <- runWilcoxon(ifnb_liger, compare.method = "clusters") head(cluster.results) datasets.results <- runWilcoxon(ifnb_liger, compare.method = "datasets") head(datasets.results)
只需要指定compare.method就可以进行不同的比较,不过这个函数其实也只能做这两种比较,如果我想做之前提到的更加个性化的比较,就只能导入到seurat中再进行比较。
> cluster.results <- runWilcoxon(ifnb_liger, compare.method = "clusters") [1] "Performing Wilcoxon test on ALL datasets: ctrl, stim" > head(cluster.results) feature group avgExpr logFC statistic 1 RP11-206L10.2 0 -23.01898 -0.005557170 4259604 2 RP11-206L10.9 0 -23.01914 -0.005979592 4259602 3 LINC00115 0 -23.00609 -0.031260202 4252898 4 NOC2L 0 -21.83167 0.292228474 4340248 5 KLHL17 0 -23.00556 0.003658342 4262141 6 PLEKHN1 0 -23.01194 -0.039663835 4249915 auc pval padj pct_in pct_out 1 0.4998057 0.570575909 0.69419204 100 100 2 0.4998055 0.570110484 0.69419204 100 100 3 0.4990189 0.138680541 0.24813295 100 100 4 0.5092682 0.004585054 0.01435755 100 100 5 0.5001034 0.819517383 0.88363326 100 100 6 0.4986689 0.044537037 0.10099905 100 100 > datasets.results <- runWilcoxon(ifnb_liger, compare.method = "datasets") [1] "Performing Wilcoxon test on ALL datasets: ctrl, stim" > head(datasets.results) feature group avgExpr logFC statistic 1 RP11-206L10.2 0-ctrl -23.02585 -0.01408078 665437.5 2 RP11-206L10.9 0-ctrl -23.02585 -0.01376717 665437.5 3 LINC00115 0-ctrl -22.99983 0.01283157 666564.5 4 NOC2L 0-ctrl -21.87312 -0.08498766 662383.5 5 KLHL17 0-ctrl -22.98625 0.03960509 667718.0 6 PLEKHN1 0-ctrl -23.02585 -0.02853147 664846.0 auc pval padj pct_in pct_out 1 0.4995560 0.30577296 0.6990712 100 100 2 0.4995560 0.30577296 0.6990712 100 100 3 0.5004020 0.59231247 0.8964793 100 100 4 0.4972633 0.61893894 0.9110340 100 100 5 0.5012680 0.09102028 0.4580475 100 100 6 0.4991119 0.14726281 0.5773302 100 100
对差异表达进行筛选。
> # filter the data to find the meaningful DEG for your own experiment > cluster.results <- cluster.results[cluster.results$padj < 0.05,] > cluster.results <- cluster.results[cluster.results$logFC > 3,] > wilcoxon.cluster_3 <- cluster.results[cluster.results$group == 3, ] > wilcoxon.cluster_3 <- wilcoxon.cluster_3[order(wilcoxon.cluster_3$padj), ] > markers <- wilcoxon.cluster_3[1:20, ] > head(markers) feature group avgExpr logFC statistic auc 41861 GNLY 3 -5.282466 16.147323 2379942 0.9647642 46904 CLIC3 3 -12.962070 9.485599 1946458 0.7890417 47130 PRF1 3 -12.260573 9.830082 1970124 0.7986352 49239 GZMB 3 -7.840488 13.697178 2231966 0.9047789 52832 NKG7 3 -6.594620 14.433185 2324832 0.9424243 48310 KLRC1 3 -18.000059 4.916239 1606731 0.6513252 pval padj pct_in pct_out 41861 0.000000e 00 0.000000e 00 100 100 46904 0.000000e 00 0.000000e 00 100 100 47130 0.000000e 00 0.000000e 00 100 100 49239 0.000000e 00 0.000000e 00 100 100 52832 0.000000e 00 0.000000e 00 100 100 48310 1.839405e-289 4.099114e-286 100 100
可以看一下PRF1在不同的cluster之间的表达情况。
# feature plot on UMAP PRF1 <- plotGene(ifnb_liger, "PRF1", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T) PRF1[[1]] PRF1[[2]]
可以看到PRF1在两个数据集的cluster_3中均显著高表达。也可以用小提琴图来展示。
# vlnplot PRF1 <- plotGeneViolin(ifnb_liger,gene = 'PRF1',return.plots = TRUE) PRF1[[1]] PRF1[[2]]
完整代码
############################################ ## Project: Liger-learning ## Script Purpose: run pipeline of liger integration and further analysis ## Data: 2020.10.24 ## Author: Yiming Sun ############################################ # general setting setwd('/data/User/sunym/project/liger_learning/') # library library(liger) library(Seurat) library(dplyr) library(tidyverse) library(viridis) library(parallel) #my function MySuggestLambda <- function(object,lambda,k,knn_k = 20,thresh = 1e-04,max.iters = 100,k2 = 500,ref_dataset = NULL,resolution = 1,parl_num){ res <- matrix(ncol = 2,nrow = length(lambda)) colnames(res) <- c('lambda','alignment') res[,'lambda'] <- lambda cl <- makeCluster(parl_num) clusterExport(cl,c('object','k','knn_k','thresh','max.iters','k2','ref_dataset','resolution'),envir = environment()) clusterExport(cl,c('optimizeALS','quantileAlignSNF','calcAlignment'),envir = .GlobalEnv) alignment_score <- parLapply(cl = cl,lambda,function(x){ object <- optimizeALS(object,k = k,lambda = x,thresh = thresh,max.iters = max.iters) object <- quantileAlignSNF(object,knn_k = knn_k,k2 = k2,ref_dataset = ref_dataset,resolution = resolution) return(calcAlignment(object)) }) stopCluster(cl) res[,'alignment'] <- (alignment_score %>% unlist()) res <- as.data.frame(res) res[,1] <- as.numeric(as.character(res[,1])) res[,2] <- as.numeric(as.character(res[,2])) p <- ggplot(res,aes(x = lambda,y = alignment)) geom_point() geom_line() return(list(res,p)) } # 1.Integrating Multiple Single-Cell RNA-seq Datasets #load data ctrl_dge <- readRDS("./data/PBMC_control.RDS") stim_dge <- readRDS("./data/PBMC_interferon-stimulated.RDS") #initialize a liger object ifnb_liger <- createLiger(list(ctrl = ctrl_dge, stim = stim_dge)) #explore liger object dim([email protected]$ctrl) head(colnames([email protected]$ctrl)) head(rownames([email protected]$ctrl)) dim([email protected]$stim) #normalize data ifnb_liger <- normalize(ifnb_liger) #select variable gene ifnb_liger <- selectGenes(ifnb_liger) #scale data but not center ifnb_liger <- scaleNotCenter(ifnb_liger) #find the optimized k suggestK(ifnb_liger,k.test = seq(10,30,1),lambda = 5,plot.log2 = FALSE,num.cores = 10) #find the optimized lambda #use liger function suggestLambda(object = ifnb_liger,k = 20,lambda.test = seq(1,10,1),num.cores = 10) #use my function suggest_lambda <- MySuggestLambda(ifnb_liger,lambda = seq(1,10,1),k = 20,parl_num = 10) suggest_lambda[[2]] #use lambda 5 k 20 #integrate NMF ifnb_liger <- optimizeALS(ifnb_liger,k = 20,lambda = 5,max.iters = 30,thresh = 1e-06) #Quantile Normalization and Joint Clustering ifnb_liger <- quantile_norm(ifnb_liger,knn_k = 20,quantiles = 50,min_cells = 20,do.center = FALSE, max_sample = 1000,refine.knn = TRUE,eps = 0.9) # you can use louvain cluster to detect and assign cluster ifnb_liger <- louvainCluster(ifnb_liger, resolution = 0.25) head([email protected]) #Visualization and Downstream Analysis ifnb_liger <- runUMAP(ifnb_liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.3) all.plots <- plotByDatasetAndCluster(ifnb_liger, axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T) #the place of each cell and the cluster it belongs head(all.plots[[1]]$data) pdf(file = './res/fig_201024/plot_by_dataset_and_cluster.pdf',width = 8,height = 4) all.plots[[1]] all.plots[[2]] dev.off() #find the maximun loading factor on each cluster and the maximun loading gene on the factor gene_loadings <- plotGeneLoadings(ifnb_liger, do.spec.plot = FALSE, return.plots = TRUE) pdf(file = './res/fig_201024/gene_loading_plot.pdf',width = 6.5,height = 6) gene_loadings[[4]] dev.off() #the metadata of the factor/gene loading plot #UMAP head(gene_loadings[[4]][[1]]$data) #gene loading value on the specific factor head(gene_loadings[[4]][[2]][[1]]$data[order(gene_loadings[[4]][[2]][[1]]$data$loadings,decreasing = TRUE),]) head(gene_loadings[[4]][[2]][[2]]$data[order(gene_loadings[[4]][[2]][[2]]$data$loadings,decreasing = TRUE),]) head(gene_loadings[[4]][[2]][[3]]$data[order(gene_loadings[[4]][[2]][[3]]$data$loadings,decreasing = TRUE),]) # conclusion: gene_loadings object is a list of all the factor, specific the factor you want # conclusion: the factor object listed by gene_loadings has two lists, the first is the UMAP with the factor loading value, the second is the dataset-sepcific matrix W # the dataset specific matrix W is a list contains the all the dataset and shared gene loading value # what a mess! #differential gene express # for different cluster cluster.results <- runWilcoxon(ifnb_liger, compare.method = "clusters") head(cluster.results) datasets.results <- runWilcoxon(ifnb_liger, compare.method = "datasets") head(datasets.results) # filter the data to find the meaningful DEG for your own experiment cluster.results <- cluster.results[cluster.results$padj < 0.05,] cluster.results <- cluster.results[cluster.results$logFC > 3,] wilcoxon.cluster_3 <- cluster.results[cluster.results$group == 3, ] wilcoxon.cluster_3 <- wilcoxon.cluster_3[order(wilcoxon.cluster_3$padj), ] markers <- wilcoxon.cluster_3[1:20, ] head(markers) # feature plot on UMAP PRF1 <- plotGene(ifnb_liger, "PRF1", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T) pdf(file = './res/fig_201024/feature_plot_PRF1.pdf',width = 8,height = 4) PRF1[[1]] PRF1[[2]] dev.off() # vlnplot PRF1 <- plotGeneViolin(ifnb_liger,gene = 'PRF1',return.plots = TRUE) pdf(file = './res/fig_201024/vlnplot_PRF1.pdf',width = 6,height = 3) PRF1[[1]] PRF1[[2]] dev.off() # compare cluster marker express winthin and across dataset IFIT3 <- plotGene(ifnb_liger, "IFIT3", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE) IFITM3 <- plotGene(ifnb_liger, "IFITM3", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE) pdf(file = './res/fig_201024/cluster_marker_within_between_dataset.pdf',width = 8,height = 8) plot_grid(IFIT3[[1]],IFIT3[[2]],IFITM3[[1]],IFITM3[[2]], ncol=2) dev.off()