使用DESeq2进行差异表达分析

下载完原始数据,比对以及构建DESeqDataSet对象等一系列使用前准备完成后,我们便可以使用DESeq2进行后续的差异表达分析了。由于DESeq2的原理和内部的处理方法非常复杂,因此本文暂时不涉及原理的解释,后续可能会专门出一期来详解DESeq2的算法原理,感兴趣的同学也可以参考文献:Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

首先导入Salmon比对后的数据,构建DESeqDataSet对象,代码参考DESeq2使用前准备

  1. ############################################
  2. ## Project: DESeq2 learining
  3. ## Script Purpose: learn DESeq2
  4. ## Data: 2021.01.08
  5. ## Author: Yiming Sun
  6. ############################################
  7. #sleep
  8. ii <- 1
  9. while(1){
  10. cat(paste("round",ii),sep = "n")
  11. ii <- ii 1
  12. Sys.sleep(30)
  13. }
  14. #work dir
  15. setwd('/lustre/user/liclab/sunym/R/rstudio/deseq2/')
  16. #load R package
  17. .libPaths('/lustre/user/liclab/sunym/R/rstudio/Rlib/')
  18. library(DESeq2)
  19. library(airway)
  20. library(tximeta)
  21. library(GEOquery)
  22. library(GenomicFeatures)
  23. library(BiocParallel)
  24. library(GenomicAlignments)
  25. library(vsn)
  26. library(dplyr)
  27. library(ggplot2)
  28. library(RColorBrewer)
  29. library(pheatmap)
  30. library(PoiClaClu)
  31. library(glmpca)
  32. library(ggbeeswarm)
  33. library(apeglm)
  34. library(Rmisc)
  35. library(genefilter)
  36. library(AnnotationDbi)
  37. library(org.Hs.eg.db)
  38. library(Gviz)
  39. library(sva)
  40. library(RUVSeq)
  41. library(fission)
  42. #load data
  43. #-----------Transcript quantification by Salmon-----------------#
  44. #metadata
  45. dir <- system.file("extdata", package="airway", mustWork=TRUE)
  46. csvfile <- file.path(dir, "sample_table.csv")
  47. coldata <- read.csv(csvfile, row.names=1, stringsAsFactors=FALSE)
  48. coldata$names <- coldata$Run
  49. dir <- getwd()
  50. dir <- paste(dir,'airway_salmon',sep = '/')
  51. coldata$files <- file.path(dir,coldata$Run,'quant.sf')
  52. file.exists(coldata$files)
  53. #create SummarizedExperiment dataset
  54. makeLinkedTxome(indexDir='/lustre/user/liclab/sunym/R/rstudio/deseq2/Salmon_index/gencode.v35_salmon_0.14.1/', source="GENCODE", organism="Homo sapiens",
  55. release="35", genome="GRCh38", fasta='/lustre/user/liclab/sunym/R/rstudio/deseq2/transcript/gencode.v35.transcripts.fa', gtf='/lustre/user/liclab/sunym/R/rstudio/deseq2/annotation_gencode/gencode.v35.annotation.gtf', write=FALSE)
  56. se <- tximeta(coldata)
  57. #explore se
  58. dim(se)
  59. head(rownames(se))
  60. gse <- summarizeToGene(se)
  61. dim(gse)
  62. head(rownames(gse))
  63. #DESeqDataSet
  64. #start from SummarizedExperiment set
  65. round( colSums(assay(gse))/1e6, 1 )
  66. colData(gse)
  67. gse$dex
  68. dds <- DESeqDataSet(gse, design = ~ cell dex)
  69. dds$dex
  70. dds$dex <- factor(dds$dex,levels = c('untrt','trt'))
  71. dds$dex

探索性分析与可视化

在控制细胞类型的前提下,我们探究dex的处理对细胞转录组的影响。一方面,我们可以从降维的可视化图上,查看不同样本之间的聚类情况;另一方面我们也可以通过差异表达分析来探究药物的处理对哪些基因产生了影响。由于DESeq2使用负二项分布来对细胞的转录组进行建模,因此对于差异表达分析的部分,我们需要使用原始的counts矩阵;对于PCA等降维可视化的方法,由于存在高表达基因主导差异来源的情况,因此我们需要对表达矩阵进行适度的筛选和标准化,从而尽可能使得表达矩阵满足同方差性。

筛选表达矩阵

由于RNA-seq的表达矩阵中存在非常多的零,因此将这些完全没有检测到的基因预先筛选掉能够大大加速我们的计算,当然我们也可以选择其他标准进行筛选。这里,我们筛选掉在所有样本中基因表达小于等于1的基因。

  1. > #1.Exploratory analysis and visualization
  2. > head(assay(dds))
  3. SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
  4. ENSG00000000003.15 706 462 895 423 1183 1085 801
  5. ENSG00000000005.6 0 0 0 0 0 0 0
  6. ENSG00000000419.13 455 510 604 352 583 774 410
  7. ENSG00000000457.14 310 268 360 222 358 428 298
  8. ENSG00000000460.17 88 74 51 45 105 99 96
  9. ENSG00000000938.13 0 0 2 0 1 0 0
  10. SRR1039521
  11. ENSG00000000003.15 597
  12. ENSG00000000005.6 0
  13. ENSG00000000419.13 499
  14. ENSG00000000457.14 287
  15. ENSG00000000460.17 84
  16. ENSG00000000938.13 0
  17. > #1.1 Pre-filtering the dataset
  18. > #the minimal filter(the features that express)
  19. > nrow(dds)
  20. [1] 60237
  21. > keep <- rownames(dds)[rowSums(counts(dds)) > 1]
  22. > length(keep)
  23. [1] 32920
  24. > dds <- dds[keep,]
  25. > nrow(dds)
  26. [1] 32920
  27. >

标准化

从直觉上来说,一个基因的表达量越高,则它的方差就容易越大。假设基因表达服从泊松分布,则显然其表达量的均值和方差均为lambda,均值越大方差也越大。

  1. #1.2 The variance stabilizing transformation and the rlog
  2. #varience increase along with mean(exampled by poisson distribution)
  3. lambda <- 10^seq(from = -1, to = 2, length = 1000)
  4. cts <- matrix(rpois(1000*200, lambda), ncol = 200)
  5. pdf(file = './res/possion_mean_variance.pdf',width = 8,height = 5)
  6. meanSdPlot(cts, ranks = FALSE)
  7. dev.off()

使用DESeq2进行差异表达分析-图片1

最简单的标准化方法便是取对数。

  1. #poisson processed by log
  2. log.cts.one <- log2(cts 1)
  3. pdf(file = './res/poisson_log_mean_variance.pdf',width = 8,height = 5)
  4. meanSdPlot(log.cts.one, ranks = FALSE)
  5. dev.off()
  6. #conclusion:the simple log transform will increase the noise of the data close to 0

使用DESeq2进行差异表达分析-图片2

虽然高表达量的基因的方差被成功压制,但一部分表达量接近0的基因的方差则不能很好地进行矫正。作者这里提供了两种更为复杂和成熟的算法来对表达矩阵进行标准化,分别是vst和rlog。vst方法在计算上更快,并且对于高表达量的outlier更加不敏感;rlog对于小数据集(n < 30)的效果更好,对于测序深度差异广泛的一组数据的表现也会更好。因此作者建议,大中型的数据集可以选择vst,小型数据可以选择rlog,当然我们可以都做一下,然后选择最优的一种。

  1. > #solution:vst and rlog
  2. > #vst
  3. > vsd <- vst(dds, blind = FALSE)
  4. using 'avgTxLength' from assays(dds), correcting for library size
  5. > head(assay(vsd), 3)
  6. SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
  7. ENSG00000000003.15 10.171389 9.696483 10.155402 9.858628 10.341690 10.174484 10.343805
  8. ENSG00000000419.13 9.631483 9.871469 9.742678 9.736292 9.700797 9.813558 9.619015
  9. ENSG00000000457.14 9.378150 9.226881 9.282191 9.373374 9.194730 9.314305 9.410820
  10. SRR1039521
  11. ENSG00000000003.15 9.892914
  12. ENSG00000000419.13 9.787441
  13. ENSG00000000457.14 9.398507
  14. > colData(vsd)
  15. DataFrame with 8 rows and 10 columns
  16. SampleName cell dex albut Run avgLength Experiment Sample
  17. SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568
  18. SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567
  19. SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571
  20. SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572
  21. SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575
  22. SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576
  23. SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579
  24. SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358 SRS508580
  25. BioSample names
  26. SRR1039508 SAMN02422669 SRR1039508
  27. SRR1039509 SAMN02422675 SRR1039509
  28. SRR1039512 SAMN02422678 SRR1039512
  29. SRR1039513 SAMN02422670 SRR1039513
  30. SRR1039516 SAMN02422682 SRR1039516
  31. SRR1039517 SAMN02422673 SRR1039517
  32. SRR1039520 SAMN02422683 SRR1039520
  33. SRR1039521 SAMN02422677 SRR1039521
  34. > #rlog
  35. > rld <- rlog(dds, blind = FALSE)
  36. using 'avgTxLength' from assays(dds), correcting for library size
  37. > head(assay(rld), 3)
  38. SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
  39. ENSG00000000003.15 9.613929 9.047835 9.596064 9.249122 9.805339 9.618000 9.806436
  40. ENSG00000000419.13 8.863465 9.154978 9.000860 8.993053 8.949360 9.086998 8.848148
  41. ENSG00000000457.14 8.354403 8.153928 8.228018 8.347470 8.108486 8.270879 8.396243
  42. SRR1039521
  43. ENSG00000000003.15 9.288931
  44. ENSG00000000419.13 9.055059
  45. ENSG00000000457.14 8.380384
  46. >

在上面的计算中,我们设置blind = FALSE,表示我们认为在design formula中的变量(细胞系与dex处理)并不会改变预期的表达量均值-方差趋势。默认情况下,DESeq2采取无监督的标准化转化,即blind = TRUE。下面我们展示经过测序深度矫正的取对数方法与vst和rlog的效果之间的差异。下图x轴与y轴分别对应经过变换后的两个样本的不同基因的表达强度,vst和rlog会默认矫正测序深度的效应,因此不需要特别指定normalized = TRUE。

  1. #compare log transform with vst and rlog
  2. dds <- estimateSizeFactors(dds)
  3. df <- bind_rows(
  4. as_data_frame(log2(counts(dds, normalized=TRUE)[, 1:2] 1)) %>% mutate(transformation = "log2(x 1)"),
  5. as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
  6. as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
  7. colnames(df)[1:2] <- c("x", "y")
  8. lvls <- c("log2(x 1)", "vst", "rlog")
  9. df$transformation <- factor(df$transformation, levels=lvls)
  10. pdf(file = './res/compare_log_vst_rlog.pdf',width = 12,height = 6)
  11. ggplot(df, aes(x = x, y = y)) geom_hex(bins = 80)
  12. coord_fixed() facet_grid( . ~ transformation)
  13. dev.off()

使用DESeq2进行差异表达分析-图片3

散点偏离对角线y = x的距离,则会贡献PCA的差异,可以看到rlog无论是在基因表达量的范围上,还是基因表达量方差的缩放上,都要一定程度优于另外两种算法,而vst则会一定程度上将低表达基因的表达量放大。取对数的方法虽然简单,但在低表达量的基因中的表现并不好,并不能很好地压制这部分基因的方差。

样本空间距离

我们可以计算样本在基因的高维空间中的欧式距离,从而比较不同细胞之间的整体相似性,从而一定程度上验证我们对于design formula的假设。需要注意的是dist函数要求每一行为一个不同的样本,每一列为不同的纬度。这里我们选择使用vst标准化后的表达矩阵。

  1. #1.3 Sample distances
  2. #Euclidean distance
  3. sampleDists <- dist(t(assay(vsd)))
  4. sampleDists
  5. sampleDistMatrix <- as.matrix( sampleDists )
  6. rownames(sampleDistMatrix) <- paste( vsd$dex, vsd$cell, sep = " - " )
  7. colnames(sampleDistMatrix) <- NULL
  8. colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
  9. pdf(file = './res/sample_distance.pdf',width = 8,height = 6.5)
  10. pheatmap(sampleDistMatrix,
  11. clustering_distance_rows = sampleDists,
  12. clustering_distance_cols = sampleDists,
  13. col = colors,
  14. main = 'sample euclidean distance')
  15. dev.off()

使用DESeq2进行差异表达分析-图片4

可以看到,不同的处理更倾向于聚在一起,即处理带来的差异大于细胞系之间的差异。当然,在不同的处理组中,我们也可以观察到,不同细胞系之间的相似关系也是类似的。另外,我们也可以使用泊松距离来度量样本间的差异程度,需要注意的是泊松距离的计算需要使用原始的表达矩阵。

  1. #Poisson Distance
  2. poisd <- PoissonDistance(t(counts(dds)))
  3. samplePoisDistMatrix <- as.matrix( poisd$dd )
  4. rownames(samplePoisDistMatrix) <- paste( dds$dex, dds$cell, sep=" - " )
  5. colnames(samplePoisDistMatrix) <- NULL
  6. pdf(file = './res/sample_poisson_distance.pdf',width = 8,height = 6.5)
  7. pheatmap(samplePoisDistMatrix,
  8. clustering_distance_rows = poisd$dd,
  9. clustering_distance_cols = poisd$dd,
  10. col = colors,
  11. main = 'sample poisson distance')
  12. dev.off()

使用DESeq2进行差异表达分析-图片5

结果同样是类似的。

PCA主成分分析

PCA是最常见的bulk数据降维聚类方法了,其中PC1解释了样本之间最大的差异,PC2则解释了样本之间次大的差异。PCA通常使用标准化过后的表达矩阵。

  1. #1.4 PCA
  2. #plot by func build in DESeq2
  3. pdf(file = './res/PCA.pdf',width = 8,height = 4)
  4. plotPCA(vsd, intgroup = c("dex", "cell"))
  5. dev.off()

使用DESeq2进行差异表达分析-图片6

这个图不是特别美观和清楚,我们可以使用ggplot自己手动画一个。

  1. #plot by ggplot2
  2. pcaData <- plotPCA(vsd, intgroup = c( "dex", "cell"), returnData = TRUE)
  3. pcaData
  4. percentVar <- round(100 * attr(pcaData, "percentVar"))
  5. percentVar
  6. pdf(file = './res/PCA_by_ggplot.pdf',width = 8,height = 4)
  7. ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell))
  8. geom_point(size =3)
  9. xlab(paste0("PC1: ", percentVar[1], "% variance"))
  10. ylab(paste0("PC2: ", percentVar[2], "% variance"))
  11. coord_fixed()
  12. ggtitle("PCA with VST data")
  13. dev.off()

使用DESeq2进行差异表达分析-图片7

从图上可以看到,PC1解释了dex处理之间的差异,而PC2则解释了细胞系之间的差异。显然,细胞系之间的差异虽然小于dex处理所带来的差异,但依然是非常可观的。这也体现出了我们做配对比较的必要性,也就是在控制细胞系的条件下对dex处理和未处理进行比较,这在我们的design formula中已经设置好了。

广义PCA

对于非正态分布的数据,我们可以使用广义PCA(generalized PCA)进行降维可视化。广义PCA通常使用原始的表达矩阵进行计算。

  1. #1.5 PCA plot using Generalized PCA
  2. gpca <- glmpca(counts(dds), L=2)
  3. gpca.dat <- gpca$factors
  4. gpca.dat$dex <- dds$dex
  5. gpca.dat$cell <- dds$cell
  6. pdf(file = './res/glmpca.pdf',width = 8,height = 4)
  7. ggplot(gpca.dat, aes(x = dim1, y = dim2, color = dex, shape = cell))
  8. geom_point(size =3) coord_fixed() ggtitle("glmpca - Generalized PCA")
  9. dev.off()

使用DESeq2进行差异表达分析-图片8

多维缩放

多维缩放的简单想法是将高维空间中的距离关系,尽可能保持不变地映射到低维空间。这在我们没有样本的表达矩阵,而只存在样本之间的相互距离的时候非常有用。这里,我们使用标准化后的表达矩阵所计算出的样本之间的相互距离。

  1. #1.6 MDS plot
  2. #MDS plot using vst data
  3. mds <- as.data.frame(colData(vsd)) %>% cbind(cmdscale(sampleDistMatrix))
  4. pdf(file = './res/MDS_VST_plot.pdf',width = 8,height = 4)
  5. ggplot(mds, aes(x = `1`, y = `2`, color = dex, shape = cell))
  6. geom_point(size = 3) coord_fixed() ggtitle("MDS with VST data")
  7. dev.off()

使用DESeq2进行差异表达分析-图片9

当然,我们也可以使用利用原始表达矩阵所计算出的泊松距离。

  1. #MDS plot using poisson distance
  2. mdsPois <- as.data.frame(colData(dds)) %>% cbind(cmdscale(samplePoisDistMatrix))
  3. pdf(file = './res/MDS_Possion_plot.pdf',width = 8,height = 4)
  4. ggplot(mdsPois, aes(x = `1`, y = `2`, color = dex, shape = cell))
  5. geom_point(size = 3) coord_fixed() ggtitle("MDS with PoissonDistances")
  6. dev.off()

使用DESeq2进行差异表达分析-图片10

差异表达分析

除了从降维可视化的层面上对不同的样本进行相似性的分析,我们也可以利用原始的表达矩阵,计算不同样本之间的差异。在比较不同的组之前,我们需要先估计数据集的大小因子,从而来矫正测序深度带来的批次效应;另外,我们还需要估计各个基因在不同样本间表达的离散度;最后我们根据design formula,使用广义线性模型来拟合和建模矫正过后的表达矩阵。上述的三个步骤都可以使用DESeq函数轻松搞定,并将拟合的参数返回DESeqDataSet对象,我们只需要提取出这些结果就可以进行后续的比较了。

  1. > #--------------------------------------------------------#
  2. > #--------------------------------------------------------#
  3. > #2.Differential expression analysis
  4. > #2.1 Running the differential expression pipeline
  5. > dds <- DESeq(dds)
  6. using pre-existing normalization factors
  7. estimating dispersions
  8. gene-wise dispersion estimates
  9. mean-dispersion relationship
  10. final dispersion estimates
  11. fitting model and testing

构建结果表格

我们可以直接用results函数提取出差异表达的计算结果,如果有多种比较方式,则results默认返回第一种比较的差异表达结果。如果p value出现NA,则说明该基因的表达量全部为0,不适用于任何检验方法,或者出现非常极端的outlier。大家感兴趣的话,可以进一步去研究一下DESeq2筛选outlier的方法,这里不对原理做具体的阐述。

  1. > #2.2 Building the results table
  2. > res <- results(dds)
  3. > res
  4. log2 fold change (MLE): dex trt vs untrt
  5. Wald test p-value: dex trt vs untrt
  6. DataFrame with 32920 rows and 6 columns
  7. baseMean log2FoldChange lfcSE stat
  8. ENSG00000000003.15 742.027385348416 -0.501576985024735 0.129105068551828 -3.88502938460067
  9. ENSG00000000419.13 511.733294147504 0.204616852714061 0.128576587230486 1.59140055838678
  10. ENSG00000000457.14 312.29261469575 0.0251067981341494 0.159146442445286 0.157759091239385
  11. ENSG00000000460.17 79.3907526709186 -0.111406959051003 0.319164050388844 -0.349058607682394
  12. ENSG00000000938.13 0.307079546955295 -1.36851064387886 3.50376299451696 -0.390583109080277
  13. ... ... ... ... ...
  14. ENSG00000288636.1 3.34150352168806 -3.33576637787421 1.57307985327023 -2.12053213378811
  15. ENSG00000288637.1 14.0035268554639 -0.427472592993911 0.611570416082812 -0.69897526393106
  16. ENSG00000288640.1 25.576775477853 0.224597114398936 0.443043875359963 0.506941020720478
  17. ENSG00000288642.1 2.70286053446089 -1.59495281968392 2.20717217902031 -0.722622745449729
  18. ENSG00000288644.1 0.923439930859957 -1.05728198235034 3.42054519810226 -0.309097503794697
  19. pvalue padj
  20. ENSG00000000003.15 0.000102317512305595 0.000973479685222727
  21. ENSG00000000419.13 0.111519458749328 0.294711851453744
  22. ENSG00000000457.14 0.874646635361525 0.95097361100203
  23. ENSG00000000460.17 0.727045310707714 0.879978596275505
  24. ENSG00000000938.13 0.696105412837306 NA
  25. ... ... ...
  26. ENSG00000288636.1 0.0339611945802605 NA
  27. ENSG00000288637.1 0.484567489650358 0.720059819063478
  28. ENSG00000288640.1 0.612196202195972 0.808631021762538
  29. ENSG00000288642.1 0.469911690167384 NA
  30. ENSG00000288644.1 0.757247357961551 NA

当然,我们也可以指定比较的方式。

  1. > resultsNames(dds)
  2. [1] "Intercept" "cell_N061011_vs_N052611" "cell_N080611_vs_N052611"
  3. [4] "cell_N61311_vs_N052611" "dex_trt_vs_untrt"
  4. > res <- results(dds, contrast=c("dex","trt","untrt"))
  5. > mcols(res, use.names = TRUE)
  6. DataFrame with 6 rows and 2 columns
  7. type description
  8. baseMean intermediate mean of normalized counts for all samples
  9. log2FoldChange results log2 fold change (MLE): dex trt vs untrt
  10. lfcSE results standard error: dex trt vs untrt
  11. stat results Wald statistic: dex trt vs untrt
  12. pvalue results Wald test p-value: dex trt vs untrt
  13. padj results BH adjusted p-values
  14. > summary(res)
  15.  
  16. out of 32920 with nonzero total read count
  17. adjusted p-value < 0.1
  18. LFC > 0 (up) : 2401, 7.3%
  19. LFC < 0 (down) : 1972, 6%
  20. outliers [1] : 0, 0%
  21. low counts [2] : 15956, 48%
  22. (mean count < 9)
  23. [1] see 'cooksCutoff' argument of ?results
  24. [2] see 'independentFiltering' argument of ?results
  25.  
  26. >

看到,总共有4种比较方式,这里我们依然设置dex的实验组和对照组之间的差异表达。summary函数可以返回按照我们设置的阈值标准通过的差异表达基因的数量。baseMean表示各个基因在不同样本间的标准化后的表达矩阵的平均值;log2FoldChange则是我们比较关心的差异表达的强度变化;当然,这种表达强度的变化并不是在每个(细胞类型)组的比较都是一致的,因此我们用标准差来描述这种差异的变异情况,也就是lfcSE。当然还有wald检验后得到的p value以及使用BH矫正后的p value都可以用来描述这种表达强度变化的不确定性。

当然我们也可以改变差异表达的阈值来改变最终通过差异表达的基因的数量。下面的代码分别通过的卡假阳性率(padj)以及卡差异表达强度的变化(lfc)的方法,对差异表达基因进行了进一步的筛选。

  1. > #decrease the false discovery rate (FDR)
  2. > res.05 <- results(dds, alpha = 0.05)
  3. > summary(res.05)
  4.  
  5. out of 32920 with nonzero total read count
  6. adjusted p-value < 0.05
  7. LFC > 0 (up) : 2015, 6.1%
  8. LFC < 0 (down) : 1603, 4.9%
  9. outliers [1] : 0, 0%
  10. low counts [2] : 16594, 50%
  11. (mean count < 11)
  12. [1] see 'cooksCutoff' argument of ?results
  13. [2] see 'independentFiltering' argument of ?results
  14.  
  15. > table(res.05$padj < 0.05)
  16.  
  17. FALSE TRUE
  18. 12708 3618
  19. > #raise the fold change
  20. > resLFC1 <- results(dds, lfcThreshold=1)
  21. > summary(resLFC1)
  22.  
  23. out of 32920 with nonzero total read count
  24. adjusted p-value < 0.1
  25. LFC > 1.00 (up) : 143, 0.43%
  26. LFC < -1.00 (down) : 76, 0.23%
  27. outliers [1] : 0, 0%
  28. low counts [2] : 13403, 41%
  29. (mean count < 5)
  30. [1] see 'cooksCutoff' argument of ?results
  31. [2] see 'independentFiltering' argument of ?results
  32.  
  33. > table(resLFC1$padj < 0.1)
  34.  
  35. FALSE TRUE
  36. 19298 219
  37. >

其他比较

当然我们也可以使用results函数中的contrast参数和name参数进行各种我们想要的比较。

  1. > #2.3 Other comparisons
  2. > results(dds, contrast = c("cell", "N061011", "N61311"))
  3. log2 fold change (MLE): cell N061011 vs N61311
  4. Wald test p-value: cell N061011 vs N61311
  5. DataFrame with 32920 rows and 6 columns
  6. baseMean log2FoldChange lfcSE stat
  7. ENSG00000000003.15 742.027385348416 0.268736206423564 0.18342134708472 1.46513048069284
  8. ENSG00000000419.13 511.733294147504 -0.0769346903889134 0.18269336402621 -0.421113765127649
  9. ENSG00000000457.14 312.29261469575 0.183307224602225 0.225831447498455 0.811699285607592
  10. ENSG00000000460.17 79.3907526709186 -0.15753454183667 0.447319749296685 -0.35217434974503
  11. ENSG00000000938.13 0.307079546955295 0 4.9975798104491 0
  12. ... ... ... ... ...
  13. ENSG00000288636.1 3.34150352168806 0.512761341689826 2.2443030878119 0.228472412872607
  14. ENSG00000288637.1 14.0035268554639 1.62456410843762 0.884427160806683 1.83685461101834
  15. ENSG00000288640.1 25.576775477853 -0.0272704904165941 0.632018225775059 -0.0431482658322893
  16. ENSG00000288642.1 2.70286053446089 -3.46055682956256 3.1930099143229 -1.08379144519393
  17. ENSG00000288644.1 0.923439930859957 2.0935591646235 4.93399742969465 0.42431298241537
  18. pvalue padj
  19. ENSG00000000003.15 0.142885322019071 0.517056988815031
  20. ENSG00000000419.13 0.673672010481914 0.923753118447307
  21. ENSG00000000457.14 0.416964204379591 0.814864516519577
  22. ENSG00000000460.17 0.724707512192108 0.943151150723668
  23. ENSG00000000938.13 1 NA
  24. ... ... ...
  25. ENSG00000288636.1 0.819279000336978 NA
  26. ENSG00000288637.1 0.0662313613909477 0.349822008801278
  27. ENSG00000288640.1 0.965583344530636 0.992049518979949
  28. ENSG00000288642.1 0.278457279130544 NA
  29. ENSG00000288644.1 0.6713375724719 NA
  30. >

多重检验

在高通量的测序数据中,我们通常尽可能避免直接使用p value,由于我们有大量的基因数,因此即使选择很小的p value,也有可能带来相当巨大的假阳性率。

  1. > #2.4 Multiple testing
  2. > #the false discovery rate due to multiple test
  3. > sum(res$pvalue < 0.05, na.rm=TRUE)
  4. [1] 5263
  5. > sum(!is.na(res$pvalue))
  6. [1] 32920
  7. > #if p threshold = 0.05
  8. > fdr = (0.05*sum(!is.na(res$pvalue)))/5263
  9. > fdr
  10. [1] 0.3127494
  11. >

从上面的分析可以看到,p value小于0.05的基因数有5263个,全部的筛选后的基因有32920个。那么,假设我们的数据服从null假设,即处理前后没有变化,那么根据p value的定义,我们知道会有5%的基因被错误地分类为阳性结果,也就是假阳性的结果。我们可以看到这部分假阳性的结果占到所有阳性结果的31%,因此我们可以说:直接使用p value来作为判断的标准,难以控制阳性结果中的假阳性率。当然,在这方面我们也有非常多成熟的算法来对结果进行修正,最常见的就是BH矫正,矫正的结果一般称之为FDR或padj。一般来说,BH矫正的阈值直接决定了假阳性率的大小(in brief, this method calculates for each gene an adjusted p value that answers the following question: if one called significant all genes with an adjusted p value less than or equal to this gene’s adjusted p value threshold, what would be the fraction of false positives (the false discovery rate, FDR) among them)。

  1. > #if false discovery fate = 0.1
  2. > sum(res$padj < 0.1, na.rm=TRUE)
  3. [1] 4373
  4. > #get significant genes
  5. > resSig <- subset(res, padj < 0.1)
  6. > #strongest down-regulation
  7. > head(resSig[order(resSig$log2FoldChange,decreasing = FALSE),])
  8. log2 fold change (MLE): dex trt vs untrt
  9. Wald test p-value: dex trt vs untrt
  10. DataFrame with 6 rows and 6 columns
  11. baseMean log2FoldChange lfcSE stat
  12. ENSG00000235174.1 11.559747104067 -7.72171774800333 3.38538757988485 -2.28089622407901
  13. ENSG00000280852.2 16.4764928902195 -5.2675486636711 2.00031644174996 -2.63335767967933
  14. ENSG00000257542.5 19.3345948420855 -5.00272789483913 2.0574374082432 -2.43153345749208
  15. ENSG00000146006.8 55.5996261041168 -4.28395099706049 0.650053973930185 -6.59014661684166
  16. ENSG00000108700.5 14.5099736013949 -4.08037596598935 0.939294244818382 -4.34408705099468
  17. ENSG00000155897.10 22.9776308925677 -4.03144853094491 0.845023966288923 -4.77080969507854
  18. pvalue padj
  19. ENSG00000235174.1 0.0225545884487648 0.0904387185320246
  20. ENSG00000280852.2 0.00845452601041571 0.0415958756498527
  21. ENSG00000257542.5 0.0150350596501083 0.0659567499106379
  22. ENSG00000146006.8 4.39392443858689e-11 1.22596273316099e-09
  23. ENSG00000108700.5 1.39856053242458e-05 0.000164872695427732
  24. ENSG00000155897.10 1.8348683549146e-06 2.62894482878136e-05
  25. > #strongest up-regulation
  26. > head(resSig[order(resSig$log2FoldChange,decreasing = TRUE),])
  27. log2 fold change (MLE): dex trt vs untrt
  28. Wald test p-value: dex trt vs untrt
  29. DataFrame with 6 rows and 6 columns
  30. baseMean log2FoldChange lfcSE stat
  31. ENSG00000179593.16 67.1461478274207 9.50575055732586 1.07612899394557 8.83328170768219
  32. ENSG00000224712.12 35.3836081465072 7.14509353707476 2.16130265497796 3.30591993704261
  33. ENSG00000109906.14 1033.70021615222 6.66901385295929 0.39773436123582 16.7675074193682
  34. ENSG00000250978.5 45.4578052875013 5.91313864050675 0.720209179506691 8.21030723956749
  35. ENSG00000101342.10 39.217785350109 5.72927135957107 1.45533465210653 3.93673809063654
  36. ENSG00000249364.6 14.460887352797 5.67220252036344 1.17631737724728 4.8220001081996
  37. pvalue padj
  38. ENSG00000179593.16 1.01648597959634e-18 6.34131307724195e-17
  39. ENSG00000224712.12 0.000946651328083399 0.00666444712676844
  40. ENSG00000109906.14 4.21836791193269e-63 2.98168305241775e-60
  41. ENSG00000250978.5 2.20623227036585e-16 1.12055461779899e-14
  42. ENSG00000101342.10 8.25966685528947e-05 0.000803884042071891
  43. ENSG00000249364.6 1.42125868688356e-06 2.08205806254687e-05
  44. >

差异表达结果可视化

counts数变化

最简单的,我们可以展示某个差异表达基因在处理前后的counts数的变化。

  1. #---------------------------------------------------#
  2. #---------------------------------------------------#
  3. #3.Plotting results
  4. #3.1 Counts plot -- visualize the count of a particular gene
  5. topGene <- rownames(res)[which.min(res$padj)]
  6. pdf(file = './res/plotcounts_function.pdf',width = 8,height = 6)
  7. plotCounts(dds, gene = topGene, intgroup=c("dex"))
  8. dev.off()

使用DESeq2进行差异表达分析-图片11

当然这个图不是特别好看,我们可以使用ggplot画个更好看的图。

  1. # counts plot using ggplot2
  2. geneCounts <- plotCounts(dds, gene = topGene, intgroup = c("dex","cell"),returnData = TRUE)
  3. geneCounts
  4. p1 <- ggplot(geneCounts, aes(x = dex, y = count, color = cell)) scale_y_log10() geom_beeswarm(cex = 3)
  5. p2 <- ggplot(geneCounts, aes(x = dex, y = count, color = cell, group = cell)) scale_y_log10() geom_point(size = 3) geom_line()
  6. pdf(file = './res/count_plot_using_ggplot.pdf',width = 8,height = 6)
  7. Rmisc::multiplot(p1,p2,cols = 1)
  8. dev.off()

使用DESeq2进行差异表达分析-图片12

从处理前和处理后的不同细胞系的连线图来看,我们也可以清楚的意识到,不同的细胞系对于处理所表现出的影响并不是完全相同的,我们在做差异比较分析的时候应当牢牢记住这一点。

MA plot

MA图也能够帮助我们了解差异表达基因的分布情况。M代表minus,是取对数后表达量的差值,其实就是lfc,A表示average,是基因的平均表达量的分布。注意:plotMA函数在其他的包中也存在,有时会引起冲突出现错误,这时候不妨试一下指定函数的包来源。

  1. # MA plot without shrink
  2. res.noshr <- results(dds, name="dex_trt_vs_untrt")
  3. pdf(file = './res/MA_plot_without_shrink.pdf',width = 8,height = 6)
  4. plotMA(res.noshr, ylim = c(-5, 5))
  5. dev.off()

使用DESeq2进行差异表达分析-图片13

我们会发现,对于低表达量的基因,表达量的一些轻微变化就会引起lfc的巨大变化,但显然这种变化极有可能是噪声造成的,并没有明显的生物学意义,因此我们需要对lfc进行一定的缩放。在lfcShrink函数中内置了三种缩放的方法,一般可以直接采用广义线性模型的方法,他对于噪声的缩放以及真实生物学信号的还原有着非常好的效果。

  1. #3.2 MA plot
  2. resultsNames(dds)
  3. res <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm")
  4. pdf(file = './res/MA_plot.pdf',width = 8,height = 6)
  5. plotMA(res, ylim = c(-5, 5))
  6. dev.off()

使用DESeq2进行差异表达分析-图片14

当然,我们也可以在图上标注出我们感兴趣的基因。

  1. # show the gene we interested about on the MA plot
  2. pdf(file = './res/MA_plot_with_gene.pdf',width = 8,height = 6)
  3. plotMA(res, ylim = c(-5,5))
  4. topGene <- rownames(res)[which.min(res$padj)]
  5. with(res[topGene, ], {
  6. points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
  7. text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
  8. })
  9. dev.off()

使用DESeq2进行差异表达分析-图片15

另外,p value的分布情况也值得我们的关注。

  1. # histogram show the distribution of p value
  2. pdf(file = './res/histogram_show_the_distribution_of_p_value.pdf',width = 8,height = 6)
  3. hist(res$pvalue[res$baseMean > 1], breaks = (0:20)/20,col = "grey50", border = "white")
  4. dev.off()

使用DESeq2进行差异表达分析-图片16

基因聚类

我们可以对一群高差异表达的基因进行聚类,观察他们在不同样本间的表达情况。

  1. #3.3 gene cluster
  2. topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)
  3. mat <- assay(vsd)[topVarGenes, ]
  4. mat <- mat - rowMeans(mat)
  5. anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
  6. pdf(file = './res/pheatmap_top_20_variable_gene.pdf',width = 8,height = 6)
  7. pheatmap(mat, annotation_col = anno,scale = 'none')
  8. dev.off()

使用DESeq2进行差异表达分析-图片17

独立过滤

从之前的MA图我们便可以了解到,低表达量的基因由于混杂了大量的噪声,因此难以提供可靠的生物学信号,通常我们可以把这部分基因预先筛选掉,从而大大提高我们利用padj发现差异表达基因的能力。

  1. #3.4 independent filtering
  2. #filter the low counts gene to improve FDR
  3. qs <- c(0, quantile(resLFC1$baseMean[resLFC1$baseMean > 0], 0:6/6))
  4. bins <- cut(resLFC1$baseMean, qs)
  5. levels(bins) <- paste0("~", round(signif((qs[-1] qs[-length(qs)])/2, 2)))
  6. fractionSig <- tapply(resLFC1$pvalue, bins, function(p){mean(p < .05, na.rm = TRUE)})
  7. pdf(file = './res/proportion_of_small_p_value.pdf',width = 8,height = 6)
  8. barplot(fractionSig, xlab = "mean normalized count",
  9. ylab = "fraction of small p values")
  10. dev.off()

使用DESeq2进行差异表达分析-图片18

可以看到,对于低表达量的基因来说,满足高显著性(p value < 0.05)的基因的比例非常低,通常可以直接删去。在DESeq2中,results函数中的相关参数会默认执行独立过滤,从而得到尽可能多的,通过padj的基因数。

独立假设权重(Independent Hypothesis Weighting)

p值过滤的一个一般性想法是对假设的权重进行打分,从而优化检验的效力(纯翻译,我没看懂是啥意思,哭了)。我们可以使用IHW包来进行这项操作,感兴趣的同学可以参考这个包的说明文档以及相关的文献

结果注释与导出

前面所有的结果,基因的名字都是用ensembl id来表示的,但我们通常更倾向于使用symbol来命名基因,因此我们需要做一个基因id的转换。在bioconductor上有非常多的生物注释包,我们可以直接利用这些包中存储的相关注释信息来对我们的结果进行相应的注释。

  1. #---------------------------------------------------#
  2. #---------------------------------------------------#
  3. #4.Annotating and exporting results
  4. #rename gene
  5. columns(org.Hs.eg.db)
  6. ens.str <- substr(rownames(res), 1, 15)
  7. res$symbol <- mapIds(org.Hs.eg.db,
  8. keys=ens.str,
  9. column="SYMBOL",
  10. keytype="ENSEMBL",
  11. multiVals="first")
  12. res$entrez <- mapIds(org.Hs.eg.db,
  13. keys=ens.str,
  14. column="ENTREZID",
  15. keytype="ENSEMBL",
  16. multiVals="first")
  17. resOrdered <- res[order(res$pvalue),]
  18. head(resOrdered)

这样,我们就成功为我们的结果表格增加了两个注释列。

在基因组空间上展示表达强度变化

我们可以在基因组的track图上,展示出相应位置上的基因,以及这些基因的表达强度变化(lfc)。如果我们在DESeq2的准备工作中使用了tximeta函数来读取原始测序文件的比对结果并导入DESeq2,那我们的DESeqDataSet对象中则自动包含了每个基因所在的基因组位置信息(gene range),当然,我们也可以尝试进行手动的注释。下面直接使用lfcShrink函数,在对差异表达进行缩放的同时,加入基因组位置的信息。

  1. > #4.1 Plotting fold changes in genomic space
  2. > resGR <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm", format="GRanges")
  3. using 'apeglm' for LFC shrinkage. If used in published research, please cite:
  4. Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
  5. sequence count data: removing the noise and preserving large differences.
  6. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
  7. > resGR
  8. GRanges object with 32920 ranges and 5 metadata columns:
  9. seqnames ranges strand | baseMean log2FoldChange
  10. |
  11. ENSG00000000003.15 chrX 100627108-100639991 - | 742.027385348416 -0.462006977773251
  12. ENSG00000000419.13 chr20 50934867-50958555 - | 511.733294147504 0.175874104721422
  13. ENSG00000000457.14 chr1 169849631-169894267 - | 312.29261469575 0.0210504596782739
  14. ENSG00000000460.17 chr1 169662007-169854080 | 79.3907526709186 -0.0525107381932432
  15. ENSG00000000938.13 chr1 27612064-27635185 - | 0.307079546955295 -0.0115833528835126
  16. ... ... ... ... . ... ...
  17. ENSG00000288636.1 chr1 17005068-17011757 - | 3.34150352168806 -0.111244739220307
  18. ENSG00000288637.1 chr4 153152435-153415081 | 14.0035268554639 -0.0856302687213988
  19. ENSG00000288640.1 chr7 112450460-112490941 | 25.576775477853 0.0731990767853117
  20. ENSG00000288642.1 chrX 140783578-140784366 - | 2.70286053446089 -0.0231619748179045
  21. ENSG00000288644.1 chr1 203802094-203802192 | 0.923439930859957 -0.00513130945288652
  22. lfcSE pvalue padj
  23. ENSG00000000003.15 0.129731785360462 0.000102317512305595 0.000973479685222727
  24. ENSG00000000419.13 0.122266659157736 0.111519458749328 0.294711851453744
  25. ENSG00000000457.14 0.141312325328858 0.874646635361525 0.95097361100203
  26. ENSG00000000460.17 0.223920772734305 0.727045310707714 0.879978596275505
  27. ENSG00000000938.13 0.304446602085739 0.696105412837306
  28. ... ... ... ...
  29. ENSG00000288636.1 0.33098831652559 0.0339611945802605
  30. ENSG00000288637.1 0.286103379083077 0.484567489650358 0.720059819063478
  31. ENSG00000288640.1 0.258699729044232 0.612196202195972 0.808631021762538
  32. ENSG00000288642.1 0.30456581058953 0.469911690167384
  33. ENSG00000288644.1 0.304189947166826 0.757247357961551
  34. -------
  35. seqinfo: 25 sequences (1 circular) from hg38 genome
  36. > ens.str <- substr(names(resGR), 1, 15)
  37. > resGR$symbol <- mapIds(org.Hs.eg.db, ens.str, "SYMBOL", "ENSEMBL")
  38. 'select()' returned 1:many mapping between keys and columns
  39. >

接着我们需要用Gviz包,画出相应的GRanges窗口,并加上相应的基因注释以及表达强度变化信息。下面的代码,我们挑选出我们感兴趣的基因上下游一百万个碱基的窗口,找到这个窗口中相应的基因,最后从上到下构建出相应的染色体区域,基因表达强度变化的峰图,以及相应的基因注释框。

  1. # specify the window and filter the symbol
  2. window <- resGR[topGene] 1e6
  3. strand(window) <- "*"
  4. resGRsub <- resGR[resGR %over% window]
  5. naOrDup <- is.na(resGRsub$symbol) | duplicated(resGRsub$symbol)
  6. resGRsub$group <- ifelse(naOrDup, names(resGRsub), resGRsub$symbol)
  7. status <- factor(ifelse(resGRsub$padj < 0.05 & !is.na(resGRsub$padj),"sig", "notsig"))
  8. # ploting the result using Gviz
  9. options(ucscChromosomeNames = FALSE)
  10. g <- GenomeAxisTrack()
  11. a <- AnnotationTrack(resGRsub, name = "gene ranges", feature = status)
  12. d <- DataTrack(resGRsub, data = "log2FoldChange", baseline = 0, type = "h", name = "log2 fold change", strand = " ")
  13. pdf(file = './res/plot_lfc_in_genomic_space.pdf',width = 8,height = 6)
  14. plotTracks(list(g, d, a), groupAnnotation = "group", notsig = "grey", sig = "hotpink")
  15. dev.off()

使用DESeq2进行差异表达分析-图片19

矫正潜在批次效应

假设我们不知道我们的八个样本存在细胞系之间的差异,而这种细胞系对于差异表达的影响又是实实在在存在的,因此这种细胞系的差异带来的影响成为了一种潜在的批次效应。我们可以利用sva包或者RUVSeq包对这些潜在的批次效应进行建模,并在构建DESeqDataSet的过程中将这些批次效应一并写入design formula。

SVA

首先我们得到经过测序深度矫正过后的表达矩阵,筛选出其中在各个样本间平均表达量大于1的基因;接着我们依照dex处理构建一个全模型,一个null模型,并且设定两个替代变量(surrogate variable)。

  1. > #---------------------------------------------------#
  2. > #---------------------------------------------------#
  3. > #5.remove hiddden batch effects
  4. > #5.1 use SVA with DESeq2
  5. > # fillter the gene has normalized counts > 1
  6. > dat <- counts(dds, normalized = TRUE)
  7. > idx <- rowMeans(dat) > 1
  8. > dat <- dat[idx, ]
  9. > # build two model, one account for dex and one for null hypothesis
  10. > mod <- model.matrix(~ dex, colData(dds))
  11. > mod0 <- model.matrix(~ 1, colData(dds))
  12. > # set two surrogate variables
  13. > svseq <- svaseq(dat, mod, mod0, n.sv = 2)
  14. Number of significant surrogate variables is: 2
  15. Iteration (out of 5 ):1 2 3 4 5 > svseq$sv
  16. [,1] [,2]
  17. [1,] -0.2716054 -0.51274255
  18. [2,] -0.2841240 -0.56748944
  19. [3,] -0.1329942 0.28279655
  20. [4,] -0.1925966 0.43526264
  21. [5,] 0.6021702 -0.08935819
  22. [6,] 0.6107734 -0.06471253
  23. [7,] -0.1722518 0.26804837
  24. [8,] -0.1593716 0.24819515
  25. >

我们可以看下返回的替代变量与细胞系之间的关系。

  1. # show the correlation of the sv with the cellline
  2. pdf(file = './res/ralationship_between_sv_and_celline.pdf',width = 8,height = 6)
  3. par(mfrow = c(2, 1), mar = c(3,5,3,1))
  4. for (i in 1:2) {
  5. stripchart(svseq$sv[, i] ~ dds$cell, vertical = TRUE, main = paste0("SV", i))
  6. abline(h = 0)
  7. }
  8. dev.off()

使用DESeq2进行差异表达分析-图片20

可以看到,建模得到的替代变量很好地捕捉到了来自细胞系之间的差异,在后续DESeqDataSet进行广义线性模型建模的时候,我们可以通过控制这两个替代变量从而专注于dex处理所带来的变化。

  1. # use sva to remove the batch effect
  2. ddssva <- dds
  3. ddssva$SV1 <- svseq$sv[,1]
  4. ddssva$SV2 <- svseq$sv[,2]
  5. design(ddssva) <- ~ SV1 SV2 dex

RUVSeq

我们也可以类似地使用RUVSeq包对潜在的批次效应进行建模。需要注意的是RUVSeq需要先根据已知的处理或批次,构建全模型,从而获得差异表达分析的结果;再从中挑选出差异表达不显著的基因作为经验控制基因,进一步用这些基因的表达量进行建模,从而找到相应的unwanted variable。

  1. #5.2 Using RUV with DESeq2
  2. set <- newSeqExpressionSet(counts(dds))
  3. idx <- rowSums(counts(set) > 5) >= 2
  4. set <- set[idx, ]
  5. set <- betweenLaneNormalization(set, which="upper")
  6. res <- results(DESeq(DESeqDataSet(gse, design = ~ dex)))
  7. not.sig <- rownames(res)[which(res$pvalue > .1)]
  8. empirical <- rownames(set)[ rownames(set) %in% not.sig ]
  9. set <- RUVg(set, empirical, k=2)
  10. pData(set)
  11.  
  12. pdf(file = './res/ralationship_between_unwanted_variation_and_celline.pdf',width = 8,height = 6)
  13. par(mfrow = c(2, 1), mar = c(3,5,3,1))
  14. for (i in 1:2) {
  15. stripchart(pData(set)[, i] ~ dds$cell, vertical = TRUE, main = paste0("W", i))
  16. abline(h = 0)
  17. }
  18. dev.off()

使用DESeq2进行差异表达分析-图片21

同样的,我们也可以矫正这些unwanted variable的影响。

  1. # use unwanted factor with DESeq2
  2. ddsruv <- dds
  3. ddsruv$W1 <- set$W_1
  4. ddsruv$W2 <- set$W_2
  5. design(ddsruv) <- ~ W1 W2 dex

当然,我们矫正完潜在的批次效应后应当重新跑一下DESeq函数,从而来估计和修正新的参数和影响。

时序研究

DESeq2也可用于存在一系列时间点的时序实验中,从而探究特定条件下,与基准样本相比,基因表达随时间的变化。这里我们利用fission包中关于酵母分裂过程中一系列时间点的bulk测序数据来简要说明如何利用DESeq2进行相关的时序分析。我们在控制品系和时间的基础上,探究品系与时间的互作项对基因表达的影响。

  1. #---------------------------------------------------#
  2. #---------------------------------------------------#
  3. #6.Time course experiments
  4. # use a yeast fission data, control the effect of strain and time, investigate the strain-specific difference over time
  5. data("fission")
  6. ddsTC <- DESeqDataSet(fission, ~ strain minute strain:minute)

我们使用似然性检验进行差异表达分析,并去除品系之间随时间变化的差异表达,从而关注于在一个或多个时间点,在不同的品系间存在显著差异表达的基因。

  1. > ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ strain minute)
  2. estimating size factors
  3. estimating dispersions
  4. gene-wise dispersion estimates
  5. mean-dispersion relationship
  6. final dispersion estimates
  7. fitting model and testing
  8. > resTC <- results(ddsTC)
  9. > resTC$symbol <- mcols(ddsTC)$symbol
  10. > head(resTC[order(resTC$padj),], 4)
  11. log2 fold change (MLE): strainmut.minute180
  12. LRT p-value: '~ strain minute strain:minute' vs '~ strain minute'
  13. DataFrame with 4 rows and 7 columns
  14. baseMean log2FoldChange lfcSE stat pvalue
  15. SPBC2F12.09c 174.671161802578 -2.65671953167539 0.752261272510015 97.2833864066372 1.97415107456139e-19
  16. SPAC1002.18 444.504950375698 -0.050932139511341 0.204299484855662 56.9535986228365 5.16955159218189e-11
  17. SPAC1002.19 336.373206596558 -0.392748982380864 0.57349400917308 43.533908577047 2.87980357351979e-08
  18. SPAC1002.17c 261.773132733438 -1.13876476912801 0.60612875682785 39.3158355351572 2.05137078379603e-07
  19. padj symbol
  20. SPBC2F12.09c 1.3345261264035e-15 atf21
  21. SPAC1002.18 1.74730843815748e-07 urg3
  22. SPAC1002.19 6.4891573856646e-05 urg1
  23. SPAC1002.17c 0.000346681662461529 urg2
  24. >

我们可以挑出最显著的差异表达基因,画出其在不同的品系中,随时间变化的趋势。需要注意的是,这里的差异表达是在矫正了起始时间点差异后,在一段时间内,两个品系之间的差异表达(Keep in mind that the interaction terms are the difference between the two groups at a given time after accounting for the difference at time 0)。

  1. fiss <- plotCounts(ddsTC, which.min(resTC$padj), intgroup = c("minute","strain"), returnData = TRUE)
  2. fiss$minute <- as.numeric(as.character(fiss$minute))
  3. pdf(file = './res/show_strain_independent_fiss_gene_change_through_time.pdf',width = 8,height = 4)
  4. ggplot(fiss, aes(x = minute, y = count, color = strain, group = strain))
  5. geom_point()
  6. stat_summary(fun.y=mean, geom="line")
  7. scale_y_log10()
  8. dev.off()

使用DESeq2进行差异表达分析-图片22

在各个时间点的不同品系间的差异表达分析也可以使用results函数中的name参数加以指定。

  1. > # Wald test
  2. > resultsNames(ddsTC)
  3. [1] "Intercept" "strain_mut_vs_wt" "minute_15_vs_0" "minute_30_vs_0"
  4. [5] "minute_60_vs_0" "minute_120_vs_0" "minute_180_vs_0" "strainmut.minute15"
  5. [9] "strainmut.minute30" "strainmut.minute60" "strainmut.minute120" "strainmut.minute180"
  6. > res30 <- results(ddsTC, name="strainmut.minute30", test="Wald")
  7. > res30[which.min(resTC$padj),]
  8. log2 fold change (MLE): strainmut.minute30
  9. Wald test p-value: strainmut.minute30
  10. DataFrame with 1 row and 6 columns
  11. baseMean log2FoldChange lfcSE stat pvalue
  12. SPBC2F12.09c 174.671161802578 -2.60046902875454 0.634342916342929 -4.09946885471125 4.14099389594983e-05
  13. padj
  14. SPBC2F12.09c 0.279931187366209
  15. >

我们也可以根据不同的条件对基因进行聚类。利用coef函数取出不同比较条件下的lfc值,聚类画热图。

  1. # Extract a matrix of model coefficients/standard errors
  2. betas <- coef(ddsTC)
  3. colnames(betas)
  4. # cluster significant genes by their profiles
  5. topGenes <- head(order(resTC$padj),20)
  6. mat <- betas[topGenes, -c(1,2)]
  7. thr <- 3
  8. mat[mat < -thr] <- -thr
  9. mat[mat > thr] <- thr
  10. pdf(file = './res/cluster_gene_by_coef_of_different_conditions.pdf',width = 8,height = 6)
  11. pheatmap(mat, breaks=seq(from=-thr, to=thr, length=101), cluster_col=FALSE)
  12. dev.off()

使用DESeq2进行差异表达分析-图片23

完整代码

最后老规矩,贴一个完整代码。另外,由于这里的命名有些混乱,因此简单总结下DESeq2的基本流程:

SummarizedExperiment类 —- DESeqDataSet函数 —-> DESeqDataSet对象 —- DESeq函数 —-> 差异表达分析后的DESeqDataSet对象 —- results/lfcShrink函数 —-> 提取差异表达基因

  1. ############################################
  2. ## Project: DESeq2 learining
  3. ## Script Purpose: learn DESeq2
  4. ## Data: 2021.01.08
  5. ## Author: Yiming Sun
  6. ############################################
  7. #sleep
  8. ii <- 1
  9. while(1){
  10. cat(paste("round",ii),sep = "n")
  11. ii <- ii 1
  12. Sys.sleep(30)
  13. }
  14. #work dir
  15. setwd('/lustre/user/liclab/sunym/R/rstudio/deseq2/')
  16. #load R package
  17. .libPaths('/lustre/user/liclab/sunym/R/rstudio/Rlib/')
  18. library(DESeq2)
  19. library(airway)
  20. library(tximeta)
  21. library(GEOquery)
  22. library(GenomicFeatures)
  23. library(BiocParallel)
  24. library(GenomicAlignments)
  25. library(vsn)
  26. library(dplyr)
  27. library(ggplot2)
  28. library(RColorBrewer)
  29. library(pheatmap)
  30. library(PoiClaClu)
  31. library(glmpca)
  32. library(ggbeeswarm)
  33. library(apeglm)
  34. library(Rmisc)
  35. library(genefilter)
  36. library(AnnotationDbi)
  37. library(org.Hs.eg.db)
  38. library(Gviz)
  39. library(sva)
  40. library(RUVSeq)
  41. library(fission)
  42. #load data
  43. #-----------Transcript quantification by Salmon-----------------#
  44. #metadata
  45. dir <- system.file("extdata", package="airway", mustWork=TRUE)
  46. csvfile <- file.path(dir, "sample_table.csv")
  47. coldata <- read.csv(csvfile, row.names=1, stringsAsFactors=FALSE)
  48. coldata$names <- coldata$Run
  49. dir <- getwd()
  50. dir <- paste(dir,'airway_salmon',sep = '/')
  51. coldata$files <- file.path(dir,coldata$Run,'quant.sf')
  52. file.exists(coldata$files)
  53. #create SummarizedExperiment dataset
  54. makeLinkedTxome(indexDir='/lustre/user/liclab/sunym/R/rstudio/deseq2/Salmon_index/gencode.v35_salmon_0.14.1/', source="GENCODE", organism="Homo sapiens",
  55. release="35", genome="GRCh38", fasta='/lustre/user/liclab/sunym/R/rstudio/deseq2/transcript/gencode.v35.transcripts.fa', gtf='/lustre/user/liclab/sunym/R/rstudio/deseq2/annotation_gencode/gencode.v35.annotation.gtf', write=FALSE)
  56. se <- tximeta(coldata)
  57. #explore se
  58. dim(se)
  59. head(rownames(se))
  60. gse <- summarizeToGene(se)
  61. dim(gse)
  62. head(rownames(gse))
  63. #DESeqDataSet
  64. #start from SummarizedExperiment set
  65. round( colSums(assay(gse))/1e6, 1 )
  66. colData(gse)
  67. gse$dex
  68. dds <- DESeqDataSet(gse, design = ~ cell dex)
  69. dds$dex
  70. dds$dex <- factor(dds$dex,levels = c('untrt','trt'))
  71. dds$dex
  72. #1.Exploratory analysis and visualization
  73. head(assay(dds))
  74. #1.1 Pre-filtering the dataset
  75. #the minimal filter(the features that express)
  76. nrow(dds)
  77. keep <- rownames(dds)[rowSums(counts(dds)) > 1]
  78. length(keep)
  79. dds <- dds[keep,]
  80. nrow(dds)
  81.  
  82. #1.2 The variance stabilizing transformation and the rlog
  83. #varience increase along with mean(exampled by poisson distribution)
  84. lambda <- 10^seq(from = -1, to = 2, length = 1000)
  85. cts <- matrix(rpois(1000*200, lambda), ncol = 200)
  86. pdf(file = './res/possion_mean_variance.pdf',width = 8,height = 5)
  87. meanSdPlot(cts, ranks = FALSE)
  88. dev.off()
  89. #poisson processed by log
  90. log.cts.one <- log2(cts 1)
  91. pdf(file = './res/poisson_log_mean_variance.pdf',width = 8,height = 5)
  92. meanSdPlot(log.cts.one, ranks = FALSE)
  93. dev.off()
  94. #conclusion:the simple log transform will increase the noise of the data close to 0
  95. #solution:vst and rlog
  96. #vst
  97. vsd <- vst(dds, blind = FALSE)
  98. head(assay(vsd), 3)
  99. colData(vsd)
  100. #rlog
  101. rld <- rlog(dds, blind = FALSE)
  102. head(assay(rld), 3)
  103. #compare log transform with vst and rlog
  104. dds <- estimateSizeFactors(dds)
  105. df <- bind_rows(
  106. as_data_frame(log2(counts(dds, normalized=TRUE)[, 1:2] 1)) %>% mutate(transformation = "log2(x 1)"),
  107. as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
  108. as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
  109. colnames(df)[1:2] <- c("x", "y")
  110. lvls <- c("log2(x 1)", "vst", "rlog")
  111. df$transformation <- factor(df$transformation, levels=lvls)
  112. pdf(file = './res/compare_log_vst_rlog.pdf',width = 12,height = 6)
  113. ggplot(df, aes(x = x, y = y)) geom_hex(bins = 80)
  114. coord_fixed() facet_grid( . ~ transformation)
  115. dev.off()
  116.  
  117. #1.3 Sample distances
  118. #Euclidean distance
  119. sampleDists <- dist(t(assay(vsd)))
  120. sampleDists
  121. sampleDistMatrix <- as.matrix( sampleDists )
  122. rownames(sampleDistMatrix) <- paste( vsd$dex, vsd$cell, sep = " - " )
  123. colnames(sampleDistMatrix) <- NULL
  124. colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
  125. pdf(file = './res/sample_distance.pdf',width = 8,height = 6.5)
  126. pheatmap(sampleDistMatrix,
  127. clustering_distance_rows = sampleDists,
  128. clustering_distance_cols = sampleDists,
  129. col = colors,
  130. main = 'sample euclidean distance')
  131. dev.off()
  132. #Poisson Distance
  133. poisd <- PoissonDistance(t(counts(dds)))
  134. samplePoisDistMatrix <- as.matrix( poisd$dd )
  135. rownames(samplePoisDistMatrix) <- paste( dds$dex, dds$cell, sep=" - " )
  136. colnames(samplePoisDistMatrix) <- NULL
  137. pdf(file = './res/sample_poisson_distance.pdf',width = 8,height = 6.5)
  138. pheatmap(samplePoisDistMatrix,
  139. clustering_distance_rows = poisd$dd,
  140. clustering_distance_cols = poisd$dd,
  141. col = colors,
  142. main = 'sample poisson distance')
  143. dev.off()
  144.  
  145. #1.4 PCA
  146. #plot by func build in DESeq2
  147. pdf(file = './res/PCA.pdf',width = 8,height = 4)
  148. plotPCA(vsd, intgroup = c("dex", "cell"))
  149. dev.off()
  150. #plot by ggplot2
  151. pcaData <- plotPCA(vsd, intgroup = c( "dex", "cell"), returnData = TRUE)
  152. pcaData
  153. percentVar <- round(100 * attr(pcaData, "percentVar"))
  154. percentVar
  155. pdf(file = './res/PCA_by_ggplot.pdf',width = 8,height = 4)
  156. ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell))
  157. geom_point(size =3)
  158. xlab(paste0("PC1: ", percentVar[1], "% variance"))
  159. ylab(paste0("PC2: ", percentVar[2], "% variance"))
  160. coord_fixed()
  161. ggtitle("PCA with VST data")
  162. dev.off()
  163.  
  164. #1.5 PCA plot using Generalized PCA
  165. gpca <- glmpca(counts(dds), L=2)
  166. gpca.dat <- gpca$factors
  167. gpca.dat$dex <- dds$dex
  168. gpca.dat$cell <- dds$cell
  169. pdf(file = './res/glmpca.pdf',width = 8,height = 4)
  170. ggplot(gpca.dat, aes(x = dim1, y = dim2, color = dex, shape = cell))
  171. geom_point(size =3) coord_fixed() ggtitle("glmpca - Generalized PCA")
  172. dev.off()
  173. #1.6 MDS plot
  174. #MDS plot using vst data
  175. mds <- as.data.frame(colData(vsd)) %>% cbind(cmdscale(sampleDistMatrix))
  176. pdf(file = './res/MDS_VST_plot.pdf',width = 8,height = 4)
  177. ggplot(mds, aes(x = `1`, y = `2`, color = dex, shape = cell))
  178. geom_point(size = 3) coord_fixed() ggtitle("MDS with VST data")
  179. dev.off()
  180. #MDS plot using poisson distance
  181. mdsPois <- as.data.frame(colData(dds)) %>% cbind(cmdscale(samplePoisDistMatrix))
  182. pdf(file = './res/MDS_Possion_plot.pdf',width = 8,height = 4)
  183. ggplot(mdsPois, aes(x = `1`, y = `2`, color = dex, shape = cell))
  184. geom_point(size = 3) coord_fixed() ggtitle("MDS with PoissonDistances")
  185. dev.off()
  186.  
  187.  
  188.  
  189. #--------------------------------------------------------#
  190. #--------------------------------------------------------#
  191. #2.Differential expression analysis
  192. #2.1 Running the differential expression pipeline
  193. dds <- DESeq(dds)
  194. #2.2 Building the results table
  195. res <- results(dds)
  196. res
  197. resultsNames(dds)
  198. res <- results(dds, contrast=c("dex","trt","untrt"))
  199. mcols(res, use.names = TRUE)
  200. summary(res)
  201. #decrease the false discovery rate (FDR)
  202. res.05 <- results(dds, alpha = 0.05)
  203. summary(res.05)
  204. table(res.05$padj < 0.05)
  205. #raise the fold change
  206. resLFC1 <- results(dds, lfcThreshold=1)
  207. summary(resLFC1)
  208. table(resLFC1$padj < 0.1)
  209. #2.3 Other comparisons
  210. results(dds, contrast = c("cell", "N061011", "N61311"))
  211. #2.4 Multiple testing
  212. #the false discovery rate due to multiple test
  213. sum(res$pvalue < 0.05, na.rm=TRUE)
  214. sum(!is.na(res$pvalue))
  215. #if p threshold = 0.05
  216. fdr = (0.05*sum(!is.na(res$pvalue)))/5263
  217. fdr
  218. #if false discovery fate = 0.1
  219. sum(res$padj < 0.1, na.rm=TRUE)
  220. #get significant genes
  221. resSig <- subset(res, padj < 0.1)
  222. #strongest down-regulation
  223. head(resSig[order(resSig$log2FoldChange,decreasing = FALSE),])
  224. #strongest up-regulation
  225. head(resSig[order(resSig$log2FoldChange,decreasing = TRUE),])
  226.  
  227. #---------------------------------------------------#
  228. #---------------------------------------------------#
  229. #3.Plotting results
  230. #3.1 Counts plot -- visualize the count of a particular gene
  231. topGene <- rownames(res)[which.min(res$padj)]
  232. pdf(file = './res/plotcounts_function.pdf',width = 8,height = 6)
  233. plotCounts(dds, gene = topGene, intgroup=c("dex"))
  234. dev.off()
  235. # counts plot using ggplot2
  236. geneCounts <- plotCounts(dds, gene = topGene, intgroup = c("dex","cell"),returnData = TRUE)
  237. geneCounts
  238. p1 <- ggplot(geneCounts, aes(x = dex, y = count, color = cell)) scale_y_log10() geom_beeswarm(cex = 3)
  239. p2 <- ggplot(geneCounts, aes(x = dex, y = count, color = cell, group = cell)) scale_y_log10() geom_point(size = 3) geom_line()
  240. pdf(file = './res/count_plot_using_ggplot.pdf',width = 8,height = 6)
  241. Rmisc::multiplot(p1,p2,cols = 1)
  242. dev.off()
  243. #3.2 MA plot
  244. resultsNames(dds)
  245. res <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm")
  246. pdf(file = './res/MA_plot.pdf',width = 8,height = 6)
  247. DESeq2::plotMA(res, ylim = c(-5, 5))
  248. dev.off()
  249.  
  250. # MA plot without shrink
  251. res.noshr <- results(dds, name="dex_trt_vs_untrt")
  252. pdf(file = './res/MA_plot_without_shrink.pdf',width = 8,height = 6)
  253. DESeq2::plotMA(res.noshr, ylim = c(-5, 5))
  254. dev.off()
  255.  
  256. # show the gene we interested about on the MA plot
  257. pdf(file = './res/MA_plot_with_gene.pdf',width = 8,height = 6)
  258. DESeq2::plotMA(res, ylim = c(-5,5))
  259. topGene <- rownames(res)[which.min(res$padj)]
  260. with(res[topGene, ], {
  261. points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
  262. text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
  263. })
  264. dev.off()
  265.  
  266. # histogram show the distribution of p value
  267. pdf(file = './res/histogram_show_the_distribution_of_p_value.pdf',width = 8,height = 6)
  268. hist(res$pvalue[res$baseMean > 1], breaks = (0:20)/20,col = "grey50", border = "white")
  269. dev.off()
  270.  
  271. #3.3 gene cluster
  272. topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)
  273. mat <- assay(vsd)[topVarGenes, ]
  274. mat <- mat - rowMeans(mat)
  275. anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
  276. pdf(file = './res/pheatmap_top_20_variable_gene.pdf',width = 8,height = 6)
  277. pheatmap(mat, annotation_col = anno,scale = 'none')
  278. dev.off()
  279.  
  280. #3.4 independent filtering
  281. #filter the low counts gene to improve FDR
  282. qs <- c(0, quantile(resLFC1$baseMean[resLFC1$baseMean > 0], 0:6/6))
  283. bins <- cut(resLFC1$baseMean, qs)
  284. levels(bins) <- paste0("~", round(signif((qs[-1] qs[-length(qs)])/2, 2)))
  285. fractionSig <- tapply(resLFC1$pvalue, bins, function(p){mean(p < .05, na.rm = TRUE)})
  286. pdf(file = './res/proportion_of_small_p_value.pdf',width = 8,height = 6)
  287. barplot(fractionSig, xlab = "mean normalized count",
  288. ylab = "fraction of small p values")
  289. dev.off()
  290.  
  291. #---------------------------------------------------#
  292. #---------------------------------------------------#
  293. #4.Annotating and exporting results
  294. #rename gene
  295. columns(org.Hs.eg.db)
  296. ens.str <- substr(rownames(res), 1, 15)
  297. res$symbol <- mapIds(org.Hs.eg.db,
  298. keys=ens.str,
  299. column="SYMBOL",
  300. keytype="ENSEMBL",
  301. multiVals="first")
  302. res$entrez <- mapIds(org.Hs.eg.db,
  303. keys=ens.str,
  304. column="ENTREZID",
  305. keytype="ENSEMBL",
  306. multiVals="first")
  307. resOrdered <- res[order(res$pvalue),]
  308. head(resOrdered)
  309.  
  310. #4.1 Plotting fold changes in genomic space
  311. resGR <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm", format="GRanges")
  312. resGR
  313. ens.str <- substr(names(resGR), 1, 15)
  314. resGR$symbol <- mapIds(org.Hs.eg.db, ens.str, "SYMBOL", "ENSEMBL")
  315. # specify the window and filter the symbol
  316. window <- resGR[topGene] 1e6
  317. strand(window) <- "*"
  318. resGRsub <- resGR[resGR %over% window]
  319. naOrDup <- is.na(resGRsub$symbol) | duplicated(resGRsub$symbol)
  320. resGRsub$group <- ifelse(naOrDup, names(resGRsub), resGRsub$symbol)
  321. status <- factor(ifelse(resGRsub$padj < 0.05 & !is.na(resGRsub$padj),"sig", "notsig"))
  322. # ploting the result using Gviz
  323. options(ucscChromosomeNames = FALSE)
  324. g <- GenomeAxisTrack()
  325. a <- AnnotationTrack(resGRsub, name = "gene ranges", feature = status)
  326. d <- DataTrack(resGRsub, data = "log2FoldChange", baseline = 0, type = "h", name = "log2 fold change", strand = " ")
  327. pdf(file = './res/plot_lfc_in_genomic_space.pdf',width = 8,height = 6)
  328. plotTracks(list(g, d, a), groupAnnotation = "group", notsig = "grey", sig = "hotpink")
  329. dev.off()
  330.  
  331. #---------------------------------------------------#
  332. #---------------------------------------------------#
  333. #5.remove hiddden batch effects
  334. #5.1 use SVA with DESeq2
  335. # fillter the gene has normalized counts > 1
  336. dat <- counts(dds, normalized = TRUE)
  337. idx <- rowMeans(dat) > 1
  338. dat <- dat[idx, ]
  339. # build two model, one account for dex and one for null hypothesis
  340. mod <- model.matrix(~ dex, colData(dds))
  341. mod0 <- model.matrix(~ 1, colData(dds))
  342. # set two surrogate variables
  343. svseq <- svaseq(dat, mod, mod0, n.sv = 2)
  344. svseq$sv
  345. # show the correlation of the sv with the cellline
  346. pdf(file = './res/ralationship_between_sv_and_celline.pdf',width = 8,height = 6)
  347. par(mfrow = c(2, 1), mar = c(3,5,3,1))
  348. for (i in 1:2) {
  349. stripchart(svseq$sv[, i] ~ dds$cell, vertical = TRUE, main = paste0("SV", i))
  350. abline(h = 0)
  351. }
  352. dev.off()
  353. # use sva to remove the batch effect
  354. ddssva <- dds
  355. ddssva$SV1 <- svseq$sv[,1]
  356. ddssva$SV2 <- svseq$sv[,2]
  357. design(ddssva) <- ~ SV1 SV2 dex
  358.  
  359. #5.2 Using RUV with DESeq2
  360. set <- newSeqExpressionSet(counts(dds))
  361. idx <- rowSums(counts(set) > 5) >= 2
  362. set <- set[idx, ]
  363. set <- betweenLaneNormalization(set, which="upper")
  364. res <- results(DESeq(DESeqDataSet(gse, design = ~ dex)))
  365. not.sig <- rownames(res)[which(res$pvalue > .1)]
  366. empirical <- rownames(set)[ rownames(set) %in% not.sig ]
  367. set <- RUVg(set, empirical, k=2)
  368. pData(set)
  369.  
  370. pdf(file = './res/ralationship_between_unwanted_variation_and_celline.pdf',width = 8,height = 6)
  371. par(mfrow = c(2, 1), mar = c(3,5,3,1))
  372. for (i in 1:2) {
  373. stripchart(pData(set)[, i] ~ dds$cell, vertical = TRUE, main = paste0("W", i))
  374. abline(h = 0)
  375. }
  376. dev.off()
  377.  
  378. # use unwanted factor with DESeq2
  379. ddsruv <- dds
  380. ddsruv$W1 <- set$W_1
  381. ddsruv$W2 <- set$W_2
  382. design(ddsruv) <- ~ W1 W2 dex
  383.  
  384. #---------------------------------------------------#
  385. #---------------------------------------------------#
  386. #6.Time course experiments
  387. # use a yeast fission data, control the effect of strain and time, investigate the strain-specific difference over time
  388. data("fission")
  389. ddsTC <- DESeqDataSet(fission, ~ strain minute strain:minute)
  390.  
  391. ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ strain minute)
  392. resTC <- results(ddsTC)
  393. resTC$symbol <- mcols(ddsTC)$symbol
  394. head(resTC[order(resTC$padj),], 4)
  395.  
  396. fiss <- plotCounts(ddsTC, which.min(resTC$padj), intgroup = c("minute","strain"), returnData = TRUE)
  397. fiss$minute <- as.numeric(as.character(fiss$minute))
  398. pdf(file = './res/show_strain_independent_fiss_gene_change_through_time.pdf',width = 8,height = 4)
  399. ggplot(fiss, aes(x = minute, y = count, color = strain, group = strain))
  400. geom_point()
  401. stat_summary(fun.y=mean, geom="line")
  402. scale_y_log10()
  403. dev.off()
  404. # Wald test
  405. resultsNames(ddsTC)
  406. res30 <- results(ddsTC, name="strainmut.minute30", test="Wald")
  407. res30[which.min(resTC$padj),]
  408. # Extract a matrix of model coefficients/standard errors
  409. betas <- coef(ddsTC)
  410. colnames(betas)
  411. # cluster significant genes by their profiles
  412. topGenes <- head(order(resTC$padj),20)
  413. mat <- betas[topGenes, -c(1,2)]
  414. thr <- 3
  415. mat[mat < -thr] <- -thr
  416. mat[mat > thr] <- thr
  417. pdf(file = './res/cluster_gene_by_coef_of_different_conditions.pdf',width = 8,height = 6)
  418. pheatmap(mat, breaks=seq(from=-thr, to=thr, length=101), cluster_col=FALSE)
  419. dev.off()

发表评论

匿名网友

拖动滑块以完成验证
加载失败