R/BioC序列处理之四:BSgenome简介

一、BSgenome和BSgenome数据包

Bioconductor提供了某些物种的全基因组序列数据包,这些数据包是基于Biostrings构建的,称为BSgenome数据包。不同物种的BSgenome数据包都有类似的数据结构,可以用统一的方式进行处理。但是BSgenome数据包仅包含有数据,它们的处理的方法由另外一个软件包提供,即BSgenome包。

先安装BSgenome包(如果没有安装):

  1. library(BiocInstaller)
  2. biocLite("BSgenome")

载入BSgenome包,并查看当前版本提供的BSgenome数据包:

  1. library(BSgenome)
  2. (ag <- available.genomes())
  3. ## [1] "BSgenome.Alyrata.JGI.v1"
  4. ## [2] "BSgenome.Amellifera.BeeBase.assembly4"
  5. ## [3] "BSgenome.Amellifera.UCSC.apiMel2"
  6. ## [4] "BSgenome.Athaliana.TAIR.04232008"
  7. ## [5] "BSgenome.Athaliana.TAIR.TAIR9"
  8. ## [6] "BSgenome.Btaurus.UCSC.bosTau3"
  9. ## [7] "BSgenome.Btaurus.UCSC.bosTau4"
  10. ## [8] "BSgenome.Btaurus.UCSC.bosTau6"
  11. ## [9] "BSgenome.Celegans.UCSC.ce10"
  12. ## [10] "BSgenome.Celegans.UCSC.ce2"
  13. ## [11] "BSgenome.Celegans.UCSC.ce6"
  14. ## [12] "BSgenome.Cfamiliaris.UCSC.canFam2"
  15. ## [13] "BSgenome.Cfamiliaris.UCSC.canFam3"
  16. ## [14] "BSgenome.Dmelanogaster.UCSC.dm2"
  17. ## [15] "BSgenome.Dmelanogaster.UCSC.dm3"
  18. ## [16] "BSgenome.Drerio.UCSC.danRer5"
  19. ## [17] "BSgenome.Drerio.UCSC.danRer6"
  20. ## [18] "BSgenome.Drerio.UCSC.danRer7"
  21. ## [19] "BSgenome.Ecoli.NCBI.20080805"
  22. ## [20] "BSgenome.Gaculeatus.UCSC.gasAcu1"
  23. ## [21] "BSgenome.Ggallus.UCSC.galGal3"
  24. ## [22] "BSgenome.Ggallus.UCSC.galGal4"
  25. ## [23] "BSgenome.Hsapiens.UCSC.hg17"
  26. ## [24] "BSgenome.Hsapiens.UCSC.hg18"
  27. ## [25] "BSgenome.Hsapiens.UCSC.hg19"
  28. ## [26] "BSgenome.Mmulatta.UCSC.rheMac2"
  29. ## [27] "BSgenome.Mmusculus.UCSC.mm10"
  30. ## [28] "BSgenome.Mmusculus.UCSC.mm8"
  31. ## [29] "BSgenome.Mmusculus.UCSC.mm9"
  32. ## [30] "BSgenome.Ptroglodytes.UCSC.panTro2"
  33. ## [31] "BSgenome.Ptroglodytes.UCSC.panTro3"
  34. ## [32] "BSgenome.Rnorvegicus.UCSC.rn4"
  35. ## [33] "BSgenome.Rnorvegicus.UCSC.rn5"
  36. ## [34] "BSgenome.Scerevisiae.UCSC.sacCer1"
  37. ## [35] "BSgenome.Scerevisiae.UCSC.sacCer2"
  38. ## [36] "BSgenome.Scerevisiae.UCSC.sacCer3"
  39. ## [37] "BSgenome.Tgondii.ToxoDB.7.0"
  40. # 当前版本提供了18个物种的37个基因组包
  41. unique(gsub("BSgenome\\.([^\\.]+).*", "\\1", ag))
  42. ## [1] "Alyrata" "Amellifera" "Athaliana" "Btaurus"
  43. ## [5] "Celegans" "Cfamiliaris" "Dmelanogaster" "Drerio"
  44. ## [9] "Ecoli" "Gaculeatus" "Ggallus" "Hsapiens"
  45. ## [13] "Mmulatta" "Mmusculus" "Ptroglodytes" "Rnorvegicus"
  46. ## [17] "Scerevisiae" "Tgondii"

获取拟南芥的BSgenome数据包(BSgenome.Athaliana.TAIR.TAIR9):

  1. biocLite("BSgenome.Athaliana.TAIR.TAIR9")

载入BSgenome.Athaliana.TAIR.TAIR9包,查看数据信息:

  1. library(BSgenome.Athaliana.TAIR.TAIR9)
  2. # BSgenome.Athaliana.TAIR.TAIR9只定义了一个对象
  3. ls("package:BSgenome.Athaliana.TAIR.TAIR9")
  4. ## [1] "Athaliana"
  5. # 查看总体信息
  6. Athaliana
  7. ## Arabidopsis genome
  8. ## |
  9. ## | organism: Arabidopsis thaliana (Arabidopsis)
  10. ## | provider: TAIR
  11. ## | provider version: TAIR9
  12. ## | release date: June 9, 2009
  13. ## | release name: TAIR9 Genome Release
  14. ## |
  15. ## | sequences (see '?seqnames'):
  16. ## | Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC
  17. ## |
  18. ## | (use the '$' or '[[' operator to access a given sequence)
  19. # 不载入序列就可以获取染色体名称和长度信息
  20. seqnames(Athaliana)
  21. ## [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
  22. seqlengths(Athaliana)
  23. ## Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC
  24. ## 30427671 19698289 23459830 18585056 26975502 366924 154478

二、BSgenome方法

BSgenome数据包内包含的序列为DNAString,DNAStringSet或MaskedDNAString对象,完全可以使用Biostrings包的方法进行处理:

  1. # 可以使用多种方式获取某一序列
  2. Athaliana$Chr1
  3. ## 30427671-letter "DNAString" instance
  4. ## seq: CCCTAAACCCTAAACCCTAAACCCTAAACCTCTG...TAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGG
  5. Athaliana[[1]]
  6. ## 30427671-letter "DNAString" instance
  7. ## seq: CCCTAAACCCTAAACCCTAAACCCTAAACCTCTG...TAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGG
  8. Athaliana[["Chr1"]]
  9. ## 30427671-letter "DNAString" instance
  10. ## seq: CCCTAAACCCTAAACCCTAAACCCTAAACCTCTG...TAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGG

下面使用sapply函数统计碱基组成和GC含量。如果不熟悉sapply函数,请参考这篇文章。

  1. # 统计碱基组成
  2. sapply(seqnames(Athaliana), function(x) alphabetFrequency(Athaliana[[x]]))
  3. ## Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC
  4. ## A 9709674 6315641 7484757 5940546 8621974 102464 48546
  5. ## C 5435374 3542973 4258333 3371349 4832253 82661 28496
  6. ## G 5421151 3520766 4262704 3356091 4858759 81609 27570
  7. ## T 9697113 6316348 7448059 5914038 8652238 100190 49866
  8. ## M 76 5 2 1 0 0 0
  9. ## R 36 7 4 0 0 0 0
  10. ## W 124 18 2 0 0 0 0
  11. ## S 30 3 1 0 0 0 0
  12. ## Y 82 12 2 0 0 0 0
  13. ## K 53 10 0 0 0 0 0
  14. ## V 0 0 0 0 0 0 0
  15. ## H 0 0 0 0 0 0 0
  16. ## D 0 0 0 1 0 0 0
  17. ## B 0 0 0 0 0 0 0
  18. ## N 163958 2506 5966 3030 10278 0 0
  19. ## - 0 0 0 0 0 0 0
  20. ## + 0 0 0 0 0 0 0
  21. # 计算GC含量
  22. sapply(seqnames(Athaliana), function(x) letterFrequency(Athaliana[[x]], letters = "GC",
  23. as.prob = TRUE))
  24. ## Chr1.G|C Chr2.G|C Chr3.G|C Chr4.G|C Chr5.G|C ChrM.G|C ChrC.G|C
  25. ## 0.3568 0.3586 0.3632 0.3620 0.3593 0.4477 0.3629

但是BSgenome包试图提供一些简便的方法来处理基因组水平的BSgenome类数据,例如类似于apply函数的bsapply函数。要使用bsapply函数得先构建BSParams类对象,用于设置以下参数:

  • X:将要处理的BSgenome对象
  • FUN:将要对X中每条染色体进行处理函数
  • exclude:字符向量,表示排除的染色体名称
  • simplify:逻辑向量,它的意义和与sapply的simplify参数一样。默认为FALSE,bsapply返回GenomeData类数据;如果设置为TRUE,函数的结果尽量使用表格(数据框)类型显示
  • maskList:逻辑向量,表示各染色体使用掩膜的应用状态
  • motifList:字符向量,对序列进行掩膜的motif
  • userMask:RangesList对象,每个元素将用于对响应染色体进行掩膜处理,为用户提供额外的掩膜。
  • invertUserMask:TRUE/FALSE,表示是否对userMask进行反转。

FUN函数的其他参数可以在bsapply函数中以有名参数的方式进行设置。 下面代码的作用和前面sapply函数的应用是相同的:

  1. op <- options()
  2. options(digits = 4)
  3. params <- new("BSParams", X = Athaliana, FUN = letterFrequency, simplify = TRUE)
  4. bsapply(params, letters = "GC", as.prob = TRUE)
  5. ## Chr1.G|C Chr2.G|C Chr3.G|C Chr4.G|C Chr5.G|C ChrM.G|C ChrC.G|C
  6. ## 0.3568 0.3586 0.3632 0.3620 0.3593 0.4477 0.3629
  7. BSParamsS4类,对象数据可以修改,方便重复使用:
  8. params@motifList <- "N"
  9. bsapply(params, letters = "GC", as.prob = TRUE)
  10. ## Chr1.G|C Chr2.G|C Chr3.G|C Chr4.G|C Chr5.G|C ChrM.G|C ChrC.G|C
  11. ## 0.3587 0.3586 0.3633 0.3620 0.3594 0.4477 0.3629
  12. options(op)
  13. params@FUN <- alphabetFrequency
  14. bsapply(params, baseOnly = TRUE)
  15. ## Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC
  16. ## A 9709674 6315641 7484757 5940546 8621974 102464 48546
  17. ## C 5435374 3542973 4258333 3371349 4832253 82661 28496
  18. ## G 5421151 3520766 4262704 3356091 4858759 81609 27570
  19. ## T 9697113 6316348 7448059 5914038 8652238 100190 49866
  20. ## other 401 55 11 2 0 0 0

BSgenome包实现了BSgenome类对象的matchPWM,countPWM,vmatchPattern,vcountPattern,vmatchPDict和vcountPDict等操作,除继承了Biostrings中相应的方法外还增加了一些针对BSgenome类对象的参数设置(和bsapply函数参数类似):

  1. vmatchPattern("ACGTGKC", Athaliana, fix = "subject", exclude = c("ChrC", "ChrM"))
  2. ## GRanges with 13745 ranges and 0 metadata columns:
  3. ## seqnames ranges strand
  4. ##
  5. ## [1] Chr1 [ 1540, 1546] +
  6. ## [2] Chr1 [ 8276, 8282] +
  7. ## [3] Chr1 [12017, 12023] +
  8. ## [4] Chr1 [14697, 14703] +
  9. ## [5] Chr1 [34225, 34231] +
  10. ## ... ... ... ...
  11. ## [13741] Chr5 [26911361, 26911367] -
  12. ## [13742] Chr5 [26915247, 26915253] -
  13. ## [13743] Chr5 [26917155, 26917161] -
  14. ## [13744] Chr5 [26923565, 26923571] -
  15. ## [13745] Chr5 [26933617, 26933623] -
  16. ## ---
  17. ## seqlengths:
  18. ## Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC
  19. ## 30427671 19698289 23459830 18585056 26975502 366924 154478

BSgenome包还提供了其他一些方法,比如用户自行构建BSgenome数据包和SNP的处理等。这些不在我的关心范围之内,没去了解。

原文来自:http://blog.csdn.net/u014801157/article/details/24372467

发表评论

匿名网友

拖动滑块以完成验证
加载失败