R/BioC序列处理之五:Rle和Ranges

1 Rle(Run Length Encoding,行程编码)

1.1 Rle类和Rle对象

序列或基因最终要定位到染色体上。序列往往数量非常巨大,但染色体数量很少,如果每条序列的染色体定位都显式标注,将会产生大量的重复信息,更糟糕的是它们要占用大量的内存。BioC的IRanges包为这些数据提供了一种简便可行的信息压缩方式,即Rle。如果染色体1-3分别有3000,5000和2000个基因,基因的染色体注释可以用字符向量表示,也可以用Rle对象表示:

  1. library(IRanges) #可以不执行,载入Biostrings包将自动载入依赖包IRanges
  2. library(Biostrings)
  3. chr.str <- c(rep("ChrI", 3000), rep("ChrII", 5000), rep("ChrIII", 2000))
  4. chr.rle <- Rle(chr.str)

两种方式的效果是完全一样的,但是Rle对象占用空间还不到字符向量的2%:

  1. # Rle对象向量化后和原向量是完全相同的
  2. identical(as.vector(chr.rle), chr.str)
  3. ## [1] TRUE
  4. # 对象大小(内存占用)比:
  5. as.vector(object.size(chr.rle)/object.size(chr.str))
  6. ## [1] 0.01795

使用Rle并不总是可以“压缩”数据。如果信息没有重复或重复量很少,Rle会占用更多的内存:

  1. strx <- sample(DNA_BASES, 10000, replace = TRUE)
  2. strx.rle <- Rle(strx)
  3. as.vector(object.size(strx.rle)/object.size(strx))
  4. ## [1] 1.518

Rle对象用两个属性来表示原向量,一个是值(values),可以是向量或因子;另一个是长度(lengths),为整型数据,表示对应位置的value的重复次数。

  1. chr.rle
  2. ## character-Rle of length 10000 with 3 runs
  3. ## Lengths: 3000 5000 2000
  4. ## Values : "ChrI" "ChrII" "ChrIII"
  5. getClass(class(chr.rle))
  6. ## Class "Rle" [package "IRanges"]
  7. ##
  8. ## Slots:
  9. ##
  10. ## Name: values lengths elementMetadata metadata
  11. ## Class: vectorORfactor integer DataTableORNULL list
  12. ##
  13. ## Extends:
  14. ## Class "Vector", directly
  15. ## Class "Annotated", by class "Vector", distance 2

1.2 Rle对象的处理方法

1.2.1 Rle对象构建/获取

Rle对象可以用构造函数Rle来产生,它有两种用法:

  1. Rle(values)
  2. Rle(values, lengths)

values和lengths均为(原子)向量。第一种用法前面已经出现过了,我们看看第二种用法:

  1. chr.rle <- Rle(values = c("Chr1", "Chr2", "Chr3", "Chr1", "Chr3"), lengths = c(3,
  2. 2, 5, 4, 5))
  3. chr.rle
  4. ## character-Rle of length 19 with 5 runs
  5. ## Lengths: 3 2 5 4 5
  6. ## Values : "Chr1" "Chr2" "Chr3" "Chr1" "Chr3"

原子向量也可以通过类型转换函数as由原子向量产生,它等价于上面的第一种方式:

  1. as(chr.str, "Rle")
  2. ## character-Rle of length 10000 with 3 runs
  3. ## Lengths: 3000 5000 2000
  4. ## Values : "ChrI" "ChrII" "ChrIII"

1.2.2 获取属性

Rle是S4类,Rle对象的属性如值、长度等可以使用属性读取函数获取:

  1. runLength(chr.rle)
  2. ## [1] 3 2 5 4 5
  3. runValue(chr.rle)
  4. ## [1] "Chr1" "Chr2" "Chr3" "Chr1" "Chr3"
  5. nrun(chr.rle)
  6. ## [1] 5
  7. start(chr.rle)
  8. ## [1] 1 4 6 11 15
  9. end(chr.rle)
  10. ## [1] 3 5 10 14 19
  11. width(chr.rle)
  12. ## [1] 3 2 5 4 5

1.2.3 属性替换

Rle对象的长度和值还可以使用属性替换函数进行修改:

  1. runLength(chr.rle) <- rep(3, nrun(chr.rle))
  2. chr.rle
  3. ## character-Rle of length 15 with 5 runs
  4. ## Lengths: 3 3 3 3 3
  5. ## Values : "Chr1" "Chr2" "Chr3" "Chr1" "Chr3"
  6. runValue(chr.rle)[3:4] <- c("III", "IV")
  7. chr.rle
  8. ## character-Rle of length 15 with 5 runs
  9. ## Lengths: 3 3 3 3 3
  10. ## Values : "Chr1" "Chr2" "III" "IV" "Chr3"
  11. # 替换向量和被替换向量的长度必需相同,否则出错。下面两个语句都不正确:
  12. runValue(chr.rle) <- c("ChrI", "ChrV")
  13. ## Error: 'length(lengths)' != 'length(values)'
  14. runLength(chr.rle) <- 3
  15. ## Error: 'length(lengths)' != 'length(values)'

1.2.4 类型转换

除使用as.vector函数外,Rle对象还可以使用很多函数进行类型转换,如:

  1. as.factor(chr.rle)
  2. ## [1] Chr1 Chr1 Chr1 Chr2 Chr2 Chr2 III III III IV IV IV Chr3 Chr3
  3. ## [15] Chr3
  4. ## Levels: Chr1 Chr2 Chr3 III IV
  5. as.character(chr.rle)
  6. ## [1] "Chr1" "Chr1" "Chr1" "Chr2" "Chr2" "Chr2" "III" "III" "III" "IV"
  7. ## [11] "IV" "IV" "Chr3" "Chr3" "Chr3"

1.2.5 Rle的S4类集团泛函数运算

Rle是BioC定义的基础数据类型。既然“基础”,那么它应当能进行R语言中数据的一般性运算,比如加减乘除、求模、求余等数学运算。事实也是如此,Rle支持R语言S4类集团泛函数(group generic functions,“集团通用函数”?)运算,包括算术、复数、比较、逻辑、数学函数和R语言的汇总("max", "min", "range", "prod", "sum", "any", "all"等)运算(没有去验证是否所有运算都已实现)。下面仅简单具几个例子,具体情况请参考Rle-class的相关说明:

  1. set.seed(0)
  2. rle1 <- Rle(sample(4, 6, replace = TRUE))
  3. rle2 <- Rle(sample(5, 12, replace = TRUE))
  4. rle3 <- Rle(sample(4, 8, replace = TRUE))
  5. rle1 + rle2
  6. ## integer-Rle of length 12 with 11 runs
  7. ## Lengths: 1 1 1 1 1 1 1 1 1 2 1
  8. ## Values : 9 7 6 7 5 3 5 6 4 7 5
  9. rle1 + rle3
  10. ## integer-Rle of length 8 with 8 runs
  11. ## Lengths: 1 1 1 1 1 1 1 1
  12. ## Values : 8 4 6 7 5 4 5 4
  13. rle1 * rle2
  14. ## integer-Rle of length 12 with 11 runs
  15. ## Lengths: 1 1 1 1 1 1 1 1 1 2 1
  16. ## Values : 20 10 8 12 4 2 4 8 4 12 4
  17. sqrt(rle1)
  18. ## numeric-Rle of length 6 with 5 runs
  19. ## Lengths: 1 2 ... 1
  20. ## Values : 2 1.4142135623731 ... 1
  21. range(rle1)
  22. ## [1] 1 4
  23. cumsum(rle1)
  24. ## integer-Rle of length 6 with 6 runs
  25. ## Lengths: 1 1 1 1 1 1
  26. ## Values : 4 6 8 11 15 16
  27. (rle1 <- Rle(sample(DNA_BASES, 10, replace = TRUE)))
  28. ## character-Rle of length 10 with 9 runs
  29. ## Lengths: 1 1 1 1 2 1 1 1 1
  30. ## Values : "C" "A" "C" "T" "C" "G" "C" "A" "T"
  31. (rle2 <- Rle(sample(DNA_BASES, 8, replace = TRUE)))
  32. ## character-Rle of length 8 with 8 runs
  33. ## Lengths: 1 1 1 1 1 1 1 1
  34. ## Values : "G" "T" "A" "G" "C" "T" "G" "T"
  35. paste(rle1, rle2, sep = "")
  36. ## character-Rle of length 10 with 10 runs
  37. ## Lengths: 1 1 1 1 1 1 1 1 1 1
  38. ## Values : "CG" "AT" "CA" "TG" "CC" "CT" "GG" "CT" "AG" "TT"

2 Ranges(序列区间/范围)

2.1 BioC中的Ranges

Ranges是一类特殊但又常用的数据类型,它们可以表示小段序列在大段序列中的位置、名称和组织结构等信息。BioC中与Ranges定义有关的软件包主要有IRanges, GenomicRanges和GenomicFeatures。

IRanges包定义了Ranges的一般数据结构和处理方法,但不直接面向序列处理;GenomicRanges包定义的GRanges和GRangesList类除了储存Ranges信息外还包含了序列的名称和DNA链等信息;而GenomicFeatures(包)则处理以数据库形式提供的GRanges信息,如基因、外显子、内含子、启动子、UTR等。

先看看BioC中Ranges最基本的类定义:

  1. getClass("Ranges")
  2. ## Virtual Class "Ranges" [package "IRanges"]
  3. ##
  4. ## Slots:
  5. ##
  6. ## Name: elementType elementMetadata metadata
  7. ## Class: character DataTableORNULL list
  8. ##
  9. ## Extends:
  10. ## Class "IntegerList", directly
  11. ## Class "RangesORmissing", directly
  12. ## Class "AtomicList", by class "IntegerList", distance 2
  13. ## Class "List", by class "IntegerList", distance 3
  14. ## Class "Vector", by class "IntegerList", distance 4
  15. ## Class "Annotated", by class "IntegerList", distance 5
  16. ##
  17. ## Known Subclasses:
  18. ## Class "IRanges", directly
  19. ## Class "Partitioning", directly
  20. ## Class "GappedRanges", directly
  21. ## Class "IntervalTree", directly
  22. ## Class "NormalIRanges", by class "IRanges", distance 2
  23. ## Class "PartitioningByEnd", by class "Partitioning", distance 2
  24. ## Class "PartitioningByWidth", by class "Partitioning", distance 2

Ranges是虚拟类,实际应用中最常用的IRanges子类,它继承了Ranges的数据结构,另外多设置了3个slots(存储槽),分别用于存贮Ranges的起点、宽度和名称信息。由于Ranges由整数确定,所以称为IRanges(Integer Ranges,整数区间),但也有人理解成间隔区间(Interval Ranges):

  1. getSlots("Ranges")
  2. ## elementType elementMetadata metadata
  3. ## "character" "DataTableORNULL" "list"
  4. getSlots("IRanges")
  5. ## start width NAMES elementType
  6. ## "integer" "integer" "characterORNULL" "character"
  7. ## elementMetadata metadata
  8. ## "DataTableORNULL" "list"

GRanges是Ranges概念在序列处理上的具体应用,但它和IRanges没有继承关系:

  1. library(GenomicRanges)
  2. getSlots("GRanges")
  3. ## seqnames ranges strand elementMetadata
  4. ## "Rle" "IRanges" "Rle" "DataFrame"
  5. ## seqinfo metadata
  6. ## "Seqinfo" "list"

Ranges对于序列处理非常重要,除GenomicRanges外,Biostrings一些类的定义也应用了Ranges:

  1. getSlots("XStringViews")
  2. ## subject ranges elementType elementMetadata
  3. ## "XString" "IRanges" "character" "DataTableORNULL"
  4. ## metadata
  5. ## "list"

2.2 对象构建和属性获取

IRanges对象可以使用对象构造函数IRanges产生,需提供起点(start)、终点(end)和宽度(width)三个参数中的任意两个:

  1. ir1 <- IRanges(start = 1:10, width = 10:1)
  2. ir2 <- IRanges(start = 1:10, end = 11)
  3. ir3 <- IRanges(end = 11, width = 10:1)
  4. ir1
  5. ## IRanges of length 10
  6. ## start end width
  7. ## [1] 1 10 10
  8. ## [2] 2 10 9
  9. ## [3] 3 10 8
  10. ## [4] 4 10 7
  11. ## [5] 5 10 6
  12. ## [6] 6 10 5
  13. ## [7] 7 10 4
  14. ## [8] 8 10 3
  15. ## [9] 9 10 2
  16. ## [10] 10 10 1

GRanges对象也可以使用构造函数生成,其方式与数据框对象生成有些类似:

  1. genes <- GRanges(seqnames = c("Chr1", "Chr3", "Chr3"), ranges = IRanges(start = c(1300,
  2. 1050, 2000), end = c(2500, 1870, 3200)), strand = c("+", "+", "-"), seqlengths = c(Chr1 = 1e+05,
  3. Chr3 = 2e+05))
  4. genes
  5. ## GRanges with 3 ranges and 0 metadata columns:
  6. ## seqnames ranges strand
  7. ##
  8. ## [1] Chr1 [1300, 2500] +
  9. ## [2] Chr3 [1050, 1870] +
  10. ## [3] Chr3 [2000, 3200] -
  11. ## ---
  12. ## seqlengths:
  13. ## Chr1 Chr3
  14. ## 100000 200000

IRanges和GRanges都是S4类,其属性获取有相应的函数:

  1. start(ir1)
  2. ## [1] 1 2 3 4 5 6 7 8 9 10
  3. end(ir1)
  4. ## [1] 10 10 10 10 10 10 10 10 10 10
  5. width(ir1)
  6. ## [1] 10 9 8 7 6 5 4 3 2 1
  7. ranges(genes)
  8. ## IRanges of length 3
  9. ## start end width
  10. ## [1] 1300 2500 1201
  11. ## [2] 1050 1870 821
  12. ## [3] 2000 3200 1201
  13. start(ranges(genes))
  14. ## [1] 1300 1050 2000

Views对象也包含有IRanges属性:

  1. # 按长度设置产生随机序列的函数
  2. rndSeq <- function(dict, n) {
  3. paste(sample(dict, n, replace = T), collapse = "")
  4. }
  5. set.seed(0)
  6. dna <- DNAString(rndSeq(DNA_BASES, 1000))
  7. vws <- as(maskMotif(dna, "TGA"), "Views")
  8. (ir <- ranges(vws))
  9. ## IRanges of length 18
  10. ## start end width
  11. ## [1] 1 104 104
  12. ## [2] 108 264 157
  13. ## [3] 268 268 1
  14. ## [4] 272 300 29
  15. ## [5] 304 393 90
  16. ## ... ... ... ...
  17. ## [14] 586 752 167
  18. ## [15] 756 851 96
  19. ## [16] 855 912 58
  20. ## [17] 916 989 74
  21. ## [18] 993 1000 8

模式匹配的match类函数返回IRanges对象,而vmatch类函数返回GRanges类对象:

2.3 IRanges对象的运算和处理方法

2.3.1 Ranges内变换(Intra-range transformations)

这种类型的处理函数包括shift,flank,narrow,reflect,resize,restrict和promoters等,它们对每个Ranges进行独立处理。为了方便理解,我们使用IRanges包的Vignette提供的一个很有用的IRanges作图函数(稍做修改):

  1. plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black",
  2. add = FALSE, ybottom = NULL, ...) {
  3. require(scales)
  4. col <- alpha(col, 0.5)
  5. height <- 1
  6. sep <- 0.5
  7. if (is(xlim, "Ranges")) {
  8. xlim <- c(min(start(xlim)), max(end(xlim)) * 1.2)
  9. }
  10. if (!add) {
  11. bins <- disjointBins(IRanges(start(x), end(x) + 1))
  12. ybottom <- bins * (sep + height) - height
  13. par(mar = c(3, 0.5, 2.5, 0.5), mgp = c(1.5, 0.5, 0))
  14. plot.new()
  15. plot.window(xlim, c(0, max(bins) * (height + sep)))
  16. }
  17. rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col,
  18. ...)
  19. text((start(x) + end(x))/2, ybottom + height/2, 1:length(x), col = "white",
  20. xpd = TRUE)
  21. title(main)
  22. axis(1)
  23. invisible(ybottom)
  24. }

shift函数对Ranges进行平移(下面图形中蓝色为原始Ranges,红色为变换后的Ranges,黑色/灰色则为参考Ranges,其他颜色为重叠区域):

  1. ir <- IRanges(c(3000, 2500), width = c(300, 850))
  2. ir.trans <- shift(ir, 500)
  3. xlim <- c(0, max(end(ir, ir.trans)) * 1.3)
  4. ybottom <- plotRanges(ir, xlim = xlim, main = "shift", col = "blue")
  5. plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

R/BioC序列处理之五:Rle和Ranges-图片1

flank函数获取Ranges的相邻区域,width参数为整数表示左侧,负数表示右侧:

  1. ir.trans <- flank(ir, width = 200)
  2. xlim <- c(0, max(end(ir, ir.trans)) * 1.3)
  3. ybottom <- plotRanges(ir, xlim = xlim, main = "flank", col = "blue")
  4. plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

R/BioC序列处理之五:Rle和Ranges-图片2

reflect函数获取Ranges的镜面对称区域,bounds为用于设置镜面位置的Ranges对象:

  1. bounds <- IRanges(c(2000, 3000), width = 500)
  2. ir.trans <- reflect(ir, bounds = bounds)
  3. xlim <- c(0, max(end(ir, ir.trans, bounds)) * 1.3)
  4. ybottom <- plotRanges(ir, xlim = xlim, main = "reflect", col = "blue")
  5. plotRanges(bounds, add = TRUE, ybottom = ybottom, main = "")
  6. plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

R/BioC序列处理之五:Rle和Ranges-图片3

promoters函数获取promoter区域,upstream和downstream分别设置上游和下游截取的序列长度:

  1. ir.trans <- promoters(ir, upstream = 1000, downstream = 100)
  2. xlim <- c(0, max(end(ir, ir.trans)) * 1.3)
  3. ybottom <- plotRanges(ir, xlim = xlim, main = "promoters", col = "blue")
  4. plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

R/BioC序列处理之五:Rle和Ranges-图片4

resize函数改变Ranges的大小,width设置宽度,fix设置固定位置(start/end/center):

  1. ir.trans <- resize(ir, width = c(100, 1300), fix = "start")
  2. xlim <- c(0, max(end(ir, ir.trans)) * 1.3)
  3. ybottom <- plotRanges(ir, xlim = xlim, main = "resize, fix=\"start\"", col = "blue")
  4. plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")
  5. ir.trans <- resize(ir, width = c(100, 1300), fix = "center")
  6. xlim <- c(0, max(end(ir, ir.trans)) * 1.3)
  7. ybottom <- plotRanges(ir, xlim = xlim, main = "resize, fix=\"center\"", col = "blue")
  8. plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

R/BioC序列处理之五:Rle和Ranges-图片5

R/BioC序列处理之五:Rle和Ranges-图片6

其他函数的使用请自行尝试使用。

2.3.2 Ranges间转换(Inter-range transformations)

range函数用于获取Ranges所包括的整个区域(包括间隔区);reduce将重叠区域合并;gaps用于获取间隔区域:

  1. ir <- IRanges(c(200, 1000, 3000, 2500), width = c(600, 1000, 300, 850))
  2. ir.trans <- range(ir)
  3. xlim <- c(0, max(end(ir, ir.trans)) * 1.3)
  4. ybottom <- plotRanges(ir, xlim = xlim, col = "blue")
  5. plotRanges(ir.trans, xlim = xlim, col = "red", main = "range")
  6. ir.trans <- reduce(ir)
  7. plotRanges(ir.trans, xlim = xlim, col = "red", main = "reduce")
  8. ir.trans <- gaps(ir)
  9. plotRanges(ir.trans, xlim = xlim, col = "red", main = "gaps")

R/BioC序列处理之五:Rle和Ranges-图片7

R/BioC序列处理之五:Rle和Ranges-图片8

R/BioC序列处理之五:Rle和Ranges-图片9

R/BioC序列处理之五:Rle和Ranges-图片10

2.3.3 Ranges对象间的集合运算

intersect求交集区域;setdiff求差异区域;union求并集:

  1. ir1 <- IRanges(c(200, 1000, 3000, 2500), width = c(600, 1000, 300, 850))
  2. ir2 <- IRanges(c(100, 1500, 2000, 3500), width = c(600, 800, 1000, 550))
  3. xlim <- c(0, max(end(ir1, ir2)) * 1.3)
  4. ybottom <- plotRanges(reduce(ir1), xlim = xlim, col = "blue", main = "original")
  5. plotRanges(reduce(ir2), xlim = xlim, col = "blue", main = "", add = TRUE, ybottom = ybottom)
  6. plotRanges(intersect(ir1, ir2), xlim = xlim, col = "red")
  7. plotRanges(setdiff(ir1, ir2), xlim = xlim, col = "red")
  8. plotRanges(union(ir1, ir2), xlim = xlim, col = "red")

R/BioC序列处理之五:Rle和Ranges-图片11

R/BioC序列处理之五:Rle和Ranges-图片12

R/BioC序列处理之五:Rle和Ranges-图片13

R/BioC序列处理之五:Rle和Ranges-图片14

此外还有punion,pintersect,psetdiff和pgap函数,进行element-wise的运算。

原文来自:http://blog.csdn.net/u014801157/article/details/24372479

发表评论

匿名网友

拖动滑块以完成验证
加载失败