还是先获取随机DNA序列和其他序列对象:
library(Biostrings) rndSeq <- function(dict, n) { paste(sample(dict, n, replace = T), collapse = "") } set.seed(0) # 用mapply和rndSeq函数获取5条序列(字符串): DNA.raw <- mapply(rndSeq, list(DNA_BASES), rep(20, 5)) names(DNA.raw) <- paste("SEQ", 1:5, sep = "-") # DNAString对象,1条序列 DNA.str <- DNAString(DNA.raw[1]) # DNAStringSet对象,含5条序列 DNA.set <- DNAStringSet(DNA.raw) # Views对象 DNA.vws <- successiveViews(DNA.str, width = rep(4, 5))
一、获取序列基本信息
包括获取名称(names)、长度(length)、字符个数(nchar)和对象头/尾(head/tail)等信息的函数。
函数的用法简单,但需注意XString类对象的返回结果和其他类型有些差别:
# length函数---------------------- length(DNA.raw) ## [1] 5 # 结果为序列的长度 length(DNA.str) ## [1] 20 length(DNA.set) ## [1] 5 length(DNA.vws) ## [1] 5 # nchar函数----------------------- nchar(DNA.raw) ## SEQ-1 SEQ-2 SEQ-3 SEQ-4 SEQ-5 ## 20 20 20 20 20 nchar(DNA.str) ## [1] 20 nchar(DNA.set) ## [1] 20 20 20 20 20 nchar(DNA.vws) ## [1] 4 4 4 4 4 # head/tail函数------------------- head(DNA.raw, 2) ## SEQ-1 SEQ-2 ## "TCCGTATTGGAAAGCTCGTC" "TTAGACCACTCCGCATGTAG" # 结果为序列的前几个碱基 head(DNA.str, 2) ## 2-letter "DNAString" instance ## seq: TC head(DNA.set, 2) ## A DNAStringSet instance of length 2 ## width seq names ## [1] 20 TCCGTATTGGAAAGCTCGTC SEQ-1 ## [2] 20 TTAGACCACTCCGCATGTAG SEQ-2 head(DNA.vws, 2) ## Views on a 20-letter DNAString subject ## subject: TCCGTATTGGAAAGCTCGTC ## views: ## start end width ## [1] 1 4 4 [TCCG] ## [2] 5 8 4 [TATT]
二、序列转换
1、获取反向、互补、反向互补序列:
reverse(), complement(), reverseComplement()可以使用string, XString, XXXSet和Views类对象进行操作。下面看看reverse函数的结果:
DNA.raw ## SEQ-1 SEQ-2 SEQ-3 ## "TCCGTATTGGAAAGCTCGTC" "TTAGACCACTCCGCATGTAG" "CTGTGGTACGGCTCAAACGG" ## SEQ-4 SEQ-5 ## "CTCCCGCCTATCTCCCTTCT" "TCGCCTAGAAAAAGTTTCCT" reverse(DNA.raw) ## SEQ-1 SEQ-2 SEQ-3 ## "CTGCTCGAAAGGTTATGCCT" "GATGTACGCCTCACCAGATT" "GGCAAACTCGGCATGGTGTC" ## SEQ-4 SEQ-5 ## "TCTTCCCTCTATCCGCCCTC" "TCCTTTGAAAAAGATCCGCT" reverse(DNA.str) ## 20-letter "DNAString" instance ## seq: CTGCTCGAAAGGTTATGCCT reverse(DNA.set) ## A DNAStringSet instance of length 5 ## width seq names ## [1] 20 CTGCTCGAAAGGTTATGCCT SEQ-1 ## [2] 20 GATGTACGCCTCACCAGATT SEQ-2 ## [3] 20 GGCAAACTCGGCATGGTGTC SEQ-3 ## [4] 20 TCTTCCCTCTATCCGCCCTC SEQ-4 ## [5] 20 TCCTTTGAAAAAGATCCGCT SEQ-5 reverse(DNA.vws) ## Views on a 20-letter DNAString subject ## subject: CTGCTCGAAAGGTTATGCCT ## views: ## start end width ## [1] 17 20 4 [GCCT] ## [2] 13 16 4 [TTAT] ## [3] 9 12 4 [AAGG] ## [4] 5 8 4 [TCGA] ## [5] 1 4 4 [CTGC]
2、序列翻译:
翻译函数translate()只能使用XString和XXXSet类对象(Biostrings version 2.29.3):
# 错误 translate(DNA.raw) ## Error: unable to find an inherited method for function 'translate' for ## signature '"character"' translate(DNA.str) ## 6-letter "AAString" instance ## seq: SVLESS # 错误 translate(DNA.set) ## A AAStringSet instance of length 5 ## width seq ## [1] 6 SVLESS ## [2] 6 LDHSAC ## [3] 6 LWYGSN ## [4] 6 LPPISL ## [5] 6 SPRKSF # 错误 translate(DNA.vws) ## Error: unable to find an inherited method for function 'translate' for ## signature '"XStringViews"'
3、DNA/RNA互转:
使用dna2rna, rna2dna或cDNA函数。这些函数对于数据对象有严格的要求:
DNA.str ## 20-letter "DNAString" instance ## seq: TCCGTATTGGAAAGCTCGTC dna2rna(DNA.str) ## 20-letter "RNAString" instance ## seq: UCCGUAUUGGAAAGCUCGUC # 错误 dna2rna(DNA.raw) ## Error: dna2rna() only works on DNA input # 错误 cDNA(DNA.str) ## Error: cDNA() only works on RNA input cDNA(dna2rna(DNA.str)) ## 20-letter "DNAString" instance ## seq: AGGCATAACCTTTCGAGCAG
4、其他:
R base包的chartr函数已经重载,可以应用于序列对象,但最好避免使用,因为会出现符号检查问题:
chartr("T", "A", DNA.set) ## A DNAStringSet instance of length 5 ## width seq names ## [1] 20 ACCGAAAAGGAAAGCACGAC SEQ-1 ## [2] 20 AAAGACCACACCGCAAGAAG SEQ-2 ## [3] 20 CAGAGGAACGGCACAAACGG SEQ-3 ## [4] 20 CACCCGCCAAACACCCAACA SEQ-4 ## [5] 20 ACGCCAAGAAAAAGAAACCA SEQ-5 # 错误 chartr("T", "U", DNA.set) ## Error: key 85 (char 'U') not in lookup table
其他一些R base包的字符串操作函数也可以使用序列对象,但注意返回结果的类型。下面两行代码的返回值都是字符串向量,而不是DNAStringSet对象:
tolower(DNA.set) ## SEQ-1 SEQ-2 SEQ-3 ## "tccgtattggaaagctcgtc" "ttagaccactccgcatgtag" "ctgtggtacggctcaaacgg" ## SEQ-4 SEQ-5 ## "ctcccgcctatctcccttct" "tcgcctagaaaaagtttcct" gsub("T", "U", DNA.set) ## SEQ-1 SEQ-2 SEQ-3 ## "UCCGUAUUGGAAAGCUCGUC" "UUAGACCACUCCGCAUGUAG" "CUGUGGUACGGCUCAAACGG" ## SEQ-4 SEQ-5 ## "CUCCCGCCUAUCUCCCUUCU" "UCGCCUAGAAAAAGUUUCCU"
三、序列截取(sequence subset)
1、使用序列构造函数
Biostrings的XXXSet类和Views类序列构造函数本身就具备序列subset的功能,Views类对象可以通过类型转换获得XXXSet类对象。具体使用方法请参看上一篇文章。
2、subseq函数
subseq函数可以用于序列截取,也可以对选定序列进行修改:
DNAstr <- DNAString(rndSeq(DNA_BASES, 50)) (subseq(DNAstr, start = 20, end = 30)) ## 11-letter "DNAString" instance ## seq: CGTCCATCGAA # 上面语句的subseq函数仅仅是截取了序列,没有改变原序列 DNAstr ## 50-letter "DNAString" instance ## seq: GGCCTGAACTGTGCCAAGACGTCCATCGAAGGAAGTGGGTGGGACGCAGA # 下面语句的subseq函数改变了原序列 subseq(DNAstr, start = 20, end = 30) <- DNAString("NNN") DNAstr ## 42-letter "DNAString" instance ## seq: GGCCTGAACTGTGCCAAGANNNGGAAGTGGGTGGGACGCAGA
3、特殊序列查找和截取
回文(palindrome)序列相关的函数,有:
findPalindromes(subject, min.armlength = 4, max.looplength = 1, min.looplength = 0, max.mismatch = 0) palindromeArmLength(x, max.mismatch = 0, ...) palindromeLeftArm(x, max.mismatch = 0, ...) palindromeRightArm(x, max.mismatch = 0, ...) findComplementedPalindromes(subject, min.armlength = 4, max.looplength = 1, min.looplength = 0, max.mismatch = 0) complementedPalindromeArmLength(x, max.mismatch = 0, ...) complementedPalindromeLeftArm(x, max.mismatch = 0, ...) complementedPalindromeRightArm(x, max.mismatch = 0, ...)
有应用需求的自己去看看函数说明吧。
四、字符或寡核苷酸组合的统计
1、letterFrequency函数
这个函数用于统计指定字符的频率或比例:
letterFrequency(DNA.set, DNA_BASES) ## A C G T ## [1,] 4 5 5 6 ## [2,] 5 6 4 5 ## [3,] 4 5 7 4 ## [4,] 1 11 1 7 ## [5,] 6 5 3 6 letterFrequency(DNA.set, DNA_ALPHABET) ## A C G T M R W S Y K V H D B N - + ## [1,] 4 5 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [2,] 5 6 4 5 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [3,] 4 5 7 4 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [4,] 1 11 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [5,] 6 5 3 6 0 0 0 0 0 0 0 0 0 0 0 0 0 letterFrequency(DNA.set, DNA_BASES, as.prob = TRUE) ## A C G T ## [1,] 0.20 0.25 0.25 0.30 ## [2,] 0.25 0.30 0.20 0.25 ## [3,] 0.20 0.25 0.35 0.20 ## [4,] 0.05 0.55 0.05 0.35 ## [5,] 0.30 0.25 0.15 0.30 # 注意下面这种用法: letterFrequency(DNA.set, "GC", as.prob = TRUE) ## G|C ## [1,] 0.5 ## [2,] 0.5 ## [3,] 0.6 ## [4,] 0.6 ## [5,] 0.4
2、letterFrequencyInSlidingView函数
该函数按设置的窗口长度(view.width)一个个碱基滑动并统计字符频率或比例:
letterFrequencyInSlidingView(DNA.set[[1]], view.width = 16, letter = DNA_BASES) ## A C G T ## [1,] 4 3 4 5 ## [2,] 4 4 4 4 ## [3,] 4 3 5 4 ## [4,] 4 2 5 5 ## [5,] 4 3 4 5 letterFrequencyInSlidingView(DNA.set[[1]], 16, "GC", as.prob = TRUE) ## G|C ## [1,] 0.4375 ## [2,] 0.5000 ## [3,] 0.5000 ## [4,] 0.4375 ## [5,] 0.4375
3、alphabetFrequency函数
作用与letterFrequency函数类似,但按ALPHABET中的所有因子进行统计。baseOnly设置为TRUE可以对ALPHABET进行限制:
alphabetFrequency(DNA.set) ## A C G T M R W S Y K V H D B N - + ## [1,] 4 5 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [2,] 5 6 4 5 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [3,] 4 5 7 4 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [4,] 1 11 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [5,] 6 5 3 6 0 0 0 0 0 0 0 0 0 0 0 0 0 alphabetFrequency(DNA.set, as.prob = TRUE, baseOnly = TRUE) ## A C G T other ## [1,] 0.20 0.25 0.25 0.30 0 ## [2,] 0.25 0.30 0.20 0.25 0 ## [3,] 0.20 0.25 0.35 0.20 0 ## [4,] 0.05 0.55 0.05 0.35 0 ## [5,] 0.30 0.25 0.15 0.30 0
4、寡核苷酸统计:
有以下函数,分别统计2核苷酸组合、3核苷酸组合和寡核苷酸组合:
dinucleotideFrequency
trinucleotideFrequency
oligonucleotideFrequency
五、杂项函数
包括使用序列对象进行运算的一些函数(和符号):
# 反转对象元素的顺序,不是反转序列! rev(DNA.set) ## A DNAStringSet instance of length 5 ## width seq names ## [1] 20 TCGCCTAGAAAAAGTTTCCT SEQ-5 ## [2] 20 CTCCCGCCTATCTCCCTTCT SEQ-4 ## [3] 20 CTGTGGTACGGCTCAAACGG SEQ-3 ## [4] 20 TTAGACCACTCCGCATGTAG SEQ-2 ## [5] 20 TCCGTATTGGAAAGCTCGTC SEQ-1 # 判断两个对象是否相同 DNA.set[[1]] == DNA.set[[2]] ## [1] FALSE # 判断对象是否包含在另外一个对象的元素中 DNA.str %in% DNA.set ## [1] TRUE
Biostrings在2.0版后还添加了append函数,可以把几个序列集合合并,还可以使用c函数进行合并:
append(DNA.set, DNAStringSet(DNA.str)) ## A DNAStringSet instance of length 6 ## width seq names ## [1] 20 TCCGTATTGGAAAGCTCGTC SEQ-1 ## [2] 20 TTAGACCACTCCGCATGTAG SEQ-2 ## [3] 20 CTGTGGTACGGCTCAAACGG SEQ-3 ## [4] 20 CTCCCGCCTATCTCCCTTCT SEQ-4 ## [5] 20 TCGCCTAGAAAAAGTTTCCT SEQ-5 ## [6] 20 TCCGTATTGGAAAGCTCGTC c(DNA.set, DNAStringSet(DNA.str)) ## A DNAStringSet instance of length 6 ## width seq names ## [1] 20 TCCGTATTGGAAAGCTCGTC SEQ-1 ## [2] 20 TTAGACCACTCCGCATGTAG SEQ-2 ## [3] 20 CTGTGGTACGGCTCAAACGG SEQ-3 ## [4] 20 CTCCCGCCTATCTCCCTTCT SEQ-4 ## [5] 20 TCGCCTAGAAAAAGTTTCCT SEQ-5 ## [6] 20 TCCGTATTGGAAAGCTCGTC
除以上列出的例子外还有duplicated, unique, sort, order, split, relist等。随着Biostrings的完善,可能会添加更多的函数,使序列的运算更符合R语言的运算习惯。
注:本文R语言的代码加亮显示由RStudio/knitr产生,使用R脚本对其产生的HTML代码进行解析后发布到本博客。
原文来自:http://blog.csdn.net/u014801157/article/details/24372455