利用bioperl读取复杂序列

2012/07/27评论3,826

Genbank序列描述的内容就非常丰富（类似的还有SwissProt，EMBL等格式的序列），除了名称、描述和序列字符串以外还有序列号、形状（线状还是环状？）、发布日期、所属物种以及序列内包含的基因、RNA、CDS、内含子（如果有的话）……What's more，这些信息并不是你想怎么写就怎么写的，而都是有一定的规范格式，故BioPerl可以准确地识别并提取这些信息。所以，不要自行修改下载得到的Genbank序列，否则BioPerl可能会无法识别。

如何从本地文件中提取Genbank序列？

[perl]
#!usr/bin/perl -w

use Bio::SeqIO;
$catch_seqio_obj->new(-file => 'wrky.gbk', -format => 'genbank');
$seq_obj = $catch_seqio_obj->next_seq; #后缀加上去比较好，也可以使用ref函数来检查一下$catch_seqio_obj究竟是什么。ref($catch_seq); 该函数会返回Bio::SeqIO::genpept，所以说明$catch_seqio_obj是一个Bio::SeqIO类型的对象！
#$seq_obj = Bio::SeqIO -> new(-file => 'wrky.gbk', -format => 'genbank') -> next_seq; 上边两句合并，注意：如果一个文件中有多条序列，那是不能这么写的。
[/perl]
[perl]
$display_name = $seq_obj->display_name; # 序列的名称
$desc = $seq_obj->desc; # 序列的描述
$seq = $seq_obj->seq; # 序列字符串
$seq_type = $seq_obj->alphabet; # 序列的类型（dna还是蛋白质？）
$seq_length = $seq_obj->length; # 序列的长度
[/perl]
$seq_obj能调用的方法还有一些，在http://www.bioperl.org/wiki/HOWTO:Beginners的Table3中有详细的描述。

我们更想要物种名以及序列里面包含的基因、mRNA、CDS等元件的位置、名称、序列等信息。BioPerl有另外一套机制来获取它们。

处理多个Genbank文件：只是模块要换一下，把Bio::SeqIO改成Bio::SeqIO::MultiFile，
[perl]
use Bio::SeqIO::MultiFile;
$catch_seqio_obj = Bio::SeqIO::MultiFile -> new(-files=>['a.gbk','b.gbk'], -format=>'genbank'); # 读取a.gbk和b.gbk两个文件
while($seq_obj = $catch_seqio_obj -> next_seq) {
do sth...
}
[/perl]
或者：
[perl]
use Bio::SeqIO::MultiFile;
$catch_seqio_obj = Bio::SeqIO::MultiFile -> new(-files=>[glob "*.gbk"], -format=>'genbank'); # 读取当前目录所有后缀名为gbk的文件。
while($seq_obj = $catch_seqio_obj-> next_seq) {
do sth...
}
[/perl]
从命令行中传入参数

如果我们想用前面的代码处理另一个Genbank文件（比如seq.gbk），我们必须打开源代码文件，把-file => 'wrky.gbk' 改成 -file => 'seq.gbk'，这样子太麻烦了。事实上这些要处理的文件的名字应该由用户来决定，而不是由你写在源代码里。有没有办法像“钻石操作符”（<>）那样通过命令行参数读入文件名呢？当然可以请看：
[perl]
use Bio::SeqIO;
$filename = shift @ARGV; # 从命令行中读取文件名,而不是写在程序里
$catch_seqio_obj = Bio::SeqIO -> new(-file => $filename, -format => 'genbank');
[/perl]
呵呵，这样子这几行代码就可以处理世界上所有的Genbank文件了，再也不需要修改源代码了。

读取多个文件
[perl]
use Bio::SeqIO::MultiFile;
@filenames = @ARGV;
$catch_seq = Bio::SeqIO::MultiFile -> new(-files => [@filenames], -format => 'genbank');
[/perl]

Table 3: Values from the Sequence object (Genbank)
Method	Returns
display_id	ECORHO
desc	E.coli rho gene coding for transcription termination factor.
display_name	ECORHO
accession	J01673
primary_id	147605
seq_version	1
keywords	attenuator; leader peptide; rho gene; transcription terminator
is_circular
namespace
authority
length	1880
seq	AACCCT...ACAGGAC
division	BCT
molecule	DNA
get_dates	26-APR-1993
get_secondary_accessions	J01674

关于Bio::SeqIO模块究竟支持哪些序列格式，以及格式的写法如下：（from http://www.bioperl.org/wiki/HOWTO:SeqIO ）

Table 1: Bio::SeqIO modules and formats supported
Name	Description	File extension	Module
abi	ABI tracefile	ab[i1]	Bio::SeqIO::abi
ace	Ace database	ace	Bio::SeqIO::ace
agave	AGAVE XML		Bio::SeqIO::agave
alf	ALF tracefile	alf	Bio::SeqIO::alf
asciitree	write-only, to visualize features		Bio::SeqIO::asciitree
bsml	BSML, using XML::DOM	bsml	Bio::SeqIO::bsml
bsml_sax	BSML, using XML::SAX		Bio::SeqIO::bsml_sax
chadoxml	CHADO sequence format		Bio::SeqIO::chadoxml
chaos	CHAOS sequence format		Bio::SeqIO::chaos
chaosxml	Chaos XML		Bio::SeqIO::chaosxml
ctf	CTF tracefile	ctf	Bio::SeqIO::ctf
embl	EMBL database	ebl\|emb\|dat	Bio::SeqIO::embl
entrezgene	Entrez Gene ASN1		Bio::SeqIO::entrezgene
excel	Excel		Bio::SeqIO::excel
exp	Staden EXP format	exp	Bio::SeqIO::exp
fasta	FASTA	fast\|seq\|fa\|fsa\|nt\|aa	Bio::SeqIO::fasta
fastq	quality score data in FASTA-like format	fastq	Bio::SeqIO::fastq
flybase_chadoxml	variant of Chado XML		Bio::SeqIO::flybase_chadoxml
game	GAME XML		Bio::SeqIO::game
gcg	GCG	gcg	Bio::SeqIO::gcg
genbank	GenBank	gbank\|genbank	Bio::SeqIO::genbank
interpro	InterProScan XML		Bio::SeqIO::interpro
kegg	KEGG		Bio::SeqIO::kegg
largefasta	Large files, fasta format		Bio::SeqIO::largefasta
lasergene	Lasergene format		Bio::SeqIO::lasergene
locuslink	LocusLink LL_tmpl		Bio::SeqIO::locuslink
metafasta			Bio::SeqIO::metafasta
phd	Phred	phred	Bio::SeqIO::phd
pir	PIR database	pir	Bio::SeqIO::pir
pln	PLN tracefile	pln	Bio::SeqIO::pln
qual	Phred		Bio::SeqIO::qual
raw	plain text	txt	Bio::SeqIO::raw
scf	Standard Chromatogram Format	scf	Bio::SeqIO::scf
seqxml	SeqXML sequence format using XML::LibXML and XML::Writer	xml	Bio::SeqIO::seqxml
strider	DNA Strider format		Bio::SeqIO::strider
swiss	SwissProt	sp	Bio::SeqIO::swiss
tab	tab-delimited		Bio::SeqIO::tab
table	Table		Bio::SeqIO::table
tigr	TIGR XML		Bio::SeqIO::tigr
tigrxml	TIGR Coordset XML		Bio::SeqIO::tigrxml
tinyseq	NCBI TinySeq XML		Bio::SeqIO::tinyseq
ztr	ZTR tracefile	ztr	Bio::SeqIO::ztr

本文来自：http://blog.163.com/guojinyan2008@126/blog/static/3074706720111019101034907/

发表评论