序列对象
前面涉及到了很多序列对象,展示了序列对象的一些创建和使用方法。这里来详细描述序列对象的功能。
下表列出了序列对象的‘方法’(面向对象编程中的概念,见前文;表的内容就不翻译了)。‘return’表示使用这个方法时,对象所返回的值(或内容)。其中有些方法,如seq(),既可用于输出,也可以向其输入。例如,从已有的序列对象中获取序列。
- $sequence_as_string = $seq_obj->seq;
也可以自己设定序列:
- $seq_obj->seq("MMTYDFFFFVVNNNNPPPPAAAW");
NAME | RETURNS | EXAMPLE | NOTE |
---|---|---|---|
accession_number | identifier | $acc = $so->accession_number | get or set an identifier |
alphabet | alphabet | $so->alphabet(‘dna’) | get or set the alphabet (‘dna’,'rna’,'protein’) |
authority | authority, if available | $so->authority(“FlyBase”) | get or set the organization |
desc | description | $so->desc(“Example 1″) | get or set a description |
display_id | identifier | $so->display_id(“NP_123456″) | get or set an identifier |
division | division, if available (e.g. PRI) | $div = $so->division | get division (e.g. “PRI”) |
get_dates | array of dates, if available | @dates = $so->get_dates | get dates |
get_secondary_accessions | array of secondary accessions, if available | @accs = $so->get_secondary_accessions | get other identifiers |
is_circular | Boolean | if $so->is_circular { # } | get or set |
keywords | keywords, if available | @array = $so->keywords | get or set keywords |
length | length, a number | $len = $so->length | get the length |
molecule | molecule type, if available | $type = $so->molecule | get molecule (e.g. “RNA”, “DNA”) |
namespace | namespace, if available | $so->namespace(“Private”) | get or set the name space |
new | Sequence object | $so = Bio::Seq->new(-seq => “MPQRAS”) | create a new one, see Bio::Seq for more |
pid | pid, if available | $pid = $so->pid | get pid |
primary_id | identifier | $so->primary_id(12345) | get or set an identifier |
revcom | Sequence object | $so2 = $so1->revcom | Reverse complement |
seq | sequence string | $seq = $so->seq | get or set the sequence |
seq_version | version, if available | $so->seq_version(“1″) | get or set a version |
species | Species object | $species_obj = $so->species | See Bio::Species for more |
subseq | sequence string | $string = $seq_obj->subseq(10,40) | Arguments are start and end |
translate | protein Sequence object | $prot_obj = $dna_obj->translate | See the Bioperl Tutorial for more |
trunc | Sequence object | $so2 = $so1->trunc(10,40) | Arguments are start and end |
需要注意的是,上表列出的有些方法,如molecule和division,仅在序列对象有相应值的时候才有效,有些序列格式并不包括这些信息。所以,使用某种方法之前,一定要了解清楚输入的序列文件,及其包含的内容。
还有一些方法是关于序列注释信息的,但这些内容可能有点离题,如果要了解的话,详见Feature-Annotation HOWTO。下表列出了一些有关的方法。
NAME | RETURNS | NOTE |
---|---|---|
get_SeqFeatures | array of SeqFeature objects | |
get_all_SeqFeatures | array of SeqFeature objects array | includes sub-features |
remove_SeqFeatures | array of SeqFeatures removed | |
feature_count | number of SeqFeature objects | |
add_SeqFeature | annotation array of Annotation objects | get or set |
举例
接着来看一下如何使用上面提到的各种方法。看看这些方法如何从不同的来源获取序列对象以及输出内容。先来看看如何从Genbank获取并创建序列对象,代码如下:
- use Bio::DB::GenBank;
- $db_obj = Bio::DB::GenBank->new;
- $seq_obj = $db_obj->get_Seq_by_acc("J01673");
或者从本地已有的Genbank文件中获取,代码如下
- use Bio::SeqIO;
- $seqio_obj = Bio::SeqIO->new(-file => "J01673.gb", -format => "genbank" );
- $seq_obj = $seqio_obj->next_seq;
Genbank文件格式如下所示:
- LOCUS ECORHO 1880 bp DNA linear BCT 26-APR-1993
- DEFINITION E.coli rho gene coding for transcription termination factor.
- ACCESSION J01673 J01674
- VERSION J01673.1 GI:147605
- KEYWORDS attenuator; leader peptide; rho gene; transcription terminator.
- SOURCE Escherichia coli
- ORGANISM Escherichia coli
- Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
- Enterobacteriaceae; Escherichia.
- REFERENCE 1 (bases 1 to 1880)
- AUTHORS Brown,S., Albrechtsen,B., Pedersen,S. and Klemm,P.
- TITLE Localization and regulation of the structural gene for
- transcription-termination factor rho of Escherichia coli
- JOURNAL J. Mol. Biol. 162 (2), 283-298 (1982)
- MEDLINE 83138788
- PUBMED 6219230
- REFERENCE 2 (bases 1 to 1880) AUTHORS Pinkham,J.L. and Platt,T.
- TITLE The nucleotide sequence of the rho gene of E. coli K-12
- JOURNAL Nucleic Acids Res. 11 (11), 3531-3545 (1983)
- MEDLINE 83220759
- PUBMED 6304634
- COMMENT Original source text: Escherichia coli (strain K-12) DNA.
- A clean copy of the sequence for [2] was kindly provided by
- J.L.Pinkham and T.Platt.
- FEATURES Location/Qualifiers
- source 1..1880
- /organism="Escherichia coli"
- /mol_type="genomic DNA"
- /strain="K-12"
- /db_xref="taxon:562"
- mRNA 212..>1880
- /product="rho mRNA"
- CDS 282..383
- /note="rho operon leader peptide"
- /codon_start=1
- /transl_table=11
- /protein_id="AAA24531.1"
- /db_xref="GI:147606"
- /translation="MRSEQISGSSLNPSCRFSSAYSPVTRQRKDMSR"
- gene 468..1727
- /gene="rho"
- CDS 468..1727
- /gene="rho"
- /note="transcription termination factor"
- /codon_start=1
- /transl_table=11
- /protein_id="AAA24532.1"
- /db_xref="GI:147607"
- /translation="MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAK
- SGEDIFGDGVLEILQDGFGFLRSADSSYLAGPDDIYVSPSQIRRFNLRTGDTISGKIR
- PPKEGERYFALLKVNEVNFDKPENARNKILFENLTPLHANSRLRMERGNGSTEDLTAR
- VLDLASPIGRGQRGLIVAPPKAGKTMLLQNIAQSIAYNHPDCVLMVLLIDERPEEVTE
- MQRLVKGEVVASTFDEPASRHVQVAEMVIEKAKRLVEHKKDVIILLDSITRLARAYNT
- VVPASGKVLTGGVDANALHRPKRFFGAARNVEEGGSLTIIATALIDTGSKMDEVIYEE
- FKGTGNMELHLSRKIAEKRVFPAIDYNRSGTRKEELLTTQEELQKMWILRKIIHPMGE
- IDAMEFLINKLAMTKTNDDFFEMMKRS"
- ORIGIN 15 bp upstream from HhaI site.
- 1 aaccctagca ctgcgccgaa atatggcatc cgtggtatcc cgactctgct gctgttcaaa
- 61 aacggtgaag tggcggcaac caaagtgggt gcactgtcta aaggtcagtt gaaagagttc
- ...deleted...
- 1801 tgggcatgtt aggaaaattc ctggaatttg ctggcatgtt atgcaatttg catatcaaat
- 1861 ggttaatttt tgcacaggac
- //
不论用那种方式,都能得到一样的序列对象。下表列出了这个序列对象的可用方法及其返回值。
METHOD | RETURNS |
---|---|
display_id | ECORHO |
desc | E.coli rho gene coding for transcription termination factor. |
display_name | ECORHO |
accession | J01673 |
primary_id | 147605 |
seq_version | 1 |
keywords | attenuator; leader peptide; rho gene; transcription terminator |
is_circular | |
namespace | |
authority | |
length | 1880 |
seq | AACCCT…ACAGGAC |
division | BCT |
molecule | DNA |
get_dates | 26-APR-1993 |
get_secondary_accessions | J01674 |
这里需要说明一下。首先,很多序列信息没有被返回。这些“丢失”的信息都是和序列注释信息有关,可详见Feature and Annotation HOWTO。并且,有些方法返回的是空值,比如namespace和authority。原因是对应的序列信息还没有一个普遍接受的格式或确定的名字,也许等确定的时候,作者会重写代码。(译者注:可能作者是先构造了一个结构,没有对应的内容。反正现在这些方法是没用的,暂不用管。)最后,你可能会问各个序列信息如何和相应的方法对应起来的。一般来说,由于没有一个通用标准,代码作者根据自己的常识,将相应的序列信息命一个合理的名字,然后对应到某个方法上。(最后一句可能翻译的不准确)
再来看fasta格式文件作为输入(仍是同一序列)。fasta格式如下所示,相对Genbank,显得非常简单:
- >gi|147605|gb|J01673.1|ECORHO E.coli rho gene coding for transcription termination factor
- AACCCTAGCACTGCGCCGAAATATGGCATCCGTGGTATCCCGACTCTGCTGCTGTTCAAAAACGGTGAAG
- TGGCGGCAACCAAAGTGGGTGCACTGTCTAAAGGTCAGTTGAAAGAGTTCCTCGACGCTAACCTGGCGTA
- ...deleted...
- ACGTGTTTACGTGGCGTTTTGCTTTTATATCTGTAATCTTAATGCCGCGCTGGGCATGTTAGGAAAATTC
- CTGGAATTTGCTGGCATGTTATGCAATTTGCATATCAAATGGTTAATTTTTGCACAGGAC
可返回的内容:
METHOD | RETURNS |
---|---|
display_id | 147605|gb|J01673.1|ECORHO |
desc | E.coli rho gene coding for transcription termination factor |
display_name | 147605|gb|J01673.1|ECORHO |
accession | unknown |
primary_id | 147605|gb|J01673.1|ECORHO |
is_circular | |
namespace | |
authority | |
length | 1880 |
seq | AACCCT…ACAGGAC |
和上面使用Genbank文件得到的信息相比,会缺少一些序列信息,如seq_version。另外,如display_id,显示的是不同值。原因在于Genbank服务器将Genbank格式转换fasta格式时遵循的规则和SwissProt服务器将SwissProt格式转换fasta格式的规则不一样。除非有一个统一的标准,否则代码作者一般是根据自己的理解将各个序列信息对应到某一方法上。虽然Bioperl可以遵循某一个特定的规则,如Genbank所使用的。但Bioperl的作者们通过投票决定不遵循任何一个只来源于某一个组织的转换规则。
接着看一下SwissProt格式文件的输入。
ID A2S3_RAT STANDARD; PRT; 913 AA.
- AC Q8R2H7; Q8R2H6; Q8R4G3;
- DT 28-FEB-2003 (Rel. 41, Created)
- DE Amyotrophic lateral sclerosis 2 chromosomal region candidate gene
- DE protein 3 homolog (GABA-A receptor interacting factor-1) (GRIF-1) (O-
- DE GlcNAc transferase-interacting protein of 98 kDa).
- GN ALS2CR3 OR GRIF1 OR OIP98.
- OS Rattus norvegicus (Rat).
- OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
- OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.
- OX NCBI_TaxID=10116;
- RN [1]
- RP SEQUENCE FROM N.A. (ISOFORMS 1 AND 2), SUBCELLULAR LOCATION, AND
- RP INTERACTION WITH GABA-A RECEPTOR.
- RC TISSUE=Brain;
- RX MEDLINE=22162448; PubMed=12034717;
- RA Beck M., Brickley K., Wilkinson H.L., Sharma S., Smith M.,
- RA Chazot P.L., Pollard S., Stephenson F.A.;
- RT "Identification, molecular cloning, and characterization of a novel
- RT GABAA receptor-associated protein, GRIF-1.";
- RL J. Biol. Chem. 277:30079-30090(2002).
- RN [2]
- RP REVISIONS TO 579 AND 595-596, AND VARIANTS VAL-609 AND PRO-820.
- RA Stephenson F.A.;
- RL Submitted (FEB-2003) to the EMBL/GenBank/DDBJ databases.
- RN [3]
- RP SEQUENCE FROM N.A. (ISOFORM 3), INTERACTION WITH O-GLCNAC TRANSFERASE,
- RP AND O-GLYCOSYLATION.
- RC STRAIN=Sprague-Dawley; TISSUE=Brain;
- RX MEDLINE=22464403; PubMed=12435728;
- RA Iyer S.P.N., Akimoto Y., Hart G.W.;
- RT "Identification and cloning of a novel family of coiled-coil domain
- RT proteins that interact with O-GlcNAc transferase.";
- RL J. Biol. Chem. 278:5399-5409(2003).
- CC -!- SUBUNIT: Interacts with GABA-A receptor and O-GlcNac transferase.
- CC -!- SUBCELLULAR LOCATION: Cytoplasmic.
- CC -!- ALTERNATIVE PRODUCTS:
- CC Event=Alternative splicing; Named isoforms=3;
- CC Name=1; Synonyms=GRIF-1a;
- CC IsoId=Q8R2H7-1; Sequence=Displayed;
- CC Name=2; Synonyms=GRIF-1b;
- CC IsoId=Q8R2H7-2; Sequence=VSP_003786, VSP_003787;
- CC Name=3;
- CC IsoId=Q8R2H7-3; Sequence=VSP_003788;
- CC -!- PTM: O-glycosylated.
- CC -!- SIMILARITY: TO HUMAN OIP106.
- DR EMBL; AJ288898; CAC81785.2; -.
- DR EMBL; AJ288898; CAC81786.2; -.
- DR EMBL; AF474163; AAL84588.1; -.
- DR GO; GO:0005737; C:cytoplasm; IEP.
- DR GO; GO:0005634; C:nucleus; IDA.
- DR GO; GO:0005886; C:plasma membrane; IEP.
- DR GO; GO:0006357; P:regulation of transcription from Pol II pro...; IDA.
- DR InterPro; IPR006933; HAP1_N.
- DR Pfam; PF04849; HAP1_N; 1.
- KW Coiled coil; Alternative splicing; Polymorphism.
- FT DOMAIN 134 355 COILED COIL (POTENTIAL).
- FT VARSPLIC 653 672 VATSNPGKCLSFTNSTFTFT -> ALVSHHCPVEAVRAVHP
- FT TRL (in isoform 2).
- FT /FTId=VSP_003786.
- FT VARSPLIC 673 913 Missing (in isoform 2).
- FT /FTId=VSP_003787.
- FT VARSPLIC 620 687 VQQPLQLEQKPAPPPPVTGIFLPPMTSAGGPVSVATSNPGK
- FT CLSFTNSTFTFTTCRILHPSDITQVTP -> GSAASSTGAE
- FT ACTTPASNGYLPAAHDLSRGTSL (in isoform 3).
- FT /FTId=VSP_003788.
- FT VARIANT 609 609 E -> V.
- FT VARIANT 820 820 S -> P.
- SQ SEQUENCE 913 AA; 101638 MW; D0E135DBEC30C28C CRC64;
- MSLSQNAIFK SQTGEENLMS SNHRDSESIT DVCSNEDLPE VELVNLLEEQ LPQYKLRVDS
- LFLYENQDWS QSSHQQQDAS ETLSPVLAEE TFRYMILGTD RVEQMTKTYN DIDMVTHLLA
- ...deleted...
- GIARVVKTPV PRENGKSREA EMGLQKPDSA VYLNSGGSLL GGLRRNQSLP VMMGSFGAPV
- CTTSPKMGIL KED
- //
相应的返回值如下表所示:
METHOD | RETURNS |
---|---|
display_id | A2S3_RAT |
desc | Amyotrophic lateral … protein of 98 kDa). |
display_name | A2S3_RAT |
accession | Q8R2H7 |
is_circular | |
namespace | |
authority | |
seq_version | |
keywords | Coiled coil; Alternative splicing; Polymorphism |
length | 913 |
seq | MSLSQ…ILKED |
division | RAT |
get_dates | 28-FEB-2003 (Rel. 41, Created) |
get_secondary_accessions | Q8R2H6 Q8R4G3 |
和Genbank一样,详见Feature and Annotation HOWTO,查看序列注释信息。