详解Bioperl的序列对象(Bioperl HOWTO翻译7)

序列对象

英文原文

前面涉及到了很多序列对象,展示了序列对象的一些创建和使用方法。这里来详细描述序列对象的功能。

下表列出了序列对象的‘方法’(面向对象编程中的概念,见前文;表的内容就不翻译了)。‘return’表示使用这个方法时,对象所返回的值(或内容)。其中有些方法,如seq(),既可用于输出,也可以向其输入。例如,从已有的序列对象中获取序列。

  1. $sequence_as_string = $seq_obj->seq;

也可以自己设定序列:

  1. $seq_obj->seq("MMTYDFFFFVVNNNNPPPPAAAW");
Table 1: Sequence Object Methods
NAMERETURNSEXAMPLENOTE
accession_numberidentifier$acc = $so->accession_numberget or set an identifier
alphabetalphabet$so->alphabet(‘dna’)get or set the alphabet (‘dna’,'rna’,'protein’)
authorityauthority, if available$so->authority(“FlyBase”)get or set the organization
descdescription$so->desc(“Example 1″)get or set a description
display_ididentifier$so->display_id(“NP_123456″)get or set an identifier
divisiondivision, if available (e.g. PRI)$div = $so->divisionget division (e.g. “PRI”)
get_datesarray of dates, if available@dates = $so->get_datesget dates
get_secondary_accessionsarray of secondary accessions, if available@accs = $so->get_secondary_accessionsget other identifiers
is_circularBooleanif $so->is_circular { # }get or set
keywordskeywords, if available@array = $so->keywordsget or set keywords
lengthlength, a number$len = $so->lengthget the length
moleculemolecule type, if available$type = $so->moleculeget molecule (e.g. “RNA”, “DNA”)
namespacenamespace, if available$so->namespace(“Private”)get or set the name space
newSequence object$so = Bio::Seq->new(-seq => “MPQRAS”)create a new one, see Bio::Seq for more
pidpid, if available$pid = $so->pidget pid
primary_ididentifier$so->primary_id(12345)get or set an identifier
revcomSequence object$so2 = $so1->revcomReverse complement
seqsequence string$seq = $so->seqget or set the sequence
seq_versionversion, if available$so->seq_version(“1″)get or set a version
speciesSpecies object$species_obj = $so->speciesSee Bio::Species for more
subseqsequence string$string = $seq_obj->subseq(10,40)Arguments are start and end
translateprotein Sequence object$prot_obj = $dna_obj->translateSee the Bioperl Tutorial for more
truncSequence object$so2 = $so1->trunc(10,40)Arguments are start and end

需要注意的是,上表列出的有些方法,如molecule和division,仅在序列对象有相应值的时候才有效,有些序列格式并不包括这些信息。所以,使用某种方法之前,一定要了解清楚输入的序列文件,及其包含的内容。

还有一些方法是关于序列注释信息的,但这些内容可能有点离题,如果要了解的话,详见Feature-Annotation HOWTO。下表列出了一些有关的方法。

Table 2: Feature and Annotation Methods
NAMERETURNSNOTE
get_SeqFeaturesarray of SeqFeature objects
get_all_SeqFeaturesarray of SeqFeature objects arrayincludes sub-features
remove_SeqFeaturesarray of SeqFeatures removed
feature_countnumber of SeqFeature objects
add_SeqFeatureannotation array of Annotation objectsget or set

举例

接着来看一下如何使用上面提到的各种方法。看看这些方法如何从不同的来源获取序列对象以及输出内容。先来看看如何从Genbank获取并创建序列对象,代码如下:

  1. use Bio::DB::GenBank;
  2.  
  3. $db_obj = Bio::DB::GenBank->new;
  4. $seq_obj = $db_obj->get_Seq_by_acc("J01673");

或者从本地已有的Genbank文件中获取,代码如下

  1. use Bio::SeqIO;
  2.  
  3. $seqio_obj = Bio::SeqIO->new(-file => "J01673.gb", -format => "genbank" );
  4. $seq_obj = $seqio_obj->next_seq;

Genbank文件格式如下所示:

  1. LOCUS ECORHO 1880 bp DNA linear BCT 26-APR-1993
  2. DEFINITION E.coli rho gene coding for transcription termination factor.
  3. ACCESSION J01673 J01674
  4. VERSION J01673.1 GI:147605
  5. KEYWORDS attenuator; leader peptide; rho gene; transcription terminator.
  6. SOURCE Escherichia coli
  7. ORGANISM Escherichia coli
  8. Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
  9. Enterobacteriaceae; Escherichia.
  10. REFERENCE 1 (bases 1 to 1880)
  11. AUTHORS Brown,S., Albrechtsen,B., Pedersen,S. and Klemm,P.
  12. TITLE Localization and regulation of the structural gene for
  13. transcription-termination factor rho of Escherichia coli
  14. JOURNAL J. Mol. Biol. 162 (2), 283-298 (1982)
  15. MEDLINE 83138788
  16. PUBMED 6219230
  17. REFERENCE 2 (bases 1 to 1880) AUTHORS Pinkham,J.L. and Platt,T.
  18. TITLE The nucleotide sequence of the rho gene of E. coli K-12
  19. JOURNAL Nucleic Acids Res. 11 (11), 3531-3545 (1983)
  20. MEDLINE 83220759
  21. PUBMED 6304634
  22. COMMENT Original source text: Escherichia coli (strain K-12) DNA.
  23. A clean copy of the sequence for [2] was kindly provided by
  24. J.L.Pinkham and T.Platt.
  25. FEATURES Location/Qualifiers
  26. source 1..1880
  27. /organism="Escherichia coli"
  28. /mol_type="genomic DNA"
  29. /strain="K-12"
  30. /db_xref="taxon:562"
  31. mRNA 212..>1880
  32. /product="rho mRNA"
  33. CDS 282..383
  34. /note="rho operon leader peptide"
  35. /codon_start=1
  36. /transl_table=11
  37. /protein_id="AAA24531.1"
  38. /db_xref="GI:147606"
  39. /translation="MRSEQISGSSLNPSCRFSSAYSPVTRQRKDMSR"
  40. gene 468..1727
  41. /gene="rho"
  42. CDS 468..1727
  43. /gene="rho"
  44. /note="transcription termination factor"
  45. /codon_start=1
  46. /transl_table=11
  47. /protein_id="AAA24532.1"
  48. /db_xref="GI:147607"
  49. /translation="MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAK
  50. SGEDIFGDGVLEILQDGFGFLRSADSSYLAGPDDIYVSPSQIRRFNLRTGDTISGKIR
  51. PPKEGERYFALLKVNEVNFDKPENARNKILFENLTPLHANSRLRMERGNGSTEDLTAR
  52. VLDLASPIGRGQRGLIVAPPKAGKTMLLQNIAQSIAYNHPDCVLMVLLIDERPEEVTE
  53. MQRLVKGEVVASTFDEPASRHVQVAEMVIEKAKRLVEHKKDVIILLDSITRLARAYNT
  54. VVPASGKVLTGGVDANALHRPKRFFGAARNVEEGGSLTIIATALIDTGSKMDEVIYEE
  55. FKGTGNMELHLSRKIAEKRVFPAIDYNRSGTRKEELLTTQEELQKMWILRKIIHPMGE
  56. IDAMEFLINKLAMTKTNDDFFEMMKRS"
  57. ORIGIN 15 bp upstream from HhaI site.
  58. 1 aaccctagca ctgcgccgaa atatggcatc cgtggtatcc cgactctgct gctgttcaaa
  59. 61 aacggtgaag tggcggcaac caaagtgggt gcactgtcta aaggtcagtt gaaagagttc
  60.  
  61. ...deleted...
  62.  
  63. 1801 tgggcatgtt aggaaaattc ctggaatttg ctggcatgtt atgcaatttg catatcaaat
  64. 1861 ggttaatttt tgcacaggac
  65. //

不论用那种方式,都能得到一样的序列对象。下表列出了这个序列对象的可用方法及其返回值。

Table 3: Values from the Sequence object (Genbank)
METHODRETURNS
display_idECORHO
descE.coli rho gene coding for transcription termination factor.
display_nameECORHO
accessionJ01673
primary_id147605
seq_version1
keywordsattenuator; leader peptide; rho gene; transcription terminator
is_circular
namespace
authority
length1880
seqAACCCT…ACAGGAC
divisionBCT
moleculeDNA
get_dates26-APR-1993
get_secondary_accessionsJ01674

这里需要说明一下。首先,很多序列信息没有被返回。这些“丢失”的信息都是和序列注释信息有关,可详见Feature and Annotation HOWTO。并且,有些方法返回的是空值,比如namespace和authority。原因是对应的序列信息还没有一个普遍接受的格式或确定的名字,也许等确定的时候,作者会重写代码。(译者注:可能作者是先构造了一个结构,没有对应的内容。反正现在这些方法是没用的,暂不用管。)最后,你可能会问各个序列信息如何和相应的方法对应起来的。一般来说,由于没有一个通用标准,代码作者根据自己的常识,将相应的序列信息命一个合理的名字,然后对应到某个方法上。(最后一句可能翻译的不准确)

再来看fasta格式文件作为输入(仍是同一序列)。fasta格式如下所示,相对Genbank,显得非常简单:

  1. >gi|147605|gb|J01673.1|ECORHO E.coli rho gene coding for transcription termination factor
  2. AACCCTAGCACTGCGCCGAAATATGGCATCCGTGGTATCCCGACTCTGCTGCTGTTCAAAAACGGTGAAG
  3. TGGCGGCAACCAAAGTGGGTGCACTGTCTAAAGGTCAGTTGAAAGAGTTCCTCGACGCTAACCTGGCGTA
  4.  
  5. ...deleted...
  6.  
  7. ACGTGTTTACGTGGCGTTTTGCTTTTATATCTGTAATCTTAATGCCGCGCTGGGCATGTTAGGAAAATTC
  8. CTGGAATTTGCTGGCATGTTATGCAATTTGCATATCAAATGGTTAATTTTTGCACAGGAC

可返回的内容:

Table 4: Values from the Sequence object (Fasta)
METHODRETURNS
display_id147605|gb|J01673.1|ECORHO
descE.coli rho gene coding for transcription termination factor
display_name147605|gb|J01673.1|ECORHO
accessionunknown
primary_id147605|gb|J01673.1|ECORHO
is_circular
namespace
authority
length1880
seqAACCCT…ACAGGAC

和上面使用Genbank文件得到的信息相比,会缺少一些序列信息,如seq_version。另外,如display_id,显示的是不同值。原因在于Genbank服务器将Genbank格式转换fasta格式时遵循的规则和SwissProt服务器将SwissProt格式转换fasta格式的规则不一样。除非有一个统一的标准,否则代码作者一般是根据自己的理解将各个序列信息对应到某一方法上。虽然Bioperl可以遵循某一个特定的规则,如Genbank所使用的。但Bioperl的作者们通过投票决定不遵循任何一个只来源于某一个组织的转换规则。

接着看一下SwissProt格式文件的输入。

ID A2S3_RAT STANDARD; PRT; 913 AA.

  1. AC Q8R2H7; Q8R2H6; Q8R4G3;
  2. DT 28-FEB-2003 (Rel. 41, Created)
  3. DE Amyotrophic lateral sclerosis 2 chromosomal region candidate gene
  4. DE protein 3 homolog (GABA-A receptor interacting factor-1) (GRIF-1) (O-
  5. DE GlcNAc transferase-interacting protein of 98 kDa).
  6. GN ALS2CR3 OR GRIF1 OR OIP98.
  7. OS Rattus norvegicus (Rat).
  8. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
  9. OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.
  10. OX NCBI_TaxID=10116;
  11. RN [1]
  12. RP SEQUENCE FROM N.A. (ISOFORMS 1 AND 2), SUBCELLULAR LOCATION, AND
  13. RP INTERACTION WITH GABA-A RECEPTOR.
  14. RC TISSUE=Brain;
  15. RX MEDLINE=22162448; PubMed=12034717;
  16. RA Beck M., Brickley K., Wilkinson H.L., Sharma S., Smith M.,
  17. RA Chazot P.L., Pollard S., Stephenson F.A.;
  18. RT "Identification, molecular cloning, and characterization of a novel
  19. RT GABAA receptor-associated protein, GRIF-1.";
  20. RL J. Biol. Chem. 277:30079-30090(2002).
  21. RN [2]
  22. RP REVISIONS TO 579 AND 595-596, AND VARIANTS VAL-609 AND PRO-820.
  23. RA Stephenson F.A.;
  24. RL Submitted (FEB-2003) to the EMBL/GenBank/DDBJ databases.
  25. RN [3]
  26. RP SEQUENCE FROM N.A. (ISOFORM 3), INTERACTION WITH O-GLCNAC TRANSFERASE,
  27. RP AND O-GLYCOSYLATION.
  28. RC STRAIN=Sprague-Dawley; TISSUE=Brain;
  29. RX MEDLINE=22464403; PubMed=12435728;
  30. RA Iyer S.P.N., Akimoto Y., Hart G.W.;
  31. RT "Identification and cloning of a novel family of coiled-coil domain
  32. RT proteins that interact with O-GlcNAc transferase.";
  33. RL J. Biol. Chem. 278:5399-5409(2003).
  34. CC -!- SUBUNIT: Interacts with GABA-A receptor and O-GlcNac transferase.
  35. CC -!- SUBCELLULAR LOCATION: Cytoplasmic.
  36. CC -!- ALTERNATIVE PRODUCTS:
  37. CC Event=Alternative splicing; Named isoforms=3;
  38. CC Name=1; Synonyms=GRIF-1a;
  39. CC IsoId=Q8R2H7-1; Sequence=Displayed;
  40. CC Name=2; Synonyms=GRIF-1b;
  41. CC IsoId=Q8R2H7-2; Sequence=VSP_003786, VSP_003787;
  42. CC Name=3;
  43. CC IsoId=Q8R2H7-3; Sequence=VSP_003788;
  44. CC -!- PTM: O-glycosylated.
  45. CC -!- SIMILARITY: TO HUMAN OIP106.
  46. DR EMBL; AJ288898; CAC81785.2; -.
  47. DR EMBL; AJ288898; CAC81786.2; -.
  48. DR EMBL; AF474163; AAL84588.1; -.
  49. DR GO; GO:0005737; C:cytoplasm; IEP.
  50. DR GO; GO:0005634; C:nucleus; IDA.
  51. DR GO; GO:0005886; C:plasma membrane; IEP.
  52. DR GO; GO:0006357; P:regulation of transcription from Pol II pro...; IDA.
  53. DR InterPro; IPR006933; HAP1_N.
  54. DR Pfam; PF04849; HAP1_N; 1.
  55. KW Coiled coil; Alternative splicing; Polymorphism.
  56. FT DOMAIN 134 355 COILED COIL (POTENTIAL).
  57. FT VARSPLIC 653 672 VATSNPGKCLSFTNSTFTFT -> ALVSHHCPVEAVRAVHP
  58. FT TRL (in isoform 2).
  59. FT /FTId=VSP_003786.
  60. FT VARSPLIC 673 913 Missing (in isoform 2).
  61. FT /FTId=VSP_003787.
  62. FT VARSPLIC 620 687 VQQPLQLEQKPAPPPPVTGIFLPPMTSAGGPVSVATSNPGK
  63. FT CLSFTNSTFTFTTCRILHPSDITQVTP -> GSAASSTGAE
  64. FT ACTTPASNGYLPAAHDLSRGTSL (in isoform 3).
  65. FT /FTId=VSP_003788.
  66. FT VARIANT 609 609 E -> V.
  67. FT VARIANT 820 820 S -> P.
  68. SQ SEQUENCE 913 AA; 101638 MW; D0E135DBEC30C28C CRC64;
  69. MSLSQNAIFK SQTGEENLMS SNHRDSESIT DVCSNEDLPE VELVNLLEEQ LPQYKLRVDS
  70. LFLYENQDWS QSSHQQQDAS ETLSPVLAEE TFRYMILGTD RVEQMTKTYN DIDMVTHLLA
  71. ...deleted...
  72. GIARVVKTPV PRENGKSREA EMGLQKPDSA VYLNSGGSLL GGLRRNQSLP VMMGSFGAPV
  73. CTTSPKMGIL KED
  74. //

相应的返回值如下表所示:

Table 5: Values from the Sequence object (Swissprot)
METHODRETURNS
display_idA2S3_RAT
descAmyotrophic lateral … protein of 98 kDa).
display_nameA2S3_RAT
accessionQ8R2H7
is_circular
namespace
authority
seq_version
keywordsCoiled coil; Alternative splicing; Polymorphism
length913
seqMSLSQ…ILKED
divisionRAT
get_dates28-FEB-2003 (Rel. 41, Created)
get_secondary_accessionsQ8R2H6 Q8R4G3

和Genbank一样,详见Feature and Annotation HOWTO,查看序列注释信息。

发表评论

匿名网友

拖动滑块以完成验证
加载失败