GBrowse访问数据制备

2012/08/09评论3,484

什么样的数据，数据制备为什么格式，保存为什么样的形式，GBrowse进行怎样的设置，基因组图谱上就可以展示出来了。

要展示什么样的数据？

基因组注释数据，比如contig的拼接关系、功能基因的注释、基因组序列的特征比如GC含量等，Blast比对结果，SNP，转录丰度等等，都可以使用GBrowse来展示这些数据。可以概括为两个位置之间如何，有什么。

数据制备的格式

或者应该说这里指的是数据交换格式，注释数据为GFF3格式，序列为fasta格式。对于GFF3有丰富的转换脚本，包括不同格式转换为GFF3格式，以及GFF3格式录入数据库脚本。

GBrowse访问数据的形式

GBrowse也不是直接访问GFF格式，而是需要导入数据库中，或者全部载入内存，成为更为结构化的、或者模式数据，以统一的数据库访问接口，供GBrowse调用数据，在GBrowse配置文件中，也就是对应的db_adaptor参数，看看官方的定义：

db_adaptor is the name of a Perl database adaptor module for accessing the sequence annotation database

目前GBrowse支持的接口包括：

Bio::DB::SeqFeature::StoreGFF3格式支持的GBrowse官方推荐的数据库
Bio::DB::Das::Chado GMOD支撑的通用的生物信息学数据库模式
Bio::Das分布式注释系统
Bio::DB::Das::BioSQL 一个通用的生物信息学数据库模式
Bio::DB::GFF基于GFF2的数据库系统

每种接口支持的adaptor也不尽相同，比如 Bio::DB::SeqFeature::Store支持以下三种：

memory内存，数据放到内存中，支持的数据量取决于系统的内存，通常建议feature记录数超过10,000时，就不要使用内存；
DBI::mysql mysql数据库，通常存在的形式，也是默认的形式，也是为什么要装mysql的原因
berkeleydb

GBrowse中的配置

可以在GBrowse.conf文件的，增减数据源，每一个数据源对应一个配置文件，配置文件中指明该数据源访问接口，已经主机、用户名等访问参数。

###############################################################################################
#
# DATASOURCE DEFINITIONS
# One stanza for each configured data source
#
###############################################################################################
 
[yeast]
description   = Yeast chromosomes 1+2 (basic)
path          = yeast_simple.conf
 
[yeast_advanced]
description   = Yeast chromosomes 1+2 (advanced)
path          = yeast_chr1+2.conf
访问接口的配置示例
 db_adaptor = Bio::DB::SeqFeature::Store
 db_args = -adaptor memory
 -dir '~/httpd-2.2/htdocs/gbrowse2/databases/volvox'
 db_adaptor = Bio::DB::SeqFeature::Store
 db_args = -adaptor dbi::mysql
 -dsn dbi:mysql:database=<database>;host=localhost
 -user <username>
 -pass <password>

mysql数据库的建立

可以使用 bp_seqfeature_load.pl 脚本创建和录入GFF3格式的数据。运行脚本前，先创建一个数据库，以及具有相应权限的访问用户。

Usage: /usr/bin/bp_seqfeature_load.pl [options] gff_file1 gff_file2...
  Options:
          -d --dsn        The database name (dbi:mysql:test)
          -s --seqfeature The type of SeqFeature to create (Bio::DB::SeqFeature)
          -a --adaptor    The storage adaptor to use (DBI::mysql)
          -v --verbose    Turn on verbose progress reporting
             --noverbose  Turn off verbose progress reporting
          -f --fast       Activate fast loading (only some adaptors)
          -T --temporary-directory  Specify temporary directory for fast loading (/tmp)
          -c --create     Create the database and reinitialize it (will erase contents)
          -u --user       User to connect to database as
          -p --password   Password to use to connect to database
          -S --subfeatures   Turn on indexing of subfeatures (default)
             --nosubfeatures Turn off indexing of subfeatures
          -i --ignore-seqregion 忽视序列区域
                          If true, then ignore ##sequence-region directives in the
                          GFF3 file (default, create a feature for each region)
          -z --zip        If true, database tables will be compressed to save space

命令示例：

重新创建

bp_seqfeature_load.pl -a DBI::mysql -d <db> -u <user> -p <passwd> --create *.gff3 *.fasta

带频率图

bp_seqfeature_load.pl -a DBI::mysql -d <db> -u <user> -p <passwd> --summary *.gff3

新版本的GBrowse，–summary好像是默认的，都会创建feature频率表

来源：http://boyun.sh.cn/bio/?p=1786