使用TopHat分析RNA-Seq结果

TopHat是一个基于Bowtie的RNA-Seq数据分析工具。它可以快速确认exon-exon剪切拼接事件。TopHat有Linux和OS X x86_64编译版本,当然也可以使用原代码编译适合自己操作系统的版本。

其上游软件是Bowtie,下游是Cufflinks。

理论上,TopHat是针对Illumina Genome Analyzer而设计的软件,它偶尔也能对其它来源的数据进行分析,但不保证成功。它针对75bp以上长度的短序进行了优化。

在使用TopHat前,必须将Bowtie的可执行文件的目录输出到PATH变量中去,例:

export PATH=$PATH:/share/sbin/bowtie

确保TopHat可以运行bowtie, bowtie-inspect以及bowtie-build。

还需要下载安装samtools。

TopHat的使用范例:

[shell]
tophat [options]* <ebwt_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
[/shell]

先下载测试文件,并解压。

[shell]
tar zxvf test_data.tar.gz
cd test_data
tophat -r 20 test_ref reads_1.fq reads_2.fq
[/shell]

如果成功的话,可以看到类似下面的输出:

[shell]
[Mon May 4 11:07:23 2009] Beginning TopHat run (v1.1.1)
-----------------------------------------------
[Mon May 4 11:07:23 2009] Preparing output location ./tophat_out/
[Mon May 4 11:07:23 2009] Checking for Bowtie index files
[Mon May 4 11:07:23 2009] Checking for reference FASTA file
[Mon May 4 11:07:23 2009] Checking for Bowtie
Bowtie version: 0.9.9.1
[Mon May 4 11:07:23 2009] Checking reads
seed length: 75bp
format: fastq
quality scale: phred
Splitting reads into 3 segments
[Mon May 4 11:07:23 2009] Mapping reads against test_ref with Bowtie
[Mon May 4 11:07:24 2009] Mapping reads against test_ref with Bowtie
[Mon May 4 11:07:24 2009] Mapping reads against test_ref with Bowtie
Splitting reads into 3 segments
[Mon May 4 11:07:24 2009] Mapping reads against test_ref with Bowtie
[Mon May 4 11:07:24 2009] Mapping reads against test_ref with Bowtie
[Mon May 4 11:07:24 2009] Mapping reads against test_ref with Bowtie
[Mon May 4 11:07:24 2009] Searching for junctions via coverage islands
[Mon May 4 11:07:24 2009] Searching for junctions via mate-pair closures
[Mon May 4 11:07:24 2009] Retrieving sequences for splices
[Mon May 4 11:07:24 2009] Indexing splices
[Mon May 4 11:07:24 2009] Mapping reads against segment_juncs with Bowtie
[Mon May 4 11:07:24 2009] Mapping reads against segment_juncs with Bowtie
[Mon May 4 11:07:24 2009] Mapping reads against segment_juncs with Bowtie
[Mon May 4 11:07:24 2009] Joining segment hits
[Mon May 4 11:07:24 2009] Mapping reads against segment_juncs with Bowtie
[Mon May 4 11:07:24 2009] Mapping reads against segment_juncs with Bowtie
[Mon May 4 11:07:24 2009] Mapping reads against segment_juncs with Bowtie
[Mon May 4 11:07:24 2009] Joining segment hits
[Mon May 4 11:07:24 2009] Reporting output tracks
-----------------------------------------------
Run complete [00:00:00 elapsed]
[/shell]

通过运行测试文件,我们知道命令当中的ebwt_base其实就是biowtie所需要的INDEXES的文件名的前缀。这些文件,我们可以从TopHat网站上下载

而之后的reads1_1[,...,readsN_1]以及[reads1_2,...readsN_2]就是需要分析的fastq文件了。如果是成对比对的话,那么文件名必须按照*_1, *_2就样成对出现。

而对于其它参数,解释其中一部分:

 

-o/--output-dir <string>输出目录。默认值为 “./tophat_out”.
-r/--mate-inner-dist <int>比对时两成对引物间的距离中值。比如说,如果你的插入片段平均有300bp,包括2个adapter各10bp, 2个barcode各6bp, 你的reads(读序长度为50bp),那么r值就应该是160=300-2*50-2*20(此为barcode在reads当中的情况),如果读序长度为100时,r值就为60=300-2*100-2*20。没有默认值,如果是末端配对比对时这个值是必须的。
--mate-std-dev <int>末端配对时中间插入片段的长度的标准差,默认值为20bp
-a/--min-anchor-length <int>锚定点长度”anchor length”. TopHat可以判断junction(剪切拼接)。这需要设定锚定点的最短长度,最短不能少于3,默认值为8
-m/--splice-mismatches <int>锚定点范围内错配的个数。默认值为0
-i/--min-intron-length <int>最短的内含子长度。默认值为70
-I/--max-intron-length <int>最长的内含子长度。默认值为500000.
--max-insertion-length <int>比对时插入错配最长的长度,默认值为3.
--max-deletion-length <int>比对时缺失的最长长度,默认值为3.
--solexa-qualsUse the Solexa scale for quality values in FASTQ files.
--solexa1.3-qualsAs of the Illumina GA pipeline version 1.3, quality scores are encoded in Phred-scaled base-64. Use this option for FASTQ files from pipeline 1.3 or later.
-Q/--qualsSeparate quality value files – colorspace read files (CSFASTA) come with separate qual files.
--integer-qualsQuality values are space-delimited integer values, this becomes default when you specify -C/–color.
-C/--colorColorspace reads, note that it uses a colorspace bowtie index and requires Bowtie 0.12.6 or higher.
Common usage: tophat –color –quals [other options]* <colorspace_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2] <quals1_1[,...,qualsN_1]> [quals1_2,...qualsN_2]
-F/--min-isoform-fraction <0.0-1.0>TopHat filters out junctions supported by too few alignments. Suppose a junction spanning two exons, is supported by S reads. Let the average depth of coverage of exon A be D, and assume that it is higher than B. If S / D is less than the minimum isoform fraction, the junction is not reported. A value of zero disables the filter. The default is 0.15.
-p/--num-threads <int>线程数,默认值为单线程1.
-g/--max-multihits <int>Instructs TopHat to allow up to this many alignments to the reference for a given read, and suppresses all alignments for reads with more than this many alignments. The default is 20 for read mapping (and it uses two time larger number (40) for segment mapping).
--no-closure-searchDisables the mate pair closure-based search for junctions. Currently, has no effect – closure search is off by default.
--closure-searchEnables the mate pair closure-based search for junctions. Closure-based search should only be used when the expected inner distance between mates is small (<= 50bp)
--no-coverage-searchDisables the coverage based search for junctions.
--coverage-searchEnables the coverage based search for junctions. Use when coverage search is disabled by default (such as for reads 75bp or longer), for maximum sensitivity.
--microexon-search使用这一选项,比对时搜索microexons。这个选项只对长于50bp的短序片段起作用。
--butterfly-search更加敏感的比对方式。如果你的测序结果是pre-mRNA样品的话,最好开启这一选项。
--library-typeTopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol.

原文来自:http://pgfe.umassmed.edu/ou/archives/2318/

 

 

发表评论

匿名网友

拖动滑块以完成验证