Velvet and Oases For Transcriptome Assembly(基于转录组进行拼接)

2011/12/12评论5,270

首先要有Linux环境的服务器或PC，推荐最低需求：
64bit Linux System (Rethat, CentOS, Fedora), 4 core CPU processors, 2.26GHz, basically 16GB RAM, 20GB HD storage space.
Environment: GCC/G++ 4.3.5, PERL v5.8, Python 2.6, JAVA 6(JDK/JRE)

选择软件有：

Velvet+Oases;
ABySS+TransABySS;
Newbler(cDNA assembler)+iAssembler(CAP3)

前两个主要用于Illumina和ABi SOLiD的短序列RNA-seq组装，第三个用于454较长一些的序列(cDNA)组装，后面我们只介绍第一种。

下载，安装：

http://www.ebi.ac.uk/~zerbino/velvet/

建议下载用git(http://git.or.cz/ )
下载方法：安装好git后

获取velvet最新版本：(以下所有your-path是你要安装软件存放的目录）

$mkdir -p /your-path/velvet
$cd /your-path/velvet
$git clone git://github.com/dzerbino/velvet.git

获取oases最新版本：

$cd /your-path/velvet
$git clone git://github.com/dzerbino/oases.git

编译velvet：

$cd /your-path/velvet/velvet
$./update_velvet.sh

如果需要用大的kmer值(默认是31，这里假设需要75，根据需求，一般kmer值决定组装片段大小，但也受内存限制，16GB节点建议是31kmer以下，72GB建议45kmer以下，512GB建议75kmer)，则可以按下面方法编译：
$make MAXKMERLENGTH=75

编译oases：

$cd /your-path/velvet/oases/
$make ‘VELVET_DIR=/your-path/velvet/velvet’

如果需要所有用户可使用，即在有root权限时，运行以下命令即可：

$cd /usr/local/bin
$ln -s /your-path/velvet/velvet/velveth && ln -s /your-path/velvet/velvet/velvetg && ln -s /your-path/velvet/oases/oases
$chmod 555 velveth velvetg oases
就可以了，如果没有有两种办法运行：

1). 每次运行用绝对路径运行，如：

$/your-path/velvet/velvet/velveth
$/your-path/velvet/velvet/velvetg
$/your-path/velvet/oases/oases

2). 用编辑~/.bashrc，写入以下三行：

alias velveth=”/your-path/velvet/velvet/velveth”
alias velvetg=”/your-path/velvet/velvet/velvetg”
alias oases=”/your-path/velvet/oases/oases”
安装就好了，运行方法可以直接运行velveth, velvetg, oases就看到相关参数。

（第一步）hashing
Velveth用法：

velveth – simple hashing program
Version 1.1.04
Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk)
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Compilation settings:
CATEGORIES = 2
MAXKMERLENGTH = 75
Usage:
./velveth directory hash_length {[-file_format][-read_type] filename1 [filename2 ...]} {…} [options]
directory : directory name for output files
hash_length : EITHER an odd integer (if even, it will be decremented) <= 75 (if above, will be reduced)
: OR: m,M,s where m and M are odd integers (if not, they will be decremented) with m < M ILLUMINA-57021F:8:1:1106:3817#0/1
AGATATTTGAGATTTATGGATTAGGAAAGGAGGTTGTTGACCTAGCATTTTGTGGGAGTG
>ILLUMINA-57021F:8:1:1106:3817#0/2
AATTTTTGACACACTCCCACAAAATGCTAGGTCAACAACCTCCTTTCCTAATCCATAAAT
>ILLUMINA-57021F:8:1:1123:6213#0/1 score: 0 left_trim: 0
GACTGTCCCTGTTAATCATTACTCCGATCCCGAAGGCCAACAGAATAGGACCGAAATCCT
>ILLUMINA-57021F:8:1:1123:6213#0/2 score: 2 left_trim: 0
AGCATGGGATAACATCATAGGATTTCGGGCCTATTCTGTTGGCCTTCGGGATCGGAGTAA

这个在velvet的安装目录里有两个脚本可供做这样事情，shuffleSequences_fastq.pl, shuffleSequences_fasta.pl；
如果是mate pairs序列，由于方向是 RF；而不是—>|<— FR，所以mate pairs的序列全部要反转互补一下。

(5)如果RNASeq是strand specific测序的话，可以选择-strand_specific参数；

输出：
你所指定的一个文件夹名称(例子中的Assem），这个文件夹请勿改动，要用于velvetg和oases使用。

其余可参考velvet的说明手册：http://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf

注：最新版的velvet1.1.04 支持多线程跑，如果内存足够，又不知道设置多大kmer合适，可以考虑用MetaVelveth(velvet/contrib/MetaVelvet-v0.3.1)，或设置kmer值时，用

$velveth output_directory/ m,M,s (..data files..)

m<=k<=m, s是步长；如31,45,2；就从31,33,35,…,45之间顺次设置

Velvet还有很多第三方程序和脚本，用于优化组装，见安装velvet中的velvet/contrib目录，有兴趣可对手册参照使用。

（第二步）de Bruijn graph assembly
Velvetg用法

velvetg – de Bruijn graph construction, error removal and repeat resolution
Version 1.1.04
Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk)
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Compilation settings:
CATEGORIES = 2
MAXKMERLENGTH = 75
Usage:
./velvetg directory [options]
directory : working directory name
Standard options:
-cov_cutoff : removal of low coverage nodes AFTER tour bus or allow the system to infer it
(default: no removal)
-ins_length : expected distance between two paired end reads (default: no read pairing)
-read_trkg : tracking of short read positions in assembly (default: no tracking)
-min_contig_lgth : minimum contig length exported to contigs.fa file (default: hash length * 2)
-amos_file : export assembly to AMOS file (default: no export)
-exp_cov : expected coverage of unique regions or allow the system to infer it
(default: no long or paired-end read resolution)
-long_cov_cutoff : removal of nodes with low long-read coverage AFTER tour bus
(default: no removal)
Advanced options:
-ins_length2 : expected distance between two paired-end reads in the second short-read dataset (default: no read pairing)
-ins_length_long : expected distance between two long paired-end reads (default: no read pairing)
-ins_length*_sd : est. standard deviation of respective dataset (default: 10% of corresponding length)
[replace '*' by nothing, '2' or '_long' as necessary]
-scaffolding : scaffolding of contigs used paired end information (default: on)
-max_branch_length : maximum length in base pair of bubble (default: 100)
-max_divergence : maximum divergence rate between two branches in a bubble (default: 0.2)
-max_gap_count : maximum number of gaps allowed in the alignment of the two branches of a bubble (default: 3)
-min_pair_count : minimum number of paired end connections to justify the scaffolding of two long contigs (default: 5)
-max_coverage : removal of high coverage nodes AFTER tour bus (default: no removal)
-long_mult_cutoff : minimum number of long reads required to merge contigs (default: 2)
-unused_reads : export unused reads in UnusedReads.fa file (default: no)
-alignments : export a summary of contig alignment to the reference sequences (default: no)
-exportFiltered : export the long nodes which were eliminated by the coverage filters (default: no)
-clean : remove all the intermediary files which are useless for recalculation (default : no)
-very_clean : remove all the intermediary files (no recalculation possible) (default: no)
-paired_exp_fraction : remove all the paired end connections which less than the specified fraction of the expected count (default: 0.1)
-shortMatePaired* : for mate-pair libraries, indicate that the library might be contaminated with paired-end reads (default no)
Output:
directory/contigs.fa : fasta file of contigs longer than twice hash length
directory/stats.txt : stats file (tab-spaced) useful for determining appropriate coverage cutoff
directory/LastGraph : special formatted file with all the information on the final graph
directory/velvet_asm.afg : (if requested) AMOS compatible assembly file

如上：velvetg有很多参数，对于RNA-seq中有用的有以下几个
输入：velveth运行指定的那个目录如例子中的Assem

参数选择：

cov_cutoff 最低覆盖度，去除低频度序列，用于纠错；
ins_length 插入片段长度，如500bp insert，则设置500,如果在velveth输入有多个插入片段的libraries,可以通过-ins_length2和-ins_length_long设置；
ins_length_sd 插入片段分布的标准差值，通过比对或建库信息获取，一般500bp以内的在5~30以内；
exp_cov 期望覆盖度，如果不确定可以设置为auto，这里设置的是浮点数，如果准确设置对组装有所贡献；
min_pair_count 用于搭建scaffolds(设置了-scaffolding yes)最少的pairs数，如设置3，即如果有3对pairs搭上相同的两个contigs（方向插入片段正确），则可认为这两个contigs可以构建为scaffolds；
max_coverage 最大覆盖度，用于去除高频序列，主要是低复杂度的重复序列，或polyA和adapters序列。

其他参数参考手册。

输出：contigs.fa这个即为组装的contigs序列，没有转录本信息。

（第三步）transcriptome assembling
oases – De novo transcriptome assembler for the Velvet package
Version 0.1.18
Copyright 2009,2010 Daniel Zerbino (dzerbino@soe.ucsc.edu)
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Compilation settings:
CATEGORIES = 20
MAXKMERLENGTH = 75
Usage:
./oases directory [options]
directory : working directory name
Standard options:
-ins_length2 : expected distance between two paired-end reads in the second short-read dataset (default: no read pairing)
-ins_length_long : expected distance between two long paired-end reads (default: no read pairing)
-ins_length*_sd : est. standard deviation of respective dataset (default: 10% of corresponding length)
[replace '*' by nothing, '2' or '_long' as necessary]
-unused_reads : export unused reads in UnusedReads.fa file (default: no)
-amos_file : export assembly to AMOS file (default: no export)
-alignments : export a summary of contig alignment to the reference sequences (default: no)
–help : this help message
Advanced options:
-cov_cutoff : removal of low coverage nodes AFTER tour bus or allow the system to infer it (default: 3)
-min_pair_count : minimum number of paired end connections to justify the scaffolding of two long contigs (default: 4)
-min_trans_lgth : Minimum length of output transcripts (default: hash-length)
-paired_cutoff : minimum ratio allowed between the numbers of observed and estimated connecting read pairs
Must be part of the open interval ]0,1[ (default: 0.1)
-scaffolding :Allow gaps in transcripts (default: yes)
-degree_cutoff : Maximum allowed degree on either end of a contigg to consider it ‘unique’ (default: 3)
Output:
directory/transcripts.fa
directory/splicing_events.txt
directory/contig-ordering.txt
Oases是最后一步用于组装transcripts,并报告splicing events(splicing isoforms)
输入：velveth运行指定的那个目录如例子中的Assem
参数选择：跟velvetg过程相似
输出：
1) transcripts.fa
FASTA格式，如下：
>Locus_1_Transcript_1/10_Confidence_0.111
CACAGTTGAGCAGATAAAAAAGGGATTTGCAAGATGGAGAGCATCCAAAGCTTTTGGCAA
TTGGGTGATGAGCTCCGAGGACAATCAAAAGTCTCAGAGGATCATAAGTGGTTAATGGCT
GCCTCGAAATTGGCTGAGCAGACAAGGTCAAAGGGGGAACGCATGAATAATCTTGATCTT
TCAAAAGGTCTGTCAGAATTAAGGCCAAGGGAGAAGATTGGGTTTCAGGCTGGAGATCGG
AAGAGCGGTTCAGCAGGAA
>Locus_1_Transcript_2/10_Confidence_0.222
CACAGTTGAGCAGATAAAAAAGGGATTTGCAAGATGGAGAGCATCCAAAGCTTTTGGCAA
TTGGGTGATGAGCTCCGAGGACAATCAAAAGTCTCAGAGGATCATAAGTGGTTAATGGCT
GCCTCGAAATTGGCTGAGCAGACAAGGTCAAAGGGGGAACGCATGAATAATCTTGATCTT
TCAAAAGGTCTGTCAGAATTAAGGCCAAGGGAGAAGATTGGGTTTCAGGAAGATAATAAA
TTTGAAAGCCTTAACTTTAATATGTTGAACTTGGAATCTAAGATGGCTGAAAATGTGGGC
AAAGGTTCTTTCAGAAGTGGAGTTTACAATATGAATGCAGTGTACCAGAAAACCAACTTA
AATGGTATTGGAAATCTTACTGGTAGCAAGTATAGCGGTAACAATCAGAATACCAAAGAC
CCCAACAACAACAGCAATAACAACATTAACAACATTGAGAACAGTGCAAATAATGCTGTT
GACAAAAGGTTCAAGACTCTGCCTGCAACAGAGACGCTTCCACGGAATGAGGTTCTTGGA
GGATACATCTTTGTCTGTAACAATGACACTATGCAGGAAGATTTGAAGCGCCAGCTATTT
GGTTTACCTCCAAGATATAGAGATTCTGTTCGGGCAATAACACCTGGCTTACCTCTCTTT
CTTTACAACTACACAACTCACCAGTTGCATGGTATATTTGAGGCAGCTAGTTTTGGGGGT
TCTAACATTGATCCTACTGCTTGGGAAGACAAAAAATGTAAAGGCGAGTCGAGACTGGAG
ATCGGAAGAGCGGTTCAGC

其中Locus_1是指gene locus；
1/10指这个基因locus有10个转录本；
Confidence是指置信度，一般转录本越多，置信度越低，具体可参考：
What do the confidence scores mean?

The confidence scores assigned by Oases are a heuristic measure that expresses the uniqueness of a transcript in a locus. In the case of loci with only one or two possible transcripts, they/it are assigned a confidence of 1. In the case of more complex loci, if a transcript of a locus shares the majority of the contigs associated with a locus than the Confidence of the transcript is high (close to 1), whereas a Confidence Score close to 0 indicates that only few of the contigs of a locus are part of the transcript.

2) splicing_events.txt

>Locus_1_Node_30005
CACAGTTGAGCAGATAAAAAAGGGATTTGCAAGATGGAGAGCATCCAAAGCTTTTGGCAA
TTGGGTGATGAGCTCCGAGGACAATCAAAAGTCTCAGAGGATCATAAGTGGTTAATGGCT
GCCTCGAAATTGGCTGAGCAGACAAGGTCAAAG
>Locus_1_Node_30006
TGCCTCGAAATTGGCTGAGCAGACAAGGTCAAAGGGGGAACGCATGAATAATCTTGATCT
TTCAAAAGGTCTGTCAGAATTAAGGCCAAGGGAGAAGATTGGGTTTCAGG

报告每个gene Locus中由哪些contigs组成，用 AStalavista nomenclature*(参考文献Sammeth M, Foissac S, Guigó R. A general definition and nomenclature for alternative splicing events. PLoS Comput Biol. 2008 Aug 8;4(8):e1000147. )方式来描述alternative splicing

3) contig-ordering.txt

>Locus_1_Node_120550
ATCGGAAGAGCGGTTCAG
>Locus_1_Node_120121
ACTGGAG
>Locus_1_Transcript_1/10_Confidence_0.111
30005:153-(0)->30006:229-(0)->129011:259
>Locus_1_Transcript_2/10_Confidence_0.222
30005:153-(0)->30006:229-(0)->1:767-(0)->92621:768-(0)->92622:773-(0)->94084:799

一种混合的FASTA格式，用来报告contigs的线性关系：
$contig_id:$cumulative_length-($distance_to_next_contig)->
在这里，累计长度是描述转录本从5′端向3′端的组装起来contigs的总长度,从这个结果可以找到在转录本序列来自于哪些contigs。

更多请参考：http://www.ebi.ac.uk/~zerbino/oases/Manual.txt

由于版本不兼容问题，如果不急于得到结果，或者不是必须使用多线程，建议安装旧的版本：

http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.0.18.tgz

和

http://www.ebi.ac.uk/~zerbino/oases/oases_0.1.18.tgz

另外也可以参考Adrian Reich的README, Velvet and Oases (transcriptome): https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/velvet-and-oases-transcriptome

发表评论