color space的测序数据

2014/08/14来源：生信菜鸟团评论2,678

看了illumina的测序仪市场份额的确很夸张，像我这样在生信数据分析领域身经百战的老鸟，都是直到今天才碰到color space的测序数据。测序平台是AB 5500xl Genetic Analyzer，就是传说中的solid格式。主要是我在学习一篇关于tp53转录因子结合能力的文章的时候碰到的，我查看了下载的数据虽然还是fastq格式，但很诡异，我完全不认识里面的序列。这里总结一下，下面是我的学习过程及思路，有点乱，大家随便看看！

首先：测序仪给的数据应该是 (.csfasta & .qual) 这两个后缀名的文件

然后，可以用脚本把数据转为csfastq格式，与普通fastq数据格式是没有区别，但是里面包含的不是序列，是color的编码。

其次，color space不允许转为base space数据！！！

最后，之所以转为csfastq格式，是为了适应很多软件，fastqc,cutadap，SHRiMP，sequel和BFAST ，bowtie等等

csfastq数据如下，还是四行代表一条read：

@SRR2967009.1 100_1000_1168_F3
T10011023211201220121202030102221012302121010131001
2@@@@>@?@@@@<@@//;@@/@9?@8@=@@@6;6@66;<@6@67?2?;/@
@SRR2967009.2 100_1000_1211_F3
T20132312201120021312220200023110220113100012321011
@@@@@@@@@<@@@@@@@@@@@@@@@@@@@@@@?@@@@/?@@@@@@@@

比如对http://www.ncbi.nlm.nih.gov/sra?term=SRP066824 来说：

首先下载数据并且解压：
for ((i=7009;i<7014;i )) ;do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP066/SRP066824/SRR296$i/SRR296$i.sra;done
因为测序平台是AB 5500xl Genetic Analyzer，就是传说中的solid格式，所以不应该用fastq-dump啦，应该用abi-dump才对！

参考：http://davetang.org/muse/2012/07/04/from-sra-to-fastq-for-solid-data/

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/abi-dump $id;done

解压之后是下面这样：
color space的测序数据-图片1

这样只能转为csfasta格式文件和qual文件，需要下载大名鼎鼎的lh写的一个脚本：wget http://www.bbmriwiki.nl/svn/bwa_45_patched/solid2fastq.pl 来转为fastq格式
程序非常好用：perl solid2fastq.pl SRR2967009_ SRR2967009 即可

也可以用Python程序来做这个转换，http://edison.cremag.org/resources/seq-analysis/tools/solid2fastq/

最后就是输出了fastq格式的 color space的数据，但是我测试了，直接用fastq-dump也可以把数据解压成fastq格式的color space的数据，并不需要那么麻烦的，因为我们不是从测序仪拿数据，而是从SRA数据库里面直接下载。(补充一下，直接用fastq-dump也可以把数据解压成fastq格式跟用abi-dump解压后再转换成csfastq有区别，但是我现在说不清楚区别是什么，建议用abi-dump)

SOLiD native (CSFASTA/QUAL)

All SRA data can be output into color space data. The utility ‘abi-dump’ can be used to output CSFASTA and QUAL data files (with appropriate options, fastq-dump can be used to output “CSFASTQ” format).

SHRiMP，sequel和BFAST 都可以来比对fastq格式的color space的数据，或者直接从 (.csfasta & .qual) 这两个文件开始处理，其实bowtie也可以的。

https://wikis.utexas.edu/display/bioiteam/BFAST

比对后的bam文件，就可以走正常的illumina数据分析流程啦！
转为了fastq格式的color space的数据，就可以直接进行fastqc看看质量控制图片，如果质量很差，可以直接用处理cutadapt等各种软件进行处理，in a .csfasta and a .qual file (this is the native SOLiD format).
参考：http://cutadapt.readthedocs.io/en/stable/colorspace.html
fastqc软件直接处理csfastq格式数据结果如下：
color space的测序数据-图片2

参考：http://seqanswers.com/forums/showpost.php?p=59156&postcount=4
Sequencer reads have a chance of read error (e.g. spot misidentification), combined with a chance of sequence error (e.g. polymerase misread in the PCR step).
For sequencers that output in base space, both these errors have a similar effect on the base-space mapping.

For sequencers that output in color-space, the read errors result in a somewhat unexpected base-space translation even if the underlying sequence has a perfect match to the reference.

The issues relating to color-space to base-space translation were discussed in the thread you linked to, but here’s my take on it (dumped from an email I recently sent to someone else):A color-space sequence is an encoding of adjacent dimers such that unchanging bases are encoded with ‘0’, complementary changes are encoded with ‘3’, the colour ‘1’ is used for a non-complementary base change on the same side of the alphabet (AC, CA, GT, or TG), and the colour ‘2’ is used for a non-complementary base change on a different side of the alphabet (AG, GA, CT, or TC). A table of these changes can be found here:

http://www.ploscompbiol.org/article/…i.1000386.g002

This has a few nice properties (e.g. the reverse-complement of a color-space sequence is the same as the reverse of the color-space sequence, a SNP will have two transitions), but many annoying and nasty properties.

The first is that a color-space sequence in itself is meaningless without a base reference (usually the starting base).

原文来自：http://www.bio-info-trainee.com/1850.html

SOLiD native (CSFASTA/QUAL)

发表评论