最近NCBI的数据格式由于空间缘故都转换成了*.sra格式,不再支持*.fastq.gz,因此需要一个特别的转化工具来转换下载的*.sra数据文件。
下载地址:
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software
这里面包含了不同系统平台下的程序以及源代码。
转换命令
$ fastq-dump -A <SRR_accession> -D <Path_to_SRR_Directory> -O <Output_Path>
基本的命令参数
| Description |
---|---|
‘-A’ or ‘--accession’ | Enables modification of the output name used for the fastq files. For example: fastq-dump -A foo SRR000001 Will produce files named ‘foo.fastq’, ‘foo_1.fastq’, and ‘foo_2.fastq’ |
‘-D’ or ‘--table-path’ | Makes the archive path more explicitly specified, thus preventing confusion when more than option is specified. These two commands produce the same files: fastq-dump ~/SRR000001 fastq-dump -D ~/SRR000001 However, the first command below will fail while the second will succeed: fastq-dump -C ~/SRR000001 fastq-dump -C -D ~/SRR000001 (‘-C’ option is explained further below) |
‘-N’ or ‘--minSpotId’ | Minimum spot number at which to start the dump process |
‘-X’ or ‘--maxSpotId’ | Maximum spot number at which to stop the dump process For example: fastq-dump -N 5 -X 10 SRR000001 This command will dump six spots starting from spot ‘SRR000001.5’ and ending in spot ‘SRR000001.10’. Filtered spots can result in less than (maxSpotId - minSpotId + 1) total spots output. |
‘-G’ or ‘--spot-group’ | Boolean option that results in fastq files divided into spot groups as defined in the Experiment (or eventually Run) xml. This command: fastq-dump -G SRR051894 Produces these five fragment files: SRR051894.fastq SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CB028-01WG.fastq SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CB036-01WG.fastq SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CD021-01WG.fastq SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CD036-01WG.fastq |
‘-T’ or ‘--group-in-dirs’ | Boolean option directing the utility to produce fastq files in sub-directories rather than producing files within the same directory |
‘-O’ or ‘--outdir’ | Indicates the directory where the fastq result should be placed For example: fastq-dump -O /tmp -T SRR000001 will create a directory, SRR000001, in /tmp with this tree structure: >tree /tmp/SRR000001 /tmp/SRR000001 |-- 1 | `-- fastq |-- 2 | `-- fastq `-- fastq |
‘-K’ or ‘--keep-empty-files’ | Has no effect - at one time this option would represent all three possible files even if one or two were empty |
‘-M’ or ‘--minReadLen’ | Allows specification of the desired minimum read length to output (default is 25). The command ‘fastq-dump -M 0 SRR000001’ prevents any filtering based on read length. |
‘-W’ or ‘--noclip’ | Prevents clipping of a spot sequence based on the right clip information. Toggling ‘show-clipped’ in the ‘customize’ area for reads in the SRA Run Brower enables observing the effect of this option (e.g. see SRR000001). |
‘-F’ or ‘--origfmt’ | Results in fastq containing only the original identifier on the defline (i.e. no length or SRR identifier are present) |
‘-C’ or ‘--dumpcs’ | Forces color space sequence to be dumped instead of base space. If the optional ‘cskey’ if provided (i.e. A, C, T, or G), then all fastq files produced will use that key at the start of each color space sequence. |
‘-B’ or ’--dumpbase’ | Forces base space sequence to be dumped instead of color space. |
‘-Q’ or ‘--offset’ | Allows using a different offset value to represent a different offset character in the fastq output. For example, using an offset of 64 represents using ‘@’ as the offset character. |
‘-I’ or ‘--readids’ | Appends a read index to the run identifier starting with ‘1’ as the first index. Note that this differs from the spot descriptor in the Experiment xml where the read indices start with ‘0’. In the case of SRR000001, the first spot in each file would have the identifiers ‘SRR000001.5.4’, ‘SRR000001.1.2’, and ‘SRR000001.1.4’. Note that the first spot sequence in SRR000001.fastq, the fragment file, comes from the second biological/application read which has an index of ‘4’. |
‘-E’ or ‘--no_qual_filter’ | This option turns off quality filtering based on leading/trailing low quality values. As reads have become longer this option has become a more viable alternative. |
‘-SF’ or ‘--complete’ | Outputs the separated reads into a single file. For example, the command: fastq-dump -SF SRR029338 Results in the first eight lines of the file, SRR029338.fastq, containing: @SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36 GGTTGAGTAAAGTGTCTAAAGGCA +SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36 IIIIIIIIIIIIIIIIIIIAIIA<I8I+7I9+II2I @SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36 AAAGTCAAATTTGAATTGTTGTCA +SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36 IIIIIIIIDIIIIIIIIIIIII.1F2II=8*2+//I In the case of 454 pair submissions, the second technical read (i.e. linker) is included in this single output file. |
‘-DB’ or ‘--defline-seq’ | Allows specification of the sequence defline format. For example: -DB "@$ac.$si $sn length=$rl" This specification produces the same output as the default output. See Appendix D for a more in-depth explanation. Note that submission of a ‘fastq-dump’ command to a compute farm (e.g. Sun Grid Engine) can require preceding a number of the characters with backslash characters when using this option. The above example might require this version: -DB "@\\\$ac.\\\$si \\\$sn length=\\\$rl" |
‘-DQ’ or ‘--defline-qual’ | Allows specification of the quality defline format. For example: -DQ "+$ac.$si $sn length=$rl" |
‘-alt [n]’ | Provides alternative output formats without have to indicate the individual options. Alternate ‘1’, the only option, results in this format for SRR029338_1.fastq: @SRR029338.1 080115_EAS112_0034:8:1:615:780/1 GGTTGAGTAAAGTGTCTAAAGGCA + IIIIIIIIIIIIIIIIIIIAIIA<I8I+7I9+II2I And this format for SRR029338_2.fastq: @SRR029338.1 080115_EAS112_0034:8:1:615:780/2 AAAGTCAAATTTGAATTGTTGTCA + IIIIIIIIDIIIIIIIIIIIII.1F2II=8*2+//I |
转换*.sra 文件格式到SFF格式
$ sff-dump -A <SRR_accession> -D <Path_to_SRR_Directory> -O <Output_Path>
Options:
Command | Description |
---|---|
-O | Allows user to specify an output directory. If not used, output will default to the current directory. |
-N | Minimum spot ID to output. The first spot in the output will be the number given for this option. |
-X | Maximum spot ID to output. The last spot in the output will be the number given. Min and Max spot options can be combined to output subsections of an SRR. |
-G | spotgroup-file Split into files by SPOT_GROUP |
-T | spotgroup-dir Split into subdirectories (of -O ) by SPOT_GROUP |
-L | Log level: 0-13 or fatal|sys|int|err|warn|info|debug[1-10]. (default: info) Set to ‘4’ to mimic the unix standard of no messages for a successful operation. |
-H | Prints this help message and version information. |
转换*.sra 文件格式到Illumina native文件格式
$illumina-dump [options] -path <directory_containing_the_accession> <acces
Command | Description |
---|---|
-D, --table-path | Path to accession data. |
-O, --outdir | Output directory. Default: '.' |
-N, --minSpotId | Minimum spot id to output. |
-X, --maxSpotId | Maximum spot id to output. |
-G, --spot-group | Split into files by SPOT_GROUP (member). |
-T, --group-in-dirs | Split into subdirectories instead of files. |
-K, --keep-empty-files | Do not delete empty files. |
-L, --log-level | Logging level: 0-13 or fatal|sys|int|err|warn|info|debug[1-10]. Default: info |
-H, --help | Prints this message |
Format options:
Command | Description |
---|---|
-r, --read | Output READ: "seq". Default: on |
-q, --qual1 | Output QUALITY, into single (1) or multiple (2) files: "qcal". Default: 1 |
-p, --qual4 | Output full QUALITY: "prb". Default: off |
-i, --intensity | Output INTENSITY, if present: "int". Default: off |
-n, --noise | Output NOISE, if present: "nse". Default: off |
-s, --signal | Output SIGNAL, if present: "sig2". Default: off |
-qseq | Output QSEQ format: "qseq". Default: off\ |
1F
想不明白空间原因是什么原因。。。sra格式的文件明明就比fastq.tar.gz大了不少,NCBI这么干是什么目的?每次还得这么折腾一下。
B1
@ 怪羊基德 我猜可能sra比较兼容各种高通量测序的数据吧,比如454下机的数据并不是fastq的。。
B1
@ 怪羊基德 能不能帮转一个SRA文件?菜鸟我已经可怜的把儿童节搭进去了
2F
你要把SRA转成什么格式?如果是把SRA转成fastq的话,直接用fastq-dump SRA_ID就好了。新版的fastq-dump不需要下载SRA文件。
3F
windows 系统中怎么转转