Running newbler: more de novo assembly parameters

2012/03/31评论2,630

There is a long list of options/flags/parameters for a newbler assembly, some of which have been treated in the previous post. In this post I will describe some more parameters. At the end, as a bonus, I will share a parameter that is not mentioned in the current documentation…

-ss -sl -sc -ais -ads
These parameters control read overlap detection (there are two more, -mi and -ml, which I described in the previous post). More on seeds and overlap detection is described in the post explaining how newbler works. I never change these parameters as I assume 454 has done a good job optimizing them. But I would love to hear from people that have tried the effect of adjusting these parameters…

-ss sets the seed step, i.e. how many bases further down the read does the next seed start (default: 12)
-sl sets the seed length (default 16)
-sc sets how many seeds are needed to overlap between two reads before they are deemed overlapping (I think) (default 1)
-ais and -ads set the alignment identity and difference score parameter, these are used to sort overlaps when there are multiple ones (defaults 2, and -3 respectively)

-e
If (parts of) the genome you are sequencing are covered by many, many reads, say more than 50x coverage, it is possible that small sequencing errors between the reads will force newbler to artificially make two contigs of a region, where there should only be one. Telling newbler in advance about the depth using the -e parameter will adjust for this. An example could be a BAC/cosmid/Fosmid, where, since these are relatively short, there is a good chance you will have many more reads than you actually would need. If you don’t know the depth of the read dataset, just run a normal assembly first and have a look at the 454ContigGraph.txt file, described here.

-m
This parameter forces newbler to keep all sequence data in memory instead of on disc. It will make assembly faster, but requires larger amounts of memory. I have never tried this, so I don’t know how much it speeds up, nor how much memory newbler needs in this case.

-qo
For large assemblies, the output generation phase will take a long time (newbler has to go through all the flowgrams twice, and so far, this stage is not yet parallelized). To get a quick idea of what the assembly looks like, you could suppress parts of the output generation with this flag. In particular, newbler will not go through all the flow signal intensities to calculate average values, which are needed to determine consensus base quality. As a result, the will be more errors in the contigs, but at least you will get a feeling for the number and lengths of contigs/scaffolds, N50 etc. If you used the -nrm flag, or started the assembly with runProject, you can actually restart the assembly to get the full output by writing runProject projectname, see also this post/

-notrim
With this flag set, newbler will not do any additional trimming (based on quality, or primers/adaptors/vectors etc you might have added using -vt or -vs).

-nobig, -noace
With -nobig, the following (usually large) files will not be included in the output: ACE/consed files, 454PairAlign.txt, 454AlignmentInfo.tsv. With -noace, the ace file will not be generated.

-ar -at -ad
These parameters control how reads are entered in the ace file. -ar will results in the entire raw read (after basecalling) being added , with -at the trimmed portion of the read will be added, -ad resets to default, which is trimmed

-tr
And now for a hidden option that is not mentioned in the manual. I got special permission from my contacts at 454 to describe this parameter, but they wanted me to stress that it is not yet fully supported, but will be in the next software release (i.e. use at your own risk). -tr will result in two files, 454TrimmedReads.fna and 454TrimmedReads.qual. These files contain the reads after trimming (by newbler). Newbler describes the trimpoints in the 454TrimStatus.txt file, and uses these to generate these output files. Quite handy if you quickly need access to the reads as newbler used them! Another use of this file is to extract the singletons, by using the read IDs from the reads labeled “Singleton” in the 454ReadStatus file, and a script that pulls these out of the 454Trimmed.fna file.

发表评论