The files most people are after when they do an assembly must be these: the actual contig and scaffold sequences. The contigs are in the files 454AllContigs and 454LargeContigs. ‘All’ indicates by default contigs of at least 100 bp, while ‘Large’ contigs are at least 500 bp. These lower limits can be set during assembly.
The ‘fna’ files contain the sequences (bases) in fasta format (I actually do not why this extension was chosen over ‘fasta’ or ‘fa’ which are most often used). The ‘qual’ files contain phred-like quality scores . The contigs are in the same order between fna and qual files, and the quality scores are in the same order as the bases:
>contig00005 length=962 numreads=77
CgaCTAGTATTGACACCCACAGTGAACTAACTATTGGTAACTATTATTAGGAACATGTAACTTGCATCAGGTACAGGTAACTAAAGGTATGTCTATTTAC
…
>contig00005 length=962 numreads=77
64 25 38 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 6464 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
…
Note the lower case bases in the beginning of the sequence; these correspond to quality values below 40.
The fasta header:
>contig00005 length=962 numreads=77
gives the unique contig number, its length in bp and the number of reads in the alignment used to build this contig. Note however, that this number represents all reads that were aligned, regardless of whether the read was aligned over its entire length, or just a part of it. For this to make sense, please remember what I explained in the previous post on the contig graph: reads can appear in more than one contig. Although the number of reads mentioned in the fasta header gives some indication on the read depth, it is not a good proxy. For the actual read depth (total number of bases aligned to generate the consensus contig sequence, divided over the contig length), please refer to the 454ContigGraph.txt file, which will be the subject of the next post.
The 454Scaffolds.fna and .qual files contain the scaffolds, which are nothing more than selected contigs, interspersed with gaps indicated by strings of ‘N’. The number of ‘N’s represent the gap length estimate, with a lower gap length limit of 20 bases. The contigs (between the gaps) in the scaffolds can be found in the 454AllContigs.fna (or the 454LargeCorntigs.fna) file, except those that are below the ‘All contigs’ lower length limit, usually 100 bp.
Finally, the 454Scaffolds.txt file contains a description of the scaffolds, showing which contigs are making up each scaffold, and what the gap sizes are in between them. The files follows the NCBI scaffold layout ‘AGP’ format.
A portion of an example 454Scaffolds.txt file is shown here:
scaffold00001 1 27007 1 W contig00001 1 27007 +
scaffold00001 27008 27315 2 N 308 fragment yes
scaffold00001 27316 61770 3 W contig00002 1 34455 +
scaffold00001 61771 63341 4 N 1571 fragment yes
scaffold00001 63342 99181 5 W contig00003 1 35840 +
scaffold00001 99182 99489 6 N 308 fragment yes
scaffold00001 99490 132133 7 W contig00004 1 32644 +
Note column 5, which indicates whether the line describes a contig: ‘W’ or gap: ‘N’. For each unique scaffold number/name (column 1), the starting and ending position are in columns 2 and 3, an incremental line number in column 4 (this starts at ‘1’ for each next scaffold).
For the contig lines (‘W’ in column 5), the contig name is given in column 6, followed by the start and end position of the part of the contig, which in the case of 454 newbler assemblies always are the first and last base of the contig. The last column indicates the orientation of the contig in the scaffold, ‘+’ means forward, ‘-‘ reverse. Newbler adjusts the orientation of contigs such that this column always will be ‘+’.
For the lines indicating gaps (‘N’ in column 5), the gap length is given in column 6, followed by the gap type, which in the case of newbler assemblies always is ‘fragment’ (i.e. a gap between sequences). The final column is always ‘yes’ for gaps lines, which indicates that “there is evidence of linkage between the adjacent lines”.
A note on scaffold definitions
I would define a scaffold as ‘two or more contigs connected by a minimum number of consistent paired end reads’ (where newbler’s minimum is 2 paired reads). In the output of newbler, besides these scaffolds, you will also find all unscaffolded contigs of at least 2000 bp, as these constitute significant parts of the genome sequenced. This means that in the 454Scaffolds.fna file, there will be scaffolds without any gap bases. These scaffolds will be represented in the 454Scaffolds.txt file by a single line. I tend to remove these scaffolds from an assembly and only report on genuine scaffolds, with at least one gap…