With this post, I’ll start going through the output files newbler generates. Some of these will be described in detail as they contain a lot of important information.
For today’s post, we’ll start with the 454NewblerMetrics.txt file. This file contains a lot of details on the reads used during the assembly, as well as the resulting contigs and, in the case of paired end reads, scaffolds.
The file starts with some metadata, such as the date of the assembly, where is is located, and what version of newbler was used. For this post, I used a file of as assembly generated with version 2.3 and both shotgun and paired end read files. Note that the output will be slightly different for a mapping project (to be described in a later post) than for an assembly project.
Section 1: runData
file
{
path = "/your/path/yourfile1.sff";
numberOfReads = ######, ######;
numberOfBases = #########, #########;
}
For each input file, the numbers mentioned are reads and bases in the file, reads and bases after trimming.
Section 2: pairedReadData
For files with paired end reads, there is some additional information (and something strange):
{
path = "/your/path/yourfile2.sff";
numberOfReads = ######, ######;
numberOfBases = #########, #########;
numWithPairedRead = ######;
}
Not all reads have the paired end linker, or have long enough sequences on both ends of the linker, so the number of reads that have been treated as paired end reads is listed. Note that the number of trimmed reads is higher than the number of actual reads. The reason for this is that each half of the paired end reads is counted as a separate trimmed read.
Section 3: runMetrics
{
totalNumberOfReads = #########;
totalNumberOfBases = ###########;
numberSearches = ########;
seedHitsFound = ##########, ##.##;
overlapsFound = ##########, ###.##, ###.##%;
overlapsReported = ##########, ###.##, ##.##%;
overlapsUsed = #########, #.##, #.##%;
}
The numbers of searches, seeds and overlap have to do with the alignment phase of the assembly (see previous post).
Section 4: readAlignmentResults
{
path = "/your/path/yourfile1.sff";
numAlignedReads = ######, ##.##%;
numAlignedBases = #########, ##.##%;
inferredReadError = #.##%, #######;
}
Here, for each input file, the number of reads and bases that were aligned to other reads are reported. In addition, ‘inferredReadError’ indicates the number of errors (mainly indels) between the contigs and the reads in this file is reported. If this number is high, the read file might contain very low quality reads. I must admit that I don’t know that ‘high’ would be (if you have a bunch of sff files from different runs, one could compare the inferred error rate between them).
Section 5: pairedReadResults
{
path = "/your/path/yourfile2.sff";
numAlignedReads = ######, ##.##%;
numAlignedBases = ########, ##.##%;
inferredReadError = #.##%, #######;
numberWithBothMapped = #####;
numWithOneUnmapped = #####;
numWithMultiplyMapped = #####;
numWithBothUnmapped = ###;
}
For the paired end read files, the same information is displayed as in section 4. In addition, mapping results for the pair halves are mentioned: the number of reads for which i) both pair halves mapped (i.e aligned) uniquely (either to the same contig, or to two different contigs), ii) only one halve mapped, iii) one or both of the halves mapped to multiple places, neither halve aligned to any contig. Of those reads in group i), reads that map to the same contig are used to calculate the average paired end distance for the reads, while only those reads which do map to different contigs will be used for scaffolding.
This concludes the part on the input. Next are the actual assembly results
Section 6: consensusDistribution
This section deals with basecalling of the consensus contigs. After alignment and contig building (which are done in ‘base space’, i.e. using nucleotide sequences), the flow signals of the reads obtained during sequencing (i.e the light intensities during incorporation of nucleotides) is used to determine the consensus bases. In fact, the number of homopolymers (A, AA, AAA, AAAA, ….) is determined based on all the signals present at a certain position.
The section on ‘fullDistribution’ gives a histogram of all the normalized signal values (the value 0 means no base incorporated during that flow, this happens many times). You will see peaks around (but not exactly at) 1.0, 2.0 etc. These peaks are mentioned under “distributionPeaks”. Based on this information, newbler determines thresholds for the border between n A’s and n+1 A’s (or C’s, G’s T’s). These thresholds are mentioned under thresholdsUsed. For example, “threshold = 3, 4, 3.38” means the threshold between three and four bases lies at 3.38. Finally the “interpolationAmount” is used for signals over the highest peak mentioned, (but I do not know how this number is used…).
Section 7: consensusResults
Here is the really interesting stuff: the metrics of the assembly. But first, a summary of the reads, i.e. what was mentioned in section 4 and 5 is summarized under “readStatus” and “pairedReadStatus”. In addition, for all the reads it is mentioned whether they were
- assembled fully or partially (in the latter case not all of the read was used in a contig)
- singleton, i.e. reads without overlap to other reads
- repeat, i.e. the read had many seeds which were present in many other reads, indicating that they were from repeat regions
- outlier, i.e. reads which aligned, but in a way that was in inconsistent with the majority of reads
- too short, the (trimmed) read was shorter than the minimum length used by newbler (which is 50 bases, unless when paired end reads are present, in which case it is 20 bases)
Also, for paired end reads, the per-library average distance between halves is mentioned, as well as the standard deviation of the average.
library
{
l
ibraryName = "3kb_PE";
pairDistanceAvg = 2675.5;
pairDistanceDev = 668.9;
}
This number is calculated, as mentioned above, by using the pairs where both halves map to the same contig. The average distance is then used to estimate the distance between scaffolded contigs. The library name is usually the filename. It is possible, however, to have several sff files, coming from the same paired end library, grouped under one library name, with a common paired end distance, by changing the way newbler is run (it involves the “addRun” command, more on this another time). Note, by the way, that the standard deviation is always 25% of the average distance (so, this number is ‘set’ by newbler, not calculated)…
“scaffoldMetrics”:
{
numberOfScaffolds = #####;
numberOfBases = #########;
avgScaffoldSize = #####;
N50ScaffoldSize = ######;
largestScaffoldSize = #######;
}
Here, the number of scaffolds and bases (including gaps) are mentioned, as well as average and largest scaffold size. The N50 scaffold size is defined such that “half of the assembled bases reside in scaffolds having a length of at least the N50 scaffold length” Or, when you sum up the lengths of the scaffolds from longest to shortest, the N50 size is the size of the scaffold where your sum is hits or crosses half the total length. N50 is a more effective number to compare assemblies, as it is reflecting the length distribution of the scaffolds, and is less influenced by the total number of scaffolds than the average.
Next is “largeContigMetrics”:
{
numberOfContigs = ######;
numberOfBases = #########;
avgContigSize = ####;
N50ContigSize = ####;
largestContigSize = #####;
Q40PlusBases = #########, ##.##%;
Q40MinusBases = #######, #.##%;
}
Large contigs are those of at least 500 bp, (this cutoff can been set differently). The same metrics as for the scaffolds are given (this time, the total length obviously does not include gaps). The number of Q40 plus bases indicates the bases with a phred-like consensus quality score of at least 40 (meaning a 1:10000 [1 in 104.0
] chance of the base being wrong). It used to be said that for a finished genome, 100% of the bases had a quality score of at least 40.
Finally, the number and total length of all the contigs, i.e. those with a minimum length of 100 bp (again, this number can be changed) is listed.
Note that not all contigs are used in scaffolds. I feel this metric should be included in this file, but I might get back on how to obtain it using some small-scale bioinformatics.