Newbler output VI: the ‘status’ files

2012/03/31评论1,333

The files that are the topic of this post are all tables, i.e. tab separated text files. The ‘status’ files describe what happened with all the reads and the paired end halves, while the AlignmentInfo file summarizes the contig alignments.

The fact that these files are tabular makes for easy parsing using by perl/python or, my favorite, awk.

1) 454TrimStatus.txt

Accno Trimpoints Used Used Trimmed Length Orig Trimpoints Orig Trimmed Length Raw Length ERGMJHS01CYVHW 5-78 74 5-98 94 100 ERGMJHS01D6IHL 5-116 112 5-116 112 161 ERGMJHS01DYTX5 5-127 123 5-127 123 173 ERGMJHS01DYDH0 5-78 74 5-78 74 124 ERGMJHS01ECEGM 5-256 252 5-256 252 271 ERGMJHS01CRQ8D 5-272 268 5-272 268 273 ERGMJHS01ECMVT 5-260 256 5-260 256 270 ERGMJHS01EZ7VU 5-41 37 5-61 57 62 ERGMJHS01ERDXB 5-207 203 5-207 203 252

This file describes what (trimmed) part of the read was considered for alignment. The columns describe:

Accno: the unique read ID, where the first 7 characters describe the unique run ID, followed by the lane number, followed by the encoded x and y coordinates of the read on the picotiterplate.
Trimpoints Used: the start and end position of the part of the read newbler used. Most of the times, the start will be position 5, as the first four bases of every read comprise the key sequence that identifies the read as a sample read (as opposed to control reads that have different key sequences). Also, in contrast to traditional Sanger reads, read quality is usually high from the very first bases read after the sequencing primer. When MIDs (454’s Multiplex IDentifiers) or other tags/barcodes have been used during library generation, and the reads were split according to the tag (which removes the tag from the read), the starting position will be higher accordingly.
Used Trimmed Length: the length of the part of the read newbler used
Orig Trimpoints: the start and end part of the trimmed read as it was given to newbler. These positions are the result of the ~~image~~ signal processing software trimming steps (thanks to Steven Sullivan for pointing out the original trimpoints are from the signal processing steps, not image processing…)
Orig Trimmed Length: the corresponding original trimmed length
Raw Length: the length of the read as it was before image processing

Comparing the Used Trimmed Length with the Orig Trimmed Length shows that for some reads, newbler trims even further than the image processing software. Also, the usable part of a read can get shorter when the ‘trimming database’ option (-vt) was used during assembly, for example to remove vector/adaptor/primer sequences.

Another section of the same file:

FQL5QBG02GX6EQ_left 5-171 167 5-296 292 299 FQL5QBG02GX6EQ_right 217-296 80 5-296 292 299 FQL5QBG02GUPVF 255-255 1 5-255 251 265 FQL5QBG02IFXSU_left 5-173 169 5-305 301 308 FQL5QBG02IFXSU_right 219-305 87 5-305 301 308 FQL5QBG02GXQUO 29-268 240 5-268 264 268 FQL5QBG02JS960 5-270 266 5-275 271 304 FQL5QBG02H0VJ7_left 5-145 141 5-238 234 259 FQL5QBG02H0VJ7_right 190-238 49 5-238 234 259 FQL5QBG02HASXU 62-304 243 5-304 300 313

Here, some of the reads have ‘_left’ or ‘_right’ added at the end of the read ID (Accno). This indicates that the read was a paired end read (the linker sequence was detected in the read), and for this file, these reads get split into their constituent right and left halves. Note that, for example, for read FQL5QBG02GX6EQ, the position of the linker sequence can be determined from the trimpoints: from position 172 (following the last position of the left part) to 216 (just before the starting position of the right part). Also note that some reads of the same run are not paired end reads. These reads either lack the linker altogether (an results of the paired end library generation procedure), or have too few bases (less than 20) on one side of the linker to give two mappable read halves. These reads are used as normal shotgun reads.

2) 454ReadStatus.txt

Accno Read Status 5' Contig 5' Position 5' Strand 3' Contig 3' Position 3' Strand ERGMJHS01CYVHW Assembled contig00011 610 + contig00011 685 - ERGMJHS01CJOXV PartiallyAssembled contig00115 8069 - contig00115 7943 + ERGMJHS01DYDH0 Singleton ERGMJHS01EZ7VU Repeat ERGMJHS01A8MP3 Outlier FQL5QBG02GDUSS_left Assembled contig00106 3130 + contig00106 3242 - FQL5QBG02GDUSS_right Assembled contig00106 5787 - contig00106 5759 +

This file describes where reads ended up after assembly was complete. For paired end reads, the ‘fate’ of each hall is reported on a separate line. Columns are:

Accno: the unique read ID
Read Status: this can be
- Assembled: the reads was placed in one or more contigs
- PartiallyAssembled: only part of the read was used for making contigs
- Singleton: there was no (significant) overlap between this read and all the others
- Repeat: the read was most likely derived from a repeated part of the genome. More technically: more than 70% of a read’s seeds (see this post) hit to more than 70 other reads.
- Outlier: a problematic read, e.g. a chimeric read
- TooShort: the trimmed portion of the read was below the length threshold. This minimum can be set with the –minlen flag during assembly. When it is not set, and no paired end reads are included, it is 50 bases; for an assembly with paired ends, it is 20 bases (if I’m not mistaken).

5′ Contig, 5′ Position, 5′ Strand: the contig and position in it where the 5’ end of the reads alignment begins, and the orientation of the read relative to the contig (‘+’ or ‘-‘ for forward and reverse strand, respectively)
3′ Contig, 3′ Position, 3′ Strand: similar for the 3’ end of the reads alignment

Note that only the starting and end of the each read’s alignment are shown. Due to the way newbler builds contigs, the middle of a read could be aligned within one or even several other contigs. It follows then, that this file can not be used for determining all the reads that were used to build a contig, or all the contigs that a read is a part of.

3) 454PairStatus.txt

Template Status Distance Left Contig Left Pos Left Dir Right Contig Right Pos Right Dir Left Distance RightDistance FQL5QBG02GDUSS SameContig 2657 contig00106 3130 + contig00106 5787 - FQL5QBG02GRUHY Link 1366 contig00208 267 - contig00207 3298 + 267 1099 FQL5QBG02HRDSS OneUnmapped - Unmapped contig00017 10630 - FQL5QBG02FS0NM BothUnmapped - Unmapped Unmapped FQL5QBG02IIB8R MultiplyMapped - Repeat contig00173 207 - FQL5QBG02IJDOE FalsePair - contig00015 72252 + contig01166 7528 -

This file describes for each paired end read, how it ended up in the assembly. Columns are:

Template: the read ID
Status: this can be:
- SameContig: both halves of the paired end read mapped to (or, for long enough halves, were assembled into) the same contig with a consistent orientation (i.e. the halves ‘point towards each other’ as paired end halves should). These reads have been used to determine the library insert size.
- Link: the reads mapped to different contigs, close enough to the ends of these contigs so that they could be used to link the contigs together into a scaffold.
- OneUnmapped: only one of the halves was mapped, the other not
- BothUnmapped: neither the right half, or the left halve was mapped
- MultiplyMapped: one or both of the halves mapped to multiple contigs (repeated reads)
- FalsePair: both halves were mapped, but either to the same contig with incorrect orientation or, the distance between the halves was outside of the accepted range for the library.

So, of all these status categories, only the ones marked as ‘Link’ were actually used for scaffolding…
Distance:
- for reads that map to the same contig: the distance between the halves
- for reads that Link contigs into scaffolds: the sum of the distances from the position of each half to the end of the contig. So, the total distance between the halves for these pairs would be the distance mentioned in the
454PairStatus.txt file, plus the gap between distance the contigs. This distance then should be consistent with the paired end library insert size.
Left Contig, Left Pos, Left Dir: the contig ID, position (of the 5’ end) and orientation (‘+’ or ‘-‘ for forward and reverse strand, respectively) of the mapped left half. Left Contig can also be marked as ‘Unmapped‘ or ‘Repeat’
Right Contig, Right Pos, Right Dir: similar for the right half. Note that ‘position’ here refers to the position of the 3’ end of the right half.
Left Distance: for reads that ‘Link’ contigs only: the distance from the 5’ end of the left half, to the end of the contig
Right Distance: for reads that ‘Link’ contigs only: the distance from the 3’ end of the right half, to the end of the contig

From this, it follows logically that for reads marked as ‘Link’, the sum of the Left and Right Distance columns is the same as the number listed in the Distance column (column 2)

For pair halves marked as ‘Repeat’, the mapping information is not reported in this file. It is possible to obtain the mapping results by adding the –pair or –pairt flags during assembly. This will result in the 454TagPairAlign.txt file, which describes all alignments of pair halves shorter than 50 bases (these are not assembled, but mapped to contigs afterwards, see my first post). The file can either report all alignments (-pair), or a tabulated summary (-pairt)

4) 454AlignmentInfo.tsv

Position Consensus Quality Score Unique Depth Align Depth Signal StdDeviation >contig00001 1 1 C 64 24 29 0.99 0.08 2 T 64 24 29 0.94 0.10 3 C 64 24 29 0.91 0.07 4 A 64 24 29 1.93 0.10 5 A 64 24 29 1.93 0.10 6 T 64 24 29 1.03 0.08 7 A 64 23 28 0.95 0.09 8 T 64 23 28 1.93 0.08 9 T 64 23 28 1.93 0.08 10 A 64 22 27 0.99 0.08

This file gives a consensus alignment overview for each position in each contig. Normally, this file is only present in the output when the project contains less then 4 million reads, and less then 40Mb total assembled contig length. For larger assemblies, adding –info to the command line will output this file.

The information for each contig starts with a line giving the contig ID, e.g >contig00001. The number which follows is always ‘1’ for assemblies (but can be different for mapping projects, perhaps subject of a future post…)
Columns are:

Position: position in the contig
Consensus: consensus contig nucleotide (base) at this position
Quality Score: consensus contig quality score at this position
Unique Depth: the number of reads that align to (cover) the position, restricted to unique reads only (a significant proportion of 454 reads are duplicates as a results of two beads present in the same microreactor during emusion PCR).
Align Depth: the number of all reads that align to the position (including duplicates)
Signal, StdDeviation: the average flow signal and the corresponding standard deviation for the flows at that position. Note that for stretches of identical bases, these numbers are identical (as 454 sequencing basically reads homopolymer lengths), e.g. see positions 4 and 5.

In closing, with these last four posts, I have described the most important output files, and the ones that usually are present by default. With a little programming skills one should be able to distill all information necessary from a newbler assembly using these files.

发表评论