In the post on what is new in newbler version 2.6, I introduced the -scaffold option. Briefly, with this option instances (i.e. the consensus sequence) of repeats are placed in gaps. As I mentioned, setting -scaffold results in two extra files. With this post, I will explain these in detail.
454ContigScaffolds.txt and its relation to the 454Scaffolds.txt file
Both these files are in the AGP format, see my earlier post on the 454Scaffolds.txt file. The examples for post are based on a bacterial genome data set (shotgun and paired end 454 reads), assembled using the -scaffold flag (and newbler 2.6).
The 454Scaffolds.txt looks is different from an assembly without the -scaffold flag:
scaffold00001 1 4543 1 W sctg_0001_0001 1 4543 +
scaffold00001 4544 5465 2 N 922 fragment yes
scaffold00001 5466 6758 3 W sctg_0001_0002 1 1293 +
scaffold00001 6759 6868 4 N 110 fragment yes
scaffold00001 6869 75179 5 W sctg_0001_0003 1 68311 +
scaffold00001 75180 75497 6 N 318 fragment yes
scaffold00001 75498 91133 7 W sctg_0001_0004 1 15636 +
scaffold00001 91134 91476 8 N 343 fragment yes
scaffold00001 91477 151573 9 W sctg_0001_0005 1 60097 +
scaffold00001 151574 154675 10 N 3102 fragment yes
scaffold00001 154676 220143 11 W sctg_0001_0006 1 65468 +
scaffold00001 220144 220163 12 N 20 fragment yes
scaffold00001 220164 221487 13 W sctg_0001_0007 1 1324 +
scaffold00001 221488 222941 14 N 1454 fragment yes
Instead of ‘contigXXXXX’ in the 6th column, there are sctg_XXXX_YYYY. ‘sctg’ stands for ‘ScaffoldContig’, see below. ‘sctg_0001′ stands for scaffold 1, while the following ‘_0001′ stands for the first contig in this scaffold. So, the 20th contig in scaffold 13 would be sctg_0013_0020. The 454ContigScaffolds.txt file is one line per contig followed by one line for a gap.
In the new 454ContigScaffolds.txt file, the corresponding region of scaffold 1 looks like this:
scaffold00001 1 4543 1 W contig00001 1 4543 +
scaffold00001 4544 5465 2 N 922 fragment yes
scaffold00001 5466 6758 3 W contig00002 1 1293 +
scaffold00001 6759 6868 4 N 110 fragment yes
scaffold00001 6869 75179 5 W contig00003 1 68311 +
scaffold00001 75180 75497 6 N 318 fragment yes
scaffold00001 75498 91133 7 W contig00004 1 15636 +
scaffold00001 91134 91476 8 N 343 fragment yes
scaffold00001 91477 117498 9 W contig00005 1 26022 +
scaffold00001 117499 117527 10 W contig00006 1 29 +
scaffold00001 117528 117914 11 W contig00007 1 387 +
scaffold00001 117915 117970 12 W contig00008 1 56 +
scaffold00001 117971 118037 13 W contig00009 1 67 +
scaffold00001 118038 149720 14 W contig00010 1 31683 +
scaffold00001 149721 151573 15 W contig00011 1 1853 +
scaffold00001 151574 154675 16 N 3102 fragment yes
scaffold00001 154676 158800 17 W contig00012 1 4125 +
scaffold00001 158801 158926 18 W contig00013 1 126 +
scaffold00001 158927 158951 19 W contig00014 1 25 +
scaffold00001 158952 159192 20 W contig00015 1 241 +
scaffold00001 159193 159225 21 W contig00016 1 33 +
scaffold00001 159226 159843 22 W contig00017 1 618 +
scaffold00001 159844 159969 23 W contig00013 1 126 +
scaffold00001 159970 159994 24 W contig00014 1 25 +
scaffold00001 159995 160235 25 W contig00015 1 241 +
scaffold00001 160236 160268 26 W contig00016 1 33 +
scaffold00001 160269 206731 27 W contig00018 1 46463 +
scaffold00001 206732 207126 28 W contig00019 1 395 +
scaffold00001 207127 207156 29 W contig00020 1 30 +
scaffold00001 207157 220143 30 W contig00021 1 12987 +
scaffold00001 220144 220163 31 N 20 fragment yes
scaffold00001 220164 221487 32 W contig00022 1 1324 +
scaffold00001 221488 222941 33 N 1454 fragment yes
Note how there are many contigs between gaps! A careful comparison tells us that:
sctg_0001_0001 is contig00001 (4543 bp)
sctg_0001_0002 is contig00002 (1293 bp)
sctg_0001_0003 is contig00003 (68311 bp)
sctg_0001_0004 is contig00004 (15636 bp)
sctg_0001_0005 consists of contig00005 – contig00011 (these contigs are 60097 bp all together)
sctg_0001_0006 consists of contig00012 – contig00021 (these contigs are 65468 bp all together)
sctg_0001_0007 is contig00022 (1324 bp)
This show how the -scaffold option works: repeat contigs are placed in gaps, so-called ‘ScaffoldContigs’ are formed by concatenating the contigs that are now next to each other without gaps in between. The 454ContigScaffolds.txt file shows which contigs are placed where, while the 454Scaffolds.txt shows the scaffolds as they are built up out of ScaffoldContigs.
If we now add the per-contig depth (from the 454ContigGraph.txt file) to the contigs that make up the ScaffoldContigs, we get:
For sctg_0001_0005:
contig length depth
contig00005 26022 39.3
contig00006 29 267.6
contig00007 387 352.3
contig00008 56 272.8
contig00009 67 203.2
contig00010 31683 41.0
contig00011 1853 26.0
So, we have a long, 26 kb long contig of ‘normal’ depth (40x), followed by four short contigs of quite high depth (203-352x), after that one long contigagain of almost 32 kb of ‘normal’ depth. This looks like four repeat contigs in between long single-copy contigs. Finally, there is a 1.9 kb contig of somewhat lower depth, which I cannot really explain…
For sctg_0001_0006:
contig length depth
contig00012 4125 34.1
contig00013 126 75.0
contig00014 25 203.8
contig00015 241 119.3
contig00016 33 79.2
contig00017 618 42.3
contig00013 126 75.0
contig00014 25 203.8
contig00015 241 119.3
contig00016 33 79.2
contig00018 46463 38.8
contig00019 395 180.6
contig00020 30 116.3
contig00021 12987 37.4
Here, there are four long contigs, 4kb, 0.6, 46.5 kb and 13 kb, of ‘normal’ depth (34-42x), with shorter contigs in between, most of them with high depth (75 – 204x). Unsurprisingly, a quick blast identified contig 13 and 15 as being part of putative transposases, proteins known to be present in multiple copies in bacterial genomes…
454ScaffoldContigs.fna and .qual files
These simply list the sequences of the ScaffoldContig files as listed in the 454Scaffods.txt file.
In conclusion, 454 has tried to offer more complete scaffolds by placing repeats in gaps where possible