Pileup格式介绍

2016/02/07来源：郑泽鑫的博客评论2,495

Pileup format is first used by Tony Cox and Zemin Ning at the Sanger Institute. It desribes the base-pair information at each chromosomal position. This format facilitates SNP/indel calling and brief alignment viewing by eyes.

Pileup 格式是桑格中心（Tony Cox and Zemin Ning）提出，描述可用肉眼观察的某一个区域所有reads匹配的情况。

The pileup format has several variants. The default output by SAMtools looks like this

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

where each line consists of

chromosome, 染色体
1-based coordinate, 染色体上的位置
reference base, 该位点参考序列上的碱基
the number of reads covering the site, 覆盖度（测得reads的数目）
read bases and base qualities. 该位点的每条reads与该位点的匹配方式
mapping quality 匹配质量 (Phred quality score from 0 to 93 using ASCII 33 to 126 (although in raw read data the Phred quality score rarely exceeds 60, higher scores are possible in assemblies or read maps))

read bases column

. stands for a match to the reference base on the forward strand

代表匹配到正链

, for a match on the reverse strand

代表匹配到负链

ACGTN for a mismatch on the forward strand

大写的ACGTN代表与reference的正向链上不同的实际碱基的5种情况

acgtn for a mismatch on the reverse strand

小写的acgtn代表与reference的反向链上不同的实际碱基的5种情况

A pattern \+[0-9]+[ACGTNacgtn]+ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence.
- seq2 156 A 11 .$......+2AG.+2AG.+2AGGG <975;:<<<<<中的+2AG有3处，代表有3个read上有AG的2个bp的插入
Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the reference.
- seq3 200 A 20 ,,,,,..,.-4CACC.-4CACC....,.,,.^~. ==<<<<<<<<<<<::<;2<<同理，此处的-4CACC有2处，代表有2个read上有CACC的4个bp的缺失
a symbol ^ marks the start of a read segment which is a contiguous subsequence on the read separated by N/S/H CIGAR operations.

^代表刚好是read的开头

The ASCII of the character following ^ minus 33 gives the mapping quality.

^后面跟着的符号表示比对的质量（ASCII码减33）

A symbol $ marks the end of a read segment.

$代表刚好是read的结尾

read bases column

reference

发表评论