序列以及序列比对中常见术语

2011/12/02评论4,136

有些你看着知道，其实让你说，又说不出多少；有些你以为自己知道，但你不知道自己仅仅了解了其中的一部分。许多概念，虽说只是一个词，但是其背后是一个专业的领域模型，每一个概念在不同的环境有着不同的故事，看见过许多解释，但是每次都不一样，而每一次的辨析与反思后，发现自己曾经的薄浅。下面是暂时收集的列表，包括英文给出的定义，以及我个人的理解。

Domain保守域: Conserved structural entities with distinctive secondary structure content and an hydrophobic core. In small disulphide-rich and Zn²⁺-binding or Ca²⁺- binding domains the hydrophobic core may be provided by cystines and metal ions, respectively. Homologous domains with common functions usually show sequence similarities.; 结构域（structure domain）是在蛋白质三级结构中介于二级和三级结构之间的可以明显区分但又相对独立的折叠单元，每个结构域自身形成紧实的三维结构，可以独立存在或折叠，但结构域与结构域之间关系较为松散。
结构功能域通常由25~300个氨基酸残基组成，不同蛋白质分子中结构域的数目不同，同一个蛋白质分子中的几个结构域彼此相似或者不尽相同。结构域是蛋白质的功能、结构和进化单位，结构功能域分析对于蛋白质结构的分类和预测有着重要的作用。
Bits scores: Alignment scores are reported by HMMer and BLAST as bits scores. The likelihood that the query sequence is a bona fide homologue of the database sequence is compared to the likelihood that the sequence was instead generated by a “random” model. Taking the logarithm (to base 2) of this likelihood ratio gives the bits score.
P-value: This represents a probability that, given a database of a particular size, random sequences score higher than a value X. P-values are generated by the BLAST algorithm that has been integrated into SMART.
E-value: This represents the number of sequences with a score greater-than, or equal to, X, expected absolutely by chance. The E-value connects the score (“X”) of an alignment between a user-supplied sequence and a database sequence, generated by any algorithm, with how many alignments with similar or greater scores that would be expected from a search of a random sequence database of equivalent size. Since version 2.0 E-values are calculated using Hidden Markov Models, leading to more accurate estimates than before.
Motif模体: Sequence motifs are short conserved regions of polypeptides. Sets of sequence motifs need not necessarily represent homologues.; motif又称模体，是序列中局部的保守区域，或者是一组序列中共有的一小段序列模式。一般指构成任何一种特征序列的基本结构，但是多数情况下是指可能具有分子功能、结构性质或家族成员相关的任何序列模式。
motif作为结构域中的亚单位，表现结构域的各种生物学功能。常见的蛋白质结构motif，种类超过28类。常见的motif搜索方法主要基于两种，一种是序列模式（Pattern），另外一种是序列特征谱（Profile）。; Pattern; 序列模式方法直接搜索关键的几个保守残基，忽略其他位置的氨基酸多态性。例如，“L-x(6)-L-x(6)-L-x(6)-L”（x表示任意氨基酸）为亮氨酸拉链结构的序列模式，这样一段序列多处于蛋白质的活性区域或重要结构区，较为保守，是motif搜索的目标之一。由于序列模式方法搜索的不是完整的结构域或整个蛋白的特征，故其适用于识别保守的功能区域，对于序列变异大的功能区域，则无法准确识别。此外，随机的氨基酸序列也可能出现短小的序列模式，故易产生假阳性，对于此类搜索需要搜索多个不同的数据库，得到尽可能多得同源序列，从而才能更好的说明序列中包含的信息。
Profile: A profile is a table of position-specific scores and gap penalties, representing an homologous family, that may be used to search sequence databases (Ref.: [1], [2], [3]).
In CLUSTAL-W-derived profiles those sequences that are more distantly related are assigned higher weights ([4], [5], [6]). Issues in profile-based database searching are discussed in Bork & Gibson (1996) [7].; 序列特征谱搜索是基于蛋白质序列多重比对结果中的保守序列区域进行搜索，由于考虑了不同保守度的氨基酸在相应位置的权重，可以更为敏感的检测到进化距离较远的蛋白质相关性，得到比序列模式方法更为灵敏的结果，但可靠的序列特征谱数目往往有限，因此该方法在进行新基因功能预测时受到了较大的障碍。
Alignment 多重比对，序列比较: Representation of a prediction of the amino acids in tertiary structures of homologues that overlay in three dimensions.

发表评论