测一个未知基因组(de nove sequence),要进行测序、拼接及注释。关于测序仪和拼接软件已经讲的很多了,很少有关于基因组注释的文章。一篇最近在Nature Review Genetics上的文章,A beginner’s guide to eukaryotic genome annotation,非常详细地讲解了如何做基因组注释,是一篇非常好的入门文章。
基因组拼接好后,一般要先进行重复序列的检测和注释,然后mask掉这些重复序列,再进行编码基因的预测(有时候也预测非编码RNA),最后一步是整合。因为要通过不同的方法和参考来源来预测,会得到不同的结果,整合时综合考虑预测错误和可变剪接,得到可靠的注释,这一步要一个个手工检测。
有很多软件可以做注释(可见文章内的列表),主要分为ab initio和evidence-driven两种预测方法。
现在RNA-seq技术也很成熟了,一般都是在测基因组时也要做RNA-seq,这些RNA-seq既可用于分析基因的表达,也是非常好的基因注释的参考资源。
A beginner's guide to eukaryotic genome annotation
Mark Yandell & Daniel Ence
Nature Reviews Genetics 13, 329-342 (May 2012) | doi:10.1038/nrg3174
The falling cost of genome sequencing is having a marked impact on the research community with respect to which genomes are sequenced and how and where they are annotated. Genome annotation projects have generally become small-scale affairs that are often carried out by an individual laboratory. Although annotating a eukaryotic genome assembly is now within the reach of non-experts, it remains a challenging task. Here we provide an overview of the genome annotation process and the available tools and describe some best-practice approaches.
全文链接:http://www.nature.com/nrg/journal/v13/n5/full/nrg3174.html