人类基因组各种版本对应关系

2014/10/15来源：生信菜鸟团评论9,711

首先是NCBI对应UCSC，对应ENSEMBL数据库：

GRCh36 (hg18): ENSEMBL release_52.
GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.
GRCh38 (hg38): ENSEMBL release_76/77/78/80/81/82.

可以看到ENSEMBL的版本特别复杂！！！很容易搞混！

但是UCSC的版本就简单了，就hg18,19,38, 常用的是hg19，但是我推荐大家都转为hg38

看起来NCBI也是很简单，就GRCh36,37,38，但有37.1, 37.2， 37.3 等等，不过这种版本一般指的是注释在更新，基因组序列一般不会更新。

如果要下载GTF注释文件，基因组版本尤为重要。

对NCBI：ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/GFF/ ##最新版（hg38）

ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ ## 其它版本

对于ensembl：

ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

变幻中间的release就可以拿到所有版本信息：ftp://ftp.ensembl.org/pub/

对于UCSC，那就有点麻烦了：

需要选择一系列参数：

http://genome.ucsc.edu/cgi-bin/hgTables

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables
2. Select the following options:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: Select "genome" for the entire genome.
output format: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser
3. Click 'get output'.

现在重点来了，搞清楚版本关系了，就要下载呀！

UCSC里面下载非常方便，只需要根据基因组简称来拼接url即可：

http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz

或者用shell脚本指定下载的染色体号：

for i in $(seq 1 22) X Y M;
do echo $i;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;
## 这里也可以用NCBI的：ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/MGSCv3_Release3/Assembled_Chromosomes/chr前缀
done
gunzip *.gz
for i in $(seq 1 22) X Y M;
do cat chr${i}.fa >> hg19.fasta;
done
rm -fr chr*.fasta

发表评论