使用PHI-base数据库进行致病基因分析

2015/06/28来源：陈连福的生信博客1 4,806

1. PHI-base数据库简介

PHI-base数据库从文献中收集经过实验验证了的致病基因和效应基因的序列。目前（20190411）数据库版本为4.6版本，从3011篇文献中收集了263种致病菌（细菌、真菌、原生动物和线虫）的6438个基因和194种宿主（植物占~70%、脊椎动物、昆虫、线虫和真菌）的11340种相互关系，其中包含510中疾病。PHI-base将收集到的参考文献信息、基因信息、病原和宿主信息、疾病信息、表型和相互关系等记录到数据库中，并提供关键词进行搜索。同时，PHI-base还提供BLAST网页工具，用于将序列比对到数据库中，从而根据BLAST结果进行致病基因分析。此外，PHI-base也提供了数据库下载，能更方便地进行本地化批量注释。

PHI-base将数据库中基因的表型分为多类：

1. Loss of pathogenicity: 导致病原菌致病能力丧失的基因。
2. Reduced virulence： 导致病原菌致病能力减弱的基因。
3. Unaffected pathogenicity： 对病原菌致病性没有影响的基因。
4. Increased pathogenicity (Hypervirulence)： 导致病原致病能力增强的基因。
5. Effector (plant avirulence determinant)：致病效应基因。效应基因能直接或间接地让宿主识别病原，并激活植物防御反应机制，导致致病失败。此外，少数效应基因是一些易感宿主致病所必须的基因，大部分易感宿主致病不需要效应基因。
6. Enhanced antagonism: 导致植物内生菌表型出症状，病原菌在和宿主互作中占优势。
7. Lethal：致死因子。缺少致死基因或减少其表达能导致病原菌无法存活。这类基因的产物是生命所必不可少的。注意该类基因不是直接的致病基因，仅是保证病原生存的基因，和增强病原菌的致病能力没有关系。 
8. chemistry_target resistance to chemical: 抗药基因。
9. chemistry_target sensitivity to chemical: 感药基因。

对全基因组基因型PHI注释，大部分基因属于1~3类。属于第4类Increased pathogenicity (Hypervirulence)类型的基因才是关键致病基因；属于5类Effector类型和第6类Enhanced antagonism基因和致病也比较相关。

2. 下载并部署PHI-base数据库

貌似目前（201904）该数据库不能提供下载。我保留了之前4.5版本的文件并进行了BLAST数据创建：

$ mkdir /opt/biosoft/PHI-base4.5
$ cd /opt/biosoft/PHI-base4.5
# 数据库文件phi45.fas和phibase45.csv
$ grep -v -P "^#" phi45.fas > phi45.fasta
    .fas文件首行含有#符号，需要删除，才能构建BLAST数据库。
$ /opt/biosoft/ncbi-blast-2.2.28+/bin/makeblastdb -in phi45.fasta -dbtype prot -title protein -parse_seqids -out phi -logfile phi.protein.log
    .fasta文件头部的序列名超过了50个字符长度，新版本的makeblastdb不支持。
$ cp phi* /opt/biosoft/wwwblast/db/

3. 使用BLASTP进行PHI注释，并编写程序提炼结果

对全基因组蛋白序列和PHI数据库进行比对后，一条序列可能有多个Hits结果，而每个Hits结果可能有多种表型。为了得到更准确的PHI注释结果，推荐按如下方法进一步分析：首先对BLAST结果设定严格的evalue、identity和coverage阈值进行过滤；然后对每个基因BLAST结果中各种PHI表型的次数进行排序；最后选次数最高、BLAST比对得分最大的表型作为基因的注释结果，可以得到一个基因对应一个表型的结果；或者在前者基础上，额外增加次数出现也较多（出现的次数不低于最优表型次数的60%）的表型，可以得到一个基因对应多个表型的结果。

$ blast.pl --CPU 40 /opt/biosoft/wwwblast/db/phi proteins.fasta > phi.xml
$ parsing_blast_result.pl --evalue 1e-5 --identity 0.3 --subject-coverage 0.3 --query-coverage 0.3 --HSP-num 1 phi.xml > phi.tab
$ perl -e '<>; while (<>) { chomp; my @field = split /\t/, $_; $field[1] =~ m/.*#([^\t]+)/; my $annot = $1; my @annot = split /__/, $annot; foreach (@annot) { $annot{$field[0]}{$_} ++; $annot_score{$field[0]}{$_} = $_[11] if $annot_score{$field[0]}{$_} < $_[11];} } foreach my $gene (sort keys %annot) { my @annot = sort { $annot{$gene}{$b} <=> $annot{$gene}{$a} or $annot_score{$gene}{$b} <=> $annot_score{$gene}{$a} } keys %{$annot{$gene}}; print "$gene\t$annot[0]\n"; $stats{$annot[0]}{$gene} = 1; } foreach (sort keys %stats) { my @gene = sort keys %{$stats{$_}}; my $num = @gene; my $gene = join ",", @gene; print STDERR "$_\t$num\t$gene\n"; }' phi.tab > phi.annot 2> phi.stats
$ perl -e 'print "geneID\tPhenotype\tReportNum\n"; <>; while (<>) { chomp; my @field = split /\t/, $_; $field[1] =~ m/.*#([^\t]+)/; my $annot = $1; my @annot = split /__/, $annot; foreach (@annot) { $annot{$field[0]}{$_} ++; $annot_score{$field[0]}{$_} = $_[11] if $annot_score{$field[0]}{$_} < $_[11];} } foreach my $gene (sort keys %annot) { my @annot = sort { $annot{$gene}{$b} <=> $annot{$gene}{$a} or $annot_score{$gene}{$b} <=> $annot_score{$gene}{$a} } keys %{$annot{$gene}}; my $max_num = $annot{$gene}{$annot[0]}; print "$gene\t$annot[0]\t$annot{$gene}{$annot[0]}\n"; $stats{$annot[0]}{$gene} = 1; shift @annot; foreach (@annot) { if (($annot{$gene}{$_} / $max_num) >= 0.6) { print "$gene\t$_\t$annot{$gene}{$_}\n"; $stats{$_}{$gene} = 1;} } } foreach (sort keys %stats) { my @gene = sort keys %{$stats{$_}}; my $num = @gene; my $gene = join ",", @gene; print STDERR "$_\t$num\t$gene\n"; }' phi.tab > phi.annot 2> phi.stats

1. PHI-base数据库简介

2. 下载并部署PHI-base数据库

3. 使用BLASTP进行PHI注释，并编写程序提炼结果

发表评论