https://asia.ensembl.org/info/docs/tools/vep/index.html
https://github.com/Ensembl/ensembl-vep
输入一些variant的名字,出来一些注释结果。
注释结果:
#Uploaded_variation Location Allele Consequence IMPACT SYMBOL Gene Feature_type Feature BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation DISTANCE STRAND FLAGS SYMBOL_SOURCE HGNC_ID MANE TSL APPRIS SIFT PolyPhen AF CLIN_SIG SOMATIC PHENO PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE TRANSCRIPTION_FACTORS rs1258750482 19:61902-61902 A downstream_gene_variant MODIFIER WASH5P ENSG00000282458 Transcript ENST00000631796.1 processed_transcript - - - - - - - - - rs1258750482 3920 -1 - HGNC HGNC:33884 - 2 - - - - - - - - - - - - - rs1258750482 19:61902-61902 A downstream_gene_variant MODIFIER WASH5P ENSG00000282458 Transcript ENST00000631994.1 processed_transcript - - - - - - - - - rs1258750482 4476 -1 - HGNC HGNC:33884 - 5 - - - - - - - - - - - - - rs1258750482 19:61902-61902 A downstream_gene_variant MODIFIER WASH5P ENSG00000282458 Transcript ENST00000632089.1 processed_transcript - - - - - - - - - rs1258750482 3920 -1 - HGNC HGNC:33884 - 3 - - - - - - - - - - - - - rs1258750482 19:61902-61902 A downstream_gene_variant MODIFIER WASH5P ENSG00000282458 Transcript ENST00000632496.1 processed_transcript - - - - - - - - - rs1258750482 3920 -1 - HGNC HGNC:33884 - 3 - - - - - - - - - - - - - rs1258750482 19:61902-61902 A splice_region_variant,intron_variant,non_coding_transcript_variant LOW WASH5P ENSG00000282458 Transcript ENST00000632506.1 processed_transcript - 2/2 - - - - - - - rs1258750482 - -1 - HGNC HGNC:33884 - 1 - - - - - - - - - - - - - rs1258750482 19:61902-61902 A downstream_gene_variant MODIFIER WASH5P ENSG00000282458 Transcript ENST00000633703.1 processed_transcript - - - - - - - - - rs1258750482 1919 -1 - HGNC HGNC:33884 - 5 - - - - - - - - - - - - - rs1258750482 19:61902-61902 A downstream_gene_variant MODIFIER WASH5P ENSG00000282458 Transcript ENST00000633719.1 retained_intron - - - - - - - - - rs1258750482 211 -1 - HGNC HGNC:33884 - - - - - - - - - - - - - - - rs1258750482 19:61902-61902 A downstream_gene_variant MODIFIER WASH5P ENSG00000282458 Transcript ENST00000633742.1 transcribed_processed_pseudogene - - - - - - - - - rs1258750482 4418 -1 - HGNC HGNC:33884 - - - - - - - - - - - - - - - rs1258750482 19:61902-61902 A downstream_gene_variant MODIFIER WASH5P ENSG00000282458 Transcript ENST00000634023.1 processed_transcript - - - - - - - - - rs1258750482 3149 -1 - HGNC HGNC:33884 - 5 - - - - - - - - - - - - - rs1156485833 19:107157-107157 C upstream_gene_variant MODIFIER OR4F17 ENSG00000176695 Transcript ENST00000318050.4 protein_coding - - - - - - - - - rs1156485833 3486 1 - HGNC HGNC:15381 - - - - - - - - - - - - - - - rs1156485833 19:107157-107157 C splice_region_variant,5_prime_UTR_variant LOW OR4F17 ENSG00000176695 Transcript ENST00000585993.3 protein_coding 1/3 - - - 54 - - - - rs1156485833 - 1 - HGNC HGNC:15381 - - - - - - - - - - - - - - - rs1156485833 19:107157-107157 C downstream_gene_variant MODIFIER OR4G1P ENSG00000267310 Transcript ENST00000588632.2 transcribed_unprocessed_pseudogene - - - - - - - - - rs1156485833 1685 1 - HGNC HGNC:8302 - - - - - - - - - - - - - - - rs1156485833 19:107157-107157 C missense_variant,splice_region_variant MODERATE OR4F17 ENSG00000176695 Transcript ENST00000618231.3 protein_coding 1/2 - - - 54 9 3 K/N aaG/aaC rs1156485833 - 1 - HGNC HGNC:15381 - - P1 deleterious_low_confidence(0.03) benign(0.062) - - - - - - - - - - rs1156485833 19:107157-107157 C downstream_gene_variant MODIFIER OR4G1P ENSG00000267310 Transcript ENST00000641173.1 processed_transcript - - - - - - - - - rs1156485833 1080 1 - HGNC HGNC:8302 - - - - - - - - - - - - - - - rs1156485833 19:107157-107157 C splice_region_variant,non_coding_transcript_exon_variant LOW OR4F17 ENSG00000176695 Transcript ENST00000641591.1 processed_transcript 1/4 - - - 54 - - - - rs1156485833 - 1 - HGNC HGNC:15381 - - - - - - - - - - - - - - - rs1156485833 19:107157-107157 C downstream_gene_variant MODIFIER OR4G1P ENSG00000267310 Transcript ENST00000641984.1 processed_transcript - - - - - - - - - rs1156485833 1080 1 - HGNC HGNC:8302 - - - - - - - - - - - - - - - rs867704559 19:110630-110630 TT upstream_gene_variant MODIFIER OR4F17 ENSG00000176695 Transcript ENST00000318050.4 protein_coding - - - - - - - - - rs867704559 13 1 - HGNC HGNC:15381 - - - - - - - - - - - - - - - rs867704559 19:110630-110630 TT 5_prime_UTR_variant MODIFIER OR4F17 ENSG00000176695 Transcript ENST00000585993.3 protein_coding 3/3 - - - 143 - - - - rs867704559 - 1 - HGNC HGNC:15381 - - - - - - - - - - - - - - - rs867704559 19:110630-110630 TT frameshift_variant HIGH OR4F17 ENSG00000176695 Transcript ENST00000618231.3 protein_coding 2/2 - - - 60 15 5 T/TX acT/acTT rs867704559 - 1 - HGNC HGNC:15381 - - P1 - - - - - - - - - - - - rs867704559 19:110630-110630 TT downstream_gene_variant MODIFIER OR4G1P ENSG00000267310 Transcript ENST00000641173.1 processed_transcript - - - - - - - - - rs867704559 4553 1 - HGNC HGNC:8302 - - - - - - - - - - - - - - - rs867704559 19:110630-110630 TT non_coding_transcript_exon_variant MODIFIER OR4F17 ENSG00000176695 Transcript ENST00000641591.1 processed_transcript 3/4 - - - 143 - - - - rs867704559 - 1 - HGNC HGNC:15381 - - - - - - - - - - - - - - - rs867704559 19:110630-110630 TT downstream_gene_variant MODIFIER OR4G1P ENSG00000267310 Transcript ENST00000641984.1 processed_transcript - - - - - - - - - rs867704559 4553 1 - HGNC HGNC:8302 - - - - - - - - - - - - - - -
问题:
为什么一个snp有这么多注释?因为注释是按Transcript进行的,同一个位点在不同的Transcript中的功能是不同的。另外,如果两个基因离得太近,那就有可能注释到两个基因里。【按优先级排序,去掉冗余的即可】
为什么注释有冗余,既有downstream又有non_coding?是的,肯定是有冗余的,注释有不同层面,可以很粗放,也可以很精细。
数量少可以用web server,https://asia.ensembl.org/Tools/VEP【几十万个以内都可以用,可以支持rs id,非常方便】
数量多就用local tool,https://github.com/Ensembl/ensembl-vep
安装perl依赖包
perl -MCPAN -e shell install Archive::Zip
install DBI
cpan Module::Build
https://github.com/Ensembl/Bio-DB-HTS,这个模块不好装,fatal error: zlib.h: No such file or directory
装本地数据库:
cd $HOME/.vep curl -O ftp://ftp.ensembl.org/pub/release-102/variation/indexed_vep_cache/homo_sapiens_vep_102_GRCh38.tar.gz tar xzf homo_sapiens_vep_102_GRCh38.tar.gz
问题:#include <zlib.h> zlib.h: No such file or directory【非程序员背景,碰到编译问题真是头大】
解决方案:Compilation error - missing zlib.h
export PATH =$PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/ export LIBRARY_PATH=$LIBRARY_PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/ export C_INCLUDE_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/include/ export CPLUS_INCLUDE_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/include/ export PKG_CONFIG_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/pkgconfig
问题:MSG: ERROR: Cannot use ID format in offline mode【local模式无法使用rs id】
那就准备其他的格式测试一下,vcf肯定没问题。
最后安不上就用docker,https://hub.docker.com/r/ensemblorg/ensembl-vep#install
结果解读:
Consequences (all)
- intron_variant: 44%
- non_coding_transcript_variant: 16%
- upstream_gene_variant: 12%
- downstream_gene_variant: 11%
- NMD_transcript_variant: 4%
- regulatory_region_variant: 3%
- intergenic_variant: 3%
- non_coding_transcript_exon_variant: 2%
- missense_variant: 1%
- Others
Coding consequences
- missense_variant: 73%
- synonymous_variant: 26%
- stop_gained: 1%
- protein_altering_variant: 0%
- frameshift_variant: 0%
- stop_lost: 0%
- coding_sequence_variant: 0%
注意:
- non-coding包括intergenic + UTR + intron
- exon包括CDS + UTR
- upstream和downstream一般指基因上下游的2kbp
- ncRNA exonic/splicing/intronic
优先级:
- 1 splicing/ncRNA splicing
- 2 missense
- 3 coding region/ncRNA exonic
- 4 5'UTR/3'UTR
- 5 Upstream/Downstream
- 6 regulatory_region_variant
- 7 intronic/non_coding_transcript_variant:
- 8 intergenic
- 9 others
这里的功能注释也有ontology
Google search:vep regulatory_region_variant
Ensembl Variation - Calculated variant consequences 注释本身就是根据ensembl的transcript的功能来的
http://www.sequenceontology.org/miso/current_svn/term/SO:0001566