• Ensembl Variant Effect Predictor (VEP) | 变异注释工具


    https://asia.ensembl.org/info/docs/tools/vep/index.html

    https://github.com/Ensembl/ensembl-vep

    输入一些variant的名字,出来一些注释结果。

    注释结果:

    #Uploaded_variation	Location	Allele	Consequence	IMPACT	SYMBOL	Gene	Feature_type	Feature	BIOTYPE	EXON	INTRON	HGVSc	HGVSp	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	DISTANCE	STRAND	FLAGS	SYMBOL_SOURCE	HGNC_ID	MANE	TSL	APPRIS	SIFT	PolyPhen	AF	CLIN_SIG	SOMATIC	PHENO	PUBMED	MOTIF_NAME	MOTIF_POS	HIGH_INF_POS	MOTIF_SCORE_CHANGE	TRANSCRIPTION_FACTORS
    rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000631796.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	3920	-1	-	HGNC	HGNC:33884	-	2	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000631994.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	4476	-1	-	HGNC	HGNC:33884	-	5	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000632089.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	3920	-1	-	HGNC	HGNC:33884	-	3	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000632496.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	3920	-1	-	HGNC	HGNC:33884	-	3	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1258750482	19:61902-61902	A	splice_region_variant,intron_variant,non_coding_transcript_variant	LOW	WASH5P	ENSG00000282458	Transcript	ENST00000632506.1	processed_transcript	-	2/2	-	-	-	-	-	-	-	rs1258750482	-	-1	-	HGNC	HGNC:33884	-	1	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000633703.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	1919	-1	-	HGNC	HGNC:33884	-	5	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000633719.1	retained_intron	-	-	-	-	-	-	-	-	-	rs1258750482	211	-1	-	HGNC	HGNC:33884	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000633742.1	transcribed_processed_pseudogene	-	-	-	-	-	-	-	-	-	rs1258750482	4418	-1	-	HGNC	HGNC:33884	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000634023.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	3149	-1	-	HGNC	HGNC:33884	-	5	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1156485833	19:107157-107157	C	upstream_gene_variant	MODIFIER	OR4F17	ENSG00000176695	Transcript	ENST00000318050.4	protein_coding	-	-	-	-	-	-	-	-	-	rs1156485833	3486	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1156485833	19:107157-107157	C	splice_region_variant,5_prime_UTR_variant	LOW	OR4F17	ENSG00000176695	Transcript	ENST00000585993.3	protein_coding	1/3	-	-	-	54	-	-	-	-	rs1156485833	-	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1156485833	19:107157-107157	C	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000588632.2	transcribed_unprocessed_pseudogene	-	-	-	-	-	-	-	-	-	rs1156485833	1685	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1156485833	19:107157-107157	C	missense_variant,splice_region_variant	MODERATE	OR4F17	ENSG00000176695	Transcript	ENST00000618231.3	protein_coding	1/2	-	-	-	54	9	3	K/N	aaG/aaC	rs1156485833	-	1	-	HGNC	HGNC:15381	-	-	P1	deleterious_low_confidence(0.03)	benign(0.062)	-	-	-	-	-	-	-	-	-	-
    rs1156485833	19:107157-107157	C	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000641173.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1156485833	1080	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1156485833	19:107157-107157	C	splice_region_variant,non_coding_transcript_exon_variant	LOW	OR4F17	ENSG00000176695	Transcript	ENST00000641591.1	processed_transcript	1/4	-	-	-	54	-	-	-	-	rs1156485833	-	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs1156485833	19:107157-107157	C	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000641984.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1156485833	1080	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs867704559	19:110630-110630	TT	upstream_gene_variant	MODIFIER	OR4F17	ENSG00000176695	Transcript	ENST00000318050.4	protein_coding	-	-	-	-	-	-	-	-	-	rs867704559	13	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs867704559	19:110630-110630	TT	5_prime_UTR_variant	MODIFIER	OR4F17	ENSG00000176695	Transcript	ENST00000585993.3	protein_coding	3/3	-	-	-	143	-	-	-	-	rs867704559	-	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs867704559	19:110630-110630	TT	frameshift_variant	HIGH	OR4F17	ENSG00000176695	Transcript	ENST00000618231.3	protein_coding	2/2	-	-	-	60	15	5	T/TX	acT/acTT	rs867704559	-	1	-	HGNC	HGNC:15381	-	-	P1	-	-	-	-	-	-	-	-	-	-	-	-
    rs867704559	19:110630-110630	TT	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000641173.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs867704559	4553	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs867704559	19:110630-110630	TT	non_coding_transcript_exon_variant	MODIFIER	OR4F17	ENSG00000176695	Transcript	ENST00000641591.1	processed_transcript	3/4	-	-	-	143	-	-	-	-	rs867704559	-	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    rs867704559	19:110630-110630	TT	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000641984.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs867704559	4553	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
    

    问题:

    为什么一个snp有这么多注释?因为注释是按Transcript进行的,同一个位点在不同的Transcript中的功能是不同的。另外,如果两个基因离得太近,那就有可能注释到两个基因里。【按优先级排序,去掉冗余的即可】

    为什么注释有冗余,既有downstream又有non_coding?是的,肯定是有冗余的,注释有不同层面,可以很粗放,也可以很精细。


    数量少可以用web server,https://asia.ensembl.org/Tools/VEP【几十万个以内都可以用,可以支持rs id,非常方便】

    数量多就用local tool,https://github.com/Ensembl/ensembl-vep

    安装perl依赖包

    perl -MCPAN -e shell
    install Archive::Zip
    install DBI
    cpan Module::Build
    

      

    https://github.com/Ensembl/Bio-DB-HTS,这个模块不好装,fatal error: zlib.h: No such file or directory

    装本地数据库:

    cd $HOME/.vep
    curl -O ftp://ftp.ensembl.org/pub/release-102/variation/indexed_vep_cache/homo_sapiens_vep_102_GRCh38.tar.gz
    tar xzf homo_sapiens_vep_102_GRCh38.tar.gz
    

      

      

    问题:#include <zlib.h> zlib.h: No such file or directory【非程序员背景,碰到编译问题真是头大】

    解决方案:Compilation error - missing zlib.h

    export PATH =$PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/
    export LIBRARY_PATH=$LIBRARY_PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/
    export C_INCLUDE_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/include/
    export CPLUS_INCLUDE_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/include/
    export PKG_CONFIG_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/pkgconfig
    

      

    问题:MSG: ERROR: Cannot use ID format in offline mode【local模式无法使用rs id】

    那就准备其他的格式测试一下,vcf肯定没问题。

      

    最后安不上就用docker,https://hub.docker.com/r/ensemblorg/ensembl-vep#install


    结果解读:

    Consequences (all)

    • intron_variant: 44%
    • non_coding_transcript_variant: 16%
    • upstream_gene_variant: 12%
    • downstream_gene_variant: 11%
    • NMD_transcript_variant: 4%
    • regulatory_region_variant: 3%
    • intergenic_variant: 3%
    • non_coding_transcript_exon_variant: 2%
    • missense_variant: 1%
    • Others

    Coding consequences

    • missense_variant: 73%
    • synonymous_variant: 26%
    • stop_gained: 1%
    • protein_altering_variant: 0%
    • frameshift_variant: 0%
    • stop_lost: 0%
    • coding_sequence_variant: 0%

    注意:

    • non-coding包括intergenic + UTR + intron
    • exon包括CDS + UTR
    • upstream和downstream一般指基因上下游的2kbp
    • ncRNA exonic/splicing/intronic

    优先级:

    • 1 splicing/ncRNA splicing
    • 2 missense
    • 3 coding region/ncRNA exonic
    • 4 5'UTR/3'UTR
    • 5 Upstream/Downstream
    • 6 regulatory_region_variant
    • 7 intronic/non_coding_transcript_variant:
    • 8 intergenic
    • 9 others

    这里的功能注释也有ontology

    Google search:vep regulatory_region_variant

    Ensembl Variation - Calculated variant consequences 注释本身就是根据ensembl的transcript的功能来的

    http://www.sequenceontology.org/miso/current_svn/term/SO:0001566

    Critical association of ncRNA with introns

  • 相关阅读:
    利用EZMorph复制bean
    JAVA中使用FTPClient上传下载
    戏说java web开发中的listener和filter
    FastCGI的并发处理
    XPATH学习总结
    [Linux] gdb crash之后,杀掉僵尸进程的办法
    JAVA反射使用手记
    在centos5下安装配置VNC
    开始FastCGI
    log4php配置文件实例
  • 原文地址:https://www.cnblogs.com/leezx/p/14310835.html
Copyright © 2020-2023  润新知