基于GWAS的分析结果做基因集富集分析。
基本代码:
# download sumstats file from https://ctg.cncr.nl/software/summary_statistics # download NCBI37 and g1000_eur from https://ctg.cncr.nl/software/magma # download msigdb.v7.2.entrez.gmt from http://www.gsea-msigdb.org/gsea/downloads.jsp
# Annotation ../magma --annotate --snp-loc g1000_eur/g1000_eur.bim --gene-loc NCBI37/NCBI37.3.gene.loc --out test # Gene analysis - SNP p-values ../magma --bfile g1000_eur/g1000_eur --pval sumstats/SESA_neuro_clus_sumstats.txt N=449484 --gene-annot test.genes.annot --out test.gene # cat msigdb.v7.2.entrez.gmt | cut -f1,3- > formated.msigdb.v7.2.entrez.gmt
# Gene-set analysis ../magma --gene-results test.gene.genes.raw --set-annot geneSet/formated.msigdb.v7.2.entrez.gmt --out GS_test
Annotation的结果
就是把SNP的ID归到了gene里,NCBI (Entrez) Gene IDs
# window_up = 0 # window_down = 0 79501 1:69091:70008 rs140739101 rs200505207 rs527512746 rs200676709 rs570453814 100996442 1:142447:174392 rs371474651 rs547574544 rs564576411 148398 1:859993:879961 rs568015169 rs542412615 rs61464428 rs57465118 rs78033073 rs57924093 rs60837925 rs57816 26155 1:879583:894679 rs544040354 rs6605067 rs2839 rs553373622 rs143853699 rs528505070 rs557736773 rs370120035 339451 1:895967:901099 rs113034360 rs112905931 rs532448472 rs144174542 rs562855695 rs531742319 rs28393498 rs56856 84069 1:901872:910488 rs116147894 rs201730138 rs62639981 rs199941754 rs28499371 rs530698349 rs561236606 rs77359 84808 1:910579:917473 rs556054629 rs532303174 rs4970429 rs182213575 rs2340592 rs533147788 rs3748588 rs14146 57801 1:934342:936608 rs186167017 rs569070395 rs201391049 rs200512735 rs113602214 rs531408504 rs2298214 rs14625 9636 1:948847:949920 rs4615788 rs15842 rs28491407 rs140198533 rs2465124 rs568580969 rs371949419 rs148041041
Gene analysis的结果
类似GWAS的sumstats的结果,核心就是P-value
GENE CHR START STOP NSNPS NPARAM N ZSTAT P 148398 1 859993 879961 92 21 449484 0.33571 0.36854 26155 1 879583 894679 72 13 449484 0.22321 0.41168 339451 1 895967 901099 27 8 449484 0.68675 0.24612 84069 1 901872 910488 36 17 449484 1.9254 0.027088 84808 1 910579 917473 24 6 449484 0.94576 0.17214 57801 1 934342 936608 15 5 449484 0.98676 0.16188 9636 1 948847 949920 9 4 449484 0.55709 0.28873 375790 1 955503 991499 193 32 449484 1.0999 0.13569 401934 1 1007126 1009687 9 4 449484 2.4592 0.0069632 54991 1 1017198 1051736 123 17 449484 0.38241 0.35108 254173 1 1109286 1133315 154 20 449484 1.4226 0.077429 8784 1 1138888 1142163 15 4 449484 1.9105 0.028037 7293 1 1146706 1149703 10 4 449484 1.7823 0.037349 51150 1 1152288 1167447 96 6 449484 0.1506 0.44014 126792 1 1167629 1170421 9 3 449484 2.1349 0.016383 388581 1 1177826 1182102 13 3 449484 0.49731 0.30948 118424 1 1189292 1209234 66 12 449484 1.5713 0.058059 6339 1 1215816 1227409 50 18 449484 0.069082 0.47246 116983 1 1227764 1243304 76 19 449484 0.33408 0.36916
Gene-set analysis的结果
以pathway为对象的sumstats结果
# TOTAL_GENES = 18152 # TEST_DIRECTION = one-sided, positive (set), two-sided (covar) # CONDITIONED_INTERNAL = gene size, gene density, inverse mac, log(gene size), log(gene density), log(inverse mac) VARIABLE TYPE NGENES BETA BETA_STD SE P FULL_NAME BIOCARTA_FEEDER_PATHWAY SET 9 -0.24616 -0.00548 0.21239 0.87677 BIOCARTA_FEEDER_PATHWAY BIOCARTA_PROTEASOME_PATHWAY SET 17 -0.076691 -0.0023459 0.22729 0.6321 BIOCARTA_PROTEASOME_PATHWAY BIOCARTA_KREB_PATHWAY SET 8 -0.51618 -0.010834 0.40188 0.90049 BIOCARTA_KREB_PATHWAY ST_INTERFERON_GAMMA_PATHWAY SET 10 0.53983 0.012667 0.32686 0.04932 ST_INTERFERON_GAMMA_PATHWAY ST_WNT_CA2_CYCLIC_GMP_PATHWAY SET 20 0.14579 0.0048366 0.19018 0.22167 ST_WNT_CA2_CYCLIC_GMP_PATHWAY ST_DIFFERENTIATION_PATHWAY_I... SET 44 -0.037477 -0.001843 0.14289 0.60345 ST_DIFFERENTIATION_PATHWAY_IN_PC12_CELL ST_TUMOR_NECROSIS_FACTOR_PAT... SET 28 0.11224 0.0044051 0.17192 0.25692 ST_TUMOR_NECROSIS_FACTOR_PATHWAY ST_ERK1_ERK2_MAPK_PATHWAY SET 29 -0.12764 -0.0050977 0.17231 0.77057 ST_ERK1_ERK2_MAPK_PATHWAY ST_GA12_PATHWAY SET 22 0.11411 0.0039702 0.21219 0.29537 ST_GA12_PATHWAY ST_G_ALPHA_S_PATHWAY SET 16 0.1042 0.0030923 0.22091 0.31858 ST_G_ALPHA_S_PATHWAY ST_G_ALPHA_I_PATHWAY SET 36 0.15883 0.0070663 0.14975 0.14444 ST_G_ALPHA_I_PATHWAY ST_IL_13_PATHWAY SET 5 0.048314 0.00080176 0.4724 0.45927 ST_IL_13_PATHWAY ST_P38_MAPK_PATHWAY SET 36 -0.16636 -0.0074015 0.15474 0.85882 ST_P38_MAPK_PATHWAY ST_JAK_STAT_PATHWAY SET 9 0.28402 0.0063229 0.34345 0.20413 ST_JAK_STAT_PATHWAY
pathway里每一个基因的信息
# ALPHA = 0.05 # NUMBER_OF_TESTS = 31071 # _SET1_ VARIABLE = GO_DISTAL_AXON (set) # _SET1_ NGENES = 290 # _SET1_ P-VALUE = 1.46129e-06 _SET1_ GENE CHR START STOP NSNPS NPARAM N ZSTAT P ZFITTED_BASE ZRESID_BASE _SET1_ 116983 1 1227764 1243304 76 19 449484 -0.0069549 0.36916 2.1164e-16 -0.0069549 _SET1_ 1855 1 1270658 1284509 53 14 449484 1.3402 0.050325 2.2204e-16 1.3402 _SET1_ 127262 1 3541556 3546695 20 6 449484 0.78373 0.12976 2.2204e-16 0.78373 _SET1_ 8514 1 6052358 6161253 345 59 449484 2.4031 0.001132 0 2.4031 _SET1_ 65018 1 20959948 20978004 89 15 449484 0.064676 0.28667 2.0817e-16 0.064676 _SET1_ 10236 1 23635953 23671143 84 16 449484 1.5148 0.02933 2.2204e-16 1.5148 _SET1_ 4985 1 29138654 29190208 231 37 449484 0.46618 0.14046 2.2204e-16 0.46618 _SET1_ 2899 1 37261128 37499844 562 68 449484 2.5742 0.00053178 0 2.5742 _SET1_ 1996 1 50513686 50669442 258 42 449484 0.97766 0.059361 2.2204e-16 0.97766 _SET1_ 85440 1 62920397 63154039 399 19 449484 0.064887 0.26601 2.0817e-16 0.064887 _SET1_ 4919 1 64239690 64647181 1179 91 449484 -1.5572 0.7101 2.2204e-16 -1.5572 _SET1_ 58155 1 97187161 97280605 303 28 449484 1.435 0.022342 2.2204e-16 1.435 _SET1_ 22911 1 109512835 109584860 241 32 449484 1.1109 0.04975 2.2204e-16 1.1109 _SET1_ 3749 1 110753336 110776674 40 14 449484 -0.78973 0.60279 2.2204e-16 -0.78973 _SET1_ 3737 1 111136202 111174096 110 15 449484 0.59466 0.15466 2.2204e-16 0.59466 _SET1_ 3738 1 111196182 111217655 49 10 449484 -1.3769 0.81738 2.2204e-16 -1.3769 _SET1_ 9860 1 113615792 113667824 127 22 449484 0.032914 0.31657 2.0817e-16 0.032914 _SET1_ 57657 1 155247218 155259639 34 8 449484 0.15263 0.28912 2.2204e-16 0.15263 _SET1_ 23208 1 155829260 155854990 45 12 449484 0.093267 0.30148 2.0817e-16 0.093267 _SET1_ 1314 1 160258377 160313354 113 23 449484 1.2371 0.03771 2.2204e-16 1.2371 _SET1_ 55811 1 167778357 167883608 487 74 449484 0.39334 0.1405 2.2204e-16 0.39334 _SET1_ 63923 1 175036994 175117202 333 37 449484 0.96139 0.050889 2.2204e-16 0.96139 _SET1_ 2752 1 182350839 182361341 40 10 449484 0.27951 0.22106 2.2204e-16 0.27951 _SET1_ 10092 1 183595328 183605076 47 3 449484 0.29785 0.33574 2.2204e-16 0.29785 _SET1_ 23046 1 200938514 200992828 204 29 449484 0.70264 0.10325 2.2204e-16 0.70264 _SET1_ 134 1 203096833 203136533 140 29 449484 0.40025 0.16777 2.2204e-16 0.40025
核心原理:
Gene analysis
The gene analysis in MAGMA is based on a multiple linear principal components regression [18] model, using an F-test to compute the gene p-value.
This model first projects the SNP matrix for a gene onto its principal components (PC), pruning away PCs with very small eigenvalues, and then uses those PCs as predictors for the phenotype in the linear regression model.
This improves power by removing redundant parameters, and guarantees that the model is identifiable in the presence of highly collinear SNPs. By default only 0.1% of the variance in the SNP data matrix is pruned away.
Gene-set analysis
To perform the gene-set analysis, for each gene g the gene p-value pg computed with the gene analysis is converted to a Z-value zg = Φ−1(1 – pg), where Φ−1 is the probit function. This yields a roughly normally distributed variable Z with elements zg that reflects the strength of the association each gene has with the phenotype, with higher values corresponding to stronger associations.
参考链接:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004219
https://ctg.cncr.nl/software/magma
https://ctg.cncr.nl/software/MAGMA/doc/manual_v1.08.pdf
~/softwares/magma_v1.08b/test_data/