• 【4】蛋白质组学鉴定软件之MSGFPlus


    1.简介

    MSGF+也是近年来应用得比较多的蛋白鉴定软件。java写的,2008年初次发表JPR,2014年升级发表NC,免费开源,持续更新维护,良心软件。而且,有研究者对不同蛋白质组学鉴定软件进行比较分析,MSGF+的表现也是非常不错的(一下子找不到文献出处~~)。

    Github源码:https://github.com/MSGFPlus/msgfplus
    支持的输入格式包括:mzML, mzXML, Mascot Generic File (mgf), MS2 files, Micromass Peak List files (pkl), Concatenated DTA files (_dta.txt)
    主要支持HUPO PSI 的标准输入mzML格式,以及输出mzIdentML格式(简写mzid ),易转化为TSV格式。

    关于mzIdentML格式,参考http://www.psidev.info/mzidentml

    2.安装运行

    软件下载:https://github.com/MSGFPlus/msgfplus/releases
    image.png

    关于使用,MS-GF+有非常详细的文档:MS-GF+ Documentation

    参数配置文件:
    https://github.com/MSGFPlus/msgfplus/tree/master/docs/ParameterFiles

    关于运行,提供了很多示例以及参数的解释:
    https://msgfplus.github.io/msgfplus/MSGFPlus.html

    运行示例1:

    java -Xmx4000M -jar MSGFPlus.jar 
      -s test.mzML 
      -d uniprot_swissprot_human_20190313_20417.fasta 
      -t 20ppm -ti -1,2 -ntt 0 -tda 1 -e 0 -m 3 -inst 3 -minCharge 1 -maxCharge 6 -addFeatures 1 
      -mod Mods.txt 
      -o test.mzid
    

    修饰文件Mods.txt内容如下:

    # This file is used to specify modifications
    # # for comments
    #
    # Max Number of Modifications per peptide
    # If this value is large, the search takes long.
    NumMods=2
    
    # To input a modification, use the following command:
    # Mass or CompositionStr, Residues, ModType, Position, Name (all the five fields are required).
    # CompositionStr (C[Num]H[Num]N[Num]O[Num]S[Num]P[Num]Br[Num]Cl[Num]Fe[Num])
    #       - C (Carbon), H (Hydrogen), N (Nitrogen), O (Oxygen), S (Sulfer), P (Phosphorus), Br (Bromine), Cl (Chlorine), Fe (Iron), and Se (Selenium) are allowed.
    #       - Negative numbers are allowed.
    #       - E.g. C2H2O1 (valid), H2C1O1 (invalid)
    # Mass can be used instead of CompositionStr. It is important to specify accurate masses (integer masses are insufficient).
    #       - E.g. 15.994915
    # Residues: affected amino acids (must be upper letters)
    #       - Must be uppor letters or *
    #       - Use * if this modification is applicable to any residue.
    #       - * should not be "anywhere" modification (e.g. "15.994915, *, opt, any, Oxidation" is not allowed.)
    #       - E.g. NQ, *
    # ModType: "fix" for fixed modifications, "opt" for variable modifications (case insensitive)
    # Position: position in the peptide where the modification can be attached.
    #       - One of the following five values should be used:
    #       - any (anywhere), N-term (peptide N-term), C-term (peptide C-term), Prot-N-term (protein N-term), Prot-C-term (protein C-term)
    #       - Case insensitive
    #       - "-" can be omitted
    #       - E.g. any, Any, Prot-n-Term, ProtNTerm => all valid
    # Name: name of the modification (Unimod PSI-MS name)
    #       - For proper mzIdentML output, this name should be the same as the Unimod PSI-MS name
    #       - E.g. Phospho, Acetyl
    #       - Visit http://www.unimod.org to get PSI-MS names.
    
    C2H3N1O1,C,fix,any,Carbamidomethyl              # Fixed Carbamidomethyl C
    #144.102063,*,fix,N-term,iTRAQ4plex             # iTRAQ 4 plex
    #144.102063,K,fix,any,iTRAQ4plex                        # iTRAQ 4 plex
    
    # Variable Modifications (default: none)
    O1,M,opt,any,Oxidation                          # Oxidation M
    #15.994915,M,opt,any,Oxidation                  # Oxidation M (mass is used instead of CompositionStr)
    H-1N-1O1,NQ,opt,any,Deamidated                  # Negative numbers are allowed.
    #C2H3NO,*,opt,N-term,Carbamidomethyl            # Variable Carbamidomethyl N-term
    #H-2O-1,E,opt,N-term,Glu->pyro-Glu                      # Pyro-glu from E
    #H-3N-1,Q,opt,N-term,Gln->pyro-Glu                      # Pyro-glu from Q
    #C2H2O,*,opt,Prot-N-term,Acetyl                 # Acetylation Protein N-term
    #C2H2O1,K,opt,any,Acetyl                        # Acetylation K
    #CH2,K,opt,any,Methyl                           # Methylation K
    #HO3P,STY,opt,any,Phospho                       # Phosphorylation STY
    

    运行示例2

    java -Xmx4g -Xms1g -jar MSGFPlus.jar 
    -conf MSGFPlus_Parameters.txt 
    -d test.fasta 
    -s test.mzML 
    -o test.mzid
    

    参数配置文件MSGFPlus_Parameters.txt内容如下:

    #Parent mass tolerance
    #  Examples: 2.5Da or 30ppm
    #  Use comma to set asymmetric values, for example "0.5Da,2.5Da" will set 0.5Da to the left (expMass<theoMass) and 2.5Da to the right (expMass>theoMass)
    PrecursorMassTolerance=20ppm
    
    #Max Number of Modifications per peptide
    # If this value is large, the search will be slow
    NumMods=5
    
    #Modifications (see below for examples)
    StaticMod=C2H3N1O1,  C,   fix,  any,  Carbamidomethyl              # Fixed Carbamidomethyl C
    DynamicMod=O1,       M,   opt,  any,  Oxidation                    # Oxidized methionine
    DynamicMod=H-1N-1O1, NQ,  opt,  any,  Deamidated                   # Deamidation of Glutamine (+0.984016)
    
    #Custom amino acids
    CustomAA=C3H5NO,     U,  custom, U,   Selenocysteine               # Custom amino acids can only have C, H, N, O, and S
    #CustomAA=H0,        X,  custom, X,   RemoveAA                     # Remove AA
    
    #Fragmentation Method
    #  0 means as written in the spectrum or CID if no info (Default)
    #  1 means CID
    #  2 means ETD
    #  3 means HCD
    #  4 means Merge spectra from the same precursor (e.g. CID/ETD pairs, CID/HCD/ETD triplets)
    FragmentationMethodID=3
    
    #Instrument ID
    #  0 means Low-res LCQ/LTQ (Default for CID and ETD); use InstrumentID=0 if analyzing a dataset with low-res CID and high-res HCD spectra
    #  1 means High-res LTQ (Default for HCD; also appropriate for high res CID); use InstrumentID=1 for Orbitrap, Lumos, and QEHFX instruments
    #  2 means TOF
    #  3 means Q-Exactive
    InstrumentID=1
    
    #Enzyme ID
    #  0 means No enzyme used
    #  1 means Trypsin (Default); use this along with NTT=0 for a no-enzyme search of a tryptically digested sample
    #  2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: No Enzyme (for peptidomics)
    EnzymeID=1
    
    #Isotope error range
    #  Takes into account of the error introduced by choosing non-monoisotopic peak for fragmentation.
    #  Useful for accurate precursor ion masses
    #  Ignored if the parent mass tolerance is > 0.5Da or 500ppm
    #  The combination of -t and -ti determins the precursor mass tolerance.
    #  e.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
    IsotopeErrorRange=0,3
    
    #Number of tolerable termini
    #  The number of peptide termini that must have been cleaved by the enzyme (default 1)
    #  For trypsin, 2 means fully tryptic only, 1 means partially tryptic, and 0 means no-enzyme search
    NTT=2
    
    #Target/Decoy search mode
    #  0 means don't search decoy database (default)
    #  1 means search decoy database to compute FDR (source FASTA file must be forward-only proteins)
    TDA=1
    
    #Number of Threads (by default, uses all available cores)
    NumThreads=8
    
    #Minimum peptide length to consider
    MinPepLength=6
    
    #Maximum peptide length to consider
    MaxPepLength=50
    
    #Minimum precursor charge to consider (if not specified in the spectrum)
    MinCharge=1
    
    #Maximum precursor charge to consider (if not specified in the spectrum)
    MaxCharge=6
    
    #Number of matches per spectrum to be reported
    #If this value is greater than 1 then the FDR values computed by MS-GF+ will be skewed by high-scoring 2nd and 3rd hits
    NumMatchesPerSpec=1
    
    #Amino Acid Modification Examples
    # Specific static modifications using one or more StaticMod= entries
    # Specific dynamic modifications using one or more DynamicMod= entries
    # Modification format is:
    # Mass or CompositionStr, Residues, ModType, Position, Name (all the five fields are required).
    # Examples:
    #   C2H3N1O1,  C,  fix, any,         Carbamidomethyl    # Fixed Carbamidomethyl C (alkylation)
    #   O1,        M,  opt, any,         Oxidation          # Oxidation M
    #   15.994915, M,  opt, any,         Oxidation          # Oxidation M (mass is used instead of CompositionStr)
    #   H-1N-1O1,  NQ, opt, any,         Deamidated         # Negative numbers are allowed.
    #   CH2,       K,  opt, any,         Methyl             # Methylation K
    #   C2H2O1,    K,  opt, any,         Acetyl             # Acetylation K
    #   HO3P,      STY,opt, any,         Phospho            # Phosphorylation STY
    #   C2H3NO,    *,  opt, N-term,      Carbamidomethyl    # Variable Carbamidomethyl N-term
    #   H-2O-1,    E,  opt, N-term,      Glu->pyro-Glu      # Pyro-glu from E
    #   H-3N-1,    Q,  opt, N-term,      Gln->pyro-Glu      # Pyro-glu from Q
    #   C2H2O,     *,  opt, Prot-N-term, Acetyl             # Acetylation Protein N-term
    
    #Custom amino acids examples
    # Only supports empirical formulas of elements C H N O S.
    # If other elements are needed, or a specific mass is needed, they can be added as fixed modifications on the custom AA
    # Maximum atom counts: 255 C, 255 H, 63 N, 63 O, 15 S
    # Format spec is:
    # EmpiricalFormula, ResidueSymbol, custom, OriginalAA, Name (all the five fields are required, though OriginalAA is not actually used for anything)
    # Examples:
    #   C5H7N1O2S0,J,custom,P,Hydroxylation     # Hydroxyproline
    #   C3H6N2O0S1,X,custom,C,Amidation         # C-terminal amidation of Cys
    #   C5H5N1O1S0,Z,custom,E,Glu->pyro-Glu     # N-terminal pyroGlu residue, from either Glu OR Gln
    
    

    3.结果

    原始输出格式MzIdentML,示例文件test.mzid

    image.png

    有2种方法将mzid文件转化为tsv,使结果更加易读。详见https://msgfplus.github.io/msgfplus/MzidToTsv.html

    • 一是MSGFPlus.jar内置的MzIDToTsv工具,实现容易,但对于大文件慢。
    Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv
    	-i MzIDFile (MS-GF+ output file (*.mzid))
    	[-o TSVFile] (TSV output file (*.tsv) (Default: MzIDFileName.tsv))
    	[-showQValue 0/1] (0: do not show Q-values, 1: show Q-values (Default))
    	[-showDecoy 0/1] (0: do not show decoy PSMs (Default), 1: show decoy PSMs)
    	[-unroll 0/1] (0: merge shared peptides (Default), 1: unroll shared peptides)
    
    • 二是单独使用MzidToTsvConverter.exe工具,转化快,处理大文件,限于Windows(Linux需要mono)
    MzidToTsvConverter.exe -mzid:SearchResults.mzid -unroll -showDecoy
    

    转化为tsv后的示例文件:test_Unrolled.tsv
    image.png

    表头内容包含:

          1 #SpecFile
          2 SpecID
          3 ScanNum
          4 FragMethod
          5 Precursor
          6 IsotopeError
          7 PrecursorError(ppm)
          8 Charge
          9 Peptide
         10 Protein
         11 DeNovoScore
         12 MSGFScore
         13 SpecEValue
         14 EValue
         15 QValue
         16 PepQValue
    

    ref:
    https://msgfplus.github.io/msgfplus/index.html
    http://www.psidev.info/mzidentml
    https://omics.pnl.gov/software/ms-gf
    https://github.com/MSGFPlus/msgfplus
    https://github.com/MSGFPlus/msgfplus/tree/master/docs/ParameterFiles
    https://msgfplus.github.io/msgfplus/MzidToTsv.html
    https://github.com/MSGFPlus/msgfplus/releases


    蛋白质组学鉴定定量系列软件总结:
    【1】蛋白鉴定软件之X!Tandem
    【2】蛋白鉴定软件之Comet
    【3】蛋白鉴定软件之Mascot
    【4】蛋白质组学鉴定软件之MSGFPlus
    【5】蛋白质组学鉴定定量软件之PD
    【6】蛋白质组学鉴定定量软件之MaxQuant

  • 相关阅读:
    Asp.Net api接口
    Android Studio 插件官网
    Android官方培训课程中文版(v0.9.5)
    asp.net:验证控件中ValidationExpression的写法
    asp.net:录入数据库的中文变问号
    asp.net:repeater嵌套(常用于新闻等在首页归类显示)
    asp.net:用类来后台绑定数据源
    代码编写横屏的UIView
    MFC去掉win7玻璃效果
    iOS UITableView
  • 原文地址:https://www.cnblogs.com/jessepeng/p/13578867.html
Copyright © 2020-2023  润新知