• 关于基因组注释文件GTF的解释


      GTF文件的全称是gene transfer format,主要是对染色体上的基因进行标注。怎么理解呢,其实所谓的基因名,基因座等,都只是后来人们给一段DNA序列起的名字而已,还原到细胞中就是细胞核里面的一条长长的染色体(DNA序列)。而这个GTF文件的主要功能,就是指出我们所谓的基因在染色体上的位置(coordinate),并且还标注了这一段区间的其他信息。

      GTF文件我一般喜欢去ensembl下载,gencode也可以。 这里给出链接:

      ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/

      http://www.gencodegenes.org/releases/current.html 

      

      关于这个文件的解释,这里参考ensembl 给出的官方说明: http://www.ensembl.org/info/website/upload/gff.html

     

    GFF/GTF File Format - Definition and supported options

    The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The following documentation is based on the Version 2 specifications.

    The GTF (General Transfer Format) is identical to GFF version 2.

    Fields

    Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'

    1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
    2. source - name of the program that generated this feature, or the data source (database or project name)
    3. feature - feature type name, e.g. Gene, Variation, Similarity
    4. start - Start position of the feature, with sequence numbering starting at 1.
    5. end - End position of the feature, with sequence numbering starting at 1.
    6. score - A floating point value.
    7. strand - defined as + (forward) or - (reverse).
    8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
    9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

    Note that where the attributes contain identifiers that link the features together into a larger structure, these will be used by Ensembl to display the features as joined blocks.

    Sample GTF output from Ensembl data dump:

     1 transcribed_unprocessed_pseudogene  gene        11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 
     1 processed_transcript                transcript  11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";

    Sample GFF output from Ensembl export:

    1 X    Ensembl    Repeat    2419108    2419128    42    .    .    hid=trf; hstart=1; hend=21
    2 X    Ensembl    Repeat    2419108    2419410    2502    -    .    hid=AluSx; hstart=1; hend=303
    3 X    Ensembl    Repeat    2419108    2419128    0    .    .    hid=dust; hstart=2419108; hend=2419128
    4 X    Ensembl    Pred.trans.    2416676    2418760    450.19    -    2    genscan=GENSCAN00000019335
    5 X    Ensembl    Variation    2413425    2413425    .    +    .    
    6 X    Ensembl    Variation    2413805    2413805    .    +    .

    Track lines

    Although not part of the formal GFF specification, Ensembl uses track lines to further configure sets of features (thus maintaining compatibility with UCSC). Track lines should be placed at the beginning of the list of features they are to affect.

    The track line consists of the word 'track' followed by space-separated key=value pairs - see the example below. Valid parameters used by Ensembl are:

    • name - unique name to identify this track when parsing the file
    • description - Label to be displayed under the track in Region in Detail
    • priority - integer defining the order in which to display tracks, if multiple tracks are defined.

    More information

    For more information about this file format, see the documentation on the GMOD wiki.

                                    

  • 相关阅读:
    c#基础语言编程-装箱和拆箱
    c#基础语言编程-集合
    c#基础语言编程-异常处理
    HDU 5038 Grade北京赛区网赛1005
    新建Oracle数据库时,提示使用database control配置数据库时,要求在当前oracle主目录中配置监听程序
    Android开发学习笔记--计时器的应用实例
    Android开发学习笔记--一个有界面A+B的计算器
    Android开发学习笔记--给一个按钮定义事件
    HDU 5014 Number Sequence(位运算)
    HDU 5007 Post Robot KMP (ICPC西安赛区网络预选赛 1001)
  • 原文地址:https://www.cnblogs.com/Demo1589/p/6950196.html
Copyright © 2020-2023  润新知