• Sam format


    reference:https://davetang.org/wiki/tiki-index.php?page=SAM

    @SQ SN:contig1 LN:9401 (序列ID及长度)

    参考序列名,这些参考序列决定了比对结果sort的顺序,SN是参考序列名;LN是参考序列长度;每个参考序列为一行。

    @PG ID:bowtie2 PN:bowtie2 VN:2.0.0-beta7 (比对所使用的软件及版本)

    这里的ID是bwa,PN是bwa,VN是0.7.12-r1039版本。CL可以认为是运行程序

    K00133:143:H5LL5BBXX:8:1101:1194:1103   77      *       0       0       *       *       0       0A                  NCCATCCATAATTTATAGGGAAGTGTTAAAATGCGAATAATATCCTTTTCTTTTCCTGGGATTCGTTCAACATTTCCATTTCCATTTCCCTCCGAGTT                  GATTTATGGGAATTCTCAGCAGCCTTGTCCATAATTCATGAACAATCCCTG     A#AFFAFFJJJJJJJJJJJJJJ7FJJJJAJJJJJJJJJJFJJ                  JJJJJJJJJJJJJAJJJJJJJJJFJAJJJJJFJJFJJJJJJJJJJJJFFFJJJFA7AA7FFJAJ<AJ<AFF-FFFJFJFJJJF<AFFAFAFJJFAJJJ                  FA<<<AFJJA      AS:i:0  XS:i:0
    K00133:143:H5LL5BBXX:8:1101:1194:1103   141     *       0       0       *       *       0       0N                  AAGGAGAAGTTCCCATAGCAACAGCACCATTTTTTCCAGGGAAAAGGAGGAAAGAATCAAGGATGCGAAAGAGGGAATCAAAGCAGCTTAAGGGATAT                  AAAAAAAACAGGGATTGTTCATGAATTATGGACAAGGCAGCTGAGAATTCC     #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJA-<7-7                  7A<7AJJF<AJJJJJF7JFAJF7AJJFJJJJJJJFF-AFJFJJJJJ--FFJJJJJJAAJFF<AJJJJJF777-----<-7F7-<--<<<<<7F<--7)                  --A7FF--<7      AS:i:0  XS:i:0
    K00133:143:H5LL5BBXX:8:1101:1255:1103   77      *       0       0       *       *       0       0T                  NTTTGAGTGCCTTCTAATTTTTTTTCTCCTGTATCTAAATTGCATGAGTATACTCTGCTACATTTAATCTTCAGGTAAAACAAAACTTTCTTTTATCA                  TATTTAGGAAGTATCACTAATCATGAATTAAAATTAAATGTGTATTGTCAG     A#AFAJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJFJJJJJJ                  FJJJFJJJJJJJJFJFJJJJJJJJJJJJJAJJFJAAJ77-AJFJFJJJJJFJJJJJJJFJJJFJF7AFFJAFFJFFA<FFAFJJAFAJFJJJJJJF-A                  FFJF77-A--      AS:i:0  XS:i:0
    K00133:143:H5LL5BBXX:8:1101:1255:1103   141     *       0       0       *       *       0       0N                  CTGGCAGGCACAGATGCCCACAGTACACACACTGATATTTATAAAGCACTTCACAAAAAGTCACAAAAAAACCCCCAAAAACGAACCAAACCCAAAAA                  CCCCACCAAGCAAGCTGACAATACACATTTAATTTTAATTCATGATTAGTG     #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJ                  JFJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJFFJJJJJ<JFJFJFJJFJ<JJJJJ<FJJJJJFJJJJF7F<FJFJ7AFAJJ7FFFAF7F-AAF<7                  JFA<-77<-<      AS:i:0  XS:i:0

     

    $1:QNAME( 比对的序列名称,即一条测序reads的名称 ):

    K00133:143:H5LL5BBXX:8:1101:1194:1103

    $2:FLAG(表明比对类型:paring,strand,mate strand等):

    可以使用samtools查到flag信息

    $ samtools flags 77
    0x4d    77      PAIRED,UNMAP,MUNMAP,READ1
    
    $ samtools flags 141
    0x8d    141     PAIRED,UNMAP,MUNMAP,READ2

    $3:RNAME(表示read比对的那条序列的序列名称):

    情况一:这条read有比对上的序列,则名称与头部的@SQ相对应,eg:

    $ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |les|awk '{if($3!="*"){print $0}}'
    @SQ     SN:scaffold332  LN:2588
    @SQ     SN:scaffold322  LN:8291
    @SQ     SN:scaffold342  LN:24194
    @SQ     SN:scaffold191  LN:43246
    @SQ     SN:scaffold1157 LN:21100
    @PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz
    K00133:143:H5LL5BBXX:8:1101:14245:1138  69      scaffold191     6389    0       *       =       6389    0       CNTCTTTTCAACTTTCCTACCTGGGACACCAAAAGAAGTGAATTTGAAATGCTGGAATACCCATTTCTTTGTAATATACTACCAGGCAATAATTTCTTCACCAGTGTGTGTAAGCCTGTAGCACTGCCTTGTGACCCACGGGCAGGACCC    A#AFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:47S40M63S  AS:i:0  XS:i:0
    K00133:143:H5LL5BBXX:8:1101:14245:1138  137     scaffold191     6389    60      47S40M63S       =       6389    0       NCAAGCTGGGTGGGAGTGTTGATCTGCTGGAGGGCAGGAGGGCTCTGCAGAGGGATCTGGACAGGCTGGATCGATGGGCTGAGGCCAGTTGTATGAGGTTTAATAAGGTGAAATGCTGGGTCCTGCCCGTGGGTCACAAGGCAGTGCTAC    #AAFFJJJJJAJJJJJ<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJJJJJJJAJJJJFJJFFFJFJJJJJJ-<FJJJJJJJJJJJJFFFJAFJJJJJJJ7FJ<JJJFJJJJJJJJJJJJJ<FF   NM:i:1  MD:Z:26C13      AS:i:35 XS:i:0

    情况二:这条read没有比对上的序列,则名称是“*”,且这一行的第四,五,八,九 列是“0”,第六,七列与该列是相同的表示方法

    $4:POS(表示read比对到RNAME这条序列的最左边的位置)

    noted:

    (1)无论该read是正向比对或是反向比对到该序列,比对结果均是正向序列或反向序列最左端的比对位置

    (2)如果该read能够完全比对到这条序列,则CIGAR string为M

    $5:MAPQ(表示为mapping的质量值)
    (1)255:mapping值不可用,

    (2)0:通常对于unmapped read

    (3)一般结果是这一列的数值是从0到60,且0和60这两个数字出现次数最多

    (4)第五列为60表示mapping率最高,

    $6:CIGARCIGAR,简要比对信息表达式(Compact Idiosyncratic Gapped Alignment Report),其以参考序列为基础

    M:表示read可mapping到第三列的序列上,则read的碱基序列与第三列的序列碱基相同,表示正常的mapping结果,M表示完全匹配,但是无论reads与序列的正确匹配或是错误匹配该位置都显示

    I:表示read的碱基序列相对于第三列的RNAME序列,有碱基的插入

    D:表示read的碱基序列相对于第三列的RNAME序列,有碱基的删除
    N:read 中有gap
    P:read 和reference sequence都有gap
    S:read 与reference没有mapping上的区域,但是保留了该mismatchs

    H:read 与reference没有mapping上的区域,但是没有保留该mismatchs

    =:正确匹配到序列上
    X:错误匹配到序列上

    set ‘*’ if unavailable

    而H只出现在一条read的前端或末端,但不会出现在中间,S一般会和H成对出现,当有H出现时,一定会有一个与之对应的S出现
    例如:
    162M89S
    162H89M
    149M102S
    149H102M
    40S211M
    20M1D20M211H
    S可以单独出现,而H必须有与之对应的S出现时才可能出现,不可在相同第一列的情况下单独出现
    N:如果是mRNA-to-genome,N出现的位置代表内含子,其它比对形式出现N时则没有具体解释
    M/I/S/=/X:这些数值的加和等于第10列SEQ的长度

    如图:

    $7:MRNM 

    情况一:这条reads第二次比对的位置,

    在利用bwa mem产生sam文件时,如果该列是“=”而第3列RNAME不是“*”,则表示该reads比对到第3列显示序列名的序列上,而没有比对到其他位置,

    在利用bwa aln及bwa sampe比对生成的sam文件,如果和上述情况相同,则第7列为“=”,上述情况均表示该reads只比对到这一个位置

    情况二:如果这条reads没有匹配上的序列,第3列RNAME和第7列MRNM都为“*”

    情况三:如果这条reads匹配两个序列,则第一个序列的名称出现在第3列,而第二个序列的名称出现在第7列

    $8:MPOS:该列表示与该reads对应的mate pair reads的比对位置,
    如果这对pair-end reads比对到同一条reference序列上,在sam文件中reads的id出现2次,Read1比对的第4列等于Read2比对的第8列。同样Read1比对的第8列等于Read2比对的第4列。例如:
    第1列(Read id)····第4列(Read1比对位置)····第8列(mate-pair reads比对位置)
    22699:1759····124057649····124057667
    22699:1759····124057667····124057649
    相同的reads id一个来自Read1文件,一个来自Read2文件,第4列和第8列是对应的

    $ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |awk '{if($3!="*"){print $0}}'|awk '{print $1"	"$4"	"$8}'|tail -2
    K00133:143:H5LL5BBXX:8:1101:14245:1138  6389    6389
    K00133:143:H5LL5BBXX:8:1101:14245:1138  6389    6389

    $9:TLEN:insert length+mapping base
    如果R1端的read和R2端的read能够mapping到同一条Reference序列上(即第三列RNAME相同),则该列的值表示第8列减去第4列加上第6列的M值,即为R1端和R2端相同id的reads其第九列值相同,但该值为一正一负,R1文件的reads和R2文件的reads,相同id的reads要相对来看。在进行该第列值的计算时,如果取第6列的数值,一定要取出现M的值,S或H的值不能取。It is set as 0 for single-segment template or when the information is unavailable.

    $ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |awk '{if($3!="*"){print $0}}'|awk '{print $1"	"$9}'|tail -2
    K00133:143:H5LL5BBXX:8:1101:14245:1138  0
    K00133:143:H5LL5BBXX:8:1101:14245:1138  0

    $10:SEQ,(序列片段的序列信息,如果不存储此类信息,此处为’*‘,注意CIGAR中M/I/S/=/X对应数字的和要等于序列长度);

    $ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |awk '{if($3!="*"){print $0}}'|awk '{print $1"	"$10}'|tail -2
    K00133:143:H5LL5BBXX:8:1101:14245:1138  CNTCTTTTCAACTTTCCTACCTGGGACACCAAAAGAAGTGAATTTGAAATGCTGGAATACCCATTTCTTTGTAATATACTACCAGGCAATAATTTCTTCACCAGTGTGTGTAAGCCTGTAGCACTGCCTTGTGACCCACGGGCAGGACCC
    K00133:143:H5LL5BBXX:8:1101:14245:1138  NCAAGCTGGGTGGGAGTGTTGATCTGCTGGAGGGCAGGAGGGCTCTGCAGAGGGATCTGGACAGGCTGGATCGATGGGCTGAGGCCAGTTGTATGAGGTTTAATAAGGTGAAATGCTGGGTCCTGCCCGTGGGTCACAAGGCAGTGCTAC
    
    

    $11:QUAL,(ASCII,read质量的ASCII编码。)

    $ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log "$13"	"$14"	"$15"	"$16}'|tail -2
    K00133:143:H5LL5BBXX:8:1101:14245:1138  MC:Z:47S40M63S  AS:i:0  XS:i:0
    K00133:143:H5LL5BBXX:8:1101:14245:1138  NM:i:1  MD:Z:26C13      AS:i:35 XS:i:0

    $12之后是自选option

    $ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |awk '{if($3!="*"){print $0}}'|awk '{print $1"	"$12"	"$13"	"$14"	"$15"	"$16}'|tail -2
    K00133:143:H5LL5BBXX:8:1101:14245:1138  MC:Z:47S40M63S  AS:i:0  XS:i:0
    K00133:143:H5LL5BBXX:8:1101:14245:1138  NM:i:1  MD:Z:26C13      AS:i:35 XS:i:0
     
  • 相关阅读:
    Linux学习 -- Shell编程 -- 字符截取命令
    Linux学习 -- Shell编程 -- 正则表达式
    Linux学习 -- Shell基础 -- Bash变量
    Linux学习 -- Shell基础 -- Bash基本功能
    Linux学习 -- Shell基础 -- 概述
    Linux学习 -- 备份与恢复
    Linux学习 -- 启动管理
    Linux学习 -- 日志管理
    chapter9_3 协同程序实现迭代器
    chapter9_2 管道与过滤器
  • 原文地址:https://www.cnblogs.com/yuanjingnan/p/11232351.html
Copyright © 2020-2023  润新知