• Blktrace原理简介及使用


    Blktrace简介

    Blktrace是一个用户态的工具,用来收集磁盘IO信息中当IO进行到块设备层(block层,所以叫blk trace)时的详细信息(如IO请求提交,入队,合并,完成等等一些列的信息)。

     

    块设备层处于下图(借用褚霸的图)中的 block layer

     

     
    Blktrace工作原理

     

    (1)     blktrace测试的时候,会分配物理机上逻辑cpu个数个线程,并且每一个线程绑定一个逻辑cpu来收集数据

    (2)     blktracedebugfs挂载的路径(默认是/sys/kernel/debug )下每个线程产生一个文件(就有了对应的文件描述符),然后调用ioctl函数(携带文件描述符, _IOWR(0x12,115,struct blk_user_trace_setup)& blk_user_trace_setup三个参数),产生系统调用将这些东西给内核去调用相应函数来处理,由内核经由debugfs文件系统往此文件描述符写入数据

    (3)     blktrace需要结合blkparse来使用,由blkparse来解析blktrace产生的特定格式的二进制数据

    (4)     blkparse仅打开blktrace产生的文件,从文件里面取数据做展示以及最后做per cpu的统计输出,但blkparse中展示的数据状态(如 A,U,Q,详细见下)是blkparset->action & 0xffff之后自己把数值转换为“AQU之类的状态”来展示的。

     

    Blktrace安装

    1.       yum install blktrace

    2.       源码获取(你也可以从源码安装)

    git clone git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt

    cd bt

    make

    make install

     

    Blktrace的使用
     
    Debugfs挂载

        由之前的blktrace工作原理可知,blktrace需要借助内核经由debugfs文件系统(debugfs文件系统在内存中)来输出信息

     

    所以用blktrace工具之前需要先挂载debugfs文件系统

    mount      –t debugfs    debugfs /sys/kernel/debug

     

    或者在/etc/fstab中添加下面一行以便在开机启动的时候自动挂载

    debug      /sys/kernel/debug           debugfs    default     0       0

     

    blktrace具体的磁盘或分区

    blktrace具体语法man blktrace,这里讲常用的

     

    文件输出

    mkdir test  #blktrace生成的数据默认会在当前目录,如之前在blktrace原理中提到,每个逻辑cpu都有一个线程,产生一个文件,故会产生cpu数目个文件

    blktrace –d /dev/sda –o test1

    # /dev/sdatrace,输出文件名为test1. Blktrace.[0-cpu-1]   (文件里面存的是二进制数据,需要blkparse来解析)

     

    终端输出

    Blktrace –d /dev/sda –o - |blkparse  -i –

    输出到终端用“-”表示,可是都是一堆二进制东西,没法看,所以需要实时blkparse来解析

    Blkparse 的“-i”后加文件名,blktrace输出为“-“代表终端(代码里面写死了,就是用这个符号来代表终端),blkparse也用“-”来代表终端解析

     

    blkparse解析blktrace产生的数据

    blkparse具体语法man blkparse,这里讲常用的

     

    文件解析

    blkparse  -i    test1 #test1.blktrace. [0-cpu-1]都解析(只统计有数据的),

     

    实时解析

    实时数据的解析即上blktrace的“终端输出”

     

    使用实例

    终端1

    blktrace /dev/sda -o - |blkparse -i – 跑着

    终端2

    dd if=/dev/zero of=/root/a1 bs=4k count=1000

     

    终端1显示

    8,0   16     3041    94.435078912   891  A   W 72411584 + 8 <- (8,2) 71884224

    8,0   16     3042    94.435079691   891  Q   W 72411584 + 8 [flush-8:0]

    8,0   16     3043    94.435080790   891  M   W 72411584 + 8 [flush-8:0]

    8,0   16     3044    94.435083089   891  A   W 72411592 + 8 <- (8,2) 71884232

     

    输出解析

    这是默认输出格式,代码里默认输出格式为,再按action输出或不输出后续信息

     

    先输出   –f "%D %2c %8s %5T.%9t %5p %2a %3d " 
     

    其中每个字母代表意思如下,数字代表占几个字符,和printf里的数字输出一样的

     

    8,0   16     3042    94.435079691   891  Q   W 72411584 + 8 [flush-8:0]

    由于默认格式为先输出–f "%D %2c %8s %5T.%9t %5p %2a %3d "

    18,0 按默认输出对应%D,主从设备号

    216 按默认输出对应%2c,表示cpu id

    33042 按默认输出对应%8s,表示序列号(序列号是blkparse自己产生的一个序号,实际IO里没有这个号)

    494.435079691 按默认对应%5T.%9t,表示.纳秒

    5891对应%5p,表示,进程id

    6Q对应%2a,表示ActionAction表格如下(如Q表示IO handled by request queue code),更详细的含义见附录action

    The following table shows the various actions which may be output.              

    Act Description

    A IO was remapped to a different device

    B IO bounced

    C IO completion

    D IO issued to driver

    F IO front merged with request on queue

    G Get request

    I IO inserted onto request queue

    M IO back merged with request on queue

    P Plug request

    Q IO handled by request queue code

    S Sleep request

    T Unplug due to timeout

    U Unplug request

    X Split

    7对应%3d,表示RWBS域(W表示写操作),各字母含义如下

             至少包含“RWD“( R 读,W写,D块被忽略)中的1个字符

             还可以附加“BS“(B barrierS同步)

     

    再输出(源代码里面这么写的)

    switch (act[0]) {

             case 'R':   /* Requeue */

             case 'C': /* Complete */

                       if (t->action & BLK_TC_ACT(BLK_TC_PC)) {

                                char *p = dump_pdu(pdu_buf, pdu_len);

                                if (p)

                                         fprintf(ofp, "(%s) ", p);

                                fprintf(ofp, "[%d]n", t->error);

                       } else {

                                if (elapsed != -1ULL) {

                                         if (t_sec(t))

                                                   fprintf(ofp, "%llu + %u (%8llu) [%d]n",

                                                            (unsigned long long) t->sector,

                                                            t_sec(t), elapsed, t->error);

                                         else

                                                   fprintf(ofp, "%llu (%8llu) [%d]n",

                                                            (unsigned long long) t->sector,

                                                            elapsed, t->error);

                                } else {

                                         if (t_sec(t))

                                                   fprintf(ofp, "%llu + %u [%d]n",

                                                            (unsigned long long) t->sector,

                                                            t_sec(t), t->error);

                                         else

                                                   fprintf(ofp, "%llu [%d]n",

                                                            (unsigned long long) t->sector,

                                                            t->error);

                                }

                       }

                       break;

     

             case 'D':           /* Issue */

             case 'I':   /* Insert */

             case 'Q':           /* Queue */

             case 'B':   /* Bounce */

                       if (t->action & BLK_TC_ACT(BLK_TC_PC)) {

                                char *p;

                                fprintf(ofp, "%u ", t->bytes);

                                p = dump_pdu(pdu_buf, pdu_len);

                                if (p)

                                         fprintf(ofp, "(%s) ", p);

                                fprintf(ofp, "[%s]n", name);

                       } else {

                                if (elapsed != -1ULL) {

                                         if (t_sec(t))

                                                   fprintf(ofp, "%llu + %u (%8llu) [%s]n",

                                                            (unsigned long long) t->sector,

                                                            t_sec(t), elapsed, name);

                                         else

                                                   fprintf(ofp, "(%8llu) [%s]n", elapsed,

                                                            name);

                                } else {

                                         if (t_sec(t))

                                                   fprintf(ofp, "%llu + %u [%s]n",

                                                            (unsigned long long) t->sector,

                                                            t_sec(t), name);

                                         else

                                                   fprintf(ofp, "[%s]n", name);

                                }

                       }

                       break;

     

             case 'M':  /* Back merge */

             case 'F':    /* Front merge */

             case 'G':   /* Get request */

             case 'S':    /* Sleep request */

                       if (t_sec(t))

                                fprintf(ofp, "%llu + %u [%s]n",

                                         (unsigned long long) t->sector, t_sec(t), name);

                       else

                                fprintf(ofp, "[%s]n", name);

                       break;

     

             case 'P':   /* Plug */

                       fprintf(ofp, "[%s]n", name);

                       break;

     

             case 'U':   /* Unplug IO */

             case 'T': /* Unplug timer */

                       fprintf(ofp, "[%s] %un", name, get_pdu_int(t));

                       break;

     

             case 'A': /* remap */

                       get_pdu_remap(t, &r);

                       fprintf(ofp, "%llu + %u <- (%d,%d) %llun",

                                (unsigned long long) t->sector, t_sec(t),

                                MAJOR(r.device_from), MINOR(r.device_from),

                                (unsigned long long) r.sector_from);

                       break;

     

             case 'X': /* Split */

                       fprintf(ofp, "%llu / %u [%s]n", (unsigned long long) t->sector,

                                get_pdu_int(t), name);

                       break;

     

             case 'm':  /* Message */

                       fprintf(ofp, "%*sn", pdu_len, pdu_buf);

                       break;

     

             default:

                       fprintf(stderr, "Unknown action %cn", act[0]);

                       break;

             }

    所以

     

    具体解析

    8,0   16     3042    94.435079691   891  Q   W 72411584 + 8 [flush-8:0]

    中的act[0]=’Q’,后面的72411584是(80sda)相对8:0的扇区起始号,+8,为后面连续的8个扇区(默认一个扇区512byte,所以8个扇区就是4K),后面的[flush-8:0]是程序的名字。

     

    8,0   16     3041    94.435078912   891  A   W 72411584 + 8 <- (8,2) 71884224

    Action[0]=’A’, 72411584是相对8:0(即sda)的起始扇区号,(8,2)是相对/dev/sda2分区的扇区号为71884224(由于/dev/sda2分区时sda磁盘上面的一个分区,故sda2上面的起始位置要先映射到sda磁盘上面去)

     

    由于扇区号在磁盘上面是连续的,磁盘又被格式化成很多块,一个块里包含多个扇区,所以,扇区号/块大小=块号,

    根据块号你就可以找到对应的inode

    debugfs -R 'icheck  块号'  具体磁盘或分区

    如你的扇区号是相对sda2上面算出来的块号,那debugfs –R ‘icheck 块号’ /dev/sda2就可以找到对应的inode

     

    根据inode你就可以找到对应的文件是什么了
    find / -inum your_inode

     

    有一个例子见淘宝牛人写的一篇链接地址

     

    附录:action含义

    C – complete A previously issued request has been completed. The output

    will detail the sector and size of that request, as well as the success or

    failure of it.

     

    D – issued A request that previously resided on the block layer queue or in

    the io scheduler has been sent to the driver.

     

    I – inserted A request is being sent to the io scheduler for addition to the

    internal queue and later service by the driver. The request is fully formed

    at this time.

     

    Q – queued This notes intent to queue io at the given location. No real requests

    exists yet.

     

    B – bounced The data pages attached to this bio are not reachable by the

    hardware and must be bounced to a lower memory location. This causes

    a big slowdown in io performance, since the data must be copied to/from

    kernel buffers. Usually this can be fixed with using better hardware -

    either a better io controller, or a platform with an IOMMU.

     

    m – message Text message generated via kernel call to blk add trace msg.

     

    M – back merge A previously inserted request exists that ends on the boundary

    of where this io begins, so the io scheduler can merge them together.

     

    F – front merge Same as the back merge, except this io ends where a previously

    inserted requests starts.

     

    G – get request To send any type of request to a block device, a struct request

    container must be allocated first.

     

    S – sleep No available request structures were available, so the issuer has to

    wait for one to be freed.

     

    P – plug When io is queued to a previously empty block device queue, Linux

    will plug the queue in anticipation of future ios being added before this

    data is needed.

     

    U – unplug Some request data already queued in the device, start sending

    requests to the driver. This may happen automatically if a timeout period

    has passed (see next entry) or if a number of requests have been added to

    the queue.

     

    T – unplug due to timer If nobody requests the io that was queued after

    plugging the queue, Linux will automatically unplug it after a defined

    period has passed.

     

    X – split On raid or device mapper setups, an incoming io may straddle a

    device or internal zone and needs to be chopped up into smaller pieces

    for service. This may indicate a performance problem due to a bad setup

    of that raid/dm device, but may also just be part of normal boundary

    conditions. dm is notably bad at this and will clone lots of io.

     

    A – remap For stacked devices, incoming io is remapped to device below it in

    the io stack. The remap action details what exactly is being remapped to

    what.

    外带一张图,可能看得更清楚

  • 相关阅读:
    4 行代码实现将文件读到 C++ string
    Adaptive AUTOSAR 学习笔记 15
    Adaptive AUTOSAR 学习笔记 14
    Adaptive AUTOSAR 学习笔记 13
    Adaptive AUTOSAR 学习笔记 12
    Adaptive AUTOSAR 学习笔记 10
    Adaptive AUTOSAR 学习笔记 9
    Linux 彻底卸载从源码安装的 boost 库
    Adaptive AUTOSAR 学习笔记 8
    grep awk sed 正则表达式,只把匹配的内容(不是整个匹配行)提取出来,保存到 shell 脚本变量
  • 原文地址:https://www.cnblogs.com/wuchanming/p/3768323.html
Copyright © 2020-2023  润新知