• 内核调试-perf introduction


    perf概念

    perf_event

    Perf_events是目前在Linux上使用广泛的profiling/tracing工具,除了本身是内核(kernel)的组成部分以外,还提供了用户空间(user-space)的命令行 工具(“perf”,“perf-record”,“perf-stat”等等)。

    perf_events提供两种工作模式:

    1. 采样模式(sampling)
    2. 计数模式(counting)

    “perf record”命令工作在采样模式:周期性地做事件采样,并把信息记录下来,默认保存在perf.data文件;而“perf stat”命令工作在计数模式:仅仅统计 某个事件发生的次数。

    我们经常看到类似这样的命令:“perf record -a ...... sleep 10”。在这里 ,“sleep”这个命令相当于一个“dummy”命令,没有做任何有意义的工作,它的 作用是让“perf record”命令对整个系统进行采样,并在10秒后自动结束采样工作。

    perf_event - PMU

    Perf_events所处理的hardware event(硬件事件)需要CPU的支持,而目前主 流的CPU基本都包含了PMU(Performance Monitoring Unit,性能监控单元)。PMU用来统计性能相关的参数,像cache命中率,指令周期等等。由于这些统计 工作是硬件完成的,所以CPU开销很小。

    以X86体系结构为例,PMU包含了两种MSRs(Model-Specific Registers,之所 以称之为Model-Specific,是因为不同model的CPU,有些register是不同的) :Performance Event Select Registers和Performance Monitoring Counters(PMC)。当想对某种性能事件(performance event)进行统计时,需要对Performance Event Select Register进行设置,统计结果会存在Performance Monitoring Counter中。

    当perf_events工作在采样模式(sampling,perf record命令即工作在这种模 式)时,由于采样事件发生时和实际处理采样事件之间有时间上的delay,以及CPU流水线和乱序执行等因素,所以得到的指令地址IP(Instruction Pointer) 并不是当时产生采样事件的IP,这个称之为skid。为了改善这种状况,使IP值 更加准确,Intel使用PEBS(Precise Event-Based Sampling),而AMD则使用IBS(Instruction-Based Sampling)。

    以PEBS为例:每次采样事件发生时,会先把采样数据存到一个缓冲区中(PEBS buffer),当缓冲区内容达到某一值时,再一次性处理,这样可以很好地解决skid问题。

    执行一下perf list --help命令,会看到下面内容:

    The p modifier can be used for specifying how precise the instruction address should be. The p modifier can be specified multiple times:
    
           0 - SAMPLE_IP can have arbitrary skid
           1 - SAMPLE_IP must have constant skid
           2 - SAMPLE_IP requested to have 0 skid
           3 - SAMPLE_IP must have 0 skid
    
    For Intel systems precise event sampling is implemented with PEBS which supports up to precise-level 2.
    

    现在可以理解,经常看到的类似“perf record -e "cpu/mem-loads/pp" -a”命 令中,pp就是指定IP精度的。

    系统调用perf_open_event

    代表一种事件资源,用户态调用 perf_open_event 即会创建一个与之对应的 perf_event 对象,相应的一些重要数据都会以这个数据结构为维度存放 包含 pmu ctx enabled_time

    running_time count 等信息
    
    include/linux/perf_event.h
    struct perf_event {
    
    }
    
    
    ./arch/arm64/kernel/perf_event.c
    
    

    例子

    下面我用 ls 命令来演示 sys_enter 这个 tracepoint 的使用:

    perf stat -e raw_syscalls:sys_enter ls
    

    指定pid, 采集1s:

    [root@localhost /home/ahao.mah]
    #perf stat -e syscalls:* -p 49770 sleep 1
    

    perf stat的输出简单介绍

    perf stat 的输出

    [root@localhost /home/ahao.mah]
    #perf stat ls
    perf.data  perf.data.old  test  test.c
    
     Performance counter stats for 'ls':
    
              1.256036      task-clock (msec)         #    0.724 CPUs utilized
                     4      context-switches          #    0.003 M/sec
                     0      cpu-migrations            #    0.000 K/sec
                   285      page-faults               #    0.227 M/sec
             2,506,596      cycles                    #    1.996 GHz                      (87.56%)
             1,891,085      stalled-cycles-frontend   #   75.44% frontend cycles idle
             1,526,425      stalled-cycles-backend    #   60.90% backend  cycles idle
             1,551,244      instructions              #    0.62  insns per cycle
                                                      #    1.22  stalled cycles per insn
               309,841      branches                  #  246.682 M/sec
                12,190      branch-misses             #    3.93% of all branches          (21.57%)
    
           0.001733685 seconds time elapsed
    
    1. 执行时间: 1.256036ms
    2. 持续时间: 0.001733685 seconds time , 持续时间肯定大于执行时间, 因为cpu的调度策略,抢占等原因
    3. cpu利用率:  #    0.724 CPUs utilized  等于 ( 执行时间/持续时间)
    

    perf stat 实现

    tools/perf/builtin-stat.c
    run_perf_stat
       __run_perf_stat
    
    print_stat
    
    

    perf使用

    #include <stdio.h>
    void longa()
    {
      int i,j;
      for(i = 0; i < 1000000; i++)
      j=i; //am I silly or crazy? I feel boring and desperate.
    }
    
    void foo2()
    {
      int i;
      for(i=0 ; i < 10; i++)
           longa();
    }
    
    void foo1()
    {
      int i;
      for(i = 0; i< 100; i++)
         longa();
    }
    
    int main(void)
    {
      foo1();
      foo2();
    }
    
    #perf stat -e kmem:*  ./t1
    
     Performance counter stats for './t1':
    
                     1      kmem:kmalloc
                 1,443      kmem:kmem_cache_alloc
                    85      kmem:kmalloc_node
                    85      kmem:kmem_cache_alloc_node
                 1,078      kmem:kfree
                 1,472      kmem:kmem_cache_free
                    37      kmem:mm_page_free
                    35      kmem:mm_page_free_batched
                    40      kmem:mm_page_alloc
                    70      kmem:mm_page_alloc_zone_locked
                     0      kmem:mm_page_pcpu_drain
                     0      kmem:mm_page_alloc_extfrag
    
           0.382027010 seconds time elapsed
    

    perf的开销

    环境:

    1. kernel 3.10
    2. 上面运行java 满载733.3%

    针对java 单个pid 进行perf record 采集, 启动阶段perf开销100%,稳定后7.5%

    #perf sched record  -p 49770
    
    #ps -eo pmem,pcpu,args   | grep perf  | grep -v grep
     0.0  0.0 [perf]
     0.0  7.5 perf sched record -p 49770
    

    使用perf全局采集syscall,针对单pid, 开销很大, 稳定在40%

    #perf stat -e syscalls:* -p 49770 sleep 10
    
     0.0 88.0 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 96.5 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 90.6 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 68.0 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 54.4 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 45.3 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 38.8 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 34.0 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 30.2 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 27.2 perf stat -e syscalls:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0 24.7 perf stat -e syscalls:* -p 49770 sleep 10
    

    全局采集syscall,开销稍微小点

    #perf stat -e syscalls:*  sleep 10
    
     0.0  0.0 [perf]
     0.0  0.0 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  6.0 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  3.0 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  2.0 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  1.5 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  1.0 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.8 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.7 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.6 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.6 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.5 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.5 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.4 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.4 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.4 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.3 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.3 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.3 perf stat -e syscalls:* sleep 10
     0.0  0.0 [perf]
     0.0  0.4 perf stat -e syscalls:* sleep 10
    
    

    最简单的perf stat,开销比较低

    #perf stat  -p 49770 sleep 10
    
     0.0  0.0 [perf]
     0.0  0.0 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  3.0 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  1.0 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.7 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.6 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.5 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.4 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.3 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.3 perf stat -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.3 perf stat -p 49770 sleep 10
    

    perf采集kmem相关event的开销

    #perf stat -e kmem:*  -p 49770 sleep 10
    
     Performance counter stats for process id '49770':
    
               163,603      kmem:kmalloc                                                  (100.00%)
               484,012      kmem:kmem_cache_alloc                                         (100.00%)
               302,553      kmem:kmalloc_node                                             (100.00%)
               301,051      kmem:kmem_cache_alloc_node                                     (100.00%)
               263,768      kmem:kfree                                                    (100.00%)
               774,941      kmem:kmem_cache_free                                          (100.00%)
                83,850      kmem:mm_page_free                                             (100.00%)
                   799      kmem:mm_page_free_batched                                     (100.00%)
                83,064      kmem:mm_page_alloc                                            (100.00%)
                 1,088      kmem:mm_page_alloc_zone_locked                                     (100.00%)
                   403      kmem:mm_page_pcpu_drain                                       (100.00%)
                     0      kmem:mm_page_alloc_extfrag
    
     0.0  7.0 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  3.5 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  2.3 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  1.7 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  1.4 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  1.1 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  1.0 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.8 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.7 perf stat -e kmem:* -p 49770 sleep 10
     0.0  0.0 [perf]
     0.0  0.7 perf stat -e kmem:* -p 49770 sleep 10
    

    REF

    使用 OProfile for Linux on POWER 识别性能瓶颈:
    https://www.ibm.com/developerworks/cn/linux/l-pow-oprofile/

    http://abcdxyzk.github.io/blog/2015/07/27/debug-perf/

    perf-tools:

    https://www.slideshare.net/brendangregg/linux-performance-analysis-new-tools-and-old-secrets

  • 相关阅读:
    (转)Computer Vision Open Source Algorithm Implementations
    关于项目依赖项
    fatal error CVT1100: duplicate resource. type:MANIFEST, name:1, language:0×0409
    转:error LNK2005
    C++ string类常用函数
    [转载]同步synchronized方法和代码块
    Java语言基础常用对象API(二)泛型、Map集合
    Java语言基础IO流(输入输出流) 字符流
    Java语言基础常用对象API(二)集合框架
    Java语言基础IO流(输入输出流) 字节流、转换流
  • 原文地址:https://www.cnblogs.com/muahao/p/8909211.html
Copyright © 2020-2023  润新知