• linux 一个读写锁的使用异常导致的故障


    环境信息:

    WARNING: kernel version inconsistency between vmlinux and dumpfile
    
          KERNEL: vmlinux-47.90
        DUMPFILE: vmcore  [PARTIAL DUMP]
            CPUS: 32
            DATE: Wed Nov 14 11:08:24 2018
          UPTIME: 05:08:36
    LOAD AVERAGE: 484.39, 481.11, 385.18
           TASKS: 5287
        NODENAME: ycby25-3kh_2
         RELEASE: 3.0.101-0.47.90-default
         VERSION: #1 SMP Wed Oct 19 14:11:00 UTC 2016 (56c73f1)
         MACHINE: x86_64  (2600 Mhz)
          MEMORY: 255.6 GB
           PANIC: "[18477.566692] Kernel panic - not syncing: hung_task: blocked tasks"
             PID: 144
         COMMAND: "khungtaskd"
            TASK: ffff881fc096e080  [THREAD_INFO: ffff881fc0970000]
             CPU: 7
           STATE: TASK_RUNNING (PANIC)

    dmesg最后的log信息:

    [17013.334105] show_signal_msg: 30 callbacks suppressed
    [17013.334110] CMoniterThread[26144]: segfault at 0 ip 00007f13a100c699 sp 00007f138371fc30 error 6 in libCommonUtilitiesLib.so[7f13a0fdd000+
    4d000]
    [18477.566473] INFO: task dev_rdwriter:23683 blocked for more than 1200 seconds.
    [18477.566475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [18477.566477] dev_rdwriter    D 0000000000800000     0 23683  18968 0x00000000
    [18477.566479]  ffff88173e599d70 0000000000000082 ffff88173e598010 0000000000010900
    [18477.566483]  0000000000010900 0000000000010900 0000000000010900 ffff88173e599fd8
    [18477.566486]  ffff88173e599fd8 0000000000010900 ffff88173e5964c0 ffff881fc3378580
    [18477.566489] Call Trace:
    [18477.566499]  [<ffffffff81467485>] rwsem_down_failed_common+0xb5/0x160
    [18477.566505]  [<ffffffff81264d13>] call_rwsem_down_write_failed+0x13/0x20
    [18477.566509]  [<ffffffff8146679c>] down_write+0x1c/0x20
    [18477.566541]  [<ffffffffa05ddb6c>] xfs_ilock+0xec/0x100 [xfs]
    [18477.566629]  [<ffffffffa0604e47>] xfs_file_fallocate+0xc7/0x190 [xfs]
    [18477.566665]  [<ffffffff8115d629>] do_fallocate+0x129/0x130
    [18477.566669]  [<ffffffff8115d676>] sys_fallocate+0x46/0x70
    [18477.566673]  [<ffffffff8146f5f2>] system_call_fastpath+0x16/0x1b
    [18477.566690]  [<00007f344662b010>] 0x7f344662b00f
    [18477.566692] Kernel panic - not syncing: hung_task: blocked tasks
    [18477.566698] Pid: 144, comm: khungtaskd Tainted: G           ENX 3.0.101-0.47.90-default #1
    [18477.566701] Call Trace:
    [18477.566707]  [<ffffffff81004b95>] dump_trace+0x75/0x300
    [18477.566712]  [<ffffffff81464663>] dump_stack+0x69/0x6f
    [18477.566717]  [<ffffffff8146471f>] panic+0xb6/0x224
    [18477.566722]  [<ffffffff810c8731>] check_hung_uninterruptible_tasks+0x1e1/0x1f0
    [18477.566726]  [<ffffffff810c8787>] watchdog+0x47/0x50
    [18477.566730]  [<ffffffff810845f6>] kthread+0x96/0xa0
    [18477.566735]  [<ffffffff81470764>] kernel_thread_helper+0x4/0x10
    23683 阻塞了很长时间,这个时间我们当时设置的是1200s。下面分析下阻塞的原因:

    crash> bt 23683
    PID: 23683  TASK: ffff88173e5964c0  CPU: 3   COMMAND: "dev_rdwriter"
     #0 [ffff88173e599c30] schedule at ffffffff814652b9
     #1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485
     #2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13
     #3 [ffff88173e599e18] down_write at ffffffff8146679c
     #4 [ffff88173e599e20] xfs_ilock at ffffffffa05ddb6c [xfs]
     #5 [ffff88173e599e50] xfs_file_fallocate at ffffffffa0604e47 [xfs]
     #6 [ffff88173e599f20] do_fallocate at ffffffff8115d629
     #7 [ffff88173e599f50] sys_fallocate at ffffffff8115d676
     #8 [ffff88173e599f80] system_call_fastpath at ffffffff8146f5f2
        RIP: 00007f344662b010  RSP: 00007f2db9d0a2a0  RFLAGS: 00003287
        RAX: 000000000000011d  RBX: ffffffff8146f5f2  RCX: 0000000000000000
        RDX: 0000000000800000  RSI: 0000000000000001  RDI: 0000000000000050
        RBP: 0000000000880000   R8: 00007f340a0ecb60   R9: 0000000000005c83
        R10: 0000000000100000  R11: 0000000000003246  R12: 0000000000880000
        R13: 0000000009b20a20  R14: 00007f3408e01328  R15: 0000000000004800
        ORIG_RAX: 000000000000011d  CS: 0033  SS: 002b

    看堆栈是在循环等信号量:

    static struct rw_semaphore __sched *
    rwsem_down_failed_common(struct rw_semaphore *sem,
                 unsigned int flags, signed long adjustment)
    {
        struct rwsem_waiter waiter;
        struct task_struct *tsk = current;
        signed long count;
    
        set_task_state(tsk, TASK_UNINTERRUPTIBLE);
    
        /* set up my own style of waitqueue */
        spin_lock_irq(&sem->wait_lock);
        waiter.task = tsk;
        waiter.flags = flags;
        get_task_struct(tsk);
    
        if (list_empty(&sem->wait_list))
            adjustment += RWSEM_WAITING_BIAS;
        list_add_tail(&waiter.list, &sem->wait_list);
    
        /* we're now waiting on the lock, but no longer actively locking */
        count = rwsem_atomic_update(adjustment, sem);
    
        /* If there are no active locks, wake the front queued process(es) up.
         *
         * Alternatively, if we're called from a failed down_write(), there
         * were already threads queued before us and there are no active
         * writers, the lock must be read owned; so we try to wake any read
         * locks that were queued ahead of us. */
        if (count == RWSEM_WAITING_BIAS)
            sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
        else if (count > RWSEM_WAITING_BIAS &&
             adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
            sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
    
        spin_unlock_irq(&sem->wait_lock);
    
        /* wait to be given the lock */
        for (;;) {--------------------进入了这个循环,一直没出来,判断的条件就是看等待的waiter的task是否为空。
            if (!waiter.task)
                break;
            schedule();
            set_task_state(tsk, TASK_UNINTERRUPTIBLE);
        }
    
        tsk->state = TASK_RUNNING;
    
        return sem;
    }

    由于处于uninterruptible时间超过阈值,所以最终hung的检测触发了crash。为什么判断waiter.task是否为NULL,是因为读写锁释放的时候,会查看等待队列,如果有waiter,则会

    将waiter从等待队列中摘下,并且将 waiter->task = NULL; 然后唤醒等待者,等待者查看到自己 waiter.task是否为NULL来判断是否需要继续等待。

    看看等待的信号量是什么:

    crash> bt -f 23683
    PID: 23683  TASK: ffff88173e5964c0  CPU: 3   COMMAND: "dev_rdwriter"
     #0 [ffff88173e599c30] schedule at ffffffff814652b9
        ffff88173e599c38: 0000000000000082 ffff88173e598010
        ffff88173e599c48: 0000000000010900 0000000000010900
        ffff88173e599c58: 0000000000010900 0000000000010900
        ffff88173e599c68: ffff88173e599fd8 ffff88173e599fd8
        ffff88173e599c78: 0000000000010900 ffff88173e5964c0
        ffff88173e599c88: ffff881fc3378580 ffff881f22341000
        ffff88173e599c98: 00000000ffffff9c ffffffff81169f93
        ffff88173e599ca8: ffff88173e599d58 ffff88173e599de8
        ffff88173e599cb8: ffff881a8a7b21c0 ffff881a8a7b21c0
        ffff88173e599cc8: ffff88173d8aa0c0 ffff88173d8afd90
        ffff88173e599cd8: 0000000400000000 ffff88173e599dd8
        ffff88173e599ce8: ffff8805b87fa240 ffffffff811695d8
        ffff88173e599cf8: ffff88173e599d48 0000000000000000
        ffff88173e599d08: ffff88173e599dd8 ffffffff8116db50
        ffff88173e599d18: ffff88173e599e58 ffffffff8117661e
        ffff88173e599d28: ffff88173e599de8 ffff88173e599e58
        ffff88173e599d38: ffff8805b848f990 ffff88173e599e58
        ffff88173e599d48: 0000000000000002 ffff8805b848f8a8
        ffff88173e599d58: ffff8805b848f8b0 ffffffffffffffff
        ffff88173e599d68: 0000000000800000 ffff88173e5964c0
        ffff88173e599d78: ffffffff81467485
     #1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485
        ffff88173e599d80: ffff881706047ca8 ffff8805b848f8b8---------------------ffff881706047ca8就是waiter
        ffff88173e599d90: ffff88173e5964c0 ffff881f00000002
        ffff88173e599da0: ffff881f22341000 0000000000000000
        ffff88173e599db0: ffffffffffffffa1 0000000000000001
        ffff88173e599dc0: ffff8805b848f800 ffff8805b848f990
        ffff88173e599dd0: 0000000000100000 ffffffff81264d13
     #2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13
        ffff88173e599de0: 0000000000003246 0000000000100000
        ffff88173e599df0: ffff88171291ff20 ffff881ae2c64bc0
        ffff88173e599e00: 0000000000100000 0000000000000001
        ffff88173e599e10: ffff8805b848f8a8 ffffffff8146679c------------------ffff8805b848f8a8  就是信号量
    #3 [ffff88173e599e18] down_write at ffffffff8146679c ffff88173e599e20: ffffffffa05ddb6c

    根据反汇编,

    0xffffffff81467476 <rwsem_down_failed_common+166>:      cmpq   $0x0,0x10(%rsp)
    0xffffffff8146747c <rwsem_down_failed_common+172>:      je     0xffffffff814674a9 <rwsem_down_failed_common+217>
    0xffffffff8146747e <rwsem_down_failed_common+174>:      xchg   %ax,%ax
    /usr/src/linux-3.0.101-0.47.90/lib/rwsem.c: 213
    0xffffffff81467480 <rwsem_down_failed_common+176>:      callq  0xffffffff81465600 <schedule>

    可以确定rsp就是我们的waiter,也就是 ffff881706047ca8 。

    由于call_rwsem_down_write_failed  不是c,所以继续往上回溯到xfs_ilock,找到xfs_inode_t为0xffff8805b848f800,读写信号量为:0xffff8805b848f8a8

    crash> struct -x rw_semaphore 0xffff8805b848f8a8
    struct rw_semaphore {
      count = 0xffffffff00000001,
      wait_lock = {
        {
          rlock = {
            raw_lock = {
              slock = 0x70007
            }
          }
        }
      },
      wait_list = {
        next = 0xffff88173e599d80,
        prev = 0xffff881ec528bd80
      }
    }

     根据rw_semaphore的结构,可以确定wait_list的地址为 ffff8805b848f8a8 + 0x10,也就是 ffff8805b848f8b8。

    crash> list  rwsem_waiter.list -H ffff8805b848f8b8 -s rwsem_waiter.task
    ffff88173e599d80
      task = 0xffff88173e5964c0
    ffff881706047ca8
      task = 0xffff8817060442c0
    ffff881706057ca8
      task = 0xffff8817060543c0
    ffff88170605bca8
      task = 0xffff881706058400
    ffff883ec2437d80
      task = 0xffff883e23c2c380
    ffff883e54a0fd80
      task = 0xffff883998b36200
    ffff881ec528bd80
      task = 0xffff881cd766e300

    因为添加到等待队列的时候,是加到队列尾,

        if (list_empty(&sem->wait_list))
            adjustment += RWSEM_WAITING_BIAS;
        list_add_tail(&waiter.list, &sem->wait_list);

    所以第一个等待的task是list的头的next,也就是 ffff88173e599d80,对应的task是:

    crash> task 0xffff88173e5964c0
    PID: 23683  TASK: ffff88173e5964c0  CPU: 3   COMMAND: "dev_rdwriter"
    struct task_struct {
      state = 2,-----------------------------TASK_UNINTERRUPTIBLE
      stack = 0xffff88173e598000,
      usage = {
        counter = 3
      },

    然后对应的堆栈是:

    crash> bt 23683
    PID: 23683  TASK: ffff88173e5964c0  CPU: 3   COMMAND: "dev_rdwriter"
     #0 [ffff88173e599c30] schedule at ffffffff814652b9
     #1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485
     #2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13
     #3 [ffff88173e599e18] down_write at ffffffff8146679c
     #4 [ffff88173e599e20] xfs_ilock at ffffffffa05ddb6c [xfs]
     #5 [ffff88173e599e50] xfs_file_fallocate at ffffffffa0604e47 [xfs]
     #6 [ffff88173e599f20] do_fallocate at ffffffff8115d629
     #7 [ffff88173e599f50] sys_fallocate at ffffffff8115d676
     #8 [ffff88173e599f80] system_call_fastpath at ffffffff8146f5f2
        RIP: 00007f344662b010  RSP: 00007f2db9d0a2a0  RFLAGS: 00003287
        RAX: 000000000000011d  RBX: ffffffff8146f5f2  RCX: 0000000000000000
        RDX: 0000000000800000  RSI: 0000000000000001  RDI: 0000000000000050
        RBP: 0000000000880000   R8: 00007f340a0ecb60   R9: 0000000000005c83
        R10: 0000000000100000  R11: 0000000000003246  R12: 0000000000880000
        R13: 0000000009b20a20  R14: 00007f3408e01328  R15: 0000000000004800
        ORIG_RAX: 000000000000011d  CS: 0033  SS: 002b

    很明显,这个第一个等待的进程是获取写锁,写锁是排他性的,不管是读优先还是写优先,正常情况下都不应该等待这么长时间,所以唯一的可能是,谁占了锁,没释放。

    在遍历其他几个等待的task的时候,发现了如下信息:

    crash> task 0xffff8817060442c0
    PID: 26637  TASK: ffff8817060442c0  CPU: 6   COMMAND: "21-IFileSender"
    struct task_struct {
      state = 2,
      stack = 0xffff881706046000,
      usage = {
        counter = 3
      },
    
    crash> bt 26637
    PID: 26637  TASK: ffff8817060442c0  CPU: 6   COMMAND: "21-IFileSender"
     #0 [ffff881706047b58] schedule at ffffffff814652b9
     #1 [ffff881706047ca0] rwsem_down_failed_common at ffffffff81467485
     #2 [ffff881706047d00] call_rwsem_down_read_failed at ffffffff81264ce4
     #3 [ffff881706047d48] down_read at ffffffff8146677e
     #4 [ffff881706047d50] xfs_ilock at ffffffffa05ddb3c [xfs]
     #5 [ffff881706047d80] caq_xfs_file_splice_read at ffffffffa06a1850 [pagecachelimit]--------这个是我加的代码
     #6 [ffff881706047dc0] splice_direct_to_actor at ffffffff81188c6c
     #7 [ffff881706047e30] do_sendfile at ffffffffa07119ab [witdriver]
     #8 [ffff881706047ec0] my_sendfile at ffffffffa071db3a [witdriver]
     #9 [ffff881706047f80] system_call_fastpath at ffffffff8146f5f2
        RIP: 00007f13992bf3c9  RSP: 00007f136ab5f3e8  RFLAGS: 00000203
        RAX: 00000000000000b5  RBX: ffffffff8146f5f2  RCX: 00007f136ab5f42c
        RDX: 0000000000000048  RSI: 00007f136ab5e730  RDI: 0000000000000001
        RBP: 00007f136ab5f800   R8: 00007f136ab5f800   R9: 0000000000000000
        R10: 0000000000049fff  R11: 0000000000000246  R12: 0000000000000524
        R13: 0000000000000000  R14: 0000000000000003  R15: 0000000000042e28
        ORIG_RAX: 00000000000000b5  CS: 0033  SS: 002b

    看到了一个等待路径是自己加的代码,所以赶紧走查一下自己的代码,发现如下:

    static ssize_t
    caq_xfs_file_splice_read(
        struct file        *infilp,
        loff_t            *ppos,
        struct pipe_inode_info    *pipe,
        size_t            count,
        unsigned int        flags)
    {
            loff_t isize, left;
            int ret;
    
        struct xfs_inode    *ip = XFS_I(infilp->f_mapping->host);
        
    
            xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);------------------------------拿锁了
    
            isize = i_size_read(infilp->f_mapping->host);
            if (unlikely(*ppos >= isize))//读的起始位置居然超过文件大小
                return 0;------------------------------------------------------就这个地方出问题,没放锁就return 了
    
            left = isize - *ppos;
            if (unlikely(left < count))//保证读到文件结尾就结束
                count = left;
    
           ret = caq___generic_file_splice_read(infilp, ppos, pipe, count,
                               flags);
           if (ret > 0) {
                   *ppos += ret;//文件当前偏移
               }
       
        xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
    
        return ret;
    }

    代码bug很明显,在一个异常分支流程,拿读锁之后,就退出了,没有放锁。而触发这个bug的业务场景就是,多个线程写一个文件,多个线程读一个文件,读的传入的参数,超过了

    文件的大小,从而触发。

    水平有限,如果有错误,请帮忙提醒我。如果您觉得本文对您有帮助,可以点击下面的 推荐 支持一下我。版权所有,需要转发请带上本文源地址,博客一直在更新,欢迎 关注 。
  • 相关阅读:
    自定义jquery插件
    jquery中的编程范式,即jquery的牛逼之处
    $.ajax 完整参数
    URL参数获取/转码
    hello world
    此博客已不更新,作者的个人域名LIZHONGC.COM已经启用。
    岁月记录
    下雪往事
    《x86汇编语言:从实模式到保护模式》检测点和习题答案
    《穿越计算机的迷雾》第二版再版说明
  • 原文地址:https://www.cnblogs.com/10087622blog/p/9970072.html
Copyright © 2020-2023  润新知