环境信息:
WARNING: kernel version inconsistency between vmlinux and dumpfile KERNEL: vmlinux-47.90 DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 32 DATE: Wed Nov 14 11:08:24 2018 UPTIME: 05:08:36 LOAD AVERAGE: 484.39, 481.11, 385.18 TASKS: 5287 NODENAME: ycby25-3kh_2 RELEASE: 3.0.101-0.47.90-default VERSION: #1 SMP Wed Oct 19 14:11:00 UTC 2016 (56c73f1) MACHINE: x86_64 (2600 Mhz) MEMORY: 255.6 GB PANIC: "[18477.566692] Kernel panic - not syncing: hung_task: blocked tasks" PID: 144 COMMAND: "khungtaskd" TASK: ffff881fc096e080 [THREAD_INFO: ffff881fc0970000] CPU: 7 STATE: TASK_RUNNING (PANIC)
dmesg最后的log信息:
[17013.334105] show_signal_msg: 30 callbacks suppressed [17013.334110] CMoniterThread[26144]: segfault at 0 ip 00007f13a100c699 sp 00007f138371fc30 error 6 in libCommonUtilitiesLib.so[7f13a0fdd000+ 4d000] [18477.566473] INFO: task dev_rdwriter:23683 blocked for more than 1200 seconds. [18477.566475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [18477.566477] dev_rdwriter D 0000000000800000 0 23683 18968 0x00000000 [18477.566479] ffff88173e599d70 0000000000000082 ffff88173e598010 0000000000010900 [18477.566483] 0000000000010900 0000000000010900 0000000000010900 ffff88173e599fd8 [18477.566486] ffff88173e599fd8 0000000000010900 ffff88173e5964c0 ffff881fc3378580 [18477.566489] Call Trace: [18477.566499] [<ffffffff81467485>] rwsem_down_failed_common+0xb5/0x160 [18477.566505] [<ffffffff81264d13>] call_rwsem_down_write_failed+0x13/0x20 [18477.566509] [<ffffffff8146679c>] down_write+0x1c/0x20 [18477.566541] [<ffffffffa05ddb6c>] xfs_ilock+0xec/0x100 [xfs] [18477.566629] [<ffffffffa0604e47>] xfs_file_fallocate+0xc7/0x190 [xfs] [18477.566665] [<ffffffff8115d629>] do_fallocate+0x129/0x130 [18477.566669] [<ffffffff8115d676>] sys_fallocate+0x46/0x70 [18477.566673] [<ffffffff8146f5f2>] system_call_fastpath+0x16/0x1b [18477.566690] [<00007f344662b010>] 0x7f344662b00f [18477.566692] Kernel panic - not syncing: hung_task: blocked tasks [18477.566698] Pid: 144, comm: khungtaskd Tainted: G ENX 3.0.101-0.47.90-default #1 [18477.566701] Call Trace: [18477.566707] [<ffffffff81004b95>] dump_trace+0x75/0x300 [18477.566712] [<ffffffff81464663>] dump_stack+0x69/0x6f [18477.566717] [<ffffffff8146471f>] panic+0xb6/0x224 [18477.566722] [<ffffffff810c8731>] check_hung_uninterruptible_tasks+0x1e1/0x1f0 [18477.566726] [<ffffffff810c8787>] watchdog+0x47/0x50 [18477.566730] [<ffffffff810845f6>] kthread+0x96/0xa0 [18477.566735] [<ffffffff81470764>] kernel_thread_helper+0x4/0x10
23683 阻塞了很长时间,这个时间我们当时设置的是1200s。下面分析下阻塞的原因:
crash> bt 23683 PID: 23683 TASK: ffff88173e5964c0 CPU: 3 COMMAND: "dev_rdwriter" #0 [ffff88173e599c30] schedule at ffffffff814652b9 #1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485 #2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13 #3 [ffff88173e599e18] down_write at ffffffff8146679c #4 [ffff88173e599e20] xfs_ilock at ffffffffa05ddb6c [xfs] #5 [ffff88173e599e50] xfs_file_fallocate at ffffffffa0604e47 [xfs] #6 [ffff88173e599f20] do_fallocate at ffffffff8115d629 #7 [ffff88173e599f50] sys_fallocate at ffffffff8115d676 #8 [ffff88173e599f80] system_call_fastpath at ffffffff8146f5f2 RIP: 00007f344662b010 RSP: 00007f2db9d0a2a0 RFLAGS: 00003287 RAX: 000000000000011d RBX: ffffffff8146f5f2 RCX: 0000000000000000 RDX: 0000000000800000 RSI: 0000000000000001 RDI: 0000000000000050 RBP: 0000000000880000 R8: 00007f340a0ecb60 R9: 0000000000005c83 R10: 0000000000100000 R11: 0000000000003246 R12: 0000000000880000 R13: 0000000009b20a20 R14: 00007f3408e01328 R15: 0000000000004800 ORIG_RAX: 000000000000011d CS: 0033 SS: 002b
看堆栈是在循环等信号量:
static struct rw_semaphore __sched * rwsem_down_failed_common(struct rw_semaphore *sem, unsigned int flags, signed long adjustment) { struct rwsem_waiter waiter; struct task_struct *tsk = current; signed long count; set_task_state(tsk, TASK_UNINTERRUPTIBLE); /* set up my own style of waitqueue */ spin_lock_irq(&sem->wait_lock); waiter.task = tsk; waiter.flags = flags; get_task_struct(tsk); if (list_empty(&sem->wait_list)) adjustment += RWSEM_WAITING_BIAS; list_add_tail(&waiter.list, &sem->wait_list); /* we're now waiting on the lock, but no longer actively locking */ count = rwsem_atomic_update(adjustment, sem); /* If there are no active locks, wake the front queued process(es) up. * * Alternatively, if we're called from a failed down_write(), there * were already threads queued before us and there are no active * writers, the lock must be read owned; so we try to wake any read * locks that were queued ahead of us. */ if (count == RWSEM_WAITING_BIAS) sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE); else if (count > RWSEM_WAITING_BIAS && adjustment == -RWSEM_ACTIVE_WRITE_BIAS) sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED); spin_unlock_irq(&sem->wait_lock); /* wait to be given the lock */ for (;;) {--------------------进入了这个循环,一直没出来,判断的条件就是看等待的waiter的task是否为空。 if (!waiter.task) break; schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); } tsk->state = TASK_RUNNING; return sem; }
由于处于uninterruptible时间超过阈值,所以最终hung的检测触发了crash。为什么判断waiter.task是否为NULL,是因为读写锁释放的时候,会查看等待队列,如果有waiter,则会
将waiter从等待队列中摘下,并且将 waiter->task = NULL; 然后唤醒等待者,等待者查看到自己 waiter.task是否为NULL来判断是否需要继续等待。
看看等待的信号量是什么:
crash> bt -f 23683 PID: 23683 TASK: ffff88173e5964c0 CPU: 3 COMMAND: "dev_rdwriter" #0 [ffff88173e599c30] schedule at ffffffff814652b9 ffff88173e599c38: 0000000000000082 ffff88173e598010 ffff88173e599c48: 0000000000010900 0000000000010900 ffff88173e599c58: 0000000000010900 0000000000010900 ffff88173e599c68: ffff88173e599fd8 ffff88173e599fd8 ffff88173e599c78: 0000000000010900 ffff88173e5964c0 ffff88173e599c88: ffff881fc3378580 ffff881f22341000 ffff88173e599c98: 00000000ffffff9c ffffffff81169f93 ffff88173e599ca8: ffff88173e599d58 ffff88173e599de8 ffff88173e599cb8: ffff881a8a7b21c0 ffff881a8a7b21c0 ffff88173e599cc8: ffff88173d8aa0c0 ffff88173d8afd90 ffff88173e599cd8: 0000000400000000 ffff88173e599dd8 ffff88173e599ce8: ffff8805b87fa240 ffffffff811695d8 ffff88173e599cf8: ffff88173e599d48 0000000000000000 ffff88173e599d08: ffff88173e599dd8 ffffffff8116db50 ffff88173e599d18: ffff88173e599e58 ffffffff8117661e ffff88173e599d28: ffff88173e599de8 ffff88173e599e58 ffff88173e599d38: ffff8805b848f990 ffff88173e599e58 ffff88173e599d48: 0000000000000002 ffff8805b848f8a8 ffff88173e599d58: ffff8805b848f8b0 ffffffffffffffff ffff88173e599d68: 0000000000800000 ffff88173e5964c0 ffff88173e599d78: ffffffff81467485 #1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485 ffff88173e599d80: ffff881706047ca8 ffff8805b848f8b8---------------------ffff881706047ca8就是waiter ffff88173e599d90: ffff88173e5964c0 ffff881f00000002 ffff88173e599da0: ffff881f22341000 0000000000000000 ffff88173e599db0: ffffffffffffffa1 0000000000000001 ffff88173e599dc0: ffff8805b848f800 ffff8805b848f990 ffff88173e599dd0: 0000000000100000 ffffffff81264d13 #2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13 ffff88173e599de0: 0000000000003246 0000000000100000 ffff88173e599df0: ffff88171291ff20 ffff881ae2c64bc0 ffff88173e599e00: 0000000000100000 0000000000000001 ffff88173e599e10: ffff8805b848f8a8 ffffffff8146679c------------------ffff8805b848f8a8 就是信号量
#3 [ffff88173e599e18] down_write at ffffffff8146679c ffff88173e599e20: ffffffffa05ddb6c
根据反汇编,
0xffffffff81467476 <rwsem_down_failed_common+166>: cmpq $0x0,0x10(%rsp) 0xffffffff8146747c <rwsem_down_failed_common+172>: je 0xffffffff814674a9 <rwsem_down_failed_common+217> 0xffffffff8146747e <rwsem_down_failed_common+174>: xchg %ax,%ax /usr/src/linux-3.0.101-0.47.90/lib/rwsem.c: 213 0xffffffff81467480 <rwsem_down_failed_common+176>: callq 0xffffffff81465600 <schedule>
可以确定rsp就是我们的waiter,也就是 ffff881706047ca8 。
由于call_rwsem_down_write_failed 不是c,所以继续往上回溯到xfs_ilock,找到xfs_inode_t为0xffff8805b848f800,读写信号量为:0xffff8805b848f8a8
crash> struct -x rw_semaphore 0xffff8805b848f8a8 struct rw_semaphore { count = 0xffffffff00000001, wait_lock = { { rlock = { raw_lock = { slock = 0x70007 } } } }, wait_list = { next = 0xffff88173e599d80, prev = 0xffff881ec528bd80 } }
根据rw_semaphore的结构,可以确定wait_list的地址为 ffff8805b848f8a8 + 0x10,也就是 ffff8805b848f8b8。
crash> list rwsem_waiter.list -H ffff8805b848f8b8 -s rwsem_waiter.task ffff88173e599d80 task = 0xffff88173e5964c0 ffff881706047ca8 task = 0xffff8817060442c0 ffff881706057ca8 task = 0xffff8817060543c0 ffff88170605bca8 task = 0xffff881706058400 ffff883ec2437d80 task = 0xffff883e23c2c380 ffff883e54a0fd80 task = 0xffff883998b36200 ffff881ec528bd80 task = 0xffff881cd766e300
因为添加到等待队列的时候,是加到队列尾,
if (list_empty(&sem->wait_list)) adjustment += RWSEM_WAITING_BIAS; list_add_tail(&waiter.list, &sem->wait_list);
所以第一个等待的task是list的头的next,也就是 ffff88173e599d80,对应的task是:
crash> task 0xffff88173e5964c0 PID: 23683 TASK: ffff88173e5964c0 CPU: 3 COMMAND: "dev_rdwriter" struct task_struct { state = 2,-----------------------------TASK_UNINTERRUPTIBLE
stack = 0xffff88173e598000, usage = { counter = 3 },
然后对应的堆栈是:
crash> bt 23683 PID: 23683 TASK: ffff88173e5964c0 CPU: 3 COMMAND: "dev_rdwriter" #0 [ffff88173e599c30] schedule at ffffffff814652b9 #1 [ffff88173e599d78] rwsem_down_failed_common at ffffffff81467485 #2 [ffff88173e599dd8] call_rwsem_down_write_failed at ffffffff81264d13 #3 [ffff88173e599e18] down_write at ffffffff8146679c #4 [ffff88173e599e20] xfs_ilock at ffffffffa05ddb6c [xfs] #5 [ffff88173e599e50] xfs_file_fallocate at ffffffffa0604e47 [xfs] #6 [ffff88173e599f20] do_fallocate at ffffffff8115d629 #7 [ffff88173e599f50] sys_fallocate at ffffffff8115d676 #8 [ffff88173e599f80] system_call_fastpath at ffffffff8146f5f2 RIP: 00007f344662b010 RSP: 00007f2db9d0a2a0 RFLAGS: 00003287 RAX: 000000000000011d RBX: ffffffff8146f5f2 RCX: 0000000000000000 RDX: 0000000000800000 RSI: 0000000000000001 RDI: 0000000000000050 RBP: 0000000000880000 R8: 00007f340a0ecb60 R9: 0000000000005c83 R10: 0000000000100000 R11: 0000000000003246 R12: 0000000000880000 R13: 0000000009b20a20 R14: 00007f3408e01328 R15: 0000000000004800 ORIG_RAX: 000000000000011d CS: 0033 SS: 002b
很明显,这个第一个等待的进程是获取写锁,写锁是排他性的,不管是读优先还是写优先,正常情况下都不应该等待这么长时间,所以唯一的可能是,谁占了锁,没释放。
在遍历其他几个等待的task的时候,发现了如下信息:
crash> task 0xffff8817060442c0 PID: 26637 TASK: ffff8817060442c0 CPU: 6 COMMAND: "21-IFileSender" struct task_struct { state = 2, stack = 0xffff881706046000, usage = { counter = 3 }, crash> bt 26637 PID: 26637 TASK: ffff8817060442c0 CPU: 6 COMMAND: "21-IFileSender" #0 [ffff881706047b58] schedule at ffffffff814652b9 #1 [ffff881706047ca0] rwsem_down_failed_common at ffffffff81467485 #2 [ffff881706047d00] call_rwsem_down_read_failed at ffffffff81264ce4 #3 [ffff881706047d48] down_read at ffffffff8146677e #4 [ffff881706047d50] xfs_ilock at ffffffffa05ddb3c [xfs] #5 [ffff881706047d80] caq_xfs_file_splice_read at ffffffffa06a1850 [pagecachelimit]--------这个是我加的代码 #6 [ffff881706047dc0] splice_direct_to_actor at ffffffff81188c6c #7 [ffff881706047e30] do_sendfile at ffffffffa07119ab [witdriver] #8 [ffff881706047ec0] my_sendfile at ffffffffa071db3a [witdriver] #9 [ffff881706047f80] system_call_fastpath at ffffffff8146f5f2 RIP: 00007f13992bf3c9 RSP: 00007f136ab5f3e8 RFLAGS: 00000203 RAX: 00000000000000b5 RBX: ffffffff8146f5f2 RCX: 00007f136ab5f42c RDX: 0000000000000048 RSI: 00007f136ab5e730 RDI: 0000000000000001 RBP: 00007f136ab5f800 R8: 00007f136ab5f800 R9: 0000000000000000 R10: 0000000000049fff R11: 0000000000000246 R12: 0000000000000524 R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000042e28 ORIG_RAX: 00000000000000b5 CS: 0033 SS: 002b
看到了一个等待路径是自己加的代码,所以赶紧走查一下自己的代码,发现如下:
static ssize_t caq_xfs_file_splice_read( struct file *infilp, loff_t *ppos, struct pipe_inode_info *pipe, size_t count, unsigned int flags) { loff_t isize, left; int ret; struct xfs_inode *ip = XFS_I(infilp->f_mapping->host); xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);------------------------------拿锁了 isize = i_size_read(infilp->f_mapping->host); if (unlikely(*ppos >= isize))//读的起始位置居然超过文件大小 return 0;------------------------------------------------------就这个地方出问题,没放锁就return 了 left = isize - *ppos; if (unlikely(left < count))//保证读到文件结尾就结束 count = left; ret = caq___generic_file_splice_read(infilp, ppos, pipe, count, flags); if (ret > 0) { *ppos += ret;//文件当前偏移 } xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED); return ret; }
代码bug很明显,在一个异常分支流程,拿读锁之后,就退出了,没有放锁。而触发这个bug的业务场景就是,多个线程写一个文件,多个线程读一个文件,读的传入的参数,超过了
文件的大小,从而触发。