前些日子深信服面试,面试官问到了如何调试段错误,一时还真不知道如何回答。虽然偶尔会遇到段错误,但都是程序运行提示段错误后回去修改代码,而没有深入去了解。
段错误是什么?
参考维基百科,段错误的一个比较完整的定义如下:
In computing, a segmentation fault (often shortened to segfault) or access violation is a fault raised by hardware with memory protection, notifying an operating system (OS) about a memory access violation; on x86 computers this is a form of general protection fault. In short, a segmentation fault occurs when a program attempts to access a memory location that it is not allowed to access, or attempts to access a memory location in a way that is not allowed (e.g., attempts to write to a read-only location, or to overwrite part of the operating system). Systems based on processors like the Motorola 68000 tend to refer to these events as Address or Bus errors. On Unix-like operating systems, a process that accesses invalid memory receives the SIGSEGV signal. On Microsoft Windows, a process that accesses invalid memory receives the STATUS_ACCESS_VIOLATION exception.
另外,维基百科还总结了一些引起段错误的典型原因:
The following are some typical causes of a segmentation fault: 1. Dereferencing null pointers – this is special-cased by memory management hardware 2. Attempting to access a nonexistent memory address (outside process's address space) 3. Attempting to access memory the program does not have rights to (such as kernel structures in process context) 4. Attempting to write read-only memory (such as code segment)
These in turn are often caused by programming errors that result in invalid memory access: 1. Dereferencing or assigning to an uninitialized pointer (wild pointer, which points to a random memory address) 2. Dereferencing or assigning to a freed pointer (dangling pointer, which points to memory that has been freed/deallocated/deleted) 3. A buffer overflow 4. A stack overflow 5. Attempting to execute a program that does not compile correctly. (Some compilers will output an executable file despite the presence of compile-time errors.)
如何调试段错误?
该部分主要参考自博文你的java/c/c++程序崩溃了?揭秘段错误(Segmentation fault)(3)。
问题代码
作为例子的代码如下:
1 // stack.c 2 #include "stdio.h" 3 #include "string.h" 4 #include "stdlib.h" 5 6 7 int main(int argc,char** args) { 8 char * p = NULL; 9 *p = 0x0; 10 }
程序运行结果如下:
找出问题
第1步 strace 查信号描述
strace -i -x -o segfault.txt ./segfault.o
得到如下信息:
可以知道:
1.错误信号:SIGSEGV
3.错误码:SEGV_MAPERR
3.错误内存地址:0x0
4.逻辑地址0x400507处出错.
可以猜测:
程序中有空指针访问试图向
0x0
写入而引发段错误.
关于strace使用可参考博文 Linux strace 命令。
第2步 dmesg 查错误现场
dmesg
得到:
可知:
1.错误类型:segfault ,即段错误(Segmentation Fault).
2.出错时ip:0x400507
3.错误号:6,即110
第3步 收集已知结论
这里 错误号和ip
是关键,错误号对照下面:
/* * Page fault error code bits: * * bit 0 == 0: no page found 1: protection fault * bit 1 == 0: read access 1: write access * bit 2 == 0: kernel-mode access 1: user-mode access * bit 3 == 1: use of reserved bit detected * bit 4 == 1: fault was an instruction fetch */ /*enum x86_pf_error_code { PF_PROT = 1 << 0, PF_WRITE = 1 << 1, PF_USER = 1 << 2, PF_RSVD = 1 << 3, PF_INSTR = 1 << 4, };*/
对照后可知:
错误号6 = 110 = (PF_USER | PF_WIRTE | 0).
即“用户态”、“写入型页错误 ”、“没有与指定的地址相对应的页”.
上面的信息与我们最初的推断吻合.
现在,对目前已知结论进行概括如下:
1.错误类型:segfualt ,即段错误(Segmentation Fault).
2.出错时ip:0x400507
3.错误号:6,即110
4.错误码:SEGV_MAPERR 即地址没有映射到对象.
5.错误原因:对
0x0
进行写操作引发了段错误,原因是0x0
没有与之对应的页或者叫映射.
第4步 根据结论找到出错代码
gdb ./segfault.o
根据结论中的ip = 0x400507
立即得到:
显然,这验证了我们的结论:
我们试图将值
0x0
写入地址0x0
从而引发写入未映射的地址的段错误.
并且我们找到了错误的代码stack.c的第9行。
调试 Core Dump
除了以上提到的方法,我们还可以通过调试 Core Dump 来确定错误代码:
关于 Core Dump 的详细,可参考博文 Linux Core Dump。
参考资料
你的java/c/c++程序崩溃了?揭秘段错误(Segmentation fault)(1)
你的java/c/c++程序崩溃了?揭秘段错误(Segmentation fault)(2)