CPU学习笔记(2)
作者: Badcoffee
Email: blog.oliver@gmail.com
2005年4月
原文出处: http://blog.csdn.net/yayong
版权所有: 转载时请务必以超链接形式标明文章原始出处、作者信息及本声明
这是作者学习硬件基本知识过程中的笔记,由于以前很少接触这方面的知识,又缺乏系统
的学习,难免会出现错误,希望得到大家指正。
一、Cache Coherence
在2004年写的一篇文章X86汇编语言学习手记(1)中,曾经涉及到gcc编译的代码默认16字节
栈对齐的问题。之所以这样做,主要是性能优化方面的考虑。
大多数现代CPU都One-die了L1和L2Cache。对于L1 Cache,大多是write though的;L2 Cache
则是write back的,不会立即写回memory,这就会导致Cache和Memory的内容的不一致;另外,
对于MP(Multi Processors)的环境,由于Cache是CPU私有的,不同CPU的Cache的内容也存在
不一致的问题,因此很多MP的的计算架构,不论是ccNUMA还是SMP都实现了Cache Coherence
的机制,即不同CPU的Cache一致性机制。
Cache Coherence的一种实现是通过Cache-snooping协议,每个CPU通过对Bus的Snoop实现对
其它CPU读写Cache的监控:
首先,Cache line是Cache和Memory之间数据传输的最小单元。
1. 当CPU1要写Cache时,其它CPU就会检查自己Cache中对应的Cache line,如果是dirty的,
就write back到Memory,并且会将CPU1的相关Cache line刷新;如果不是dirty的,就Invalidate
该Cache line.
2. 当CPU1要读Cache时,其它CPU就会将自己Cache中对应的Cache line中标记为dirty的部分
write back到Memory,并且会将CPU1的相关Cache line刷新。
所以,提高CPU的Cache hit rate,减少Cache和Memory之间的数据传输,将会提高系统的性能。
因此,在程序和二进制对象的内存分配中保持Cache line aligned就十分重要,如果不保证
Cache line对齐,出现多个CPU中并行运行的进程或者线程同时读写同一个Cache line的情况
的概率就会很大。这时CPU的Cache和Memory之间会反复出现Write back和Refresh情况,这种
情形就叫做Cache thrashing。
为了有效的避免Cache thrashing,通常有以下两种途径:
1. 对于Heap的分配,很多系统在malloc调用中实现了强制的alignment.
2. 对于Stack的分配,很多编译器提供了Stack aligned的选项。
当然,如果在编译器指定了Stack aligned,程序的尺寸将会变大,会占用更多的内存。因此,
这中间的取舍需要仔细考虑,下面是我在google上搜索到的一段讨论:
One of our customers complained about the additional code generated to
maintain the stack aligned to 16-byte boundaries, and suggested us to
default to the minimum alignment when optimizing for code size. This
has the caveat that, when you link code optimized for size with code
optimized for speed, if a function optimized for size calls a
performance-critical function with the stack misaligned, the
performance-critical function may perform poorly.
二、gcc的对齐参数
-mpreferred-stack-boundary在X86汇编语言学习手记(1)中已经提及,另外,在google上还搜
索到了一个关于栈对齐讨论的邮件,与大家分享:
----- Original Message -----
From: "Andreas Jaeger"
To: gcc@gcc.gnu.org
Cc: "Jens Wallner" wallner@ims.uni-hannover.de
Sent: Saturday, February 03, 2001 2:37 AM
Subject: Question about -mpreferred-stack-boundary
>
> We (glibc team) got a bug report that the stack is not aligned
> properly - and I'm a bit confused by the documentation of
> -mpreferred-stack-boundary which is:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @item -mpreferred-stack-boundary=@var{num}
> Attempt to keep the stack boundary aligned to a 2 raised to @var{num}
> byte boundary. If @samp{-mpreferred-stack-boundary} is not specified,
> the default is 4 (16 bytes or 128 bits).
>
> The stack is required to be aligned on a 4 byte boundary. On Pentium
> and PentiumPro, @code{double} and @code{long double} values should be
> aligned to an 8 byte boundary (see @samp{-malign-double}) or suffer
> significant run time performance penalties. On Pentium III, the
> Streaming SIMD Extension (SSE) data type @code{__m128} suffers similar
> penalties if it is not 16 byte aligned.
>
> To ensure proper alignment of this values on the stack, the stack boundary
> must be as aligned as that required by any value stored on the stack.
> Further, every function must be generated such that it keeps the stack
> aligned. Thus calling a function compiled with a higher preferred
> stack boundary from a function compiled with a lower preferred stack
> boundary will most likely misalign the stack. It is recommended that
> libraries that use callbacks always use the default setting.
>
> This extra alignment does consume extra stack space. Code that is sensitive
> to stack space usage, such as embedded systems and operating system kernels,
> may want to reduce the preferred alignment to
> @samp{-mpreferred-stack-boundary=2}.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Who has to align the stack for calls to a function - the caller or the
> callee? In other words: Does this mean that the stack has to be
> aligned before calling a function? Or does it have to be aligned when
> entering a function?
>
> Andreas
> --
> Andreas Jaeger
> SuSE Labs aj@suse.de
> private aj@arthur.inka.de
> http://www.suse.de/~aj
I believe the preferred alignment for long double is a 16 byte boundary, and
the stack (and instruction) alignments must be so set before entering a function.
Pentium 4 increases preferred data alignments to 32 bytes in some situations,
as well as increasing the number of situations (SSE2 instructions) where 16 byte
alignment is needed.
从这里可以看到,栈对齐是在调用函数之前就必须保证的:
the stack (and instruction) alignments must be so set before entering a function
相关文档:
X86汇编语言学习手记(1)
CPU学习笔记(1)
Cache Cohernce with Multi-Processor