1. 页
内核把页作为内存管理的基本单位,而MMU也以页为单位管理系统中的页表。不同的系统,页的大小可能不同,一般Linux系统的页大小是4K,可以通过getconf PAGE_SIZE命令或者下面的C程序获取页大小:
1: #include <unistd.h>
2: #include <stdio.h>
3:
4: int main(int argc, char** argv)
5: {
6: printf("System page size: %d\n", getpagesize());
7: return 0;
8: }
内核用struct page来表示系统中的每个物理页:
1: /*
2: * Each physical page in the system has a struct page associated with
3: * it to keep track of whatever it is we are using the page for at the
4: * moment. Note that we have no way to track which tasks are using
5: * a page, though if it is a pagecache page, rmap structures can tell us
6: * who is mapping it.
7: *
8: * The objects in struct page are organized in double word blocks in
9: * order to allows us to use atomic double word operations on portions
10: * of struct page. That is currently only used by slub but the arrangement
11: * allows the use of atomic double word operations on the flags/mapping
12: * and lru list pointers also.
13: */
14: struct page {
15: /* First double word block */
16: unsigned long flags; /* Atomic flags, some possibly
17: * updated asynchronously */
18: struct address_space *mapping; /* If low bit clear, points to
19: * inode address_space, or NULL.
20: * If page mapped as anonymous
21: * memory, low bit is set, and
22: * it points to anon_vma object:
23: * see PAGE_MAPPING_ANON below.
24: */
25: /* Second double word */
26: struct {
27: union {
28: pgoff_t index; /* Our offset within mapping. */
29: void *freelist; /* slub first free object */
30: };
31:
32: union {
33: /* Used for cmpxchg_double in slub */
34: unsigned long counters;
35:
36: struct {
37:
38: union {
39: /*
40: * Count of ptes mapped in
41: * mms, to show when page is
42: * mapped & limit reverse map
43: * searches.
44: *
45: * Used also for tail pages
46: * refcounting instead of
47: * _count. Tail pages cannot
48: * be mapped and keeping the
49: * tail page _count zero at
50: * all times guarantees
51: * get_page_unless_zero() will
52: * never succeed on tail
53: * pages.
54: */
55: atomic_t _mapcount;
56:
57: struct {
58: unsigned inuse:16;
59: unsigned objects:15;
60: unsigned frozen:1;
61: };
62: };
63: atomic_t _count; /* Usage count, see below. */
64: };
65: };
66: };
67:
68: /* Third double word block */
69: union {
70: struct list_head lru; /* Pageout list, eg. active_list
71: * protected by zone->lru_lock !
72: */
73: struct { /* slub per cpu partial pages */
74: struct page *next; /* Next partial slab */
75: #ifdef CONFIG_64BIT
76: int pages; /* Nr of partial slabs left */
77: int pobjects; /* Approximate # of objects */
78: #else
79: short int pages;
80: short int pobjects;
81: #endif
82: };
83: };
84:
85: /* Remainder is not double word aligned */
86: union {
87: unsigned long private; /* Mapping-private opaque data:
88: * usually used for buffer_heads
89: * if PagePrivate set; used for
90: * swp_entry_t if PageSwapCache;
91: * indicates order in the buddy
92: * system if PG_buddy is set.
93: */
94:
95: struct kmem_cache *slab; /* SLUB: Pointer to slab */
96: struct page *first_page; /* Compound tail pages */
97: };
98:
99: /*
100: * On machines where all RAM is mapped into kernel address space,
101: * we can simply calculate the virtual address. On machines with
102: * highmem some memory is mapped into kernel virtual memory
103: * dynamically, so we need a place to store that address.
104: * Note that this field could be 16 bits on x86 ... ;)
105: *
106: * Architectures with slow multiplication can define
107: * WANT_PAGE_VIRTUAL in asm/page.h
108: */
109: #if defined(WANT_PAGE_VIRTUAL)
110: void *virtual; /* Kernel virtual address (NULL if
111: not kmapped, ie. highmem) */
112: #endif /* WANT_PAGE_VIRTUAL */
113: #ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
114: unsigned long debug_flags; /* Use atomic bitops on this */
115: #endif
116:
117: #ifdef CONFIG_KMEMCHECK
118: /*
119: * kmemcheck wants to track the status of each byte in a page; this
120: * is a pointer to such a status block. NULL if not tracked.
121: */
122: void *shadow;
123: #endif
124: };
2. 区
内核使用区对具有相似特性的页进行分组,这样可以按照用途分配物理页。在Linux中使用了以下几种区:
1: enum zone_type {
2: #ifdef CONFIG_ZONE_DMA
3: /*
4: * ZONE_DMA is used when there are devices that are not able
5: * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
6: * carve out the portion of memory that is needed for these devices.
7: * The range is arch specific.
8: *
9: * Some examples
10: *
11: * Architecture Limit
12: * ---------------------------
13: * parisc, ia64, sparc <4G
14: * s390 <2G
15: * arm Various
16: * alpha Unlimited or 0-16MB.
17: *
18: * i386, x86_64 and multiple other arches
19: * <16M.
20: */
21: ZONE_DMA,
22: #endif
23: #ifdef CONFIG_ZONE_DMA32
24: /*
25: * x86_64 needs two ZONE_DMAs because it supports devices that are
26: * only able to do DMA to the lower 16M but also 32 bit devices that
27: * can only do DMA areas below 4G.
28: */
29: ZONE_DMA32,
30: #endif
31: /*
32: * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
33: * performed on pages in ZONE_NORMAL if the DMA devices support
34: * transfers to all addressable memory.
35: */
36: ZONE_NORMAL,
37: #ifdef CONFIG_HIGHMEM
38: /*
39: * A memory area that is only addressable by the kernel through
40: * mapping portions into its own address space. This is for example
41: * used by i386 to allow the kernel to address the memory beyond
42: * 900MB. The kernel will set up special mappings (page
43: * table entries on i386) for each page that the kernel needs to
44: * access.
45: */
46: ZONE_HIGHMEM,
47: #endif
48: ZONE_MOVABLE,
//供防止物理内存碎片,是一个伪内存域
49: __MAX_NR_ZONES
//表示结束标记,在迭代系统中的所有内存时,会使用该变量
50: };
3. 获得页
内核中,通过以下接口分配物理内存页:
1: /* 分配2^order个连续的物理页,返回struct page* */
2: struct page* alloc_pages(gfp_mask, order);
3:
4: /* 将struct page*转换为逻辑地址 */
5: void *page_address(const struct page *page);
6:
7: /* 分配2^order个连续的物理页,返回第一个页的逻辑地址 */
8: unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
9:
10: /* 如果只分配一页,可以使用下面两个宏 */
11: #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
12: #define __get_free_page(gfp_mask) __get_free_pages((gfp_mask), 0)
如果想分配一页填充为0的页,可以通过下面这个接口:
1: unsigned long get_zeroed_page(gfp_t gfp_mask)
2: {
3: return __get_free_pages(gfp_mask | __GFP_ZERO, 0);
4: }
当分配的内存使用完毕后,需要释放它们:
1: #define free_page(addr) free_pages((addr), 0)
2: void free_pages(unsigned long addr, unsigned int order);
3: void __free_pages(struct page *page, unsigned int order);
4. kmalloc
当需要以页为单位的一簇连续物理页时,上面的函数很有用。但对于常用的以字节为单位的分配来说,内核中提供了更方便的kmalloc。kmalloc定义在include/linux/slab_def.h中:
1: static __always_inline void *kmalloc(size_t size, gfp_t flags)
它所分配的内存区是物理上连续的,当分配失败时(内存不足)返回NULL。使用完成后要调用kfree释放响应的内存。
来自http://stackoverflow.com/questions/12568379/kmalloc-size-allocation的更多解释:
My understanding is as follows: the kernel is dealing with the physical memory of the system, which is available only in page-sized chunks; thus when you call
kmalloc()
you are going to get only certain predefined, fixed-size byte arrays.The actual memory you get back is dependent on the system's architecture, but the smallest allocation that kmalloc can handle is as big as 32 or 64 bytes. You will get back from a call to
kmalloc()
at least as much memory as you asked for (usually more). Typically you will not get more than 128 KB (again, architecture dependent)To get the page size (in bytes) of your system you can execute the command:
getconf PAGESIZE
or
getconf PAGE_SIZE
This information on max page size is in /usr/src/linux/include/linux/slab.h
And yes, the page sizes are generally powers of 2, but again, you're not going to get exactly what you ask for, but a little bit more.
You can use some code like this:
void * stuff; stuff = kmalloc(1,GFP_KERNEL); printk("I got: %zu bytes of memory\n", ksize(stuff)); kfree(stuff);
To show the actual amount of memory allocated:
[90144.702588] I got: 32 bytes of memory
内核中定义了一些类型标志来完成不同目的内存分配:
- GFP_ATOMIC 将分配作为一个原子操作,不会睡眠
- GFP_NOIO 内存分配可以阻塞,但不启动磁盘IO
- GFP_NOFS 内存分配可以阻塞,也可以进行磁盘IO,但不可启动文件系统操作
- GFP_KERNEL 内核内存分配的首选方式,可能会阻塞
- GFP_USER 为用户空间分配内存
- GFP_HIGHUSER 从ZONE_HIGHMEM中为用户空间分配内存(这个啥时候用到呢?没有查到)
- GFP_DMA 从ZONE_DMA中分配,一般用于设备驱动
5. vmalloc
vmalloc类似于kmalloc,但它分配的内存虚拟地址是连续的,而物理地址则不一定连续(同用户空间的内存分配方式)。vmalloc需要建立专门的页表项来将物理上不连续的内存映射为虚拟地址空间上连续的页。一般来说,vmalloc仅用于分配大块内存,内核中大部分内存分配依然使用kmalloc。
vmalloc的函数原型
1: void *vmalloc(unsigned long size)
2: void vfree(const void *addr)
注意:vmalloc可能会睡眠,不能用于中断上下文中。
6. slab层
一个不错的参考资料http://www.ibm.com/developerworks/cn/linux/l-linux-slab-allocator/
7. 栈上静态分配
内核栈不比用户空间的栈,在用户空间栈的大小可以很大,并且可以动态的增长。但是,在内核栈却小而固定。每隔进程的内核栈的大小取决于体系结构,同时业余编译选项有关。历史上,每个进程都有两页的内核栈。所以,在32位或64位体系机构的下,内核栈的大小分别为4kB和8kB。
在任意一个函数中,你都必须尽量节省栈资源。只要在具体的函数中让所有的局部变量的大小之合不超过几百个字节。在内核中,进行大的栈空间的静态分配是很危险的。栈的溢出常常悄无声息,但势必会引起严重的后果。当内核栈溢出时,首先会覆盖掉紧邻堆栈末端的地址处的内容。首先即为thread_info结构。因此,动态的分配内存空间不失为一种明智的选择。
8. 高端内存映射
在linux内存管理中,内核使用3G-4G的线性地址空间,总共1G的大小。其中80x86中,内核页表的896M大小的线性地址与物理地址一一对应,而剩余128MB的线性地址留作他用(实现非连续内存分配 和固定映射的线性地址 )。通常,我们把物理地址 超过896M的区域称为高端内存。
关于高端内存,yunsongice整理了一个不错的资料http://blog.csdn.net/yunsongice/article/details/5258589
9. 每cpu变量
最好的同步技术是把设计不需要同步的临界资源放在首位,这是一种思维方法,因为每一种显式的同步原语都有不容忽视的性能开销。
最简单也是最重要的同步技术包括把内核变量或数据结构声明为每CPU变量(per-cpu variable)。每CPU变量主要是数据结构的数组,系统的每个CPU对应数组的一个元素。
一个CPU不应该访问与其他CPU对应的数组元素,另外,它可以随意读或修改它自己的元素而不用担心出现竞争条件,因为它是唯一有资格这么做的CPU。但是,这也意味着每CPU变量基本上只能在特殊情况下使用,也就是当它确定在系统的CPU上的数据在逻辑上是独立的时候。
每CPU的数组元素在主存中被排列以使每个数据结构存放在硬件高速缓存的不同行,因此,对每CPU数组的并发访问不会导致高速缓存行的窃用和失效(这种操作会带来昂贵的系统开销)。
虽然每CPU变量为来自不同CPU的并发访问提供保护,但对来自异步函数(中断处理程序和可延迟函数)的访问不提供保护,在这种情况下需要另外的同步技术。
此外,在单处理器和多处理器系统中,内核抢占都可能使每CPU变量产生竞争条件。总的原则是内核控制路径应该在禁用抢占的情况下访问每CPU变量。因为当一个内核控制路径获得了它的每CPU变量本地副本的地址,然后它因被抢占而转移到另外一个CPU上,但仍然引用原来CPU元素的地址,这是非常危险的。
操作每cpu变量的接口见http://blog.csdn.net/yunsongice/article/details/5605239