Virtual Memory I: the problem
http://en.wikipedia.org/wiki/High_memory
http://lwn.net/Articles/75174/
[Posted March 10, 2004 by corbet]
This article serves mostly as background to help understand why the kernel developers are considering making fundamental virtual memory changes at this point in the development cycle. It can probably be skipped by readers who understand how high and low memory work on 32-bit systems.
A 32-bit processor can address a maximum of 4GB of memory. One could, in theory, extend the instruction set to allow for larger pointers, but, in practice, nobody does that; the effects on performance and compatibility would be too strong. So the limitation remains: no process on a 32-bit system can have an address space larger than 4GB, and the kernel cannot directly address more than 4GB.
In fact, the limitations are more severe than that. Linux kernels split the 4GB address space between user processes and the kernel; under the most common configuration, the first 3GB of the 32-bit range are given over to user space, and the kernel gets the final 1GB starting at 0xc0000000. Sharing the address space gives a number of performance benefits; in particular, the hardware's address translation buffer can be shared between the kernel and user space.
If the kernel wishes to be able to access the system's physical memory directly, however, it must set up page tables which map that memory into the kernel's part of the address space. With the default 3GB/1GB mapping, the amount of physical memory which can be addressed in this way is somewhat less than 1GB - part of the kernel's space must be set aside for the kernel itself, for memory allocated with vmalloc(), and various other purposes. That is why, until a few years ago, Linux could not even fully handle 1GB of memory on 32-bit systems. In fact, back in 1999, Linus decreed that 32-bit Linux would never, ever support more than 2GB of memory. "This is not negotiable."
Linus's views notwithstanding, the rest of the world continued on with the strange notion that 32-bit systems should be able to support massive amounts of memory. The processor vendors added paging modes which could use physical addresses which exceed 32 bits in length, thus ending the 4GB limit for physical memory. The internal addressing limitations in the Linux kernel remained, however. Happily for users of large systems, Linus can acknowledge an error and change his mind; he did eventually allow large memory support into the 2.3 kernel. That support came with its own costs and limitations, however.
On 32-bit systems, memory is now divided into "high" and "low" memory. Low memory continues to be mapped directly into the kernel's address space, and is thus always reachable via a kernel-space pointer. High memory, instead, has no direct kernel mapping. When the kernel needs to work with a page in high memory, it must explicitly set up a special page table to map it into the kernel's address space first. This operation can be expensive, and there are limits on the number of high-memory pages which can be mapped at any particular time.
For the most part, the kernel's own data structures must live in low memory. Memory which is not permanently mapped cannot appear in linked lists (because its virtual address is transient and variable), and the performance costs of mapping and unmapping kernel memory are too high. High memory is useful for process pages and some kernel tasks (I/O buffers, for example), but the core of the kernel stays in low memory.
Some 32-bit processors can now address 64GB of physical memory, but the Linux kernel is still not able to deal effectively with that much; the current limit is around 8GB to 16GB, depending on the load. The problem now is that larger systems simply run out of low memory. As the system gets larger, it requires more kernel data structures to manage, and eventually room for those structures can run out. On a very large system, the system memory map (an array of struct page structures which represents physical memory) alone can occupy half of the available low memory.
There are users out there wanting to scale 32-bit Linux systems up to 32GB or more of main memory, so the enterprise-oriented Linux distributors have been scrambling to make that possible. One approach is the 4G/4G patch written by Ingo Molnar. This patch separates the kernel and user address spaces, allowing user processes to have 4GB of virtual memory while simultaneously expanding the kernel's low memory to 4GB. There is a cost, however: the translation buffer is no longer shared and must be flushed for every transition between kernel and user space. Estimates of the magnitude of the performance hit vary greatly, but numbers as high as 30% have been thrown around. This option makes some systems work, however, so Red Hat ships a 4G/4G kernel with its enterprise offerings.
The 4G/4G patch extends the capabilities of the Linux kernel, but it remains unpopular. It is widely seen as an ugly solution, and nobody likes the performance cost. So there are efforts afoot to extend the scalability of the Linux kernel via other means. Some of these efforts will likely go forward - in 2.6, even - but the kernel developers seem increasingly unwilling to distort the kernel's memory management systems to meet the needs of a small number of users who are trying to stretch 32-bit systems far beyond where they should go. There will come a time where they will all answer as Linus did back in 1999: go get a 64-bit system.