Rock2012’s Blog

5月 3, 2009

内存剖析

Filed under: 编程 — rock2012 @ 11:27 上午
Tags: , ,

Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I’ll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.

内存管理是操作系统的核心;无论对于编程还是系统管理来说,内存管理都非常重要。在以后的几篇文章中,我将着眼于内存的实际应用方面,但是并不回避其内部机制。概念都是相通的,例子大部分来自于32位 x86平台上的Linux和Windows操作系统。第一篇文章讲述程序在内存中是如何安排的。

Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel:

在多任务的操作系统中,每个进程运行在自己的内存沙盒中。这里的沙盒是指虚拟地址空间,在32位模式中,虚拟地址空间总是一个4GB大小的内存地址块。虚拟地址通过页表映射到物理内存上,页表由操作系统内核维护并由处理器查看。每个进程有它自己的页表集合,但是这里有个圈套。一旦虚拟地址启用,即被用到机器中的所有正在运行的软件上,包括内核本身。因此,一部分虚拟地址空间必须保留给内核:

wps_clip_image1

This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and mapped to the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:

这并非说明内核使用那块物理内存,只是说内核把那部分可用地址空间映射到它所希望的物理内存上。在页表中,内核空间只被标记为特权代码(privileged code )(ring 2或更低),因此如果是用户模式下的程序试图执行它,一个页错误即被触发。内核代码和数据总是可寻址的,时刻准备处理中断或系统调用。相比之下,无论何时一个进程切换发生,用户模式下的部分地址空间即发生变化:

wps_clip_image2

Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:

蓝色区域代表被映射到物理内存的虚拟地址,而白色区域并未被映射。上面的例子中,Firefox用掉了属于它的绝大部分虚拟地址空间,因为它是有名的内存吞噬者。地址空间中这些区分开来的条带对应着内存区段,比如堆,栈等等。记住,这些区段只是一块内存地址区域,并且和Intel式区段(Intel-style segments)无关。总之,下面是一个Linux进程的标准区段布局:

wps_clip_image3

When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux does this for the stack in randomize_stack_top(), while the start of the memory mapping segment is shuffled around by mmap_base(). Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.

当计算欢快、安全的进行时,对于机器中几乎所有的进程来说,上图的区段起始虚拟地址都是完全相同的。这就让远程地利用(exploit)安全漏洞变得容易。一个利用经常需要引用绝对内存位置:栈上的地址,比如一个库函数的地址等等。远程攻击者必须以无分别的方式(blindly)来选择这个位置,依靠地址空间相同这个事实。当他们找到了这个地址,人们就输了。因此地址空间随机化变得流行起来。Linux通过随机化栈top()randomize_stack_top()为栈做到了这点,用 mmap_base()函数打乱内存映射区段的起始位置。不幸的是32位地址空间非常紧凑,致使留给用来随机化的空间几乎没有,约束了它的作用。

The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents – a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.

进程地址空间最顶层的区段是栈,它存储了大多数编程语言形式的局部变量和函数形参。调用一个方法或函数,就会把一个新的栈帧推入到栈中。当函数返回时,这个栈帧就被销毁。这个简单的设计,可能是因为数据遵循严格( LIFO)的后进先出的顺序,这意味着没有复杂的数据结构需要跟踪栈内容-一个简单的指向栈的指针就能做到这点。因此入栈和出栈是非常快和必然可行的。并且,栈区域的不断地重用以使cpu缓存中(cpu caches)的栈内存保持活跃,加速了存取。一个进程的所有线程都有自己的栈。

It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it’s appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.

通过让超出栈所能承载的数据入栈,以此来耗尽映射到栈的空间是可能的。这会触发一个页错误,它由Linux中的expand_stack()函数控制,这个函数反过来调用acct_stack_growth() 函数检查增长栈是否合理。如果栈的大小低于RLIMIT_STACK(通常8MB),栈就会正常地增长,程序也会顺利执行,对刚才发生了什么毫无知觉。这是正常的机制,栈大小借此调整以满足需要。然而,如果已经达到了最大的栈大小,就会发生栈溢出,程序会收到一个分段错误。当被映射的栈空间扩充到满足需要时,即使栈变小,它也不会收缩。就像联邦预算,只会增加。

Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.

动态栈增长是唯一的情形下,读取一块未被映射的内存区域(上图白色区域)可能是有效的。以其他任何方式对未被映射的内存读取,都会触发一个页错误,从而导致分段错误。一些被映射的区域是只读的,因为尝试对这些区域进行写操作也会导致分段错误。

Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. ‘Large’ means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().

在栈的下方,是内存映射段。这里,内核把文件内容直接映射到内存中。任何应用程序可以通过系统调用Linux的mmap()(implementation)或 Windows中的CreateFileMapping() / MapViewOfFile()。进行文件的I/O,内存映射是一种方便并且高效的方式,因此它被用作加载动态库。创建一个匿名内存映射,它不对应任何代替程序数据被使用的文件,这样也是可行的。在Linux中,如果你通过malloc()请求一大块内存,C库就会创建一个匿名映射,而不是使用堆内存。“大”表示大于MMAP_THRESHOLD字节,默认下是128KB,可通过 mallopt()调节。

Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.

接下来,再说说地址空间中的堆。像栈一样,堆提供运行时内存分配,不同于栈的地方是,分配在堆中的数据必须比分配给函数的空间存活的更长久。大多数语言提供堆管理给程序。

因此,满足内存需要成了语言运行时和内核共同的任务。在C语言中,堆分配的接口是 malloc() 和friends,然而在有垃圾收集功能的语言中,比如C#,则是一个新的关键字。

If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs’ chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:

如果在堆中有足够的空间满足对内存的需要,那么这个任务就交由语言运行时处理,而不需要内核的介入。否则,就要借助系统调用brk()(implementation)使堆增大。堆管理是复杂的,在面对程序混乱的分配模式时,要求精细高效的算法和内存使用。执行堆的分配要求所需要的时间可以有很大的不同。实时系统有特殊用途的分配器来解决这个问题。堆也是成碎片状的,如下图所示:

wps_clip_image4

Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.

最后,我们来了解最底层的内存分段:BSS,数据和程序本身。BSS和数据存储C语言中的静态(全局)变量中的内容。不同的是,BSS存储未初始化的静态变量,即变量的值未在程序源码中被设置。BSS内存区域是匿名的:它不映射任何文件。如果有static int cntActiveUsers这样的代码,则cntActiveUsers中的内容存储在BSS中。

The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program’s binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!

另一方面,数据段存储初始化了的静态变量。这个内存区域不是匿名的。它映射了部分程序的二进制映像,映像包含了已经在源码中初始化了的静态值。如果有static int cntWorkerbees=10,则cntWorkerbees中的内容存储在数据段中,并且初始值为10。即使数据段映射了文件,也是一个私有内存映射,即在基本的文件(underlying file)中,内存数据的更新并不受影响。这是必须的,否则全局变量的赋值将改变硬盘上二进制映像的数据。难以置信!

The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo – a 4-byte memory address – live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here’s a diagram showing these segments and our example variables:

在示意图中,数据段的例子比较巧妙,因为其使用了一个指针。在这种情况下,指针gonzo的内容-一个4字节的内存地址-存储在数据段中。而指针所指的实际字符串却不在其中。这个字符串存储在正文段中。正文段是只读的并且存储所有的代码,外加其他一些琐碎的东西比如字符文字量。正文段也映射二进制文件,但是对这个区域进行写操作会导致程序发生段错误。这有助于防止指针bug的发生,当然起先就不使用C则更好。下面的图表描述了段和变量的对应关系:

wps_clip_image5

You can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what ‘area’ really means. Also, sometimes people say “data segment” meaning all of data + bss + heap.

你可以通过读取文件/proc/pid_of_process/maps来检视一个Linux进程的内存区。注意,一个段可能包含许多区。例如,每块映射到文件的内存,在mmap段中有属于自己的区,并且动态库有额外的区,类似于BSS和data段。下一篇文章会澄清“区”的真正含义。而且,有时所说的“数据段”意指所有的data+bss+heap。

You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the “flexible” layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that’s not the case, Linux reverts back to the “classic” layout shown below:

你可以通过使用nmobjdump命令检视二进制映像,来显示符号,地址,段等等。最终,上面所描述的虚拟地址的排布,在Linux中是“可变通的”排布,它作为默认情形已有好多年。它假设RLIMIT_STACK有一个值。当不是这样时,Linux则恢复经典排布,如下图: wps_clip_image6

That’s it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we’ll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.

这就是虚拟地址空间的布局。下一篇文章会探讨内核如何跟踪内存区域,接着就是内存映射,文件如何读取和写入连接(ties)到内存中,还有内存使用数字(memory usage figures)是什么意思。

一条评论 »

  1. My family all the time say that I am killing my time here at web, however I know I am getting experience daily by reading such nice posts.

    评论 由 penampakan dunia lain — 2月 26, 2014 @ 8:28 下午 | 回复


RSS feed for comments on this post. TrackBack URI

留下评论

在WordPress.com的博客.