diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cgroups/cgroups.txt (renamed from Documentation/cgroups.txt) | 0 | ||||
-rw-r--r-- | Documentation/cgroups/freezer-subsystem.txt | 99 | ||||
-rw-r--r-- | Documentation/controllers/memory.txt | 24 | ||||
-rw-r--r-- | Documentation/cpusets.txt | 2 | ||||
-rw-r--r-- | Documentation/filesystems/ext3.txt | 5 | ||||
-rw-r--r-- | Documentation/filesystems/proc.txt | 28 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 2 | ||||
-rw-r--r-- | Documentation/mtd/nand_ecc.txt | 714 | ||||
-rw-r--r-- | Documentation/sysrq.txt | 3 | ||||
-rw-r--r-- | Documentation/vm/unevictable-lru.txt | 615 |
10 files changed, 1471 insertions, 21 deletions
diff --git a/Documentation/cgroups.txt b/Documentation/cgroups/cgroups.txt index d9014aa0eb68..d9014aa0eb68 100644 --- a/Documentation/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt new file mode 100644 index 000000000000..c50ab58b72eb --- /dev/null +++ b/Documentation/cgroups/freezer-subsystem.txt @@ -0,0 +1,99 @@ + The cgroup freezer is useful to batch job management system which start +and stop sets of tasks in order to schedule the resources of a machine +according to the desires of a system administrator. This sort of program +is often used on HPC clusters to schedule access to the cluster as a +whole. The cgroup freezer uses cgroups to describe the set of tasks to +be started/stopped by the batch job management system. It also provides +a means to start and stop the tasks composing the job. + + The cgroup freezer will also be useful for checkpointing running groups +of tasks. The freezer allows the checkpoint code to obtain a consistent +image of the tasks by attempting to force the tasks in a cgroup into a +quiescent state. Once the tasks are quiescent another task can +walk /proc or invoke a kernel interface to gather information about the +quiesced tasks. Checkpointed tasks can be restarted later should a +recoverable error occur. This also allows the checkpointed tasks to be +migrated between nodes in a cluster by copying the gathered information +to another node and restarting the tasks there. + + Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping +and resuming tasks in userspace. Both of these signals are observable +from within the tasks we wish to freeze. While SIGSTOP cannot be caught, +blocked, or ignored it can be seen by waiting or ptracing parent tasks. +SIGCONT is especially unsuitable since it can be caught by the task. Any +programs designed to watch for SIGSTOP and SIGCONT could be broken by +attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can +demonstrate this problem using nested bash shells: + + $ echo $$ + 16644 + $ bash + $ echo $$ + 16690 + + From a second, unrelated bash shell: + $ kill -SIGSTOP 16690 + $ kill -SIGCONT 16990 + + <at this point 16990 exits and causes 16644 to exit too> + + This happens because bash can observe both signals and choose how it +responds to them. + + Another example of a program which catches and responds to these +signals is gdb. In fact any program designed to use ptrace is likely to +have a problem with this method of stopping and resuming tasks. + + In contrast, the cgroup freezer uses the kernel freezer code to +prevent the freeze/unfreeze cycle from becoming visible to the tasks +being frozen. This allows the bash example above and gdb to run as +expected. + + The freezer subsystem in the container filesystem defines a file named +freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the +cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup. +Reading will return the current state. + +* Examples of usage : + + # mkdir /containers/freezer + # mount -t cgroup -ofreezer freezer /containers + # mkdir /containers/0 + # echo $some_pid > /containers/0/tasks + +to get status of the freezer subsystem : + + # cat /containers/0/freezer.state + THAWED + +to freeze all tasks in the container : + + # echo FROZEN > /containers/0/freezer.state + # cat /containers/0/freezer.state + FREEZING + # cat /containers/0/freezer.state + FROZEN + +to unfreeze all tasks in the container : + + # echo THAWED > /containers/0/freezer.state + # cat /containers/0/freezer.state + THAWED + +This is the basic mechanism which should do the right thing for user space task +in a simple scenario. + +It's important to note that freezing can be incomplete. In that case we return +EBUSY. This means that some tasks in the cgroup are busy doing something that +prevents us from completely freezing the cgroup at this time. After EBUSY, +the cgroup will remain partially frozen -- reflected by freezer.state reporting +"FREEZING" when read. The state will remain "FREEZING" until one of these +things happens: + + 1) Userspace cancels the freezing operation by writing "THAWED" to + the freezer.state file + 2) Userspace retries the freezing operation by writing "FROZEN" to + the freezer.state file (writing "FREEZING" is not legal + and returns EIO) + 3) The tasks that blocked the cgroup from entering the "FROZEN" + state disappear from the cgroup's set of tasks. diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt index 9b53d5827361..1c07547d3f81 100644 --- a/Documentation/controllers/memory.txt +++ b/Documentation/controllers/memory.txt @@ -112,14 +112,22 @@ the per cgroup LRU. 2.2.1 Accounting details -All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted. -RSS pages are accounted at the time of page_add_*_rmap() unless they've already -been accounted for earlier. A file page will be accounted for as Page Cache; -it's mapped into the page tables of a process, duplicate accounting is carefully -avoided. Page Cache pages are accounted at the time of add_to_page_cache(). -The corresponding routines that remove a page from the page tables or removes -a page from Page Cache is used to decrement the accounting counters of the -cgroup. +All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. +(some pages which never be reclaimable and will not be on global LRU + are not accounted. we just accounts pages under usual vm management.) + +RSS pages are accounted at page_fault unless they've already been accounted +for earlier. A file page will be accounted for as Page Cache when it's +inserted into inode (radix-tree). While it's mapped into the page tables of +processes, duplicate accounting is carefully avoided. + +A RSS page is unaccounted when it's fully unmapped. A PageCache page is +unaccounted when it's removed from radix-tree. + +At page migration, accounting information is kept. + +Note: we just account pages-on-lru because our purpose is to control amount +of used pages. not-on-lru pages are tend to be out-of-control from vm view. 2.3 Shared Page Accounting diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 47e568a9370a..5c86c258c791 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -48,7 +48,7 @@ hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets use the generic cgroup subsystem described in -Documentation/cgroup.txt. +Documentation/cgroups/cgroups.txt. Requests by a task, using the sched_setaffinity(2) system call to include CPUs in its CPU affinity mask, and using the mbind(2) and diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 295f26cd895a..9dd2a3bb2acc 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -96,6 +96,11 @@ errors=remount-ro(*) Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. + grpid Give objects the same group ID as their creator. bsdgroups diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index c032bf39e8b9..bcceb99b81dd 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1384,15 +1384,18 @@ causes the kernel to prefer to reclaim dentries and inodes. dirty_background_ratio ---------------------- -Contains, as a percentage of total system memory, the number of pages at which -the pdflush background writeback daemon will start writing out dirty data. +Contains, as a percentage of the dirtyable system memory (free pages + mapped +pages + file cache, not including locked pages and HugePages), the number of +pages at which the pdflush background writeback daemon will start writing out +dirty data. dirty_ratio ----------------- -Contains, as a percentage of total system memory, the number of pages at which -a process which is generating disk writes will itself start writing out dirty -data. +Contains, as a percentage of the dirtyable system memory (free pages + mapped +pages + file cache, not including locked pages and HugePages), the number of +pages at which a process which is generating disk writes will itself start +writing out dirty data. dirty_writeback_centisecs ------------------------- @@ -2412,24 +2415,29 @@ will be dumped when the <pid> process is dumped. coredump_filter is a bitmask of memory types. If a bit of the bitmask is set, memory segments of the corresponding memory type are dumped, otherwise they are not dumped. -The following 4 memory types are supported: +The following 7 memory types are supported: - (bit 0) anonymous private memory - (bit 1) anonymous shared memory - (bit 2) file-backed private memory - (bit 3) file-backed shared memory - (bit 4) ELF header pages in file-backed private memory areas (it is effective only if the bit 2 is cleared) + - (bit 5) hugetlb private memory + - (bit 6) hugetlb shared memory Note that MMIO pages such as frame buffer are never dumped and vDSO pages are always dumped regardless of the bitmask status. -Default value of coredump_filter is 0x3; this means all anonymous memory -segments are dumped. + Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only + effected by bit 5-6. + +Default value of coredump_filter is 0x23; this means all anonymous memory +segments and hugetlb private memory are dumped. If you don't want to dump all shared memory segments attached to pid 1234, -write 1 to the process's proc file. +write 0x21 to the process's proc file. - $ echo 0x1 > /proc/1234/coredump_filter + $ echo 0x21 > /proc/1234/coredump_filter When a new process is created, the process inherits the bitmask status from its parent. It is useful to set up coredump_filter before the program runs. diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index bcecfaa1e770..0f1544f67400 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -690,7 +690,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/block/as-iosched.txt and Documentation/block/deadline-iosched.txt for details. - elfcorehdr= [X86-32, X86_64] + elfcorehdr= [IA64,PPC,SH,X86-32,X86_64] Specifies physical address of start of kernel core image elf header. Generally kexec loader will pass this option to capture kernel. diff --git a/Documentation/mtd/nand_ecc.txt b/Documentation/mtd/nand_ecc.txt new file mode 100644 index 000000000000..bdf93b7f0f24 --- /dev/null +++ b/Documentation/mtd/nand_ecc.txt @@ -0,0 +1,714 @@ +Introduction +============ + +Having looked at the linux mtd/nand driver and more specific at nand_ecc.c +I felt there was room for optimisation. I bashed the code for a few hours +performing tricks like table lookup removing superfluous code etc. +After that the speed was increased by 35-40%. +Still I was not too happy as I felt there was additional room for improvement. + +Bad! I was hooked. +I decided to annotate my steps in this file. Perhaps it is useful to someone +or someone learns something from it. + + +The problem +=========== + +NAND flash (at least SLC one) typically has sectors of 256 bytes. +However NAND flash is not extremely reliable so some error detection +(and sometimes correction) is needed. + +This is done by means of a Hamming code. I'll try to explain it in +laymans terms (and apologies to all the pro's in the field in case I do +not use the right terminology, my coding theory class was almost 30 +years ago, and I must admit it was not one of my favourites). + +As I said before the ecc calculation is performed on sectors of 256 +bytes. This is done by calculating several parity bits over the rows and +columns. The parity used is even parity which means that the parity bit = 1 +if the data over which the parity is calculated is 1 and the parity bit = 0 +if the data over which the parity is calculated is 0. So the total +number of bits over the data over which the parity is calculated + the +parity bit is even. (see wikipedia if you can't follow this). +Parity is often calculated by means of an exclusive or operation, +sometimes also referred to as xor. In C the operator for xor is ^ + +Back to ecc. +Let's give a small figure: + +byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 +byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 +byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 +byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 +byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 +.... +byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 +byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 + cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 + cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 + cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 + +This figure represents a sector of 256 bytes. +cp is my abbreviaton for column parity, rp for row parity. + +Let's start to explain column parity. +cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. +so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. +Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. +cp2 is the parity over bit0, bit1, bit4 and bit5 +cp3 is the parity over bit2, bit3, bit6 and bit7. +cp4 is the parity over bit0, bit1, bit2 and bit3. +cp5 is the parity over bit4, bit5, bit6 and bit7. +Note that each of cp0 .. cp5 is exactly one bit. + +Row parity actually works almost the same. +rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) +rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) +rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... +(so handle two bytes, then skip 2 bytes). +rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) +for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. +so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) +and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. +The story now becomes quite boring. I guess you get the idea. +rp6 covers 8 bytes then skips 8 etc +rp7 skips 8 bytes then covers 8 etc +rp8 covers 16 bytes then skips 16 etc +rp9 skips 16 bytes then covers 16 etc +rp10 covers 32 bytes then skips 32 etc +rp11 skips 32 bytes then covers 32 etc +rp12 covers 64 bytes then skips 64 etc +rp13 skips 64 bytes then covers 64 etc +rp14 covers 128 bytes then skips 128 +rp15 skips 128 bytes then covers 128 + +In the end the parity bits are grouped together in three bytes as +follows: +ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 +ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 +ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 +ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 + +I detected after writing this that ST application note AN1823 +(http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much +nicer picture.(but they use line parity as term where I use row parity) +Oh well, I'm graphically challenged, so suffer with me for a moment :-) +And I could not reuse the ST picture anyway for copyright reasons. + + +Attempt 0 +========= + +Implementing the parity calculation is pretty simple. +In C pseudocode: +for (i = 0; i < 256; i++) +{ + if (i & 0x01) + rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; + else + rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; + if (i & 0x02) + rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; + else + rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; + if (i & 0x04) + rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; + else + rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; + if (i & 0x08) + rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; + else + rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; + if (i & 0x10) + rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; + else + rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; + if (i & 0x20) + rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; + else + rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; + if (i & 0x40) + rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; + else + rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; + if (i & 0x80) + rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; + else + rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; + cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; + cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; + cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; + cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 + cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 + cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 +} + + +Analysis 0 +========== + +C does have bitwise operators but not really operators to do the above +efficiently (and most hardware has no such instructions either). +Therefore without implementing this it was clear that the code above was +not going to bring me a Nobel prize :-) + +Fortunately the exclusive or operation is commutative, so we can combine +the values in any order. So instead of calculating all the bits +individually, let us try to rearrange things. +For the column parity this is easy. We can just xor the bytes and in the +end filter out the relevant bits. This is pretty nice as it will bring +all cp calculation out of the if loop. + +Similarly we can first xor the bytes for the various rows. +This leads to: + + +Attempt 1 +========= + +const char parity[256] = { + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 +}; + +void ecc1(const unsigned char *buf, unsigned char *code) +{ + int i; + const unsigned char *bp = buf; + unsigned char cur; + unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; + unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; + unsigned char par; + + par = 0; + rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; + rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; + rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; + rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; + + for (i = 0; i < 256; i++) + { + cur = *bp++; + par ^= cur; + if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; + if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; + if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; + } + code[0] = + (parity[rp7] << 7) | + (parity[rp6] << 6) | + (parity[rp5] << 5) | + (parity[rp4] << 4) | + (parity[rp3] << 3) | + (parity[rp2] << 2) | + (parity[rp1] << 1) | + (parity[rp0]); + code[1] = + (parity[rp15] << 7) | + (parity[rp14] << 6) | + (parity[rp13] << 5) | + (parity[rp12] << 4) | + (parity[rp11] << 3) | + (parity[rp10] << 2) | + (parity[rp9] << 1) | + (parity[rp8]); + code[2] = + (parity[par & 0xf0] << 7) | + (parity[par & 0x0f] << 6) | + (parity[par & 0xcc] << 5) | + (parity[par & 0x33] << 4) | + (parity[par & 0xaa] << 3) | + (parity[par & 0x55] << 2); + code[0] = ~code[0]; + code[1] = ~code[1]; + code[2] = ~code[2]; +} + +Still pretty straightforward. The last three invert statements are there to +give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash +all data is 0xff, so the checksum then matches. + +I also introduced the parity lookup. I expected this to be the fastest +way to calculate the parity, but I will investigate alternatives later +on. + + +Analysis 1 +========== + +The code works, but is not terribly efficient. On my system it took +almost 4 times as much time as the linux driver code. But hey, if it was +*that* easy this would have been done long before. +No pain. no gain. + +Fortunately there is plenty of room for improvement. + +In step 1 we moved from bit-wise calculation to byte-wise calculation. +However in C we can also use the unsigned long data type and virtually +every modern microprocessor supports 32 bit operations, so why not try +to write our code in such a way that we process data in 32 bit chunks. + +Of course this means some modification as the row parity is byte by +byte. A quick analysis: +for the column parity we use the par variable. When extending to 32 bits +we can in the end easily calculate p0 and p1 from it. +(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 +respectively) +also rp2 and rp3 can be easily retrieved from par as rp3 covers the +first two bytes and rp2 the last two bytes. + +Note that of course now the loop is executed only 64 times (256/4). +And note that care must taken wrt byte ordering. The way bytes are +ordered in a long is machine dependent, and might affect us. +Anyway, if there is an issue: this code is developed on x86 (to be +precise: a DELL PC with a D920 Intel CPU) + +And of course the performance might depend on alignment, but I expect +that the I/O buffers in the nand driver are aligned properly (and +otherwise that should be fixed to get maximum performance). + +Let's give it a try... + + +Attempt 2 +========= + +extern const char parity[256]; + +void ecc2(const unsigned char *buf, unsigned char *code) +{ + int i; + const unsigned long *bp = (unsigned long *)buf; + unsigned long cur; + unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; + unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; + unsigned long par; + + par = 0; + rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; + rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; + rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; + rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; + + for (i = 0; i < 64; i++) + { + cur = *bp++; + par ^= cur; + if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; + } + /* + we need to adapt the code generation for the fact that rp vars are now + long; also the column parity calculation needs to be changed. + we'll bring rp4 to 15 back to single byte entities by shifting and + xoring + */ + rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; + rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; + rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; + rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; + rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; + rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; + rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; + rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; + rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; + rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; + rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; + rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; + rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; + rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; + par ^= (par >> 16); + rp1 = (par >> 8); rp1 &= 0xff; + rp0 = (par & 0xff); + par ^= (par >> 8); par &= 0xff; + + code[0] = + (parity[rp7] << 7) | + (parity[rp6] << 6) | + (parity[rp5] << 5) | + (parity[rp4] << 4) | + (parity[rp3] << 3) | + (parity[rp2] << 2) | + (parity[rp1] << 1) | + (parity[rp0]); + code[1] = + (parity[rp15] << 7) | + (parity[rp14] << 6) | + (parity[rp13] << 5) | + (parity[rp12] << 4) | + (parity[rp11] << 3) | + (parity[rp10] << 2) | + (parity[rp9] << 1) | + (parity[rp8]); + code[2] = + (parity[par & 0xf0] << 7) | + (parity[par & 0x0f] << 6) | + (parity[par & 0xcc] << 5) | + (parity[par & 0x33] << 4) | + (parity[par & 0xaa] << 3) | + (parity[par & 0x55] << 2); + code[0] = ~code[0]; + code[1] = ~code[1]; + code[2] = ~code[2]; +} + +The parity array is not shown any more. Note also that for these +examples I kinda deviated from my regular programming style by allowing +multiple statements on a line, not using { } in then and else blocks +with only a single statement and by using operators like ^= + + +Analysis 2 +========== + +The code (of course) works, and hurray: we are a little bit faster than +the linux driver code (about 15%). But wait, don't cheer too quickly. +THere is more to be gained. +If we look at e.g. rp14 and rp15 we see that we either xor our data with +rp14 or with rp15. However we also have par which goes over all data. +This means there is no need to calculate rp14 as it can be calculated from +rp15 through rp14 = par ^ rp15; +(or if desired we can avoid calculating rp15 and calculate it from +rp14). That is why some places refer to inverse parity. +Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. +Effectively this means we can eliminate the else clause from the if +statements. Also we can optimise the calculation in the end a little bit +by going from long to byte first. Actually we can even avoid the table +lookups + +Attempt 3 +========= + +Odd replaced: + if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; +with + if (i & 0x01) rp5 ^= cur; + if (i & 0x02) rp7 ^= cur; + if (i & 0x04) rp9 ^= cur; + if (i & 0x08) rp11 ^= cur; + if (i & 0x10) rp13 ^= cur; + if (i & 0x20) rp15 ^= cur; + + and outside the loop added: + rp4 = par ^ rp5; + rp6 = par ^ rp7; + rp8 = par ^ rp9; + rp10 = par ^ rp11; + rp12 = par ^ rp13; + rp14 = par ^ rp15; + +And after that the code takes about 30% more time, although the number of +statements is reduced. This is also reflected in the assembly code. + + +Analysis 3 +========== + +Very weird. Guess it has to do with caching or instruction parallellism +or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting +observation was that this one is only 30% slower (according to time) +executing the code as my 3Ghz D920 processor. + +Well, it was expected not to be easy so maybe instead move to a +different track: let's move back to the code from attempt2 and do some +loop unrolling. This will eliminate a few if statements. I'll try +different amounts of unrolling to see what works best. + + +Attempt 4 +========= + +Unrolled the loop 1, 2, 3 and 4 times. +For 4 the code starts with: + + for (i = 0; i < 4; i++) + { + cur = *bp++; + par ^= cur; + rp4 ^= cur; + rp6 ^= cur; + rp8 ^= cur; + rp10 ^= cur; + if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; + cur = *bp++; + par ^= cur; + rp5 ^= cur; + rp6 ^= cur; + ... + + +Analysis 4 +========== + +Unrolling once gains about 15% +Unrolling twice keeps the gain at about 15% +Unrolling three times gives a gain of 30% compared to attempt 2. +Unrolling four times gives a marginal improvement compared to unrolling +three times. + +I decided to proceed with a four time unrolled loop anyway. It was my gut +feeling that in the next steps I would obtain additional gain from it. + +The next step was triggered by the fact that par contains the xor of all +bytes and rp4 and rp5 each contain the xor of half of the bytes. +So in effect par = rp4 ^ rp5. But as xor is commutative we can also say +that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can +eliminate rp5 (or rp4, but I already foresaw another optimisation). +The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. + + +Attempt 5 +========= + +Effectively so all odd digit rp assignments in the loop were removed. +This included the else clause of the if statements. +Of course after the loop we need to correct things by adding code like: + rp5 = par ^ rp4; +Also the initial assignments (rp5 = 0; etc) could be removed. +Along the line I also removed the initialisation of rp0/1/2/3. + + +Analysis 5 +========== + +Measurements showed this was a good move. The run-time roughly halved +compared with attempt 4 with 4 times unrolled, and we only require 1/3rd +of the processor time compared to the current code in the linux kernel. + +However, still I thought there was more. I didn't like all the if +statements. Why not keep a running parity and only keep the last if +statement. Time for yet another version! + + +Attempt 6 +========= + +THe code within the for loop was changed to: + + for (i = 0; i < 4; i++) + { + cur = *bp++; tmppar = cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= cur; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + + par ^= tmppar; + if ((i & 0x1) == 0) rp12 ^= tmppar; + if ((i & 0x2) == 0) rp14 ^= tmppar; + } + +As you can see tmppar is used to accumulate the parity within a for +iteration. In the last 3 statements is is added to par and, if needed, +to rp12 and rp14. + +While making the changes I also found that I could exploit that tmppar +contains the running parity for this iteration. So instead of having: +rp4 ^= cur; rp6 = cur; +I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next +statement. A similar change was done for rp8 and rp10 + + +Analysis 6 +========== + +Measuring this code again showed big gain. When executing the original +linux code 1 million times, this took about 1 second on my system. +(using time to measure the performance). After this iteration I was back +to 0.075 sec. Actually I had to decide to start measuring over 10 +million interations in order not to loose too much accuracy. This one +definitely seemed to be the jackpot! + +There is a little bit more room for improvement though. There are three +places with statements: +rp4 ^= cur; rp6 ^= cur; +It seems more efficient to also maintain a variable rp4_6 in the while +loop; This eliminates 3 statements per loop. Of course after the loop we +need to correct by adding: + rp4 ^= rp4_6; + rp6 ^= rp4_6 +Furthermore there are 4 sequential assingments to rp8. This can be +encoded slightly more efficient by saving tmppar before those 4 lines +and later do rp8 = rp8 ^ tmppar ^ notrp8; +(where notrp8 is the value of rp8 before those 4 lines). +Again a use of the commutative property of xor. +Time for a new test! + + +Attempt 7 +========= + +The new code now looks like: + + for (i = 0; i < 4; i++) + { + cur = *bp++; tmppar = cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; + + notrp8 = tmppar; + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + rp8 = rp8 ^ tmppar ^ notrp8; + + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + + par ^= tmppar; + if ((i & 0x1) == 0) rp12 ^= tmppar; + if ((i & 0x2) == 0) rp14 ^= tmppar; + } + rp4 ^= rp4_6; + rp6 ^= rp4_6; + + +Not a big change, but every penny counts :-) + + +Analysis 7 +========== + +Acutally this made things worse. Not very much, but I don't want to move +into the wrong direction. Maybe something to investigate later. Could +have to do with caching again. + +Guess that is what there is to win within the loop. Maybe unrolling one +more time will help. I'll keep the optimisations from 7 for now. + + +Attempt 8 +========= + +Unrolled the loop one more time. + + +Analysis 8 +========== + +This makes things worse. Let's stick with attempt 6 and continue from there. +Although it seems that the code within the loop cannot be optimised +further there is still room to optimize the generation of the ecc codes. +We can simply calcualate the total parity. If this is 0 then rp4 = rp5 +etc. If the parity is 1, then rp4 = !rp5; +But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits +in the result byte and then do something like + code[0] |= (code[0] << 1); +Lets test this. + + +Attempt 9 +========= + +Changed the code but again this slightly degrades performance. Tried all +kind of other things, like having dedicated parity arrays to avoid the +shift after parity[rp7] << 7; No gain. +Change the lookup using the parity array by using shift operators (e.g. +replace parity[rp7] << 7 with: +rp7 ^= (rp7 << 4); +rp7 ^= (rp7 << 2); +rp7 ^= (rp7 << 1); +rp7 &= 0x80; +No gain. + +The only marginal change was inverting the parity bits, so we can remove +the last three invert statements. + +Ah well, pity this does not deliver more. Then again 10 million +iterations using the linux driver code takes between 13 and 13.5 +seconds, whereas my code now takes about 0.73 seconds for those 10 +million iterations. So basically I've improved the performance by a +factor 18 on my system. Not that bad. Of course on different hardware +you will get different results. No warranties! + +But of course there is no such thing as a free lunch. The codesize almost +tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. + + +Correcting errors +================= + +For correcting errors I again used the ST application note as a starter, +but I also peeked at the existing code. +The algorithm itself is pretty straightforward. Just xor the given and +the calculated ecc. If all bytes are 0 there is no problem. If 11 bits +are 1 we have one correctable bit error. If there is 1 bit 1, we have an +error in the given ecc code. +It proved to be fastest to do some table lookups. Performance gain +introduced by this is about a factor 2 on my system when a repair had to +be done, and 1% or so if no repair had to be done. +Code size increased from 330 bytes to 686 bytes for this function. +(gcc 4.2, -O3) + + +Conclusion +========== + +The gain when calculating the ecc is tremendous. Om my development hardware +a speedup of a factor of 18 for ecc calculation was achieved. On a test on an +embedded system with a MIPS core a factor 7 was obtained. +On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor +5 (big endian mode, gcc 4.1.2, -O3) +For correction not much gain could be obtained (as bitflips are rare). Then +again there are also much less cycles spent there. + +It seems there is not much more gain possible in this, at least when +programmed in C. Of course it might be possible to squeeze something more +out of it with an assembler program, but due to pipeline behaviour etc +this is very tricky (at least for intel hw). + +Author: Frans Meulenbroeks +Copyright (C) 2008 Koninklijke Philips Electronics NV. diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 5ce0952aa065..49378a9f2b5f 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt @@ -95,7 +95,8 @@ On all - write a character to /proc/sysrq-trigger. e.g.: 'p' - Will dump the current registers and flags to your console. -'q' - Will dump a list of all running timers. +'q' - Will dump a list of all running hrtimers. + WARNING: Does not cover any other timers 'r' - Turns off keyboard raw mode and sets it to XLATE. diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt new file mode 100644 index 000000000000..125eed560e5a --- /dev/null +++ b/Documentation/vm/unevictable-lru.txt @@ -0,0 +1,615 @@ + +This document describes the Linux memory management "Unevictable LRU" +infrastructure and the use of this infrastructure to manage several types +of "unevictable" pages. The document attempts to provide the overall +rationale behind this mechanism and the rationale for some of the design +decisions that drove the implementation. The latter design rationale is +discussed in the context of an implementation description. Admittedly, one +can obtain the implementation details--the "what does it do?"--by reading the +code. One hopes that the descriptions below add value by provide the answer +to "why does it do that?". + +Unevictable LRU Infrastructure: + +The Unevictable LRU adds an additional LRU list to track unevictable pages +and to hide these pages from vmscan. This mechanism is based on a patch by +Larry Woodman of Red Hat to address several scalability problems with page +reclaim in Linux. The problems have been observed at customer sites on large +memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB +of main memory will have over 32 million 4k pages in a single zone. When a +large fraction of these pages are not evictable for any reason [see below], +vmscan will spend a lot of time scanning the LRU lists looking for the small +fraction of pages that are evictable. This can result in a situation where +all cpus are spending 100% of their time in vmscan for hours or days on end, +with the system completely unresponsive. + +The Unevictable LRU infrastructure addresses the following classes of +unevictable pages: + ++ page owned by ramfs ++ page mapped into SHM_LOCKed shared memory regions ++ page mapped into VM_LOCKED [mlock()ed] vmas + +The infrastructure might be able to handle other conditions that make pages +unevictable, either by definition or by circumstance, in the future. + + +The Unevictable LRU List + +The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list +called the "unevictable" list and an associated page flag, PG_unevictable, to +indicate that the page is being managed on the unevictable list. The +PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active +flag in that it indicates on which LRU list a page resides when PG_lru is set. +The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU +Kconfig option. + +The Unevictable LRU infrastructure maintains unevictable pages on an additional +LRU list for a few reasons: + +1) We get to "treat unevictable pages just like we treat other pages in the + system, which means we get to use the same code to manipulate them, the + same code to isolate them (for migrate, etc.), the same code to keep track + of the statistics, etc..." [Rik van Riel] + +2) We want to be able to migrate unevictable pages between nodes--for memory + defragmentation, workload management and memory hotplug. The linux kernel + can only migrate pages that it can successfully isolate from the lru lists. + If we were to maintain pages elsewise than on an lru-like list, where they + can be found by isolate_lru_page(), we would prevent their migration, unless + we reworked migration code to find the unevictable pages. + + +The unevictable LRU list does not differentiate between file backed and swap +backed [anon] pages. This differentiation is only important while the pages +are, in fact, evictable. + +The unevictable LRU list benefits from the "arrayification" of the per-zone +LRU lists and statistics originally proposed and posted by Christoph Lameter. + +The unevictable list does not use the lru pagevec mechanism. Rather, +unevictable pages are placed directly on the page's zone's unevictable +list under the zone lru_lock. The reason for this is to prevent stranding +of pages on the unevictable list when one task has the page isolated from the +lru and other tasks are changing the "evictability" state of the page. + + +Unevictable LRU and Memory Controller Interaction + +The memory controller data structure automatically gets a per zone unevictable +lru list as a result of the "arrayification" of the per-zone LRU lists. The +memory controller tracks the movement of pages to and from the unevictable list. +When a memory control group comes under memory pressure, the controller will +not attempt to reclaim pages on the unevictable list. This has a couple of +effects. Because the pages are "hidden" from reclaim on the unevictable list, +the reclaim process can be more efficient, dealing only with pages that have +a chance of being reclaimed. On the other hand, if too many of the pages +charged to the control group are unevictable, the evictable portion of the +working set of the tasks in the control group may not fit into the available +memory. This can cause the control group to thrash or to oom-kill tasks. + + +Unevictable LRU: Detecting Unevictable Pages + +The function page_evictable(page, vma) in vmscan.c determines whether a +page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, +page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's +address space using a wrapper function. Wrapper functions are used to set, +clear and test the flag to reduce the requirement for #ifdef's throughout the +source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. +This flag remains for the life of the inode. + +For shared memory regions, AS_UNEVICTABLE is set when an application +successfully SHM_LOCKs the region and is removed when the region is +SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page +tables for the region as does, for example, mlock(). So, we make no special +effort to push any pages in the SHM_LOCKed region to the unevictable list. +Vmscan will do this when/if it encounters the pages during reclaim. On +SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the +unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed +region is destroyed, the pages are also "rescued" from the unevictable list in +the process of freeing them. + +page_evictable() detects mlock()ed pages by testing an additional page flag, +PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a +non-NULL vma is supplied, page_evictable() will check whether the vma is +VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and +update the appropriate statistics if the vma is VM_LOCKED. This method allows +efficient "culling" of pages in the fault path that are being faulted in to +VM_LOCKED vmas. + + +Unevictable Pages and Vmscan [shrink_*_list()] + +If unevictable pages are culled in the fault path, or moved to the unevictable +list at mlock() or mmap() time, vmscan will never encounter the pages until +they have become evictable again, for example, via munlock() and have been +"rescued" from the unevictable list. However, there may be situations where we +decide, for the sake of expediency, to leave a unevictable page on one of the +regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for +such pages in all of the shrink_{active|inactive|page}_list() functions and +will "cull" such pages that it encounters--that is, it diverts those pages to +the unevictable list for the zone being scanned. + +There may be situations where a page is mapped into a VM_LOCKED vma, but the +page is not marked as PageMlocked. Such pages will make it all the way to +shrink_page_list() where they will be detected when vmscan walks the reverse +map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() +will cull the page at that point. + +Note that for anonymous pages, shrink_page_list() attempts to add the page to +the swap cache before it tries to unmap the page. To avoid this unnecessary +consumption of swap space, shrink_page_list() calls try_to_munlock() to check +whether any VM_LOCKED vmas map the page without attempting to unmap the page. +If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page +without consuming swap space. try_to_munlock() will be described below. + +To "cull" an unevictable page, vmscan simply puts the page back on the lru +list using putback_lru_page()--the inverse operation to isolate_lru_page()-- +after dropping the page lock. Because the condition which makes the page +unevictable may change once the page is unlocked, putback_lru_page() will +recheck the unevictable state of a page that it places on the unevictable lru +list. If the page has become unevictable, putback_lru_page() removes it from +the list and retries, including the page_unevictable() test. Because such a +race is a rare event and movement of pages onto the unevictable list should be +rare, these extra evictabilty checks should not occur in the majority of calls +to putback_lru_page(). + + +Mlocked Page: Prior Work + +The "Unevictable Mlocked Pages" infrastructure is based on work originally +posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". +Nick posted his patch as an alternative to a patch posted by Christoph +Lameter to achieve the same objective--hiding mlocked pages from vmscan. +In Nick's patch, he used one of the struct page lru list link fields as a count +of VM_LOCKED vmas that map the page. This use of the link field for a count +prevented the management of the pages on an LRU list. Thus, mlocked pages were +not migratable as isolate_lru_page() could not find them and the lru list link +field was not available to the migration subsystem. Nick resolved this by +putting mlocked pages back on the lru list before attempting to isolate them, +thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated +with the Unevictable LRU work, the count was replaced by walking the reverse +map to determine whether any VM_LOCKED vmas mapped the page. More on this +below. + + +Mlocked Pages: Basic Management + +Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of +unevictable pages. When such a page has been "noticed" by the memory +management subsystem, the page is marked with the PG_mlocked [PageMlocked()] +flag. A PageMlocked() page will be placed on the unevictable LRU list when +it is added to the LRU. Pages can be "noticed" by memory management in +several places: + +1) in the mlock()/mlockall() system call handlers. +2) in the mmap() system call handler when mmap()ing a region with the + MAP_LOCKED flag, or mmap()ing a region in a task that has called + mlockall() with the MCL_FUTURE flag. Both of these conditions result + in the VM_LOCKED flag being set for the vma. +3) in the fault path, if mlocked pages are "culled" in the fault path, + and when a VM_LOCKED stack segment is expanded. +4) as mentioned above, in vmscan:shrink_page_list() with attempting to + reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_munlock(). + +Mlocked pages become unlocked and rescued from the unevictable list when: + +1) mapped in a range unlocked via the munlock()/munlockall() system calls. +2) munmapped() out of the last VM_LOCKED vma that maps the page, including + unmapping at task exit. +3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. +4) before a page is COWed in a VM_LOCKED vma. + + +Mlocked Pages: mlock()/mlockall() System Call Handling + +Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() +for each vma in the range specified by the call. In the case of mlockall(), +this is the entire active address space of the task. Note that mlock_fixup() +is used for both mlock()ing and munlock()ing a range of memory. A call to +mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED +is treated as a no-op--mlock_fixup() simply returns. + +If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" +below, mlock_fixup() will attempt to merge the vma with its neighbors or split +off a subset of the vma if the range does not cover the entire vma. Once the +vma has been merged or split or neither, mlock_fixup() will call +__mlock_vma_pages_range() to fault in the pages via get_user_pages() and +to mark the pages as mlocked via mlock_vma_page(). + +Note that the vma being mlocked might be mapped with PROT_NONE. In this case, +get_user_pages() will be unable to fault in the pages. That's OK. If pages +do end up getting faulted into this VM_LOCKED vma, we'll handle them in the +fault path or in vmscan. + +Also note that a page returned by get_user_pages() could be truncated or +migrated out from under us, while we're trying to mlock it. To detect +this, __mlock_vma_pages_range() tests the page_mapping after acquiring +the page lock. If the page is still associated with its mapping, we'll +go ahead and call mlock_vma_page(). If the mapping is gone, we just +unlock the page and move on. Worse case, this results in page mapped +in a VM_LOCKED vma remaining on a normal LRU list without being +PageMlocked(). Again, vmscan will detect and cull such pages. + +mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will +TestSetPageMlocked() for each page returned by get_user_pages(). We use +TestSetPageMlocked() because the page might already be mlocked by another +task/vma and we don't want to do extra work. We especially do not want to +count an mlocked page more than once in the statistics. If the page was +already mlocked, mlock_vma_page() is done. + +If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the +page from the LRU, as it is likely on the appropriate active or inactive list +at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will +putback the page--putback_lru_page()--which will notice that the page is now +mlocked and divert the page to the zone's unevictable LRU list. If +mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle +it later if/when it attempts to reclaim the page. + + +Mlocked Pages: Filtering Special Vmas + +mlock_fixup() filters several classes of "special" vmas: + +1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind + these mappings are inherently pinned, so we don't need to mark them as + mlocked. In any case, most of the pages have no struct page in which to + so mark the page. Because of this, get_user_pages() will fail for these + vmas, so there is no sense in attempting to visit them. + +2) vmas mapping hugetlbfs page are already effectively pinned into memory. + We don't need nor want to mlock() these pages. However, to preserve the + prior behavior of mlock()--before the unevictable/mlock changes--mlock_fixup() + will call make_pages_present() in the hugetlbfs vma range to allocate the + huge pages and populate the ptes. + +3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of + kernel pages, such as the vdso page, relay channel pages, etc. These pages + are inherently unevictable and are not managed on the LRU lists. + mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls + make_pages_present() to populate the ptes. + +Note that for all of these special vmas, mlock_fixup() does not set the +VM_LOCKED flag. Therefore, we won't have to deal with them later during +munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() +account these vmas against the task's "locked_vm". + +Mlocked Pages: Downgrading the Mmap Semaphore. + +mlock_fixup() must be called with the mmap semaphore held for write, because +it may have to merge or split vmas. However, mlocking a large region of +memory can take a long time--especially if vmscan must reclaim pages to +satisfy the regions requirements. Faulting in a large region with the mmap +semaphore held for write can hold off other faults on the address space, in +the case of a multi-threaded task. It can also hold off scans of the task's +address space via /proc. While testing under heavy load, it was observed that +the ps(1) command could be held off for many minutes while a large segment was +mlock()ed down. + +To address this issue, and to make the system more responsive during mlock()ing +of large segments, mlock_fixup() downgrades the mmap semaphore to read mode +during the call to __mlock_vma_pages_range(). This works fine. However, the +callers of mlock_fixup() expect the semaphore to be returned in write mode. +So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not +support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore +and reacquire it in write mode. In a multi-threaded task, it is possible for +the task memory map to change while the semaphore is dropped. Therefore, +mlock_fixup() looks up the vma at the range start address after reacquiring +the semaphore in write mode and verifies that it still covers the original +range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of +mlock_fixup() have been changed to deal with this new error condition. + +Note: when munlocking a region, all of the pages should already be resident-- +unless we have racing threads mlocking() and munlocking() regions. So, +unlocking should not have to wait for page allocations nor faults of any kind. +Therefore mlock_fixup() does not downgrade the semaphore for munlock(). + + +Mlocked Pages: munlock()/munlockall() System Call Handling + +The munlock() and munlockall() system calls are handled by the same functions-- +do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock +vs lock operation indicated by an argument. So, these system calls are also +handled by mlock_fixup(). Again, if called for an already munlock()ed vma, +mlock_fixup() simply returns. Because of the vma filtering discussed above, +VM_LOCKED will not be set in any "special" vmas. So, these vmas will be +ignored for munlock. + +If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off +the specified range. The range is then munlocked via the function +__mlock_vma_pages_range()--the same function used to mlock a vma range-- +passing a flag to indicate that munlock() is being performed. + +Because the vma access protections could have been changed to PROT_NONE after +faulting in and mlocking some pages, get_user_pages() was unreliable for visiting +these pages for munlocking. Because we don't want to leave pages mlocked(), +get_user_pages() was enhanced to accept a flag to ignore the permissions when +fetching the pages--all of which should be resident as a result of previous +mlock()ing. + +For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling +munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked +flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() +use the Test*PageMlocked() function to handle the case where the page might +have already been unlocked by another task. If the page was mlocked, +munlock_vma_page() updates that zone statistics for the number of mlocked +pages. Note, however, that at this point we haven't checked whether the page +is mapped by other VM_LOCKED vmas. + +We can't call try_to_munlock(), the function that walks the reverse map to check +for other VM_LOCKED vmas, without first isolating the page from the LRU. +try_to_munlock() is a variant of try_to_unmap() and thus requires that the page +not be on an lru list. [More on these below.] However, the call to +isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). +So, we go ahead and clear PG_mlocked up front, as this might be the only chance +we have. If we can successfully isolate the page, we go ahead and +try_to_munlock(), which will restore the PG_mlocked flag and update the zone +page statistics if it finds another vma holding the page mlocked. If we fail +to isolate the page, we'll have left a potentially mlocked page on the LRU. +This is fine, because we'll catch it later when/if vmscan tries to reclaim the +page. This should be relatively rare. + +Mlocked Pages: Migrating Them... + +A page that is being migrated has been isolated from the lru lists and is +held locked across unmapping of the page, updating the page's mapping +[address_space] entry and copying the contents and state, until the +page table entry has been replaced with an entry that refers to the new +page. Linux supports migration of mlocked pages and other unevictable +pages. This involves simply moving the PageMlocked and PageUnevictable states +from the old page to the new page. + +Note that page migration can race with mlocking or munlocking of the same +page. This has been discussed from the mlock/munlock perspective in the +respective sections above. Both processes [migration, m[un]locking], hold +the page locked. This provides the first level of synchronization. Page +migration zeros out the page_mapping of the old page before unlocking it, +so m[un]lock can skip these pages by testing the page mapping under page +lock. + +When completing page migration, we place the new and old pages back onto the +lru after dropping the page lock. The "unneeded" page--old page on success, +new page on failure--will be freed when the reference count held by the +migration process is released. To ensure that we don't strand pages on the +unevictable list because of a race between munlock and migration, page +migration uses the putback_lru_page() function to add migrated pages back to +the lru. + + +Mlocked Pages: mmap(MAP_LOCKED) System Call Handling + +In addition the the mlock()/mlockall() system calls, an application can request +that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() +call. Furthermore, any mmap() call or brk() call that expands the heap by a +task that has previously called mlockall() with the MCL_FUTURE flag will result +in the newly mapped memory being mlocked. Before the unevictable/mlock changes, +the kernel simply called make_pages_present() to allocate pages and populate +the page table. + +To mlock a range of memory under the unevictable/mlock infrastructure, the +mmap() handler and task address space expansion functions call +mlock_vma_pages_range() specifying the vma and the address range to mlock. +mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in +"Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will +have already been set by the caller, in filtered vmas. Thus these vma's need +not be visited for munlock when the region is unmapped. + +For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to +fault/allocate the pages and mlock them. Again, like mlock_fixup(), +mlock_vma_pages_range() downgrades the mmap semaphore to read mode before +attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore +back to write mode before returning. + +The callers of mlock_vma_pages_range() will have already added the memory +range to be mlocked to the task's "locked_vm". To account for filtered vmas, +mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the +callers then subtract a non-negative return value from the task's locked_vm. +A negative return value represent an error--for example, from get_user_pages() +attempting to fault in a vma with PROT_NONE access. In this case, we leave +the memory range accounted as locked_vm, as the protections could be changed +later and pages allocated into that region. + + +Mlocked Pages: munmap()/exit()/exec() System Call Handling + +When unmapping an mlocked region of memory, whether by an explicit call to +munmap() or via an internal unmap from exit() or exec() processing, we must +munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. +Before the unevictable/mlock changes, mlocking did not mark the pages in any way, +so unmapping them required no processing. + +To munlock a range of memory under the unevictable/mlock infrastructure, the +munmap() hander and task address space tear down function call +munlock_vma_pages_all(). The name reflects the observation that one always +specifies the entire vma range when munlock()ing during unmap of a region. +Because of the vma filtering when mlocking() regions, only "normal" vmas that +actually contain mlocked pages will be passed to munlock_vma_pages_all(). + +munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() +for the munlock case, calls __munlock_vma_pages_range() to walk the page table +for the vma's memory range and munlock_vma_page() each resident page mapped by +the vma. This effectively munlocks the page, only if this is the last +VM_LOCKED vma that maps the page. + + +Mlocked Page: try_to_unmap() + +[Note: the code changes represented by this section are really quite small +compared to the text to describe what happening and why, and to discuss the +implications.] + +Pages can, of course, be mapped into multiple vmas. Some of these vmas may +have VM_LOCKED flag set. It is possible for a page mapped into one or more +VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one +of the active or inactive LRU lists. This could happen if, for example, a +task in the process of munlock()ing the page could not isolate the page from +the LRU. As a result, vmscan/shrink_page_list() might encounter such a page +as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To +handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED +vmas while it is walking a page's reverse map. + +try_to_unmap() is always called, by either vmscan for reclaim or for page +migration, with the argument page locked and isolated from the LRU. BUG_ON() +assertions enforce this requirement. Separate functions handle anonymous and +mapped file pages, as these types of pages have different reverse map +mechanisms. + + try_to_unmap_anon() + +To unmap anonymous pages, each vma in the list anchored in the anon_vma must be +visited--at least until a VM_LOCKED vma is encountered. If the page is being +unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked +pages are migratable. However, for reclaim, if the page is mapped into a +VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap +semphore of the mm_struct to which the vma belongs in read mode. If this is +successful, try_to_unmap() will mlock the page via mlock_vma_page()--we +wouldn't have gotten to try_to_unmap() if the page were already mlocked--and +will return SWAP_MLOCK, indicating that the page is unevictable. If the +mmap semaphore cannot be acquired, we are not sure whether the page is really +unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. + + try_to_unmap_file() -- linear mappings + +Unmapping of a mapped file page works the same, except that the scan visits +all vmas that maps the page's index/page offset in the page's mapping's +reverse map priority search tree. It must also visit each vma in the page's +mapping's non-linear list, if the list is non-empty. As for anonymous pages, +on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will +attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, +returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. + + try_to_unmap_file() -- non-linear mappings + +If a page's mapping contains a non-empty non-linear mapping vma list, then +try_to_un{map|lock}() must also visit each vma in that list to determine +whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit +all vmas in the non-linear list to ensure that the pages is not/should not be +mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. +However, there is no easy way to determine whether the page is actually mapped +in a given vma--either for unmapping or testing whether the VM_LOCKED vma +actually pins the page. + +So, try_to_unmap_file() handles non-linear mappings by scanning a certain +number of pages--a "cluster"--in each non-linear vma associated with the page's +mapping, for each file mapped page that vmscan tries to unmap. If this happens +to unmap the page we're trying to unmap, try_to_unmap() will notice this on +return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it +will return SWAP_AGAIN, causing vmscan to recirculate this page. We take +advantage of the cluster scan in try_to_unmap_cluster() as follows: + +For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap +semaphore of the associated mm_struct for read without blocking. If this +attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will +retain the mmap semaphore for the scan; otherwise it drops it here. Then, +for each page in the cluster, if we're holding the mmap semaphore for a locked +vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This +call is a no-op if the page is already locked, but will mlock any pages in +the non-linear mapping that happen to be unlocked. If one of the pages so +mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will +return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan +to cull the page, rather than recirculating it on the inactive list. Again, +if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns +SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but +couldn't be mlocked. + + +Mlocked pages: try_to_munlock() Reverse Map Scan + +TODO/FIXME: a better name might be page_mlocked()--analogous to the +page_referenced() reverse map walker--especially if we continue to call this +from shrink_page_list(). See related TODO/FIXME below. + +When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() System +Call Handling" above--tries to munlock a page, or when shrink_page_list() +encounters an anonymous page that is not yet in the swap cache, they need to +determine whether or not the page is mapped by any VM_LOCKED vma, without +actually attempting to unmap all ptes from the page. For this purpose, the +unevictable/mlock infrastructure introduced a variant of try_to_unmap() called +try_to_munlock(). + +try_to_munlock() calls the same functions as try_to_unmap() for anonymous and +mapped file pages with an additional argument specifing unlock versus unmap +processing. Again, these functions walk the respective reverse maps looking +for VM_LOCKED vmas. When such a vma is found for anonymous pages and file +pages mapped in linear VMAs, as in the try_to_unmap() case, the functions +attempt to acquire the associated mmap semphore, mlock the page via +mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the +pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs +shrink_page_list() that the anonymous page should be culled rather than added +to the swap cache in preparation for a try_to_unmap() that will almost +certainly fail. + +If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap +semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() +to recycle the page on the inactive list and hope that it has better luck +with the page next time. + +For file pages mapped into non-linear vmas, the try_to_munlock() logic works +slightly differently. On encountering a VM_LOCKED non-linear vma that might +map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking +the page. munlock_vma_page() will just leave the page unlocked and let +vmscan deal with it--the usual fallback position. + +Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' +reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. +However, the scan can terminate when it encounters a VM_LOCKED vma and can +successfully acquire the vma's mmap semphore for read and mlock the page. +Although try_to_munlock() can be called many [very many!] times when +munlock()ing a large region or tearing down a large address space that has been +mlocked via mlockall(), overall this is a fairly rare event. In addition, +although shrink_page_list() calls try_to_munlock() for every anonymous page that +it handles that is not yet in the swap cache, on average anonymous pages will +have very short reverse map lists. + +Mlocked Page: Page Reclaim in shrink_*_list() + +shrink_active_list() culls any obviously unevictable pages--i.e., +!page_evictable(page, NULL)--diverting these to the unevictable lru +list. However, shrink_active_list() only sees unevictable pages that +made it onto the active/inactive lru lists. Note that these pages do not +have PageUnevictable set--otherwise, they would be on the unevictable list and +shrink_active_list would never see them. + +Some examples of these unevictable pages on the LRU lists are: + +1) ramfs pages that have been placed on the lru lists when first allocated. + +2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to + allocate or fault in the pages in the shared memory region. This happens + when an application accesses the page the first time after SHM_LOCKing + the segment. + +3) Mlocked pages that could not be isolated from the lru and moved to the + unevictable list in mlock_vma_page(). + +3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't + acquire the vma's mmap semaphore to test the flags and set PageMlocked. + munlock_vma_page() was forced to let the page back on to the normal + LRU list for vmscan to handle. + +shrink_inactive_list() also culls any unevictable pages that it finds +on the inactive lists, again diverting them to the appropriate zone's unevictable +lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became +SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or +pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from +the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice +the latter, but will pass on to shrink_page_list(). + +shrink_page_list() again culls obviously unevictable pages that it could +encounter for similar reason to shrink_inactive_list(). As already discussed, +shrink_page_list() proactively looks for anonymous pages that should have +PG_mlocked set but don't--these would not be detected by page_evictable()--to +avoid adding them to the swap cache unnecessarily. File pages mapped into +VM_LOCKED vmas but without PG_mlocked set will make it all the way to +try_to_unmap(). shrink_page_list() will divert them to the unevictable list when +try_to_unmap() returns SWAP_MLOCK, as discussed above. + +TODO/FIXME: If we can enhance the swap cache to reliably remove entries +with page_count(page) > 2, as long as all ptes are mapped to the page and +not the swap entry, we can probably remove the call to try_to_munlock() in +shrink_page_list() and just remove the page from the swap cache when +try_to_unmap() returns SWAP_MLOCK. Currently, remove_exclusive_swap_page() +doesn't seem to allow that. + + |