From 55537871ef666b4153fd1ef8782e4a13fee142cc Mon Sep 17 00:00:00 2001 From: Jiri Kosina Date: Thu, 5 Nov 2015 18:44:41 -0800 Subject: kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup In many cases of hardlockup reports, it's actually not possible to know why it triggered, because the CPU that got stuck is usually waiting on a resource (with IRQs disabled) in posession of some other CPU is holding. IOW, we are often looking at the stacktrace of the victim and not the actual offender. Introduce sysctl / cmdline parameter that makes it possible to have hardlockup detector perform all-CPU backtrace. Signed-off-by: Jiri Kosina Reviewed-by: Aaron Tomlin Cc: Ulrich Obergfell Acked-by: Don Zickus Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 5 +++++ Documentation/sysctl/kernel.txt | 12 ++++++++++++ 2 files changed, 17 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6263a2da3e2f..0231f4508abe 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1269,6 +1269,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted. Format: such that (rxsize & ~0x1fffc0) == 0. Default: 1024 + hardlockup_all_cpu_backtrace= + [KNL] Should the hard-lockup detector generate + backtraces on all cpus. + Format: + hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on for 64-bit NUMA, off otherwise. diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 6fccb69c03e7..af70d1541d3a 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -33,6 +33,7 @@ show up in /proc/sys/kernel: - domainname - hostname - hotplug +- hardlockup_all_cpu_backtrace - hung_task_panic - hung_task_check_count - hung_task_timeout_secs @@ -292,6 +293,17 @@ Information Service) or YP (Yellow Pages) domainname. These two domain names are in general different. For a detailed discussion see the hostname(1) man page. +============================================================== +hardlockup_all_cpu_backtrace: + +This value controls the hard lockup detector behavior when a hard +lockup condition is detected as to whether or not to gather further +debug information. If enabled, arch-specific all-CPU stack dumping +will be initiated. + +0: do nothing. This is the default behavior. + +1: on detection capture more debug information. ============================================================== hotplug: -- cgit v1.2.3 From ac1f591249d95372f3a5ab3828d4af5dfbf5efd3 Mon Sep 17 00:00:00 2001 From: Don Zickus Date: Thu, 5 Nov 2015 18:44:44 -0800 Subject: kernel/watchdog.c: add sysctl knob hardlockup_panic The only way to enable a hardlockup to panic the machine is to set 'nmi_watchdog=panic' on the kernel command line. This makes it awkward for end users and folks who want to run automate tests (like myself). Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic knob. Signed-off-by: Don Zickus Cc: Ulrich Obergfell Acked-by: Jiri Kosina Reviewed-by: Aaron Tomlin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/lockup-watchdogs.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt index 22dd6af2e4bd..4a6e33e1af61 100644 --- a/Documentation/lockup-watchdogs.txt +++ b/Documentation/lockup-watchdogs.txt @@ -20,8 +20,9 @@ kernel mode for more than 10 seconds (see "Implementation" below for details), without letting other interrupts have a chance to run. Similarly to the softlockup case, the current stack trace is displayed upon detection and the system will stay locked up unless the default -behavior is changed, which can be done through a compile time knob, -"BOOTPARAM_HARDLOCKUP_PANIC", and a kernel parameter, "nmi_watchdog" +behavior is changed, which can be done through a sysctl, +'hardlockup_panic', a compile time knob, "BOOTPARAM_HARDLOCKUP_PANIC", +and a kernel parameter, "nmi_watchdog" (see "Documentation/kernel-parameters.txt" for details). The panic option can be used in combination with panic_timeout (this -- cgit v1.2.3 From 76f8ec712aa94da9fbfc9c318edc89aa1e48006b Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 5 Nov 2015 18:45:40 -0800 Subject: Doc/slub: document slabinfo-gnuplot.sh script Add documentation on how to use slabinfo-gnuplot.sh script. Signed-off-by: Sergey Senozhatsky Acked-by: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/slub.txt | 59 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index b0c6d1bbb434..699d8ea5c230 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt @@ -280,4 +280,63 @@ of other objects. slub_debug=FZ,dentry +Extended slabinfo mode and plotting +----------------------------------- + +The slabinfo tool has a special 'extended' ('-X') mode that includes: + - Slabcache Totals + - Slabs sorted by size (up to -N slabs, default 1) + - Slabs sorted by loss (up to -N slabs, default 1) + +Additionally, in this mode slabinfo does not dynamically scale sizes (G/M/K) +and reports everything in bytes (this functionality is also available to +other slabinfo modes via '-B' option) which makes reporting more precise and +accurate. Moreover, in some sense the `-X' mode also simplifies the analysis +of slabs' behaviour, because its output can be plotted using the +slabinfo-gnuplot.sh script. So it pushes the analysis from looking through +the numbers (tons of numbers) to something easier -- visual analysis. + +To generate plots: +a) collect slabinfo extended records, for example: + + while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done + +b) pass stats file(-s) to slabinfo-gnuplot.sh script: + slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] + +The slabinfo-gnuplot.sh script will pre-processes the collected records +and generates 3 png files (and 3 pre-processing cache files) per STATS +file: + - Slabcache Totals: FOO_STATS-totals.png + - Slabs sorted by size: FOO_STATS-slabs-by-size.png + - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png + +Another use case, when slabinfo-gnuplot can be useful, is when you need +to compare slabs' behaviour "prior to" and "after" some code modification. +To help you out there, slabinfo-gnuplot.sh script can 'merge' the +`Slabcache Totals` sections from different measurements. To visually +compare N plots: + +a) Collect as many STATS1, STATS2, .. STATSN files as you need + while [ 1 ]; do slabinfo -X >> STATS; sleep 1; done + +b) Pre-process those STATS files + slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN + +c) Execute slabinfo-gnuplot.sh in '-t' mode, passing all of the +generated pre-processed *-totals + slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals + +This will produce a single plot (png file). + +Plots, expectedly, can be large so some fluctuations or small spikes +can go unnoticed. To deal with that, `slabinfo-gnuplot.sh' has two +options to 'zoom-in'/'zoom-out': + a) -s %d,%d overwrites the default image width and heigh + b) -r %d,%d specifies a range of samples to use (for example, + in `slabinfo -X >> FOO_STATS; sleep 1;' case, using + a "-r 40,60" range will plot only samples collected + between 40th and 60th seconds). + Christoph Lameter, May 30, 2007 +Sergey Senozhatsky, October 23, 2015 -- cgit v1.2.3 From 25ee01a2fca02dfb5a3ce316e77910c468108199 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Thu, 5 Nov 2015 18:47:11 -0800 Subject: mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which is inconvenient when we want to know per-task or per-vma base hugetlb usage. To solve this, this patch adds new fields for hugetlb usage like below: Size: 20480 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB AnonHugePages: 0 kB Shared_Hugetlb: 18432 kB Private_Hugetlb: 2048 kB Swap: 0 kB KernelPageSize: 2048 kB MMUPageSize: 2048 kB Locked: 0 kB VmFlags: rd wr mr mw me de ht [hughd@google.com: fix Private_Hugetlb alignment ] Signed-off-by: Naoya Horiguchi Acked-by: Joern Engel Acked-by: David Rientjes Acked-by: Michal Hocko Cc: Mike Kravetz Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 3a9d65c912e7..a7d6c06f36c4 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -424,6 +424,9 @@ Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 892 kB Anonymous: 0 kB +AnonHugePages: 0 kB +Shared_Hugetlb: 0 kB +Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB @@ -452,6 +455,11 @@ and a page is modified, the file page is replaced by a private anonymous copy. "Swap" shows how much would-be-anonymous memory is also used, but out on swap. "SwapPss" shows proportional swap share of this mapping. +"AnonHugePages" shows the ammount of memory backed by transparent hugepage. +"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by +hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical +reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. + "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter encoded manner. The codes are the following: -- cgit v1.2.3 From 5d317b2b6536592a9b51fe65faed43d65ca9158e Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Thu, 5 Nov 2015 18:47:14 -0800 Subject: mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Currently there's no easy way to get per-process usage of hugetlb pages, which is inconvenient because userspace applications which use hugetlb typically want to control their processes on the basis of how much memory (including hugetlb) they use. So this patch simply provides easy access to the info via /proc/PID/status. Signed-off-by: Naoya Horiguchi Acked-by: Joern Engel Acked-by: David Rientjes Acked-by: Michal Hocko Cc: Mike Kravetz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index a7d6c06f36c4..12ac0e4455d4 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -175,6 +175,7 @@ read the file /proc/PID/status: VmLib: 1412 kB VmPTE: 20 kb VmSwap: 0 kB + HugetlbPages: 0 kB Threads: 1 SigQ: 0/28578 SigPnd: 0000000000000000 @@ -238,6 +239,7 @@ Table 1-2: Contents of the status files (as of 4.1) VmPTE size of page table entries VmPMD size of second level page tables VmSwap size of swap usage (the number of referred swapents) + HugetlbPages size of hugetlb memory portions Threads number of threads SigQ number of signals queued/max. number for queue SigPnd bitmap of pending signals for the thread -- cgit v1.2.3 From 80f73b4b71f767f7fcd85f68be18af7795904484 Mon Sep 17 00:00:00 2001 From: Ebru Akagunduz Date: Thu, 5 Nov 2015 18:47:32 -0800 Subject: Documentation/vm/transhuge.txt: add information about max_ptes_swap max_ptes_swap specifies how many pages can be brought in from swap when collapsing a group of pages into a transparent huge page. /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap A higher value can cause excessive swap IO and waste memory. A lower value can prevent THPs from being collapsed, resulting fewer pages being collapsed into THPs, and lower memory access performance. Signed-off-by: Ebru Akagunduz Acked-by: Rik van Riel Acked-by: David Rientjes Cc: Oleg Nesterov Cc: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/transhuge.txt | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 8143b9e8373d..8a282687ee06 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -170,6 +170,16 @@ A lower value leads to gain less thp performance. Value of max_ptes_none can waste cpu time very little, you can ignore it. +max_ptes_swap specifies how many pages can be brought in from +swap when collapsing a group of pages into a transparent huge page. + +/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap + +A higher value can cause excessive swap IO and waste +memory. A lower value can prevent THPs from being +collapsed, resulting fewer pages being collapsed into +THPs, and lower memory access performance. + == Boot parameter == You can change the sysfs boot time defaults of Transparent Hugepage -- cgit v1.2.3 From 7a14239a8fff45a241b6943a3ac444d5b67fcbed Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Thu, 5 Nov 2015 18:49:30 -0800 Subject: mm Documentation: undoc non-linear vmas While updating some mm Documentation, I came across a few straggling references to the non-linear vmas which were happily removed in v4.0. Delete them. Signed-off-by: Hugh Dickins Cc: Christoph Lameter Cc: "Kirill A. Shutemov" Cc: Rik van Riel Acked-by: Vlastimil Babka Cc: Davidlohr Bueso Cc: Oleg Nesterov Cc: Sasha Levin Cc: Dmitry Vyukov Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 1 - Documentation/vm/page_migration | 10 +++--- Documentation/vm/unevictable-lru.txt | 63 +++--------------------------------- 3 files changed, 9 insertions(+), 65 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 12ac0e4455d4..9781b977edcb 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -485,7 +485,6 @@ manner. The codes are the following: ac - area is accountable nr - swap space is not reserved for the area ht - area uses huge tlb pages - nl - non-linear mapping ar - architecture specific flag dd - do not include area into core dump sd - soft-dirty flag diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration index 6513fe2d90b8..479c3d0329bd 100644 --- a/Documentation/vm/page_migration +++ b/Documentation/vm/page_migration @@ -99,12 +99,10 @@ Steps: 4. The new page is prepped with some settings from the old page so that accesses to the new page will discover a page with the correct settings. -5. All the page table references to the page are converted - to migration entries or dropped (nonlinear vmas). - This decrease the mapcount of a page. If the resulting - mapcount is not zero then we do not migrate the page. - All user space processes that attempt to access the page - will now wait on the page lock. +5. All the page table references to the page are converted to migration + entries. This decreases the mapcount of a page. If the resulting + mapcount is not zero then we do not migrate the page. All user space + processes that attempt to access the page will now wait on the page lock. 6. The radix tree lock is taken. This will cause all processes trying to access the page via the mapping to block on the radix tree spinlock. diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 32ee3a67dba2..e983ee4e2758 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -552,63 +552,17 @@ different reverse map mechanisms. is really unevictable or not. In this case, try_to_unmap_anon() will return SWAP_AGAIN. - (*) try_to_unmap_file() - linear mappings + (*) try_to_unmap_file() Unmapping of a mapped file page works the same as for anonymous mappings, except that the scan visits all VMAs that map the page's index/page offset - in the page's mapping's reverse map priority search tree. It also visits - each VMA in the page's mapping's non-linear list, if the list is - non-empty. + in the page's mapping's reverse map interval search tree. As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file page, try_to_unmap_file() will attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. - (*) try_to_unmap_file() - non-linear mappings - - If a page's mapping contains a non-empty non-linear mapping VMA list, then - try_to_un{map|lock}() must also visit each VMA in that list to determine - whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit - all VMAs in the non-linear list to ensure that the pages is not/should not - be mlocked. - - If a VM_LOCKED VMA is found in the list, the scan could terminate. - However, there is no easy way to determine whether the page is actually - mapped in a given VMA - either for unmapping or testing whether the - VM_LOCKED VMA actually pins the page. - - try_to_unmap_file() handles non-linear mappings by scanning a certain - number of pages - a "cluster" - in each non-linear VMA associated with the - page's mapping, for each file mapped page that vmscan tries to unmap. If - this happens to unmap the page we're trying to unmap, try_to_unmap() will - notice this on return (page_mapcount(page) will be 0) and return - SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to - recirculate this page. We take advantage of the cluster scan in - try_to_unmap_cluster() as follows: - - For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the - mmap semaphore of the associated mm_struct for read without blocking. - - If this attempt is successful and the VMA is VM_LOCKED, - try_to_unmap_cluster() will retain the mmap semaphore for the scan; - otherwise it drops it here. - - Then, for each page in the cluster, if we're holding the mmap semaphore - for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to - mlock the page. This call is a no-op if the page is already locked, - but will mlock any pages in the non-linear mapping that happen to be - unlocked. - - If one of the pages so mlocked is the page passed in to try_to_unmap(), - try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default - SWAP_AGAIN. This will allow vmscan to cull the page, rather than - recirculating it on the inactive list. - - Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it - returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED - VMA, but couldn't be mlocked. - try_to_munlock() REVERSE MAP SCAN --------------------------------- @@ -625,10 +579,9 @@ introduced a variant of try_to_unmap() called try_to_munlock(). try_to_munlock() calls the same functions as try_to_unmap() for anonymous and mapped file pages with an additional argument specifying unlock versus unmap processing. Again, these functions walk the respective reverse maps looking -for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file -pages mapped in linear VMAs, as in the try_to_unmap() case, the functions -attempt to acquire the associated mmap semaphore, mlock the page via -mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the +for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case, +the functions attempt to acquire the associated mmap semaphore, mlock the page +via mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap @@ -636,12 +589,6 @@ semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to recycle the page on the inactive list and hope that it has better luck with the page next time. -For file pages mapped into non-linear VMAs, the try_to_munlock() logic works -slightly differently. On encountering a VM_LOCKED non-linear VMA that might -map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the -page. munlock_vma_page() will just leave the page unlocked and let vmscan deal -with it - the usual fallback position. - Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. However, the scan can terminate when it encounters a VM_LOCKED VMA and can -- cgit v1.2.3 From b87537d9e2feb30f6a962f27eb32768682698d3b Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Thu, 5 Nov 2015 18:49:33 -0800 Subject: mm: rmap use pte lock not mmap_sem to set PageMlocked KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page found mapped in a VM_LOCKED vma) is ineffective against races with exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when tearing down an mm. But that's okay, those races are benign; and although we've believed for years in that ugly down_read_trylock(), it's unsuitable for the job, and frustrates the good intention of setting PageMlocked when it fails. It just doesn't matter if here we read vm_flags an instant before or after a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the syscalls (or exit) work their way up the address space (taking pt locks after updating vm_flags) to establish the final state. We do still need to be careful never to mark a page Mlocked (hence unevictable) by any race that will not be corrected shortly after. The page lock protects from many of the races, but not all (a page is not necessarily locked when it's unmapped). But the pte lock we just dropped is good to cover the rest (and serializes even with munlock_vma_pages_all(), so no special barriers required): now hold on to the pte lock while calling mlock_vma_page(). Is that lock ordering safe? Yes, that's how follow_page_pte() calls it, and how page_remove_rmap() calls the complementary clear_page_mlock(). This fixes the following case (though not a case which anyone has complained of), which mmap_sem did not: truncation's preliminary unmap_mapping_range() is supposed to remove even the anonymous COWs of filecache pages, and that might race with try_to_unmap_one() on a VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when freed. The pte lock protects against this. You could say that it also protects against the more ordinary case, racing with the preliminary unmapping of a filecache page itself: but in our current tree, that's independently protected by i_mmap_rwsem; and that race would be why "Bad page state (mlocked)" was seen before commit 48ec833b7851 ("Revert mm/memory.c: share the i_mmap_rwsem"). Vlastimil Babka points out another race which this patch protects against. try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked: leaving PageMlocked and unevictable when it should be evictable. mmap_sem is ineffective because exit_mmap() does not hold it; page lock ineffective because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock is effective because __munlock_pagevec_fill() takes it to get the page, after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one. Kirill Shutemov points out that if the compiler chooses to implement a "vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation with an intermediate store of unrelated bits set, since I'm here foregoing its usual protection by mmap_sem, try_to_unmap_one() might catch sight of a spurious VM_LOCKED in vm_flags, and make the wrong decision. This does not appear to be an immediate problem, but we may want to define vm_flags accessors in future, to guard against such a possibility. While we're here, make a related optimization in try_to_munmap_one(): if it's doing TTU_MUNLOCK, then there's no point at all in descending the page tables and getting the pt lock, unless the vma is VM_LOCKED. Yes, that can change racily, but it can change racily even without the optimization: it's not critical. Far better not to waste time here. Stopped short of separating try_to_munlock_one() from try_to_munmap_one() on this occasion, but that's probably the sensible next step - with a rename, given that try_to_munlock()'s business is to try to set Mlocked. Updated the unevictable-lru Documentation, to remove its reference to mmap semaphore, but found a few more updates needed in just that area. Signed-off-by: Hugh Dickins Cc: Christoph Lameter Cc: "Kirill A. Shutemov" Cc: Rik van Riel Acked-by: Vlastimil Babka Cc: Davidlohr Bueso Cc: Oleg Nesterov Cc: Sasha Levin Cc: Dmitry Vyukov Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/unevictable-lru.txt | 61 ++++++++++-------------------------- 1 file changed, 16 insertions(+), 45 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index e983ee4e2758..fa3b527086fa 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -531,37 +531,20 @@ map. try_to_unmap() is always called, by either vmscan for reclaim or for page migration, with the argument page locked and isolated from the LRU. Separate -functions handle anonymous and mapped file pages, as these types of pages have -different reverse map mechanisms. +functions handle anonymous and mapped file and KSM pages, as these types of +pages have different reverse map lookup mechanisms, with different locking. +In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(), +it will call try_to_unmap_one() for every VMA which might contain the page. - (*) try_to_unmap_anon() +When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED +VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it, +and return SWAP_MLOCK to indicate that the page is unevictable: and the scan +stops there. - To unmap anonymous pages, each VMA in the list anchored in the anon_vma - must be visited - at least until a VM_LOCKED VMA is encountered. If the - page is being unmapped for migration, VM_LOCKED VMAs do not stop the - process because mlocked pages are migratable. However, for reclaim, if - the page is mapped into a VM_LOCKED VMA, the scan stops. - - try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of - the mm_struct to which the VMA belongs. If this is successful, it will - mlock the page via mlock_vma_page() - we wouldn't have gotten to - try_to_unmap_anon() if the page were already mlocked - and will return - SWAP_MLOCK, indicating that the page is unevictable. - - If the mmap semaphore cannot be acquired, we are not sure whether the page - is really unevictable or not. In this case, try_to_unmap_anon() will - return SWAP_AGAIN. - - (*) try_to_unmap_file() - - Unmapping of a mapped file page works the same as for anonymous mappings, - except that the scan visits all VMAs that map the page's index/page offset - in the page's mapping's reverse map interval search tree. - - As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file - page, try_to_unmap_file() will attempt to acquire the associated - mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this - is successful, and SWAP_AGAIN, if not. +mlock_vma_page() is called while holding the page table's lock (in addition +to the page lock, and the rmap lock): to serialize against concurrent mlock or +munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, +holepunching, and truncation of file pages and their anonymous COWed pages. try_to_munlock() REVERSE MAP SCAN @@ -577,22 +560,15 @@ all PTEs from the page. For this purpose, the unevictable/mlock infrastructure introduced a variant of try_to_unmap() called try_to_munlock(). try_to_munlock() calls the same functions as try_to_unmap() for anonymous and -mapped file pages with an additional argument specifying unlock versus unmap +mapped file and KSM pages with a flag argument specifying unlock versus unmap processing. Again, these functions walk the respective reverse maps looking for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case, -the functions attempt to acquire the associated mmap semaphore, mlock the page -via mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the -pre-clearing of the page's PG_mlocked done by munlock_vma_page. - -If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap -semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to -recycle the page on the inactive list and hope that it has better luck with the -page next time. +the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This +undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. -However, the scan can terminate when it encounters a VM_LOCKED VMA and can -successfully acquire the VMA's mmap semaphore for read and mlock the page. +However, the scan can terminate when it encounters a VM_LOCKED VMA. Although try_to_munlock() might be called a great many times when munlocking a large region or tearing down a large address space that has been mlocked via mlockall(), overall this is a fairly rare event. @@ -620,11 +596,6 @@ Some examples of these unevictable pages on the LRU lists are: (3) mlocked pages that could not be isolated from the LRU and moved to the unevictable list in mlock_vma_page(). - (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't - acquire the VMA's mmap semaphore to test the flags and set PageMlocked. - munlock_vma_page() was forced to let the page back on to the normal LRU - list for vmscan to handle. - shrink_inactive_list() also diverts any unevictable pages that it finds on the inactive lists to the appropriate zone's unevictable list. -- cgit v1.2.3 From cf4b769abb8aef01f887543cb8308c0d8671367c Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Thu, 5 Nov 2015 18:50:02 -0800 Subject: mm: page migration avoid touching newpage until no going back We have had trouble in the past from the way in which page migration's newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm: fix direct reclaim writeback regression") which proposed a cleanup. We have no actual problem now, but I think the procedure would be clearer (and alternative get_new_page pools safer to implement) if we assert that newpage is not touched until we are sure that it's going to be used - except for taking the trylock on it in __unmap_and_move(). So shift the early initializations from move_to_new_page() into migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly migrate_huge_page_move_mapping(), but its NULL-mapping path can just be deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping. Adjust stages 3 to 8 in the Documentation file accordingly. Signed-off-by: Hugh Dickins Cc: Christoph Lameter Cc: "Kirill A. Shutemov" Cc: Rik van Riel Cc: Vlastimil Babka Cc: Davidlohr Bueso Cc: Oleg Nesterov Cc: Sasha Levin Cc: Dmitry Vyukov Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/page_migration | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration index 479c3d0329bd..fea5c0864170 100644 --- a/Documentation/vm/page_migration +++ b/Documentation/vm/page_migration @@ -92,27 +92,26 @@ Steps: 2. Insure that writeback is complete. -3. Prep the new page that we want to move to. It is locked - and set to not being uptodate so that all accesses to the new - page immediately lock while the move is in progress. +3. Lock the new page that we want to move to. It is locked so that accesses to + this (not yet uptodate) page immediately lock while the move is in progress. -4. The new page is prepped with some settings from the old page so that - accesses to the new page will discover a page with the correct settings. - -5. All the page table references to the page are converted to migration +4. All the page table references to the page are converted to migration entries. This decreases the mapcount of a page. If the resulting mapcount is not zero then we do not migrate the page. All user space processes that attempt to access the page will now wait on the page lock. -6. The radix tree lock is taken. This will cause all processes trying +5. The radix tree lock is taken. This will cause all processes trying to access the page via the mapping to block on the radix tree spinlock. -7. The refcount of the page is examined and we back out if references remain +6. The refcount of the page is examined and we back out if references remain otherwise we know that we are the only one referencing this page. -8. The radix tree is checked and if it does not contain the pointer to this +7. The radix tree is checked and if it does not contain the pointer to this page then we back out because someone else modified the radix tree. +8. The new page is prepped with some settings from the old page so that + accesses to the new page will discover a page with the correct settings. + 9. The radix tree is changed to point to the new page. 10. The reference count of the old page is dropped because the radix tree -- cgit v1.2.3 From a5be35632cf3048d3922e2e2fb05a4c6f0889ff0 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Thu, 5 Nov 2015 18:50:37 -0800 Subject: Documentation/filesystems/proc.txt: a little tidying There's an odd line about "Locked" at the head of the description of /proc/meminfo: it seems to have strayed from /proc/PID/smaps, so lead it back there. Move "Swap" and "SwapPss" descriptions down above it, to match the order in the file (though "PageSize"s still undescribed). The example of "Locked: 374 kB" (the same as Pss, neither Rss nor Size) is so unlikely as to be misleading: just make it 0, this is /bin/bash text; which would be "dw" (disabled write) not "de" (do not expand). Signed-off-by: Hugh Dickins Acked-by: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 9781b977edcb..1e4a6cc1b6ea 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -433,8 +433,8 @@ Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB -Locked: 374 kB -VmFlags: rd ex mr mw me de +Locked: 0 kB +VmFlags: rd ex mr mw me dw the first of these lines shows the same information as is displayed for the mapping in /proc/PID/maps. The remaining lines show the size of the mapping @@ -454,13 +454,13 @@ accessed. "Anonymous" shows the amount of memory that does not belong to any file. Even a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE and a page is modified, the file page is replaced by a private anonymous copy. -"Swap" shows how much would-be-anonymous memory is also used, but out on -swap. -"SwapPss" shows proportional swap share of this mapping. "AnonHugePages" shows the ammount of memory backed by transparent hugepage. "Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. +"Swap" shows how much would-be-anonymous memory is also used, but out on swap. +"SwapPss" shows proportional swap share of this mapping. +"Locked" indicates whether the mapping is locked in memory or not. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter encoded @@ -824,9 +824,6 @@ varies by architecture and compile options. The following is from a > cat /proc/meminfo -The "Locked" indicates whether the mapping is locked in memory or not. - - MemTotal: 16344972 kB MemFree: 13634064 kB MemAvailable: 14836172 kB -- cgit v1.2.3 From 0295fd5d570626817d10deadf5a2ad5e49c36a1d Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Thu, 5 Nov 2015 18:51:06 -0800 Subject: kasan: various fixes in documentation [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Alexander Potapenko Cc: Konstantin Serebryany Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kasan.txt | 43 ++++++++++++++++++++++--------------------- 1 file changed, 22 insertions(+), 21 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt index 0d32355a4c34..94c881579374 100644 --- a/Documentation/kasan.txt +++ b/Documentation/kasan.txt @@ -1,32 +1,31 @@ -Kernel address sanitizer -================ +KernelAddressSanitizer (KASAN) +============================== 0. Overview =========== -Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides +KernelAddressSANitizer (KASAN) is a dynamic memory error detector. It provides a fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. -KASan uses compile-time instrumentation for checking every memory access, -therefore you will need a gcc version of 4.9.2 or later. KASan could detect out -of bounds accesses to stack or global variables, but only if gcc 5.0 or later was -used to built the kernel. +KASAN uses compile-time instrumentation for checking every memory access, +therefore you will need a GCC version 4.9.2 or later. GCC 5.0 or later is +required for detection of out-of-bounds accesses to stack or global variables. -Currently KASan is supported only for x86_64 architecture and requires that the -kernel be built with the SLUB allocator. +Currently KASAN is supported only for x86_64 architecture and requires the +kernel to be built with the SLUB allocator. 1. Usage -========= +======== To enable KASAN configure kernel with: CONFIG_KASAN = y -and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline/inline -is compiler instrumentation types. The former produces smaller binary the -latter is 1.1 - 2 times faster. Inline instrumentation requires a gcc version -of 5.0 or later. +and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline and +inline are compiler instrumentation types. The former produces smaller binary +the latter is 1.1 - 2 times faster. Inline instrumentation requires a GCC +version 5.0 or later. Currently KASAN works only with the SLUB memory allocator. For better bug detection and nicer report, enable CONFIG_STACKTRACE and put @@ -42,7 +41,7 @@ similar to the following to the respective kernel Makefile: KASAN_SANITIZE := n 1.1 Error reports -========== +================= A typical out of bounds access report looks like this: @@ -119,14 +118,16 @@ Memory state around the buggy address: ffff8800693bc800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== -First sections describe slub object where bad access happened. -See 'SLUB Debug output' section in Documentation/vm/slub.txt for details. +The header of the report discribe what kind of bug happened and what kind of +access caused it. It's followed by the description of the accessed slub object +(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and +the description of the accessed memory page. In the last section the report shows memory state around the accessed address. -Reading this part requires some more understanding of how KASAN works. +Reading this part requires some understanding of how KASAN works. -Each 8 bytes of memory are encoded in one shadow byte as accessible, -partially accessible, freed or they can be part of a redzone. +The state of each 8 aligned bytes of memory is encoded in one shadow byte. +Those 8 bytes can be accessible, partially accessible, freed or be a redzone. We use the following encoding for each shadow byte: 0 means that all 8 bytes of the corresponding memory region are accessible; number N (1 <= N <= 7) means that the first N bytes are accessible, and other (8 - N) bytes are not; @@ -139,7 +140,7 @@ the accessed address is partially accessible. 2. Implementation details -======================== +========================= From a high level, our approach to memory error detection is similar to that of kmemcheck: use shadow memory to record whether each byte of memory is safe -- cgit v1.2.3 From 89d3c87e20d95e3238eac85e43de7b3cb1f39d8b Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Thu, 5 Nov 2015 18:51:23 -0800 Subject: mm, slub, kasan: enable user tracking by default with KASAN=y It's recommended to have slub's user tracking enabled with CONFIG_KASAN, because: a) User tracking disables slab merging which improves detecting out-of-bounds accesses. b) User tracking metadata acts as redzone which also improves detecting out-of-bounds accesses. c) User tracking provides additional information about object. This information helps to understand bugs. Currently it is not enabled by default. Besides recompiling the kernel with KASAN and reinstalling it, user also have to change the boot cmdline, which is not very handy. Enable slub user tracking by default with KASAN=y, since there is no good reason to not do this. [akpm@linux-foundation.org: little fixes, per David] Signed-off-by: Andrey Ryabinin Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kasan.txt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt index 94c881579374..aa1e0c91e368 100644 --- a/Documentation/kasan.txt +++ b/Documentation/kasan.txt @@ -28,8 +28,7 @@ the latter is 1.1 - 2 times faster. Inline instrumentation requires a GCC version 5.0 or later. Currently KASAN works only with the SLUB memory allocator. -For better bug detection and nicer report, enable CONFIG_STACKTRACE and put -at least 'slub_debug=U' in the boot cmdline. +For better bug detection and nicer reporting, enable CONFIG_STACKTRACE. To disable instrumentation for specific files or directories, add a line similar to the following to the respective kernel Makefile: -- cgit v1.2.3