summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-08-09workqueue: perform cpu down operations from low priority cpu_notifier()Tejun Heo
commit 6575820221f7a4dd6eadecf7bf83cdd154335eda upstream. Currently, all workqueue cpu hotplug operations run off CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to ensure that workqueue is up and running while bringing up a CPU before other notifiers try to use workqueue on the CPU. Per-cpu workqueues are supposed to remain working and bound to the CPU for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even with workqueue offlining running with higher priority because workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which runs the per-cpu workqueue without concurrency management without explicitly detaching the existing workers. However, if the trustee needs to create new workers, it creates unbound workers which may wander off to other CPUs while CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU down is cancelled, the per-CPU workqueue may end up with workers which aren't bound to the CPU. While reliably reproducible with a convoluted artificial test-case involving scheduling and flushing CPU burning work items from CPU down notifiers, this isn't very likely to happen in the wild, and, even when it happens, the effects are likely to be hidden by the following successful CPU down. Fix it by using different priorities for up and down notifiers - high priority for up operations and low priority for down operations. Workqueue cpu hotplug operations will soon go through further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09stable: update references to older 2.6 versions for 3.xPaul Gortmaker
commit 2584f5212d97b664be250ad5700a2d0fee31a10d upstream. Also add information on where the respective trees are. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Rob Landley <rob@landley.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ftrace: Disable function tracing during suspend/resume and hibernation, againSrivatsa S. Bhat
commit 443772d408a25af62498793f6f805ce3c559309a upstream. If function tracing is enabled for some of the low-level suspend/resume functions, it leads to triple fault during resume from suspend, ultimately ending up in a reboot instead of a resume (or a total refusal to come out of suspended state, on some machines). This issue was explained in more detail in commit f42ac38c59e0a03d (ftrace: disable tracing for suspend to ram). However, the changes made by that commit got reverted by commit cbe2f5a6e84eebb (tracing: allow tracing of suspend/resume & hibernation code again). So, unfortunately since things are not yet robust enough to allow tracing of low-level suspend/resume functions, suspend/resume is still broken when ftrace is enabled. So fix this by disabling function tracing during suspend/resume & hibernation. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09locks: fix checking of fcntl_setlease argumentJ. Bruce Fields
commit 0ec4f431eb56d633da3a55da67d5c4b88886ccc7 upstream. The only checks of the long argument passed to fcntl(fd,F_SETLEASE,.) are done after converting the long to an int. Thus some illegal values may be let through and cause problems in later code. [ They actually *don't* cause problems in mainline, as of Dave Jones's commit 8d657eb3b438 "Remove easily user-triggerable BUG from generic_setlease", but we should fix this anyway. And this patch will be necessary to fix real bugs on earlier kernels. ] Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09usb: gadget: Fix g_ether interface link statusKevin Cernekee
commit 31bde1ceaa873bcaecd49e829bfabceacc4c512d upstream. A "usb0" interface that has never been connected to a host has an unknown operstate, and therefore the IFF_RUNNING flag is (incorrectly) asserted when queried by ifconfig, ifplugd, etc. This is a result of calling netif_carrier_off() too early in the probe function; it should be called after register_netdev(). Similar problems have been fixed in many other drivers, e.g.: e826eafa6 (bonding: Call netif_carrier_off after register_netdevice) 0d672e9f8 (drivers/net: Call netif_carrier_off at the end of the probe) 6a3c869a6 (cxgb4: fix reported state of interfaces without link) Fix is to move netif_carrier_off() to the end of the function. Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09usbdevfs: Correct amount of data copied to user in processcompl_compatHans de Goede
commit 2102e06a5f2e414694921f23591f072a5ba7db9f upstream. iso data buffers may have holes in them if some packets were short, so for iso urbs we should always copy the entire buffer, just like the regular processcompl does. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ALSA: hda - Add support for Realtek ALC282David Henningsson
commit 4e01ec636e64707d202a1ca21a47bbc6d53085b7 upstream. This codec has a separate dmic path (separate dmic only ADC), and thus it looks mostly like ALC275. BugLink: https://bugs.launchpad.net/bugs/1025377 Tested-by: Ray Chen <ray.chen@canonical.com> Signed-off-by: David Henningsson <david.henningsson@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ARM: OMAP2+: OPP: Fix to ensure check of right oppdef after bad oneNishanth Menon
commit b110547e586eb5825bc1d04aa9147bff83b57672 upstream. Commit 9fa2df6b90786301b175e264f5fa9846aba81a65 (ARM: OMAP2+: OPP: allow OPP enumeration to continue if device is not present) makes the logic: for (i = 0; i < opp_def_size; i++) { <snip> if (!oh || !oh->od) { <snip> continue; } <snip> opp_def++; } In short, the moment we hit a "Bad OPP", we end up looping the list comparing against the bad opp definition pointer for the rest of the iteration count. Instead, increment opp_def in the for loop itself and allow continue to be used in code without much thought so that we check the next set of OPP definition pointers :) Cc: Steve Sakoman <steve@sakoman.com> Cc: Tony Lindgren <tony@atomide.com> Signed-off-by: Nishanth Menon <nm@ti.com> Signed-off-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09SCSI: Avoid dangling pointer in scsi_requeue_command()Bart Van Assche
commit 940f5d47e2f2e1fa00443921a0abf4822335b54d upstream. When we call scsi_unprep_request() the command associated with the request gets destroyed and therefore drops its reference on the device. If this was the only reference, the device may get released and we end up with a NULL pointer deref when we call blk_requeue_request. Reported-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Tejun Heo <tj@kernel.org> [jejb: enhance commend and add commit log for stable] Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09SCSI: fix hot unplug vs async scan raceDan Williams
commit 3b661a92e869ebe2358de8f4b3230ad84f7fce51 upstream. The following crash results from cases where the end_device has been removed before scsi_sysfs_add_sdev has had a chance to run. BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 IP: [<ffffffff8115e100>] sysfs_create_dir+0x32/0xb6 ... Call Trace: [<ffffffff8125e4a8>] kobject_add_internal+0x120/0x1e3 [<ffffffff81075149>] ? trace_hardirqs_on+0xd/0xf [<ffffffff8125e641>] kobject_add_varg+0x41/0x50 [<ffffffff8125e70b>] kobject_add+0x64/0x66 [<ffffffff8131122b>] device_add+0x12d/0x63a [<ffffffff814b65ea>] ? _raw_spin_unlock_irqrestore+0x47/0x56 [<ffffffff8107de15>] ? module_refcount+0x89/0xa0 [<ffffffff8132f348>] scsi_sysfs_add_sdev+0x4e/0x28a [<ffffffff8132dcbb>] do_scan_async+0x9c/0x145 ...teach scsi_sysfs_add_devices() to check for deleted devices() before trying to add them, and teach scsi_remove_target() how to remove targets that have not been added via device_add(). Reported-by: Dariusz Majchrzak <dariusz.majchrzak@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09SCSI: fix eh wakeup (scsi_schedule_eh vs scsi_restart_operations)Dan Williams
commit 57fc2e335fd3c2f898ee73570dc81426c28dc7b4 upstream. Rapid ata hotplug on a libsas controller results in cases where libsas is waiting indefinitely on eh to perform an ata probe. A race exists between scsi_schedule_eh() and scsi_restart_operations() in the case when scsi_restart_operations() issues i/o to other devices in the sas domain. When this happens the host state transitions from SHOST_RECOVERY (set by scsi_schedule_eh) back to SHOST_RUNNING and ->host_busy is non-zero so we put the eh thread to sleep even though ->host_eh_scheduled is active. Before putting the error handler to sleep we need to check if the host_state needs to return to SHOST_RECOVERY for another trip through eh. Since i/o that is released by scsi_restart_operations has been blocked for at least one eh cycle, this implementation allows those i/o's to run before another eh cycle starts to discourage hung task timeouts. Reported-by: Tom Jackson <thomas.p.jackson@intel.com> Tested-by: Tom Jackson <thomas.p.jackson@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09SCSI: libsas: fix sas_discover_devices return code handlingDan Williams
commit b17caa174a7e1fd2e17b26e210d4ee91c4c28b37 upstream. commit 198439e4 [SCSI] libsas: do not set res = 0 in sas_ex_discover_dev() commit 19252de6 [SCSI] libsas: fix wide port hotplug issues The above commits seem to have confused the return value of sas_ex_discover_dev which is non-zero on failure and sas_ex_join_wide_port which just indicates short circuiting discovery on already established ports. The result is random discovery failures depending on configuration. Calls to sas_ex_join_wide_port are the source of the trouble as its return value is errantly assigned to 'res'. Convert it to bool and stop returning its result up the stack. Tested-by: Dan Melnic <dan.melnic@amd.com> Reported-by: Dan Melnic <dan.melnic@amd.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jack Wang <jack_wang@usish.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09SCSI: libsas: continue revalidationDan Williams
commit 26f2f199ff150d8876b2641c41e60d1c92d2fb81 upstream. Continue running revalidation until no more broadcast devices are discovered. Fixes cases where re-discovery completes too early in a domain with multiple expanders with pending re-discovery events. Servicing BCNs can get backed up behind error recovery. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09powerpc: Fix wrong divisor in usecs_to_cputimeAndreas Schwab
commit 9f5072d4f63f28d30d343573830ac6c85fc0deff upstream. Commit d57af9b (taskstats: use real microsecond granularity for CPU times) renamed msecs_to_cputime to usecs_to_cputime, but failed to update all numbers on the way. This causes nonsensical cpu idle/iowait values to be displayed in /proc/stat (the only user of usecs_to_cputime so far). This also renames __cputime_msec_factor to __cputime_usec_factor, adapting its value and using it directly in cputime_to_usecs instead of doing two multiplications. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09powerpc: Add "memory" attribute for mfmsr()Tiejun Chen
commit b416c9a10baae6a177b4f9ee858b8d309542fbef upstream. Add "memory" attribute in inline assembly language as a compiler barrier to make sure 4.6.x GCC don't reorder mfmsr(). Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09powerpc/ftrace: Fix assembly trampoline register usageroger blofeld
commit fd5a42980e1cf327b7240adf5e7b51ea41c23437 upstream. Just like the module loader, ftrace needs to be updated to use r12 instead of r11 with newer gcc's. Signed-off-by: Roger Blofeld <blofeldus@yahoo.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09mmc: sdhci-pci: CaFe has broken card detectionDaniel Drake
commit 55fc05b7414274f17795cd0e8a3b1546f3649d5e upstream. At http://dev.laptop.org/ticket/11980 we have determined that the Marvell CaFe SDHCI controller reports bad card presence during resume. It reports that no card is present even when it is. This is a regression -- resume worked back around 2.6.37. Around 400ms after resuming, a "card inserted" interrupt is generated, at which point it starts reporting presence. Work around this hardware oddity by setting the SDHCI_QUIRK_BROKEN_CARD_DETECTION flag. Thanks to Chris Ball for helping with diagnosis. Signed-off-by: Daniel Drake <dsd@laptop.org> Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01Linux 3.0.39v3.0.39Greg Kroah-Hartman
2012-08-01vmscan: fix initial shrinker size handlingKonstantin Khlebnikov
commit 635697c663f38106063d5659f0cf2e45afcd4bb5 upstream. Stable note: The commit [acf92b48: vmscan: shrinker->nr updates race and go wrong] aimed to reduce excessive reclaim of slab objects but had bug in how it treated shrinker functions that returned -1. A shrinker function can return -1, means that it cannot do anything without a risk of deadlock. For example prune_super() does this if it cannot grab a superblock refrence, even if nr_to_scan=0. Currently we interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan' according to this. So the next time around this shrinker can cause really big pressure. Let's skip such shrinkers instead. Also make total_scan signed, otherwise the check (total_scan < 0) below never works. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vmaKonstantin Khlebnikov
commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") Local variable "page" can be uninitialized if the nodemask from vma policy does not intersects with nodemask from cpuset. Even if it doesn't happens it is better to initialize this variable explicitly than to introduce a kernel oops in a weird corner case. mm/hugetlb.c: In function `alloc_huge_page': mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01cpuset: mm: reduce large amounts of memory barrier related damage v3Mel Gorman
commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemaskDavid Rientjes
commit b246272ecc5ac68c743b15c9e41a2275f7ce70e2 upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a new set of allowed cpuset nodes where the two nodemasks, as a result of the remap, are now disjoint. c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") adds get_mems_allowed() to prevent the set of allowed nodes from changing for a thread. This causes any update to a set of allowed nodes to stall until put_mems_allowed() is called. This stall is unncessary, however, if at least one node remains unchanged in the update to the set of allowed nodes. This was addressed by 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one node remains set"), but it's still possible that an empty nodemask may be read from a mempolicy because the old nodemask may be remapped to the new nodemask during rebind. To prevent this, only avoid the stall if there is no mempolicy for the thread being changed. This is a temporary solution until all reads from mempolicy nodemasks can be guaranteed to not be empty without the get_mems_allowed() synchronization. Also moves the check for nodemask intersection inside task_lock() so that tsk->mems_allowed cannot change. This ensures that nothing can set this tsk's mems_allowed out from under us and also protects tsk->mempolicy. Reported-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01cpusets: avoid looping when storing to mems_allowed if one node remains setDavid Rientjes
commit 89e8a244b97e48f1f30e898b6f32acca477f2a13 upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. {get,put}_mems_allowed() exist so that general kernel code may locklessly access a task's set of allowable nodes without having the chance that a concurrent write will cause the nodemask to be empty on configurations where MAX_NUMNODES > BITS_PER_LONG. This could incur a significant delay, however, especially in low memory conditions because the page allocator is blocking and reclaim requires get_mems_allowed() itself. It is not atypical to see writes to cpuset.mems take over 2 seconds to complete, for example. In low memory conditions, this is problematic because it's one of the most imporant times to change cpuset.mems in the first place! The only way a task's set of allowable nodes may change is through cpusets by writing to cpuset.mems and when attaching a task to a generic code is not reading the nodemask with get_mems_allowed() at the same time, and then clearing all the old nodes. This prevents the possibility that a reader will see an empty nodemask at the same time the writer is storing a new nodemask. If at least one node remains unchanged, though, it's possible to simply set all new nodes and then clear all the old nodes. Changing a task's nodemask is protected by cgroup_mutex so it's guaranteed that two threads are not changing the same task's nodemask at the same time, so the nodemask is guaranteed to be stored before another thread changes it and determines whether a node remains set or not. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Paul Menage <paul@paulmenage.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: vmscan: convert global reclaim to per-memcg LRU listsJohannes Weiner
commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch. Stable note: Not tracked in Bugzilla. This is a partial backport of an upstream commit addressing a completely different issue that accidentally contained an important fix. The workload this patch helps was memcached when IO is started in the background. memcached should stay resident but without this patch it gets swapped. Sometimes this manifests as a drop in throughput but mostly it was observed through /proc/vmstat. Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant to fix a problem whereby small scan targets on memcg were ignored causing priority to raise too sharply. It forced scanning to take place if the target was small, memcg or kswapd. From the time it was introduced it caused excessive reclaim by kswapd with workloads being pushed to swap that previously would have stayed resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan: convert global reclaim to per-memcg LRU lists] by making it harder for kswapd to force scan small targets but that patchset is not suitable for backporting. This was later changed again by commit [90126375: mm/vmscan: push lruvec pointer into get_scan_count()] into a format that looks like it would be a straight-forward backport but there is a subtle difference due to the use of lruvecs. The impact of the accidental fix is to make it harder for kswapd to force scan small targets by taking zone->all_unreclaimable into account. This patch is the closest equivalent available based on what is backported. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: test PageSwapBacked in lumpy reclaimHugh Dickins
commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. Even though the subject refers to lumpy reclaim, it impacts compaction as well. Lumpy reclaim does well to stop at a PageAnon when there's no swap, but better is to stop at any PageSwapBacked, which includes shmem/tmpfs too. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-08-01mm/vmscan.c: consider swap space when deciding whether to continue reclaimMinchan Kim
commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream. Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU usage on swapless systems with high anonymous memory usage. It's pointless to continue reclaiming when we have no swap space and lots of anon pages in the inactive list. Without this patch, it is possible when swap is disabled to continue trying to reclaim when there are only anonymous pages in the system even though that will not make any progress. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01vmscan: activate executable pages after first usageKonstantin Khlebnikov
commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages the first class citizen") was noticeably weakened in commit 645747462435d84 ("vmscan: detect mapped file pages used only once"). Currently these pages can become "first class citizens" only after second usage. After this patch page_check_references() will activate they after first usage, and executable code gets yet better chance to stay in memory. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01vmscan: promote shared file mapped pagesKonstantin Khlebnikov
commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream. Stable note: Not tracked in Bugzilla. There were reports of shared mapped pages being unfairly reclaimed in comparison to older kernels. This is being addressed over time. The specific workload being addressed here in described in paragraph four and while paragraph five says it did not help performance as such, it made a difference to major page faults. I'm aware of at least one bug for a large vendor that was due to increased major faults. Commit 645747462435 ("vmscan: detect mapped file pages used only once") greatly decreases lifetime of single-used mapped file pages. Unfortunately it also decreases life time of all shared mapped file pages. Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed in fault path") page-fault handler does not mark page active or even referenced. Thus page_check_references() activates file page only if it was used twice while it stays in inactive list, meanwhile it activates anon pages after first access. Inactive list can be small enough, this way reclaimer can accidentally throw away any widely used page if it wasn't used twice in short period. After this patch page_check_references() also activate file mapped page at first inactive list scan if this page is already used multiple times via several ptes. I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5 (~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers, they share all their files but there a lot of anonymous pages: ~500mb shared file mapped memory and 15-20Gb non-shared anonymous memory. In this situation major-pagefaults are very costly, because all containers share the same page. In my load kernel created a disproportionate pressure on the file memory, compared with the anonymous, they equaled only if I raise swappiness up to 150 =) These patches actually wasn't helped a lot in my problem, but I saw noticable (10-20 times) reduce in count and average time of major-pagefault in file-mapped areas. Actually both patches are fixes for commit v2.6.33-5448-g6457474, because it was aimed at one scenario (singly used pages), but it breaks the logic in other scenarios (shared and/or executable pages) Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-08-01mm: vmscan: check if reclaim should really abort even if compaction_ready() ↵Mel Gorman
is true for one zone commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. If compaction can proceed for a given zone, shrink_zones() does not reclaim any more pages from it. After commit [e0c2327: vmscan: abort reclaim/compaction if compaction can proceed], do_try_to_free_pages() tries to finish as soon as possible once one zone can compact. This was intended to prevent slabs being shrunk unnecessarily but there are side-effects. One is that a small zone that is ready for compaction will abort reclaim even if the chances of successfully allocating a THP from that zone is small. It also means that reclaim can return too early even though sc->nr_to_reclaim pages were not reclaimed. This partially reverts the commit until it is proven that slabs are really being shrunk unnecessarily but preserves the check to return 1 to avoid OOM if reclaim was aborted prematurely. [aarcange@redhat.com: This patch replaces a revert from Andrea] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: vmscan: do not OOM if aborting reclaim to start compactionMel Gorman
commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream. Stable note: Not tracked in Bugzilla. This patch makes later patches easier to apply but otherwise has little to justify it. The problem it fixes was never observed but the source of the theoretical problem did not exist for very long. During direct reclaim it is possible that reclaim will be aborted so that compaction can be attempted to satisfy a high-order allocation. If this decision is made before any pages are reclaimed, it is possible that 0 is returned to the page allocator potentially triggering an OOM. This has not been observed but it is a possibility so this patch addresses it. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: vmscan: when reclaiming for compaction, ensure there are sufficient free ↵Mel Gorman
pages available commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. This patch addresses a problem where the fix regressed THP allocation success rates. In commit e0887c19 ("vmscan: limit direct reclaim for higher order allocations"), Rik noted that reclaim was too aggressive when THP was enabled. In his initial patch he used the number of free pages to decide if reclaim should abort for compaction. My feedback was that reclaim and compaction should be using the same logic when deciding if reclaim should be aborted. Unfortunately, this had the effect of reducing THP success rates when the workload included something like streaming reads that continually allocated pages. The window during which compaction could run and return a THP was too small. This patch combines Rik's two patches together. compaction_suitable() is still used to decide if reclaim should be aborted to allow compaction is used. However, it will also ensure that there is a reasonable buffer of free pages available. This improves upon the THP allocation success rates but bounds the number of pages that are freed for compaction. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel<riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: compaction: introduce sync-light migration for use by compactionMel Gorman
commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream. Stable note: Not tracked in Buzilla. This was part of a series that reduced interactivity stalls experienced when THP was enabled. These stalls were particularly noticable when copying data to a USB stick but the experiences for users varied a lot. This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT mode that avoids writing back pages to backing storage. Async compaction maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is used. This avoids sync compaction stalling for an excessive length of time, particularly when copying files to a USB stick where there might be a large number of dirty pages backed by a filesystem that does not support ->writepages. [aarcange@redhat.com: This patch is heavily based on Andrea's work] [akpm@linux-foundation.org: fix fs/nfs/write.c build] [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01kswapd: assign new_order and new_classzone_idx after wakeup in sleepingAlex Shi
commit f0dfcde099453aa4c0dc42473828d15a6d492936 upstream. Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This patch reduces kswapd CPU usage. There 2 places to read pgdat in kswapd. One is return from a successful balance, another is waked up from kswapd sleeping. The new_order and new_classzone_idx represent the balance input order and classzone_idx. But current new_order and new_classzone_idx are not assigned after kswapd_try_to_sleep(), that will cause a bug in the following scenario. 1: after a successful balance, kswapd goes to sleep, and new_order = 0; new_classzone_idx = __MAX_NR_ZONES - 1; 2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL 3: in the balance_pgdat() running, a new balance wakeup happened with order = 5, and classzone_idx = ZONE_NORMAL 4: the first wakeup(order = 3) finished successufly, return order = 3 but, the new_order is still 0, so, this balancing will be treated as a failed balance. And then the second tighter balancing will be missed. So, to avoid the above problem, the new_order and new_classzone_idx need to be assigned for later successful comparison. Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Tested-by: Pádraig Brady <P@draigBrady.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01kswapd: avoid unnecessary rebalance after an unsuccessful balancingAlex Shi
commit d2ebd0f6b89567eb93ead4e2ca0cbe03021f344b upstream. Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This patch reduces kswapd CPU usage. In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully") , Mel Gorman said kswapd is better to sleep after a unsuccessful balancing if there is tighter reclaim request pending in the balancing. But in the following scenario, kswapd do something that is not matched our expectation. The patch fixes this issue. 1, Read pgdat request A (classzone_idx, order = 3) 2, balance_pgdat() 3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed 4, balance_pgdat() returns but failed since returned order = 0 5, pgdat of request A assigned to balance_pgdat(), and do balancing again. While the expectation behavior of kswapd should try to sleep. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Pádraig Brady <P@draigBrady.com> Cc: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: compaction: make isolate_lru_page() filter-aware againMel Gorman
commit c82449352854ff09e43062246af86bdeb628f0c3 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. This had to be partially reverted because some dirty pages can be migrated by compaction without blocking. This patch updates "mm: compaction: make isolate_lru_page" by skipping over pages that migration has no possibility of migrating to minimise LRU disruption. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel<riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: page allocator: do not call direct reclaim for THP allocations while ↵Mel Gorman
compaction is deferred commit 66199712e9eef5aede09dbcd9dfff87798a66917 upstream. Stable note: Not tracked in Buzilla. This was part of a series that reduced interactivity stalls experienced when THP was enabled. If compaction is deferred, direct reclaim is used to try to free enough pages for the allocation to succeed. For small high-orders, this has a reasonable chance of success. However, if the caller has specified __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense to fail the allocation rather than stall the caller in direct reclaim. This patch skips direct reclaim if compaction is deferred and the caller specifies __GFP_NO_KSWAPD. Async compaction only considers a subset of pages so it is possible for compaction to be deferred prematurely and not enter direct reclaim even in cases where it should. To compensate for this, this patch also defers compaction only if sync compaction failed. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel<riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: compaction: determine if dirty pages can be migrated without blocking ↵Mel Gorman
within ->migratepage commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Asynchronous compaction is used when allocating transparent hugepages to avoid blocking for long periods of time. Due to reports of stalling, there was a debate on disabling synchronous compaction but this severely impacted allocation success rates. Part of the reason was that many dirty pages are skipped in asynchronous compaction by the following check; if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page) rc = -EBUSY; This skips over all mapping aops using buffer_migrate_page() even though it is possible to migrate some of these pages without blocking. This patch updates the ->migratepage callback with a "sync" parameter. It is the responsibility of the callback to fail gracefully if migration would block. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: compaction: allow compaction to isolate dirty pagesMel Gorman
commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Short summary: There are severe stalls when a USB stick using VFAT is used with THP enabled that are reduced by this series. If you are experiencing this problem, please test and report back and considering I have seen complaints from openSUSE and Fedora users on this as well as a few private mails, I'm guessing it's a widespread issue. This is a new type of USB-related stall because it is due to synchronous compaction writing where as in the past the big problem was dirty pages reaching the end of the LRU and being written by reclaim. Am cc'ing Andrew this time and this series would replace mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch. I'm also cc'ing Dave Jones as he might have merged that patch to Fedora for wider testing and ideally it would be reverted and replaced by this series. That said, the later patches could really do with some review. If this series is not the answer then a new direction needs to be discussed because as it is, the stalls are unacceptable as the results in this leader show. For testers that try backporting this to 3.1, it won't work because there is a non-obvious dependency on not writing back pages in direct reclaim so you need those patches too. Changelog since V5 o Rebase to 3.2-rc5 o Tidy up the changelogs a bit Changelog since V4 o Added reviewed-bys, credited Andrea properly for sync-light o Allow dirty pages without mappings to be considered for migration o Bound the number of pages freed for compaction o Isolate PageReclaim pages on their own LRU list This is against 3.2-rc5 and follows on from discussions on "mm: Do not stall in synchronous compaction for THP allocations" and "[RFC PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed patch eliminated stalls due to compaction which sometimes resulted in user-visible interactivity problems on browsers by simply never using sync compaction. The downside was that THP success allocation rates were lower because dirty pages were not being migrated as reported by Andrea. His approach at fixing this was nacked on the grounds that it reverted fixes from Rik merged that reduced the amount of pages reclaimed as it severely impacted his workloads performance. This series attempts to reconcile the requirements of maximising THP usage, without stalling in a user-visible fashion due to compaction or cheating by reclaiming an excessive number of pages. Patch 1 partially reverts commit 39deaf85 to allow migration to isolate dirty pages. This is because migration can move some dirty pages without blocking. Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using synchronous compaction when it should be. This is unrelated to the reported stalls but is worth fixing. Patch 3 checks if we isolated a compound page during lumpy scan and account for it properly. For the most part, this affects tracing so it's unrelated to the stalls but worth fixing. Patch 4 notes that it is possible to abort reclaim early for compaction and return 0 to the page allocator potentially entering the "may oom" path. This has not been observed in practice but the rest of the series potentially makes it easier to happen. Patch 5 adds a sync parameter to the migratepage callback and gives the callback responsibility for migrating the page without blocking if sync==false. For example, fallback_migrate_page will not call writepage if sync==false. This increases the number of pages that can be handled by asynchronous compaction thereby reducing stalls. Patch 6 restores filter-awareness to isolate_lru_page for migration. In practice, it means that pages under writeback and pages without a ->migratepage callback will not be isolated for migration. Patch 7 avoids calling direct reclaim if compaction is deferred but makes sure that compaction is only deferred if sync compaction was used. Patch 8 introduces a sync-light migration mechanism that sync compaction uses. The objective is to allow some stalls but to not call ->writepage which can lead to significant user-visible stalls. Patch 9 notes that while we want to abort reclaim ASAP to allow compation to go ahead that we leave a very small window of opportunity for compaction to run. This patch allows more pages to be freed by reclaim but bounds the number to a reasonable level based on the high watermark on each zone. Patch 10 allows slabs to be shrunk even after compaction_ready() is true for one zone. This is to avoid a problem whereby a single small zone can abort reclaim even though no pages have been reclaimed and no suitably large zone is in a usable state. Patch 11 fixes a problem with the rate of page scanning. As reclaim is rarely stalling on pages under writeback it means that scan rates are very high. This is particularly true for direct reclaim which is not calling writepage. The vmstat figures implied that much of this was busy work with PageReclaim pages marked for immediate reclaim. This patch is a prototype that moves these pages to their own LRU list. This has been tested and other than 2 USB keys getting trashed, nothing horrible fell out. That said, I am a bit unhappy with the rescue logic in patch 11 but did not find a better way around it. It does significantly reduce scan rates and System CPU time indicating it is the right direction to take. What is of critical importance is that stalls due to compaction are massively reduced even though sync compaction was still allowed. Testing from people complaining about stalls copying to USBs with THP enabled are particularly welcome. The following tests all involve THP usage and USB keys in some way. Each test follows this type of pattern 1. Read from some fast fast storage, be it raw device or file. Each time the copy finishes, start again until the test ends 2. Write a large file to a filesystem on a USB stick. Each time the copy finishes, start again until the test ends 3. When memory is low, start an alloc process that creates a mapping the size of physical memory to stress THP allocation. This is the "real" part of the test and the part that is meant to trigger stalls when THP is enabled. Copying continues in the background. 4. Record the CPU usage and time to execute of the alloc process 5. Record the number of THP allocs and fallbacks as well as the number of THP pages in use a the end of the test just before alloc exited 6. Run the test 5 times to get an idea of variability 7. Between each run, sync is run and caches dropped and the test waits until nr_dirty is a small number to avoid interference or caching between iterations that would skew the figures. The individual tests were then writebackCPDeviceBasevfat Disable THP, read from a raw device (sda), vfat on USB stick writebackCPDeviceBaseext4 Disable THP, read from a raw device (sda), ext4 on USB stick writebackCPDevicevfat THP enabled, read from a raw device (sda), vfat on USB stick writebackCPDeviceext4 THP enabled, read from a raw device (sda), ext4 on USB stick writebackCPFilevfat THP enabled, read from a file on fast storage and USB, both vfat writebackCPFileext4 THP enabled, read from a file on fast storage and USB, both ext4 The kernels tested were 3.1 3.1 vanilla 3.2-rc5 freemore Patches 1-10 immediate Patches 1-11 andrea The 8 patches Andrea posted as a basis of comparison The results are very long unfortunately. I'll start with the case where we are not using THP at all writebackCPDeviceBasevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%) +/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%) User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%) +/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%) Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%) +/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%) THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) The THP figures are obviously all 0 because THP was enabled. The main thing to watch is the elapsed times and how they compare to times when THP is enabled later. It's also important to note that elapsed time is improved by this series as System CPu time is much reduced. writebackCPDevicevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%) +/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60 (-10818.53%) User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%) +/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%) Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 ( 95.48%) +/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%) THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%) +/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%) Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%) +/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%) Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%) +/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03 Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28 The first thing to note is the "Elapsed Time" for the vanilla kernels of 2249 seconds versus 56 with THP disabled which might explain the reports of USB stalls with THP enabled. Applying the patches brings performance in line with THP-disabled performance while isolating pages for immediate reclaim from the LRU cuts down System CPU time. The "Fault Alloc" success rate figures are also improved. The vanilla kernel only managed to allocate 76.6 pages on average over the course of 5 iterations where as applying the series allocated 181.20 on average albeit it is well within variance. It's worth noting that applies the series at least descreases the amount of variance which implies an improvement. Andrea's series had a higher success rate for THP allocations but at a severe cost to elapsed time which is still better than vanilla but still much worse than disabling THP altogether. One can bring my series close to Andrea's by removing this check /* * If compaction is deferred for high-order allocations, it is because * sync compaction recently failed. In this is the case and the caller * has requested the system not be heavily disrupted, fail the * allocation now instead of entering direct reclaim */ if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD)) goto nopage; I didn't include a patch that removed the above check because hurting overall performance to improve the THP figure is not what the average user wants. It's something to consider though if someone really wants to maximise THP usage no matter what it does to the workload initially. This is summary of vmstat figures from the same test. 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 Page Ins 3257266139 1111844061 17263623 10901575 161423219 Page Outs 81054922 30364312 3626530 3657687 8753730 Swap Ins 3294 2851 6560 4964 4592 Swap Outs 390073 528094 620197 790912 698285 Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831 Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966 Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726 Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360 Kswapd efficiency 83% 69% 58% 57% 79% Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490 Direct efficiency 74% 9% 0% 1% 0% Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435 Percentage direct scans 96% 99% 99% 98% 99% Page writes by reclaim 722646 529174 620319 791018 699198 Page writes file 332573 1080 122 106 913 Page writes anon 390073 528094 620197 790912 698285 Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032 Page rescued immediate 0 0 0 87848 0 Slabs scanned 23552 23552 9216 8192 9216 Direct inode steals 231 0 0 0 0 Kswapd inode steals 0 0 0 0 0 Kswapd skipped wait 28076 786 0 61 6 THP fault alloc 609 383 753 906 1433 THP collapse alloc 12 6 0 0 6 THP splits 536 211 456 593 1136 THP fault fallback 4406 4633 4263 4110 3583 THP collapse fail 120 127 0 0 4 Compaction stalls 1810 728 623 779 3200 Compaction success 196 53 60 80 123 Compaction failures 1614 675 563 699 3077 Compaction pages moved 193158 53545 243185 333457 226688 Compaction move failure 9952 9396 16424 23676 45070 The main things to look at are 1. Page In/out figures are much reduced by the series. 2. Direct page scanning is incredibly high (264745.137 pages scanned per second on the vanilla kernel) but isolating PageReclaim pages on their own list reduces the number of pages scanned significantly. 3. The fact that "Page rescued immediate" is a positive number implies that we sometimes race removing pages from the LRU_IMMEDIATE list that need to be put back on a normal LRU but it happens only for 0.07% of the pages marked for immediate reclaim. writebackCPDeviceext4 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%) +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%) User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%) +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%) Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%) +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%) THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%) +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%) Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%) +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%) Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%) +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45 Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26 Similar test but the USB stick is using ext4 instead of vfat. As ext4 does not use writepage for migration, the large stalls due to compaction when THP is enabled are not observed. Still, isolating PageReclaim pages on their own list helped completion time largely by reducing the number of pages scanned by direct reclaim although time spend in congestion_wait could also be a factor. Again, Andrea's series had far higher success rates for THP allocation at the cost of elapsed time. I didn't look too closely but a quick look at the vmstat figures tells me kswapd reclaimed 8 times more pages than the patch series and direct reclaim reclaimed roughly three times as many pages. It follows that if memory is aggressively reclaimed, there will be more available for THP. writebackCPFilevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%) +/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75 (-6863.76%) User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%) +/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%) Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%) +/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%) THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%) +/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%) Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%) +/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%) Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%) +/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76 Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28 In this case, the test is reading/writing only from filesystems but as it's vfat, it's slow due to calling writepage during compaction. Little to observe really - the time to complete the test goes way down with the series applied and THP allocation success rates go up in comparison to 3.2-rc5. The success rates are lower than 3.1.0 but the elapsed time for that kernel is abysmal so it is not really a sensible comparison. As before, Andrea's series allocates more THPs at the cost of overall performance. writebackCPFileext4 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%) +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%) User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%) +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%) Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%) +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%) THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%) +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%) Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%) +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%) Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%) +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45 Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26 Same type of story - elapsed times go down. In this case, allocation success rates are roughtly the same. As before, Andrea's has higher success rates but takes a lot longer. Overall the series does reduce latencies and while the tests are inherency racy as alloc competes with the cp processes, the variability was included. The THP allocation rates are not as high as they could be but that is because we would have to be more aggressive about reclaim and compaction impacting overall performance. This patch: Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. What was missed during review is that asynchronous migration moves dirty pages if their ->migratepage callback is migrate_page() because these can be moved without blocking. This potentially impacted hugepage allocation success rates by a factor depending on how many dirty pages are in the system. This patch partially reverts 39deaf85 to allow migration to isolate dirty pages again. This increases how much compaction disrupts the LRU but that is addressed later in the series. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: migration: clean up unmap_and_move()Minchan Kim
commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream. Stable note: Not tracked in Bugzilla. This patch makes later patches easier to apply but has no other impact. unmap_and_move() is one a big messy function. Clean it up. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: zone_reclaim: make isolate_lru_page() filter-awareMinchan Kim
commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream. Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list leading to poor reclaim decisions which has a variable performance impact. In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless, we have isolated mapped page and re-add it into LRU's head. It's unnecessary CPU overhead and makes LRU churning. Of course, when we isolate the page, the page might be mapped but when we try to migrate the page, the page would be not mapped. So it could be migrated. But race is rare and although it happens, it's no big deal. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: compaction: make isolate_lru_page() filter-awareMinchan Kim
commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream. Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list leading to poor reclaim decisions which has a variable performance impact. In async mode, compaction doesn't migrate dirty or writeback pages. So, it's meaningless to pick the page and re-add it to lru list. Of course, when we isolate the page in compaction, the page might be dirty or writeback but when we try to migrate the page, the page would be not dirty, writeback. So it could be migrated. But it's very unlikely as isolate and migration cycle is much faster than writeout. So, this patch helps cpu overhead and prevent unnecessary LRU churning. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: change isolate mode from #define to bitwise typeMinchan Kim
commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream. Stable note: Not tracked in Bugzilla. This patch makes later patches easier to apply but has no other impact. Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally, macro isn't recommended as it's type-unsafe and making debugging harder as symbol cannot be passed throught to the debugger. Quote from Johannes " Hmm, it would probably be cleaner to fully convert the isolation mode into independent flags. INACTIVE, ACTIVE, BOTH is currently a tri-state among flags, which is a bit ugly." This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: compaction: trivial clean up in acct_isolated()Minchan Kim
commit b9e84ac1536d35aee03b2601f19694949f0bd506 upstream. Stable note: Not tracked in Bugzilla. This patch makes later patches easier to apply but has no other impact. acct_isolated of compaction uses page_lru_base_type which returns only base type of LRU list so it never returns LRU_ACTIVE_ANON or LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only acct_isolated so it doesn't have fields in conpact_control. This patch removes fields from compact_control and makes clear function of acct_issolated which counts the number of anon|file pages isolated. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01vmscan: abort reclaim/compaction if compaction can proceedMel Gorman
commit e0c23279c9f800c403f37511484d9014ac83adec upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. If compaction can proceed, shrink_zones() stops doing any work but its callers still call shrink_slab() which raises the priority and potentially sleeps. This is unnecessary and wasteful so this patch aborts direct reclaim/compaction entirely if compaction can proceed. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01vmscan: limit direct reclaim for higher order allocationsRik van Riel
commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 upstream. Stable note: Not tracked on Bugzilla. THP and compaction was found to aggressively reclaim pages and stall systems under different situations that was addressed piecemeal over time. Paragraph 3 of this changelog is the motivation for this patch. When suffering from memory fragmentation due to unfreeable pages, THP page faults will repeatedly try to compact memory. Due to the unfreeable pages, compaction fails. Needless to say, at that point page reclaim also fails to create free contiguous 2MB areas. However, that doesn't stop the current code from trying, over and over again, and freeing a minimum of 4MB (2UL << sc->order pages) at every single invocation. This resulted in my 12GB system having 2-3GB free memory, a corresponding amount of used swap and very sluggish response times. This can be avoided by having the direct reclaim code not reclaim from zones that already have plenty of free memory available for compaction. If compaction still fails due to unmovable memory, doing additional reclaim will only hurt the system, not help. [jweiner@redhat.com: change comment to explain the order check] Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01vmscan: reduce wind up shrinker->nr when shrinker can't do workDave Chinner
commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream. Stable note: Not tracked in Bugzilla. This patch reduces excessive reclaim of slab objects reducing the amount of information that has to be brought back in from disk. The third and fourth paragram in the series describes the impact. When a shrinker returns -1 to shrink_slab() to indicate it cannot do any work given the current memory reclaim requirements, it adds the entire total_scan count to shrinker->nr. The idea ehind this is that whenteh shrinker is next called and can do work, it will do the work of the previously aborted shrinker call as well. However, if a filesystem is doing lots of allocation with GFP_NOFS set, then we get many, many more aborts from the shrinkers than we do successful calls. The result is that shrinker->nr winds up to it's maximum permissible value (twice the current cache size) and then when the next shrinker call that can do work is issued, it has enough scan count built up to free the entire cache twice over. This manifests itself in the cache going from full to empty in a matter of seconds, even when only a small part of the cache is needed to be emptied to free sufficient memory. Under metadata intensive workloads on ext4 and XFS, I'm seeing the VFS caches increase memory consumption up to 75% of memory (no page cache pressure) over a period of 30-60s, and then the shrinker empties them down to zero in the space of 2-3s. This cycle repeats over and over again, with the shrinker completely trashing the inode and dentry caches every minute or so the workload continues. This behaviour was made obvious by the shrink_slab tracepoints added earlier in the series, and made worse by the patch that corrected the concurrent accounting of shrinker->nr. To avoid this problem, stop repeated small increments of the total scan value from winding shrinker->nr up to a value that can cause the entire cache to be freed. We still need to allow it to wind up, so use the delta as the "large scan" threshold check - if the delta is more than a quarter of the entire cache size, then it is a large scan and allowed to cause lots of windup because we are clearly needing to free lots of memory. If it isn't a large scan then limit the total scan to half the size of the cache so that windup never increases to consume the whole cache. Reducing the total scan limit further does not allow enough wind-up to maintain the current levels of performance, whilst a higher threshold does not prevent the windup from freeing the entire cache under sustained workloads. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01vmscan: shrinker->nr updates race and go wrongDave Chinner
commit acf92b485cccf028177f46918e045c0c4e80ee10 upstream. Stable note: Not tracked in Bugzilla. This patch reduces excessive reclaim of slab objects reducing the amount of information that has to be brought back in from disk. shrink_slab() allows shrinkers to be called in parallel so the struct shrinker can be updated concurrently. It does not provide any exclusio for such updates, so we can get the shrinker->nr value increasing or decreasing incorrectly. As a result, when a shrinker repeatedly returns a value of -1 (e.g. a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire, sometimes updating with the scan count that wasn't used, sometimes losing it altogether. Worse is when a shrinker does work and that update is lost due to racy updates, which means the shrinker will do the work again! Fix this by making the total_scan calculations independent of shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to other updates via cmpxchg loops. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01vmscan: add shrink_slab tracepointsDave Chinner
commit 095760730c1047c69159ce88021a7fa3833502c8 upstream. Stable note: This patch makes later patches easier to apply but otherwise has little to justify it. It is a diagnostic patch that was part of a series addressing excessive slab shrinking after GFP_NOFS failures. There is detailed information on the series' motivation at https://lkml.org/lkml/2011/6/2/42 . It is impossible to understand what the shrinkers are actually doing without instrumenting the code, so add a some tracepoints to allow insight to be gained. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-08-01vmscan: clear ZONE_CONGESTED for zone with good watermarkShaohua Li
commit 439423f6894aa0dec22187526827456f5004baed upstream. Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing ZONE_CONGESTED after it balances a zone and this patch fixes a bug where that was failing to happen. Without this patch, processes can stall in wait_iff_congested unnecessarily. For users, this can look like an interactivity stall but some workloads would see it as sudden drop in throughput. ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any task. It's possible ZONE_CONGESTED isn't cleared in some cases: 1. the zone is already balanced just entering balance_pgdat() for order-0 because concurrent tasks free memory. In this case, later check will skip the zone as it's balanced so the flag isn't cleared. 2. high order balance fallbacks to order-0. quote from Mel: At the end of balance_pgdat(), kswapd uses the following logic; If reclaiming at high order { for each zone { if all_unreclaimable skip if watermark is not met order = 0 loop again /* watermark is met */ clear congested } } i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not, it restarts balancing at order-0. However, if the higher zones are balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as that only happens after a zone is shrunk. This can mean that wait_iff_congested() stalls unnecessarily. This patch makes kswapd clear ZONE_CONGESTED during its initial highmem->dma scan for zones that are already balanced. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01mm: vmscan: fix force-scanning small targets without swapJohannes Weiner
commit a4d3e9e76337059406fcf3ead288c0df22a790e9 upstream. Stable note: Not tracked in Bugzilla. This patch augments an earlier commit that avoids scanning priority being artificially raised. The older fix was particularly important for small memcgs to avoid calling wait_iff_congested() unnecessarily. Without swap, anonymous pages are not scanned. As such, they should not count when considering force-scanning a small target if there is no swap. Otherwise, targets are not force-scanned even when their effective scan number is zero and the other conditions--kswapd/memcg--apply. This fixes 246e87a93934 ("memcg: fix get_scan_count() for small targets"). [akpm@linux-foundation.org: fix comment] Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>