Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)

Merge hugepage allocation updates from David Rientjes: "We (mostly Linus, Andrea, and myself) have been discussing offlist how to implement a sane default allocation strategy for hugepages on NUMA platforms. With these reverts in place, the page allocator will happily allocate a remote hugepage immediately rather than try to make a local hugepage available. This incurs a substantial performance degradation when memory compaction would have otherwise made a local hugepage available. This series reverts those reverts and attempts to propose a more sane default allocation strategy specifically for hugepages. Andrea acknowledges this is likely to fix the swap storms that he originally reported that resulted in the patches that removed __GFP_THISNODE from hugepage allocations. The immediate goal is to return 5.3 to the behavior the kernel has implemented over the past several years so that remote hugepages are not immediately allocated when local hugepages could have been made available because the increased access latency is untenable. The next goal is to introduce a sane default allocation strategy for hugepages allocations in general regardless of the configuration of the system so that we prevent thrashing of local memory when compaction is unlikely to succeed and can prefer remote hugepages over remote native pages when the local node is low on memory." Note on timing: this reverts the hugepage VM behavior changes that got introduced fairly late in the 5.3 cycle, and that fixed a huge performance regression for certain loads that had been around since 4.18. Andrea had this note: "The regression of 4.18 was that it was taking hours to start a VM where 3.10 was only taking a few seconds, I reported all the details on lkml when it was finally tracked down in August 2018. https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/ __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio workload degrade like in the "current upstream" above. And it still would have been that bad as above until 5.3-rc5" where the bad behavior ends up happening as you fill up a local node, and without that change, you'd get into the nasty swap storm behavior due to compaction working overtime to make room for more memory on the nodes. As a result 5.3 got the two performance fix reverts in rc5. However, David Rientjes then noted that those performance fixes in turn regressed performance for other loads - although not quite to the same degree. He suggested reverting the reverts and instead replacing them with two small changes to how hugepage allocations are done (patch descriptions rephrased by me): - "avoid expensive reclaim when compaction may not succeed": just admit that the allocation failed when you're trying to allocate a huge-page and compaction wasn't successful. - "allow hugepage fallback to remote nodes when madvised": when that node-local huge-page allocation failed, retry without forcing the local node. but by then I judged it too late to replace the fixes for a 5.3 release. So 5.3 was released with behavior that harked back to the pre-4.18 logic. But now we're in the merge window for 5.4, and we can see if this alternate model fixes not just the horrendous swap storm behavior, but also restores the performance regression that the late reverts caused. Fingers crossed. * emailed patches from David Rientjes <rientjes@google.com>: mm, page_alloc: allow hugepage fallback to remote nodes when madvised mm, page_alloc: avoid expensive reclaim when compaction may not succeed Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" Revert "Revert "mm, thp: restore node-local hugepage allocations""
author: Linus Torvalds <torvalds@linux-foundation.org> 2019-09-28 14:26:47 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2019-09-28 14:26:47 -0700
commit: edf445ad7c8d58c2784a5b733790e80999093d8f (patch)
tree: 2ffeaee2454bf0b03530c22ac23b4eb5edb4d89d /mm/page_alloc.c
parent: a2953204b576ea3ba4afd07b917811d50fc49778 (diff)
parent: 76e654cc91bbe627aa6067916f02a4d3ac041620 (diff)
1 files changed, 22 insertions, 0 deletions
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3334a769eb91..15c2050c629b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4467,6 +4467,28 @@ retry_cpuset:
 		if (page)
 			goto got_pg;
 
+		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+			/*
+			 * If allocating entire pageblock(s) and compaction
+			 * failed because all zones are below low watermarks
+			 * or is prohibited because it recently failed at this
+			 * order, fail immediately.
+			 *
+			 * Reclaim is
+			 *  - potentially very expensive because zones are far
+			 *    below their low watermarks or this is part of very
+			 *    bursty high order allocations,
+			 *  - not guaranteed to help because isolate_freepages()
+			 *    may not iterate over freed pages as part of its
+			 *    linear scan, and
+			 *  - unlikely to make entire pageblocks free on its
+			 *    own.
+			 */
+			if (compact_result == COMPACT_SKIPPED ||
+			    compact_result == COMPACT_DEFERRED)
+				goto nopage;
+		}
+
 		/*
 		 * Checks for costly allocations with __GFP_NORETRY, which
 		 * includes THP page fault allocations
author	Linus Torvalds <torvalds@linux-foundation.org>	2019-09-28 14:26:47 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2019-09-28 14:26:47 -0700
commit	edf445ad7c8d58c2784a5b733790e80999093d8f (patch)
tree	2ffeaee2454bf0b03530c22ac23b4eb5edb4d89d /mm/page_alloc.c
parent	a2953204b576ea3ba4afd07b917811d50fc49778 (diff)
parent	76e654cc91bbe627aa6067916f02a4d3ac041620 (diff)