mm: ratelimit PFNs busy info message

commit 75dddef32514f7aa58930bde6a1263253bc3d4ba upstream. The RDMA subsystem can generate several thousand of these messages per second eventually leading to a kernel crash. Ratelimit these messages to prevent this crash. Doug said: "I've been carrying a version of this for several kernel versions. I don't remember when they started, but we have one (and only one) class of machines: Dell PE R730xd, that generate these errors. When it happens, without a rate limit, we get rcu timeouts and kernel oopses. With the rate limit, we just get a lot of annoying kernel messages but the machine continues on, recovers, and eventually the memory operations all succeed" And: "> Well... why are all these EBUSY's occurring? It sounds inefficient > (at least) but if it is expected, normal and unavoidable then > perhaps we should just remove that message altogether? I don't have an answer to that question. To be honest, I haven't looked real hard. We never had this at all, then it started out of the blue, but only on our Dell 730xd machines (and it hits all of them), but no other classes or brands of machines. And we have our 730xd machines loaded up with different brands and models of cards (for instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines meant it wasn't tied to any particular brand/model of RDMA hardware. To me, it always smelled of a hardware oddity specific to maybe the CPUs or mainboard chipsets in these machines, so given that I'm not an mm expert anyway, I never chased it down. A few other relevant details: it showed up somewhere around 4.8/4.9 or thereabouts. It never happened before, but the prinkt has been there since the 3.18 days, so possibly the test to trigger this message was changed, or something else in the allocator changed such that the situation started happening on these machines? And, like I said, it is specific to our 730xd machines (but they are all identical, so that could mean it's something like their specific ram configuration is causing the allocator to hit this on these machine but not on other machines in the cluster, I don't want to say it's necessarily the model of chipset or CPU, there are other bits of identicalness between these machines)" Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Tested-by: Doug Ledford <dledford@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
author: Jonathan Toppins <jtoppins@redhat.com> 2017-08-10 15:23:35 -0700
committer: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2017-08-16 13:40:28 -0700
commit: 9ea732ebb53fdba26140989ac351dbc82056f224 (patch)
tree: 8d2124fafb52b74de150c42a031d4727cfd6f5ab /mm
parent: 97e371409da76ebe3282c0ba8bc86a84efc3694c (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f9d648fce8cd..53286b2f5b1c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6804,7 +6804,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, false)) {
-		pr_info("%s: [%lx, %lx) PFNs busy\n",
+		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
 			__func__, outer_start, end);
 		ret = -EBUSY;
 		goto done;
author	Jonathan Toppins <jtoppins@redhat.com>	2017-08-10 15:23:35 -0700
committer	Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-08-16 13:40:28 -0700
commit	9ea732ebb53fdba26140989ac351dbc82056f224 (patch)
tree	8d2124fafb52b74de150c42a031d4727cfd6f5ab /mm
parent	97e371409da76ebe3282c0ba8bc86a84efc3694c (diff)