From 08d1d0e6d0a00c6e687201774f3bf61177741e80 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Sun, 6 Oct 2019 17:58:15 -0700 Subject: memcg: only record foreign writebacks with dirty pages when memcg is not disabled In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory' for saving memory. Now kdump kernel will always panic when dump vmcore to local disk: BUG: kernel NULL pointer dereference, address: 0000000000000ab8 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018 RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140 Call Trace: __set_page_dirty+0x52/0xc0 iomap_set_page_dirty+0x50/0x90 iomap_write_end+0x6e/0x270 iomap_write_actor+0xce/0x170 iomap_apply+0xba/0x11e iomap_file_buffered_write+0x62/0x90 xfs_file_buffered_aio_write+0xca/0x320 [xfs] new_sync_write+0x12d/0x1d0 vfs_write+0xa5/0x1a0 ksys_write+0x59/0xd0 do_syscall_64+0x59/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 And this will corrupt the 1st kernel too with 'cgroup_disable=memory'. Via the trace and with debugging, it is pointing to commit 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") which introduced this regression. Disabling memcg causes the null pointer dereference at uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath(). Fix it by returning directly if memcg is disabled, but not trying to record the foreign writebacks with dirty pages. Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") Signed-off-by: Baoquan He Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Jan Kara Cc: Tejun Heo Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux/memcontrol.h') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9b60863429cc..98380779f6d5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1264,6 +1264,9 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, static inline void mem_cgroup_track_foreign_dirty(struct page *page, struct bdi_writeback *wb) { + if (mem_cgroup_disabled()) + return; + if (unlikely(&page->mem_cgroup->css != wb->memcg_css)) mem_cgroup_track_foreign_dirty_slowpath(page, wb); } -- cgit v1.2.3 From 9783aa9917f8ae24759e67bf882f1aba32fe4ea1 Mon Sep 17 00:00:00 2001 From: Chris Down Date: Sun, 6 Oct 2019 17:58:32 -0700 Subject: mm, memcg: proportional memory.{low,min} reclaim cgroup v2 introduces two memory protection thresholds: memory.low (best-effort) and memory.min (hard protection). While they generally do what they say on the tin, there is a limitation in their implementation that makes them difficult to use effectively: that cliff behaviour often manifests when they become eligible for reclaim. This patch implements more intuitive and usable behaviour, where we gradually mount more reclaim pressure as cgroups further and further exceed their protection thresholds. This cliff edge behaviour happens because we only choose whether or not to reclaim based on whether the memcg is within its protection limits (see the use of mem_cgroup_protected in shrink_node), but we don't vary our reclaim behaviour based on this information. Imagine the following timeline, with the numbers the lruvec size in this zone: 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned. 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned. 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be scanned. (?!) * Of course, we won't usually scan all available pages in the zone even without this patch because of scan control priority, over-reclaim protection, etc. However, as shown by the tests at the end, these techniques don't sufficiently throttle such an extreme change in input, so cliff-like behaviour isn't really averted by their existence alone. Here's an example of how this plays out in practice. At Facebook, we are trying to protect various workloads from "system" software, like configuration management tools, metric collectors, etc (see this[0] case study). In order to find a suitable memory.low value, we start by determining the expected memory range within which the workload will be comfortable operating. This isn't an exact science -- memory usage deemed "comfortable" will vary over time due to user behaviour, differences in composition of work, etc, etc. As such we need to ballpark memory.low, but doing this is currently problematic: 1. If we end up setting it too low for the workload, it won't have *any* effect (see discussion above). The group will receive the full weight of reclaim and won't have any priority while competing with the less important system software, as if we had no memory.low configured at all. 2. Because of this behaviour, we end up erring on the side of setting it too high, such that the comfort range is reliably covered. However, protected memory is completely unavailable to the rest of the system, so we might cause undue memory and IO pressure there when we *know* we have some elasticity in the workload. 3. Even if we get the value totally right, smack in the middle of the comfort zone, we get extreme jumps between no pressure and full pressure that cause unpredictable pressure spikes in the workload due to the current binary reclaim behaviour. With this patch, we can set it to our ballpark estimation without too much worry. Any undesirable behaviour, such as too much or too little reclaim pressure on the workload or system will be proportional to how far our estimation is off. This means we can set memory.low much more conservatively and thus waste less resources *without* the risk of the workload falling off a cliff if we overshoot. As a more abstract technical description, this unintuitive behaviour results in having to give high-priority workloads a large protection buffer on top of their expected usage to function reliably, as otherwise we have abrupt periods of dramatically increased memory pressure which hamper performance. Having to set these thresholds so high wastes resources and generally works against the principle of work conservation. In addition, having proportional memory reclaim behaviour has other benefits. Most notably, before this patch it's basically mandatory to set memory.low to a higher than desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting *some* protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: +------------+-----------------+--------------------+--------------+ | memory.low | test (pgscan/s) | control (pgscan/s) | % of control | +------------+-----------------+--------------------+--------------+ | 21G | 0 | 0 | N/A | | 17G | 867 | 3799 | 23% | | 12G | 1203 | 3543 | 34% | | 8G | 2534 | 3979 | 64% | | 4G | 3980 | 4147 | 96% | | 0 | 3799 | 3980 | 95% | +------------+-----------------+--------------------+--------------+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [akpm@linux-foundation.org: reflow block comments to fit in 80 cols] [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name Signed-off-by: Chris Down Acked-by: Johannes Weiner Reviewed-by: Roman Gushchin Cc: Michal Hocko Cc: Tejun Heo Cc: Dennis Zhou Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'include/linux/memcontrol.h') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 98380779f6d5..fa9ba2edf7e0 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -356,6 +356,14 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return 0; + + return max(READ_ONCE(memcg->memory.emin), READ_ONCE(memcg->memory.elow)); +} + enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg); @@ -537,6 +545,8 @@ void mem_cgroup_handle_over_high(void); unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); +unsigned long mem_cgroup_size(struct mem_cgroup *memcg); + void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p); @@ -829,6 +839,11 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, { } +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg) +{ + return 0; +} + static inline enum mem_cgroup_protection mem_cgroup_protected( struct mem_cgroup *root, struct mem_cgroup *memcg) { @@ -968,6 +983,11 @@ static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg) return 0; } +static inline unsigned long mem_cgroup_size(struct mem_cgroup *memcg) +{ + return 0; +} + static inline void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p) { -- cgit v1.2.3 From 9de7ca46ad2688bd51e80f7119fefa301ad7f3fa Mon Sep 17 00:00:00 2001 From: Chris Down Date: Sun, 6 Oct 2019 17:58:35 -0700 Subject: mm, memcg: make memory.emin the baseline for utilisation determination Roman points out that when when we do the low reclaim pass, we scale the reclaim pressure relative to position between 0 and the maximum protection threshold. However, if the maximum protection is based on memory.elow, and memory.emin is above zero, this means we still may get binary behaviour on second-pass low reclaim. This is because we scale starting at 0, not starting at memory.emin, and since we don't scan at all below emin, we end up with cliff behaviour. This should be a fairly uncommon case since usually we don't go into the second pass, but it makes sense to scale our low reclaim pressure starting at emin. You can test this by catting two large sparse files, one in a cgroup with emin set to some moderate size compared to physical RAM, and another cgroup without any emin. In both cgroups, set an elow larger than 50% of physical RAM. The one with emin will have less page scanning, as reclaim pressure is lower. Rebase on top of and apply the same idea as what was applied to handle cgroup_memory=disable properly for the original proportional patch http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm, memcg: Handle cgroup_disable=memory when getting memcg protection"). Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name Signed-off-by: Chris Down Suggested-by: Roman Gushchin Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Dennis Zhou Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) (limited to 'include/linux/memcontrol.h') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fa9ba2edf7e0..1cbad1248e5a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -356,12 +356,17 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } -static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg) +static inline void mem_cgroup_protection(struct mem_cgroup *memcg, + unsigned long *min, unsigned long *low) { - if (mem_cgroup_disabled()) - return 0; + if (mem_cgroup_disabled()) { + *min = 0; + *low = 0; + return; + } - return max(READ_ONCE(memcg->memory.emin), READ_ONCE(memcg->memory.elow)); + *min = READ_ONCE(memcg->memory.emin); + *low = READ_ONCE(memcg->memory.elow); } enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, @@ -839,9 +844,11 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, { } -static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg) +static inline void mem_cgroup_protection(struct mem_cgroup *memcg, + unsigned long *min, unsigned long *low) { - return 0; + *min = 0; + *low = 0; } static inline enum mem_cgroup_protection mem_cgroup_protected( -- cgit v1.2.3 From 1bc63fb1272be0773e925f78c0fbd06c89701d55 Mon Sep 17 00:00:00 2001 From: Chris Down Date: Sun, 6 Oct 2019 17:58:38 -0700 Subject: mm, memcg: make scan aggression always exclude protection This patch is an incremental improvement on the existing memory.{low,min} relative reclaim work to base its scan pressure calculations on how much protection is available compared to the current usage, rather than how much the current usage is over some protection threshold. This change doesn't change the experience for the user in the normal case too much. One benefit is that it replaces the (somewhat arbitrary) 100% cutoff with an indefinite slope, which makes it easier to ballpark a memory.low value. As well as this, the old methodology doesn't quite apply generically to machines with varying amounts of physical memory. Let's say we have a top level cgroup, workload.slice, and another top level cgroup, system-management.slice. We want to roughly give 12G to system-management.slice, so on a 32GB machine we set memory.low to 20GB in workload.slice, and on a 64GB machine we set memory.low to 52GB. However, because these are relative amounts to the total machine size, while the amount of memory we want to generally be willing to yield to system.slice is absolute (12G), we end up putting more pressure on system.slice just because we have a larger machine and a larger workload to fill it, which seems fairly unintuitive. With this new behaviour, we don't end up with this unintended side effect. Previously the way that memory.low protection works is that if you are 50% over a certain baseline, you get 50% of your normal scan pressure. This is certainly better than the previous cliff-edge behaviour, but it can be improved even further by always considering memory under the currently enforced protection threshold to be out of bounds. This means that we can set relatively low memory.low thresholds for variable or bursty workloads while still getting a reasonable level of protection, whereas with the previous version we may still trivially hit the 100% clamp. The previous 100% clamp is also somewhat arbitrary, whereas this one is more concretely based on the currently enforced protection threshold, which is likely easier to reason about. There is also a subtle issue with the way that proportional reclaim worked previously -- it promotes having no memory.low, since it makes pressure higher during low reclaim. This happens because we base our scan pressure modulation on how far memory.current is between memory.min and memory.low, but if memory.low is unset, we only use the overage method. In most cromulent configurations, this then means that we end up with *more* pressure than with no memory.low at all when we're in low reclaim, which is not really very usable or expected. With this patch, memory.low and memory.min affect reclaim pressure in a more understandable and composable way. For example, from a user standpoint, "protected" memory now remains untouchable from a reclaim aggression standpoint, and users can also have more confidence that bursty workloads will still receive some amount of guaranteed protection. Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name Signed-off-by: Chris Down Reviewed-by: Roman Gushchin Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Tejun Heo Cc: Dennis Zhou Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) (limited to 'include/linux/memcontrol.h') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1cbad1248e5a..ae703ea3ef48 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -356,17 +356,17 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } -static inline void mem_cgroup_protection(struct mem_cgroup *memcg, - unsigned long *min, unsigned long *low) +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, + bool in_low_reclaim) { - if (mem_cgroup_disabled()) { - *min = 0; - *low = 0; - return; - } + if (mem_cgroup_disabled()) + return 0; + + if (in_low_reclaim) + return READ_ONCE(memcg->memory.emin); - *min = READ_ONCE(memcg->memory.emin); - *low = READ_ONCE(memcg->memory.elow); + return max(READ_ONCE(memcg->memory.emin), + READ_ONCE(memcg->memory.elow)); } enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, @@ -844,11 +844,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, { } -static inline void mem_cgroup_protection(struct mem_cgroup *memcg, - unsigned long *min, unsigned long *low) +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, + bool in_low_reclaim) { - *min = 0; - *low = 0; + return 0; } static inline enum mem_cgroup_protection mem_cgroup_protected( -- cgit v1.2.3