linux-toradex.git/mm/page-writeback.c, branch v4.4.151

writeback: safer lock nesting

2018-04-24T07:32:12+00:00

commit 2e898e4c0a3897ccd434adac5abb8330194f527b upstream.

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    
                                    
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen 
Reported-by: Wang Long 
Acked-by: Wang Long 
Acked-by: Michal Hocko 
Reviewed-by: Andrew Morton 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: Nicholas Piggin 
Cc: 	[v4.2+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[natechancellor: Applied to 4.4 based on Greg's backport on lkml.org]
Signed-off-by: Nathan Chancellor 
Signed-off-by: Greg Kroah-Hartman

mm/page-writeback: fix dirty_ratelimit calculation

2018-01-17T08:35:27+00:00

commit d59b1087a98e402ed9a7cc577f4da435f9a555f5 upstream.

Calculation of dirty_ratelimit sometimes is not correct.  E.g.  initial
values of dirty_ratelimit == INIT_BW and step == 0, lead to the
following result:

   UBSAN: Undefined behaviour in ../mm/page-writeback.c:1286:7
   shift exponent 25600 is too large for 64-bit type 'long unsigned int'

The fix is straightforward - make step 0 if the shift exponent is too
big.

Signed-off-by: Andrey Ryabinin 
Cc: Wu Fengguang 
Cc: Tejun Heo 
Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mel Gorman 
Signed-off-by: Greg Kroah-Hartman

writeback: use higher precision calculation in domain_dirty_limits()

2016-07-27T16:47:29+00:00

commit 62a584fe05eef1f80ed49a286a29328f1a224fb9 upstream.

As vm.dirty_[background_]bytes can't be applied verbatim to multiple
cgroup writeback domains, they get converted to percentages in
domain_dirty_limits() and applied the same way as
vm.dirty_[background]ratio.  However, if the specified bytes is lower
than 1% of available memory, the calculated ratios become zero and the
writeback domain gets throttled constantly.

Fix it by using per-PAGE_SIZE instead of percentage for ratio
calculations.  Also, the updated DIV_ROUND_UP() usages now should
yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
bytes are above zero.

Signed-off-by: Tejun Heo 
Reported-by: Miao Xie 
Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()")
Reviewed-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman 

Adjusted comment based on Jan's suggestion.
Signed-off-by: Jens Axboe

writeback: Fix performance regression in wb_over_bg_thresh()

2016-05-11T09:21:18+00:00

commit 74d369443325063a5f0260e63971decb950fd8fa upstream.

Commit 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use
wb_domain aware operations") unintentionally changed this function's
meaning from "are there more dirty pages than the background writeback
threshold" to "are there more dirty pages than the writeback threshold".
The background writeback threshold is typically half of the writeback
threshold, so this had the effect of raising the number of dirty pages
required to cause a writeback worker to perform background writeout.

This can cause a very severe performance regression when a BDI uses
BDI_CAP_STRICTLIMIT because balance_dirty_pages() and the writeback worker
can now disagree on whether writeback should be initiated.

For example, in a system having 1GB of RAM, a single spinning disk, and a
"pass-through" FUSE filesystem mounted over the disk, application code
mmapped a 128MB file on the disk and was randomly dirtying pages in that
mapping.

Because FUSE uses strictlimit and has a default max_ratio of only 1%, in
balance_dirty_pages, thresh is ~200, bg_thresh is ~100, and the
dirty_freerun_ceiling is the average of those, ~150. So, it pauses the
dirtying processes when we have 151 dirty pages and wakes up a background
writeback worker. But the worker tests the wrong threshold (200 instead of
100), so it does not initiate writeback and just returns.

Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
worker who will do nothing. It remains stuck in this state until the few
dirty pages that we have finally expire and we write them back for that
reason. Then the whole process repeats, resulting in near-zero throughput
through the FUSE BDI.

The fix is to call the parameterized variant of wb_calc_thresh, so that the
worker will do writeback if the bg_thresh is exceeded which was the
behavior before the referenced commit.

Fixes: 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations")
Signed-off-by: Howard Cochran
Acked-by: Tejun Heo
Signed-off-by: Miklos Szeredi
Tested-by Sedat Dilek
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

treewide: Remove old email address

2015-11-23T08:44:58+00:00

There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Signed-off-by: Ingo Molnar

mm/page-writeback.c: initialize m_dirty to avoid compile warning

2015-11-21T00:17:32+00:00

When building kernel with gcc 5.2, the below warning is raised:

  mm/page-writeback.c: In function 'balance_dirty_pages.isra.10':
  mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized]
     unsigned long m_dirty, m_thresh, m_bg_thresh;

The m_dirty{thresh, bg_thresh} are initialized in the block of "if
(mdtc)", so if mdts is null, they won't be initialized before being used.
Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to
keep consistency.

They are used later by if condition: !mdtc || m_dirty <=
dirty_freerun_ceiling(m_thresh, m_bg_thresh)

If mdtc is null, dirty_freerun_ceiling will not be called at all, so the
initialization will not change any behavior other than just ceasing the
compile warning.

(akpm: the patch actually reduces .text size by ~20 bytes on gcc-4.x.y)

[akpm@linux-foundation.org: add comment]
Signed-off-by: Yang Shi 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: fix incorrect calculation of available memory for memcg domains

2015-10-12T16:31:13+00:00

For memcg domains, the amount of available memory was calculated as

 min(the amount currently in use + headroom according to memcg,
     total clean memory)

This isn't quite correct as what should be capped by the amount of
clean memory is the headroom, not the sum of memory in use and
headroom.  For example, if a memcg domain has a significant amount of
dirty memory, the above can lead to a value which is lower than the
current amount in use which doesn't make much sense.  In most
circumstances, the above leads to a number which is somewhat but not
drastically lower.

As the amount of memory which can be readily allocated to the memcg
domain is capped by the amount of system-wide clean memory which is
not already assigned to the memcg itself, the number we want is

 the amount currently in use +
 min(headroom according to memcg, clean memory elsewhere in the system)

This patch updates mem_cgroup_wb_stats() to return the number of
filepages and headroom instead of the calculated available pages.
mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
above calculation from file, headroom, dirty and globally clean pages.

v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
    to build failure when !CGROUP_WRITEBACK.  Fixed.

Signed-off-by: Tejun Heo 
Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
Signed-off-by: Jens Axboe

writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions

2015-10-12T16:31:13+00:00

MDTC_INIT() is used to initialize dirty_throttle_control for memcg
domains.  It used DTC_INIT_COMMON() to initialized mdtc->wb and
->wb_completions which is incorrect as DTC_INIT_COMMON() sets the
latter to wb->completions instead of wb->memcg_completions.  This can
lead to wildly incorrect results when calculating the proportion of
dirty memory the memcg domain should get.

Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize
mdtc->wb_completions to wb->memcg_completions.

Signed-off-by: Tejun Heo 
Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
Signed-off-by: Jens Axboe

writeback: bdi_writeback iteration must not skip dying ones

2015-10-12T16:31:12+00:00

bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

Signed-off-by: Tejun Heo 
Reported-and-tested-by: Artem Bityutskiy 
Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara 
Signed-off-by: Jens Axboe

writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration

2015-10-12T16:31:09+00:00

laptop_mode_timer_fn() was using bdi_for_each_wb() without the
required RCU locking leading to the following warning.

 WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
 ...
 Call Trace:
    [] dump_stack+0x4e/0x82
  [] warn_slowpath_common+0x82/0xc0
  [] warn_slowpath_null+0x1a/0x20
  [] laptop_mode_timer_fn+0x106/0x170
  [] call_timer_fn+0xb3/0x2f0
  [] run_timer_softirq+0x205/0x370
  [] __do_softirq+0xd4/0x460
  [] irq_exit+0x89/0xa0
  [] smp_apic_timer_interrupt+0x42/0x50
  [] apic_timer_interrupt+0x84/0x90
 ...

Fix it by adding rcu_read_lock() around the iteration.

Signed-off-by: Tejun Heo 
Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")
Reviewed-by: Jan Kara 
Signed-off-by: Jens Axboe