linux-toradex.git/mm/page-writeback.c, branch v3.2.60

mm/page-writeback.c: fix divide by zero in pos_ratio_polynom

2014-06-09T12:29:09+00:00

commit d5c9fde3dae750889168807038243ff36431d276 upstream.

It is possible for "limit - setpoint + 1" to equal zero, after getting
truncated to a 32 bit variable, and resulting in a divide by zero error.

Using the fully 64 bit divide functions avoids this problem.  It also
will cause pos_ratio_polynom() to return the correct value when
(setpoint - limit) exceeds 2^32.

Also uninline pos_ratio_polynom, at Andrew's request.

Signed-off-by: Rik van Riel 
Reviewed-by: Michal Hocko 
Cc: Aneesh Kumar K.V 
Cc: Mel Gorman 
Cc: Nishanth Aravamudan 
Cc: Luiz Capitulino 
Cc: Masayoshi Mizuma 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2:
 Adjust context - pos_ratio_polynom() is not a separate function]
Signed-off-by: Ben Hutchings

Negative (setpoint-dirty) in bdi_position_ratio()

2014-06-09T12:29:09+00:00

commit ed84825b785ceb932af7dd5aa08614801721320b upstream.

In bdi_position_ratio(), get difference (setpoint-dirty) right even when
negative. Both setpoint and dirty are unsigned long, the difference was
zero-padded thus wrongly sign-extended to s64. This issue affects all
32-bit architectures, does not affect 64-bit architectures where long
and s64 are equivalent.

In this function, dirty is between freerun and limit, the pseudo-float x
is between [-1,1], expected to be negative about half the time. With
zero-padding, instead of a small negative x we obtained a large positive
one so bdi_position_ratio() returned garbage.

Casting the difference to s64 also prevents overflow with left-shift;
though normally these numbers are small and I never observed a 32-bit
overflow there.

(This patch does not solve the PAE OOM issue.)

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

Reviewed-by: Jan Kara 
Reported-by: Paul Szabo 
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo 
Signed-off-by: Fengguang Wu 
Signed-off-by: Ben Hutchings

mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq()

2014-04-01T23:58:49+00:00

commit a85d9df1ea1d23682a0ed1e100e6965006595d06 upstream.

During aio stress test, we observed the following lockdep warning.  This
mean AIO+numa_balancing is currently deadlockable.

The problem is, aio_migratepage disable interrupt, but
__set_page_dirty_nobuffers unintentionally enable it again.

Generally, all helper function should use spin_lock_irqsave() instead of
spin_lock_irq() because they don't know caller at all.

   other info that might help us debug this:
    Possible unsafe locking scenario:

          CPU0
          ----
     lock(&(&ctx->completion_lock)->rlock);
     
       lock(&(&ctx->completion_lock)->rlock);

    *** DEADLOCK ***

      dump_stack+0x19/0x1b
      print_usage_bug+0x1f7/0x208
      mark_lock+0x21d/0x2a0
      mark_held_locks+0xb9/0x140
      trace_hardirqs_on_caller+0x105/0x1d0
      trace_hardirqs_on+0xd/0x10
      _raw_spin_unlock_irq+0x2c/0x50
      __set_page_dirty_nobuffers+0x8c/0xf0
      migrate_page_copy+0x434/0x540
      aio_migratepage+0xb1/0x140
      move_to_new_page+0x7d/0x230
      migrate_pages+0x5e5/0x700
      migrate_misplaced_page+0xbc/0xf0
      do_numa_page+0x102/0x190
      handle_pte_fault+0x241/0x970
      handle_mm_fault+0x265/0x370
      __do_page_fault+0x172/0x5a0
      do_page_fault+0x1a/0x70
      page_fault+0x28/0x30

Signed-off-by: KOSAKI Motohiro 
Cc: Larry Woodman 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

writeback: fix dirtied pages accounting on redirty

2013-04-25T19:25:43+00:00

commit 2f800fbd777b792de54187088df19a7df0251254 upstream.

De-account the accumulative dirty counters on page redirty.

Page redirties (very common in ext4) will introduce mismatch between
counters (a) and (b)

a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
b) NR_WRITTEN, BDI_WRITTEN

This will introduce systematic errors in balanced_rate and result in
dirty page position errors (ie. the dirty pages are no longer balanced
around the global/bdi setpoints).

Acked-by: Jan Kara 
Acked-by: Peter Zijlstra 
Signed-off-by: Wu Fengguang 
Signed-off-by: Ben Hutchings

writeback: set max_pause to lowest value on zero bdi_dirty

2011-12-08T02:49:29+00:00

Some trace shows lots of bdi_dirty=0 lines where it's actually some
small value if w/o the accounting errors in the per-cpu bdi stats.

In this case the max pause time should really be set to the smallest
(non-zero) value to avoid IO queue underrun and improve throughput.

Signed-off-by: Wu Fengguang

writeback: permit through good bdi even when global dirty exceeded

2011-12-08T02:49:27+00:00

On a system with 1 local mount and 1 NFS mount, if the NFS server
becomes not responding when dd to the NFS mount, the NFS dirty pages may
exceed the global dirty limit and _every_ task involving writing will be
blocked. The whole system appears unresponsive.

The workaround is to permit through the bdi's that only has a small
number of dirty pages. The number chosen (bdi_stat_error pages) is not
enough to enable the local disk to run in optimal throughput, however is
enough to make the system responsive on a broken NFS mount. The user can
then kill the dirtiers on the NFS mount and increase the global dirty
limit to bring up the local disk's throughput.

It risks allowing dirty pages to grow much larger than the global dirty
limit when there are 1000+ mounts, however that's very unlikely to happen,
especially in low memory profiles.

Signed-off-by: Wu Fengguang

writeback: comment on the bdi dirty threshold

2011-12-08T02:49:20+00:00

We do "floating proportions" to let active devices to grow its target
share of dirty pages and stalled/inactive devices to decrease its target
share over time.

It works well except in the case of "an inactive disk suddenly goes
busy", where the initial target share may be too small. To mitigate
this, bdi_position_ratio() has the below line to raise a small
bdi_thresh when it's safe to do so, so that the disk be feed with enough
dirty pages for efficient IO and in turn fast rampup of bdi_thresh:

        bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);

balance_dirty_pages() normally does negative feedback control which
adjusts ratelimit to balance the bdi dirty pages around the target.
In some extreme cases when that is not enough, it will have to block
the tasks completely until the bdi dirty pages drop below bdi_thresh.

Acked-by: Jan Kara 
Acked-by: Peter Zijlstra 
Signed-off-by: Wu Fengguang

writeback: remove vm_dirties and task->dirties

2011-11-17T12:49:06+00:00

They are not used any more.

Signed-off-by: Wu Fengguang

writeback: hard throttle 1000+ dd on a slow USB stick

2011-11-17T12:39:32+00:00

The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
on every 1 4KB-page, which means it cannot throttle a task under
4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
10MB/s USB stick, its bdi dirty pages could grow out of control.

Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
means a limit of 4KB/s.
                                                       
They can eventually be safeguarded by the global limit check 
(nr_dirty < dirty_thresh). However if someone is also writing to an 
HDD at the same time, it'll get poor HDD write performance.
                                                       
We at least want to maintain good write performance for other devices
when one device is attacked by some "massive parallel" workload, or
suffers from slow write bandwidth, or somehow get stalled due to some 
error condition (eg. NFS server not responding).

For a stalled device, we need to completely block its dirtiers, too,
before its bdi dirty pages grow all the way up to the global limit and
leave no space for the other functional devices.

So change the loop exit condition to

	/*
	 * Always enforce global dirty limit; also enforce bdi dirty limit
	 * if the normal max_pause sleeps cannot keep things under control.
	 */
	if (nr_dirty < dirty_thresh &&
	    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
		break;

which can be further simplified to

	if (task_ratelimit)
		break;

Signed-off-by: Wu Fengguang

mm: Make task in balance_dirty_pages() killable

2011-11-16T11:53:44+00:00

There is no reason why task in balance_dirty_pages() shouldn't be killable
and it helps in recovering from some error conditions (like when filesystem
goes in error state and cannot accept writeback anymore but we still want to
kill processes using it to be able to unmount it).

There will be follow up patches to further abort the generic_perform_write()
and other filesystem write loops, to avoid large write + SIGKILL combination
exceeding the dirty limit and possibly strange OOM.

Reported-by: Kazuya Mio 
Tested-by: Kazuya Mio 
Reviewed-by: Neil Brown 
Reviewed-by: KOSAKI Motohiro 
Signed-off-by: Jan Kara 
Signed-off-by: Wu Fengguang