<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/fs, branch v3.2.56</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>cifs: set MAY_SIGN when sec=krb5</title>
<updated>2014-04-01T23:59:01+00:00</updated>
<author>
<name>Martijn de Gouw</name>
<email>martijn.de.gouw@prodrive.nl</email>
</author>
<published>2012-10-24T09:45:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=9924c0fa968bb6104a99b7b7961a6c188fed9b31'/>
<id>9924c0fa968bb6104a99b7b7961a6c188fed9b31</id>
<content type='text'>
commit 0b7bc84000d71f3647ca33ab1bf5bd928535c846 upstream.

Setting this secFlg allows usage of dfs where some servers require
signing and others don't.

Signed-off-by: Martijn de Gouw &lt;martijn.de.gouw@prodrive.nl&gt;
Signed-off-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Steve French &lt;sfrench@us.ibm.com&gt;
[Joseph Salisbury: This backport was done so including mainline commit
8830d7e07a5e38bc47650a7554b7c1cfd49902bf is not needed.]
BugLink: http://bugs.launchpad.net/bugs/1285723
Signed-off-by: Joseph Salisbury &lt;joseph.salisbury@canonical.com&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 0b7bc84000d71f3647ca33ab1bf5bd928535c846 upstream.

Setting this secFlg allows usage of dfs where some servers require
signing and others don't.

Signed-off-by: Martijn de Gouw &lt;martijn.de.gouw@prodrive.nl&gt;
Signed-off-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Steve French &lt;sfrench@us.ibm.com&gt;
[Joseph Salisbury: This backport was done so including mainline commit
8830d7e07a5e38bc47650a7554b7c1cfd49902bf is not needed.]
BugLink: http://bugs.launchpad.net/bugs/1285723
Signed-off-by: Joseph Salisbury &lt;joseph.salisbury@canonical.com&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>hpfs: deadlock and race in directory lseek()</title>
<updated>2014-04-01T23:59:00+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2013-05-18T06:38:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b89ff066efab68d53a3a060b2c7d9089f8afa8ea'/>
<id>b89ff066efab68d53a3a060b2c7d9089f8afa8ea</id>
<content type='text'>
commit 31abdab9c11bb1694ecd1476a7edbe8e964d94ac upstream.

For one thing, there's an ABBA deadlock on hpfs fs-wide lock and i_mutex
in hpfs_dir_lseek() - there's a lot of methods that grab the former with
the caller already holding the latter, so it must take i_mutex first.

For another, locking the damn thing, carefully validating the offset,
then dropping locks and assigning the offset is obviously racy.

Moreover, we _must_ do hpfs_add_pos(), or the machinery in dnode.c
won't modify the sucker on B-tree surgeries.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 31abdab9c11bb1694ecd1476a7edbe8e964d94ac upstream.

For one thing, there's an ABBA deadlock on hpfs fs-wide lock and i_mutex
in hpfs_dir_lseek() - there's a lot of methods that grab the former with
the caller already holding the latter, so it must take i_mutex first.

For another, locking the damn thing, carefully validating the offset,
then dropping locks and assigning the offset is obviously racy.

Moreover, we _must_ do hpfs_add_pos(), or the machinery in dnode.c
won't modify the sucker on B-tree surgeries.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>hpfs: remember free space</title>
<updated>2014-04-01T23:59:00+00:00</updated>
<author>
<name>Mikulas Patocka</name>
<email>mikulas@artax.karlin.mff.cuni.cz</email>
</author>
<published>2014-01-28T23:10:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=578d1903dcfd8911534aa602ea8c104c383fadda'/>
<id>578d1903dcfd8911534aa602ea8c104c383fadda</id>
<content type='text'>
commit 2cbe5c76fc5e38e9af4b709593146e4b8272b69e upstream.

Previously, hpfs scanned all bitmaps each time the user asked for free
space using statfs.  This patch changes it so that hpfs scans the
bitmaps only once, remembes the free space and on next invocation of
statfs it returns the value instantly.

New versions of wine are hammering on the statfs syscall very heavily,
making some games unplayable when they're stored on hpfs, with load
times in minutes.

This should be backported to the stable kernels because it fixes
user-visible problem (excessive level load times in wine).

Signed-off-by: Mikulas Patocka &lt;mikulas@artax.karlin.mff.cuni.cz&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[ kamal: backport to 3.8 (no hpfs_prefetch_bitmap) ]
Signed-off-by: Kamal Mostafa &lt;kamal@canonical.com&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 2cbe5c76fc5e38e9af4b709593146e4b8272b69e upstream.

Previously, hpfs scanned all bitmaps each time the user asked for free
space using statfs.  This patch changes it so that hpfs scans the
bitmaps only once, remembes the free space and on next invocation of
statfs it returns the value instantly.

New versions of wine are hammering on the statfs syscall very heavily,
making some games unplayable when they're stored on hpfs, with load
times in minutes.

This should be backported to the stable kernels because it fixes
user-visible problem (excessive level load times in wine).

Signed-off-by: Mikulas Patocka &lt;mikulas@artax.karlin.mff.cuni.cz&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[ kamal: backport to 3.8 (no hpfs_prefetch_bitmap) ]
Signed-off-by: Kamal Mostafa &lt;kamal@canonical.com&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfs: fix do_div() warning by instead using sector_div()</title>
<updated>2014-04-01T23:58:59+00:00</updated>
<author>
<name>Helge Deller</name>
<email>deller@gmx.de</email>
</author>
<published>2013-12-02T18:59:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c725cf148008d81eac7fe4bebff7367ae2aa0aa2'/>
<id>c725cf148008d81eac7fe4bebff7367ae2aa0aa2</id>
<content type='text'>
commit 3873d064b8538686bbbd4b858dc8a07db1f7f43a upstream.

When compiling a 32bit kernel with CONFIG_LBDAF=n the compiler complains like
shown below.  Fix this warning by instead using sector_div() which is provided
by the kernel.h header file.

fs/nfs/blocklayout/extents.c: In function ‘normalize’:
include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
fs/nfs/blocklayout/extents.c:47:13: note: in expansion of macro ‘do_div’
nfs/blocklayout/extents.c:47:2: warning: right shift count &gt;= width of type [enabled by default]
fs/nfs/blocklayout/extents.c:47:2: warning: passing argument 1 of ‘__div64_32’ from incompatible pointer type [enabled by default]
include/asm-generic/div64.h:35:17: note: expected ‘uint64_t *’ but argument is of type ‘sector_t *’
extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);

Signed-off-by: Helge Deller &lt;deller@gmx.de&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 3873d064b8538686bbbd4b858dc8a07db1f7f43a upstream.

When compiling a 32bit kernel with CONFIG_LBDAF=n the compiler complains like
shown below.  Fix this warning by instead using sector_div() which is provided
by the kernel.h header file.

fs/nfs/blocklayout/extents.c: In function ‘normalize’:
include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
fs/nfs/blocklayout/extents.c:47:13: note: in expansion of macro ‘do_div’
nfs/blocklayout/extents.c:47:2: warning: right shift count &gt;= width of type [enabled by default]
fs/nfs/blocklayout/extents.c:47:2: warning: passing argument 1 of ‘__div64_32’ from incompatible pointer type [enabled by default]
include/asm-generic/div64.h:35:17: note: expected ‘uint64_t *’ but argument is of type ‘sector_t *’
extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);

Signed-off-by: Helge Deller &lt;deller@gmx.de&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ocfs2 syncs the wrong range...</title>
<updated>2014-04-01T23:58:59+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2014-02-10T20:18:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fb183963fa094a6973797eb7b4bf7106da565146'/>
<id>fb183963fa094a6973797eb7b4bf7106da565146</id>
<content type='text'>
commit 1b56e98990bcdbb20b9fab163654b9315bf158e8 upstream.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 1b56e98990bcdbb20b9fab163654b9315bf158e8 upstream.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ocfs2: fix quota file corruption</title>
<updated>2014-04-01T23:58:57+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2014-03-03T23:38:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=803299de13d1693823f00b6c52d1e3a987f291bb'/>
<id>803299de13d1693823f00b6c52d1e3a987f291bb</id>
<content type='text'>
commit 15c34a760630ca2c803848fba90ca0646a9907dd upstream.

Global quota files are accessed from different nodes.  Thus we cannot
cache offset of quota structure in the quota file after we drop our node
reference count to it because after that moment quota structure may be
freed and reallocated elsewhere by a different node resulting in
corruption of quota file.

Fix the problem by clearing dq_off when we are releasing dquot structure.
We also remove the DB_READ_B handling because it is useless -
DQ_ACTIVE_B is set iff DQ_READ_B is set.

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Goldwyn Rodrigues &lt;rgoldwyn@suse.de&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Reviewed-by: Mark Fasheh &lt;mfasheh@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 15c34a760630ca2c803848fba90ca0646a9907dd upstream.

Global quota files are accessed from different nodes.  Thus we cannot
cache offset of quota structure in the quota file after we drop our node
reference count to it because after that moment quota structure may be
freed and reallocated elsewhere by a different node resulting in
corruption of quota file.

Fix the problem by clearing dq_off when we are releasing dquot structure.
We also remove the DB_READ_B handling because it is useless -
DQ_ACTIVE_B is set iff DQ_READ_B is set.

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Goldwyn Rodrigues &lt;rgoldwyn@suse.de&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Reviewed-by: Mark Fasheh &lt;mfasheh@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>quota: Fix race between dqput() and dquot_scan_active()</title>
<updated>2014-04-01T23:58:55+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2014-02-20T16:02:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7149a5e5c1fecc22a838f87ed86bfa3cc0efc984'/>
<id>7149a5e5c1fecc22a838f87ed86bfa3cc0efc984</id>
<content type='text'>
commit 1362f4ea20fa63688ba6026e586d9746ff13a846 upstream.

Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:

CPU1					CPU2
  dqput()
    spin_lock(&amp;dq_list_lock);
    if (atomic_read(&amp;dquot-&gt;dq_count) &gt; 1) {
     - not taken
    if (test_bit(DQ_ACTIVE_B, &amp;dquot-&gt;dq_flags)) {
      spin_unlock(&amp;dq_list_lock);
      -&gt;release_dquot(dquot);
        if (atomic_read(&amp;dquot-&gt;dq_count) &gt; 1)
         - not taken
					  dquot_scan_active()
					    spin_lock(&amp;dq_list_lock);
					    if (!test_bit(DQ_ACTIVE_B, &amp;dquot-&gt;dq_flags))
					     - not taken
					    atomic_inc(&amp;dquot-&gt;dq_count);
					    spin_unlock(&amp;dq_list_lock);
        - proceeds to release dquot
					    ret = fn(dquot, priv);
					     - called for inactive dquot

Fix the problem by making sure possible -&gt;release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 1362f4ea20fa63688ba6026e586d9746ff13a846 upstream.

Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:

CPU1					CPU2
  dqput()
    spin_lock(&amp;dq_list_lock);
    if (atomic_read(&amp;dquot-&gt;dq_count) &gt; 1) {
     - not taken
    if (test_bit(DQ_ACTIVE_B, &amp;dquot-&gt;dq_flags)) {
      spin_unlock(&amp;dq_list_lock);
      -&gt;release_dquot(dquot);
        if (atomic_read(&amp;dquot-&gt;dq_count) &gt; 1)
         - not taken
					  dquot_scan_active()
					    spin_lock(&amp;dq_list_lock);
					    if (!test_bit(DQ_ACTIVE_B, &amp;dquot-&gt;dq_flags))
					     - not taken
					    atomic_inc(&amp;dquot-&gt;dq_count);
					    spin_unlock(&amp;dq_list_lock);
        - proceeds to release dquot
					    ret = fn(dquot, priv);
					     - called for inactive dquot

Fix the problem by making sure possible -&gt;release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ext4: don't leave i_crtime.tv_sec uninitialized</title>
<updated>2014-04-01T23:58:53+00:00</updated>
<author>
<name>Theodore Ts'o</name>
<email>tytso@mit.edu</email>
</author>
<published>2014-02-17T00:29:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=426f6c8a5051be7dafe96dba65ff73b778ba7839'/>
<id>426f6c8a5051be7dafe96dba65ff73b778ba7839</id>
<content type='text'>
commit 19ea80603715d473600cd993b9987bc97d042e02 upstream.

If the i_crtime field is not present in the inode, don't leave the
field uninitialized.

Fixes: ef7f38359 ("ext4: Add nanosecond timestamps")
Reported-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Tested-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 19ea80603715d473600cd993b9987bc97d042e02 upstream.

If the i_crtime field is not present in the inode, don't leave the
field uninitialized.

Fixes: ef7f38359 ("ext4: Add nanosecond timestamps")
Reported-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Tested-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>lockd: send correct lock when granting a delayed lock.</title>
<updated>2014-04-01T23:58:52+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2014-02-07T06:10:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=be1dad9009c25d4d7c1a4ff61583868f42d25257'/>
<id>be1dad9009c25d4d7c1a4ff61583868f42d25257</id>
<content type='text'>
commit 2ec197db1a56c9269d75e965f14c344b58b2a4f6 upstream.

If an NFS client attempts to get a lock (using NLM) and the lock is
not available, the server will remember the request and when the lock
becomes available it will send a GRANT request to the client to
provide the lock.

If the client already held an adjacent lock, the GRANT callback will
report the union of the existing and new locks, which can confuse the
client.

This happens because __posix_lock_file (called by vfs_lock_file)
updates the passed-in file_lock structure when adjacent or
over-lapping locks are found.

To avoid this problem we take a copy of the two fields that can
be changed (fl_start and fl_end) before the call and restore them
afterwards.
An alternate would be to allocate a 'struct file_lock', initialise it,
use locks_copy_lock() to take a copy, then locks_release_private()
after the vfs_lock_file() call.  But that is a lot more work.

Reported-by: Olaf Kirch &lt;okir@suse.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@redhat.com&gt;

--
v1 had a couple of issues (large on-stack struct and didn't really work properly).
This version is much better tested.
Signed-off-by: J. Bruce Fields &lt;bfields@redhat.com&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 2ec197db1a56c9269d75e965f14c344b58b2a4f6 upstream.

If an NFS client attempts to get a lock (using NLM) and the lock is
not available, the server will remember the request and when the lock
becomes available it will send a GRANT request to the client to
provide the lock.

If the client already held an adjacent lock, the GRANT callback will
report the union of the existing and new locks, which can confuse the
client.

This happens because __posix_lock_file (called by vfs_lock_file)
updates the passed-in file_lock structure when adjacent or
over-lapping locks are found.

To avoid this problem we take a copy of the two fields that can
be changed (fl_start and fl_end) before the call and restore them
afterwards.
An alternate would be to allocate a 'struct file_lock', initialise it,
use locks_copy_lock() to take a copy, then locks_release_private()
after the vfs_lock_file() call.  But that is a lot more work.

Reported-by: Olaf Kirch &lt;okir@suse.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@redhat.com&gt;

--
v1 had a couple of issues (large on-stack struct and didn't really work properly).
This version is much better tested.
Signed-off-by: J. Bruce Fields &lt;bfields@redhat.com&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem</title>
<updated>2014-04-01T23:58:50+00:00</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2014-02-10T22:25:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=24f600a42821551c33114b7598f2a6b7fc1894b6'/>
<id>24f600a42821551c33114b7598f2a6b7fc1894b6</id>
<content type='text'>
commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 upstream.

Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept.  At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.

I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.

There are always pathologies where order &gt; 0 allocations can fail when
there are copious amounts of free memory available.  Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.

A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.

Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem.  Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification).  We have already tried everything
reasonable to allocate a page and the only thing left to do is wait.  So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.

Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: Cong Wang &lt;cwang@twopensource.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 upstream.

Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept.  At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.

I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.

There are always pathologies where order &gt; 0 allocations can fail when
there are copious amounts of free memory available.  Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.

A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.

Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem.  Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification).  We have already tried everything
reasonable to allocate a page and the only thing left to do is wait.  So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.

Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: Cong Wang &lt;cwang@twopensource.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
</feed>
