<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/Documentation/sysctl, branch v4.4.38</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>pipe: limit the per-user amount of pages allocated in pipes</title>
<updated>2016-06-08T01:14:35+00:00</updated>
<author>
<name>Willy Tarreau</name>
<email>w@1wt.eu</email>
</author>
<published>2016-01-18T15:36:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fa6d0ba12a8eb6a2e9a1646c5816da307c1f93a7'/>
<id>fa6d0ba12a8eb6a2e9a1646c5816da307c1f93a7</id>
<content type='text'>
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Moritz Muehlenhoff &lt;moritz@wikimedia.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Moritz Muehlenhoff &lt;moritz@wikimedia.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Documentation/sysctl/vm.txt: fix misleading code reference of overcommit_memory</title>
<updated>2015-11-09T23:11:24+00:00</updated>
<author>
<name>Chun Chen</name>
<email>chenchun.feed@gmail.com</email>
</author>
<published>2015-11-09T22:58:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c56050c700d18f18fbec934f56069150bcec3709'/>
<id>c56050c700d18f18fbec934f56069150bcec3709</id>
<content type='text'>
The origin document references to cap_vm_enough_memory is because
cap_vm_enough_memory invoked __vm_enough_memory before and it no longer
does now.

Signed-off-by: Chun Chen &lt;ramichen@tencent.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The origin document references to cap_vm_enough_memory is because
cap_vm_enough_memory invoked __vm_enough_memory before and it no longer
does now.

Signed-off-by: Chun Chen &lt;ramichen@tencent.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup</title>
<updated>2015-11-06T03:34:48+00:00</updated>
<author>
<name>Jiri Kosina</name>
<email>jkosina@suse.cz</email>
</author>
<published>2015-11-06T02:44:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=55537871ef666b4153fd1ef8782e4a13fee142cc'/>
<id>55537871ef666b4153fd1ef8782e4a13fee142cc</id>
<content type='text'>
In many cases of hardlockup reports, it's actually not possible to know
why it triggered, because the CPU that got stuck is usually waiting on a
resource (with IRQs disabled) in posession of some other CPU is holding.

IOW, we are often looking at the stacktrace of the victim and not the
actual offender.

Introduce sysctl / cmdline parameter that makes it possible to have
hardlockup detector perform all-CPU backtrace.

Signed-off-by: Jiri Kosina &lt;jkosina@suse.cz&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In many cases of hardlockup reports, it's actually not possible to know
why it triggered, because the CPU that got stuck is usually waiting on a
resource (with IRQs disabled) in posession of some other CPU is holding.

IOW, we are often looking at the stacktrace of the victim and not the
actual offender.

Introduce sysctl / cmdline parameter that makes it possible to have
hardlockup detector perform all-CPU backtrace.

Signed-off-by: Jiri Kosina &lt;jkosina@suse.cz&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: qdisc: enhance default_qdisc documentation</title>
<updated>2015-09-17T23:09:22+00:00</updated>
<author>
<name>Phil Sutter</name>
<email>phil@nwl.cc</email>
</author>
<published>2015-09-15T08:33:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2e64126bb0fb581b55e52e51ec451ebb0d6769c8'/>
<id>2e64126bb0fb581b55e52e51ec451ebb0d6769c8</id>
<content type='text'>
Aside from some lingual cleanup, point out which interfaces are not or
partly covered by this setting.

Signed-off-by: Phil Sutter &lt;phil@nwl.cc&gt;
Acked-by: Cong Wang &lt;cwang@twopensource.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Aside from some lingual cleanup, point out which interfaces are not or
partly covered by this setting.

Signed-off-by: Phil Sutter &lt;phil@nwl.cc&gt;
Acked-by: Cong Wang &lt;cwang@twopensource.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/page_alloc.c: fix a misleading comment</title>
<updated>2015-09-08T22:35:28+00:00</updated>
<author>
<name>Yaowei Bai</name>
<email>bywxiaobai@163.com</email>
</author>
<published>2015-09-08T22:04:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=013110a73dcf970cb28c5b0a79f9eee577ea6aa2'/>
<id>013110a73dcf970cb28c5b0a79f9eee577ea6aa2</id>
<content type='text'>
The comment says that the per-cpu batchsize and zone watermarks are
determined by present_pages which is definitely wrong, they are both
calculated from managed_pages.  Fix it.

Signed-off-by: Yaowei Bai &lt;bywxiaobai@163.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The comment says that the per-cpu batchsize and zone watermarks are
determined by present_pages which is definitely wrong, they are both
calculated from managed_pages.  Fix it.

Signed-off-by: Yaowei Bai &lt;bywxiaobai@163.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Documentation: mm: fix location of extfrag_index</title>
<updated>2015-07-24T13:05:56+00:00</updated>
<author>
<name>Rabin Vincent</name>
<email>rabin.vincent@axis.com</email>
</author>
<published>2015-07-14T05:35:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a10726bb5472ff2ed95180cfb5e82091c45d7b19'/>
<id>a10726bb5472ff2ed95180cfb5e82091c45d7b19</id>
<content type='text'>
/proc/extfrag_index does not exist.  This file is in debugfs.  Fix the
description of extfrag_threshold to reflect this.

Signed-off-by: Rabin Vincent &lt;rabin.vincent@axis.com&gt;
Signed-off-by: Jonathan Corbet &lt;corbet@lwn.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
/proc/extfrag_index does not exist.  This file is in debugfs.  Fix the
description of extfrag_threshold to reflect this.

Signed-off-by: Rabin Vincent &lt;rabin.vincent@axis.com&gt;
Signed-off-by: Jonathan Corbet &lt;corbet@lwn.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>coredump: use from_kuid/kgid when formatting corename</title>
<updated>2015-06-26T00:00:43+00:00</updated>
<author>
<name>Nicolas Iooss</name>
<email>nicolas.iooss_linux@m4x.org</email>
</author>
<published>2015-06-25T22:03:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5202efe544c279be152f44f2821010ff7b2d7e5b'/>
<id>5202efe544c279be152f44f2821010ff7b2d7e5b</id>
<content type='text'>
When adding __printf attribute to cn_printf, gcc reports some issues:

  fs/coredump.c:213:5: warning: format '%d' expects argument of type
  'int', but argument 3 has type 'kuid_t' [-Wformat=]
       err = cn_printf(cn, "%d", cred-&gt;uid);
       ^
  fs/coredump.c:217:5: warning: format '%d' expects argument of type
  'int', but argument 3 has type 'kgid_t' [-Wformat=]
       err = cn_printf(cn, "%d", cred-&gt;gid);
       ^

These warnings come from the fact that the value of uid/gid needs to be
extracted from the kuid_t/kgid_t structure before being used as an
integer.  More precisely, cred-&gt;uid and cred-&gt;gid need to be converted to
either user-namespace uid/gid or to init_user_ns uid/gid.

Use init_user_ns in order not to break existing ABI, and document this in
Documentation/sysctl/kernel.txt.

While at it, format uid and gid values with %u instead of %d because
uid_t/__kernel_uid32_t and gid_t/__kernel_gid32_t are unsigned int.

Signed-off-by: Nicolas Iooss &lt;nicolas.iooss_linux@m4x.org&gt;
Acked-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When adding __printf attribute to cn_printf, gcc reports some issues:

  fs/coredump.c:213:5: warning: format '%d' expects argument of type
  'int', but argument 3 has type 'kuid_t' [-Wformat=]
       err = cn_printf(cn, "%d", cred-&gt;uid);
       ^
  fs/coredump.c:217:5: warning: format '%d' expects argument of type
  'int', but argument 3 has type 'kgid_t' [-Wformat=]
       err = cn_printf(cn, "%d", cred-&gt;gid);
       ^

These warnings come from the fact that the value of uid/gid needs to be
extracted from the kuid_t/kgid_t structure before being used as an
integer.  More precisely, cred-&gt;uid and cred-&gt;gid need to be converted to
either user-namespace uid/gid or to init_user_ns uid/gid.

Use init_user_ns in order not to break existing ABI, and document this in
Documentation/sysctl/kernel.txt.

While at it, format uid and gid values with %u instead of %d because
uid_t/__kernel_uid32_t and gid_t/__kernel_gid32_t are unsigned int.

Signed-off-by: Nicolas Iooss &lt;nicolas.iooss_linux@m4x.org&gt;
Acked-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>watchdog: add watchdog_cpumask sysctl to assist nohz</title>
<updated>2015-06-25T00:49:40+00:00</updated>
<author>
<name>Chris Metcalf</name>
<email>cmetcalf@ezchip.com</email>
</author>
<published>2015-06-24T23:55:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fe4ba3c34352b7e8068b7f18eb233444aed17011'/>
<id>fe4ba3c34352b7e8068b7f18eb233444aed17011</id>
<content type='text'>
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running on
with a new kernel.watchdog_cpumask sysctl.

In the current system, the watchdog subsystem runs a periodic timer that
schedules the watchdog kthread to run.  However, nohz_full cores are
designed to allow userspace application code running on those cores to
have 100% access to the CPU.  So the watchdog system prevents the
nohz_full application code from being able to run the way it wants to,
thus the motivation to suppress the watchdog on nohz_full cores, which
this patchset provides by default.

However, if we disable the watchdog globally, then the housekeeping
cores can't benefit from the watchdog functionality.  So we allow
disabling it only on some cores.  See Documentation/lockup-watchdogs.txt
for more information.

[jhubbard@nvidia.com: fix a watchdog crash in some configurations]
Signed-off-by: Chris Metcalf &lt;cmetcalf@ezchip.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Signed-off-by: John Hubbard &lt;jhubbard@nvidia.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running on
with a new kernel.watchdog_cpumask sysctl.

In the current system, the watchdog subsystem runs a periodic timer that
schedules the watchdog kthread to run.  However, nohz_full cores are
designed to allow userspace application code running on those cores to
have 100% access to the CPU.  So the watchdog system prevents the
nohz_full application code from being able to run the way it wants to,
thus the motivation to suppress the watchdog on nohz_full cores, which
this patchset provides by default.

However, if we disable the watchdog globally, then the housekeeping
cores can't benefit from the watchdog functionality.  So we allow
disabling it only on some cores.  See Documentation/lockup-watchdogs.txt
for more information.

[jhubbard@nvidia.com: fix a watchdog crash in some configurations]
Signed-off-by: Chris Metcalf &lt;cmetcalf@ezchip.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Signed-off-by: John Hubbard &lt;jhubbard@nvidia.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Doc/sysctl/kernel.txt: document threads-max</title>
<updated>2015-04-17T13:04:07+00:00</updated>
<author>
<name>Heinrich Schuchardt</name>
<email>xypron.glpk@gmx.de</email>
</author>
<published>2015-04-16T19:47:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0ec62afeb143a34ce78143cf442f879ef68382f7'/>
<id>0ec62afeb143a34ce78143cf442f879ef68382f7</id>
<content type='text'>
File /proc/sys/kernel/threads-max controls the maximum number of threads
that can be created using fork().

[akpm@linux-foundation.org: fix typo, per Guenter]
Signed-off-by: Heinrich Schuchardt &lt;xypron.glpk@gmx.de&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Guenter Roeck &lt;linux@roeck-us.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
File /proc/sys/kernel/threads-max controls the maximum number of threads
that can be created using fork().

[akpm@linux-foundation.org: fix typo, per Guenter]
Signed-off-by: Heinrich Schuchardt &lt;xypron.glpk@gmx.de&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Guenter Roeck &lt;linux@roeck-us.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: allow compaction of unevictable pages</title>
<updated>2015-04-15T23:35:17+00:00</updated>
<author>
<name>Eric B Munson</name>
<email>emunson@akamai.com</email>
</author>
<published>2015-04-15T23:13:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5bbe3547aa3ba5242366a322a28996872301b703'/>
<id>5bbe3547aa3ba5242366a322a28996872301b703</id>
<content type='text'>
Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration.  The POSIX real time
extension explicitly states that mlock() will prevent a major page
fault, but the spirit of this is that mlock() should give a process the
ability to control sources of latency, including minor page faults.
However, the mlock manpage only explicitly says that a locked page will
not be written to swap and this can cause some confusion.  The
compaction code today does not give a developer who wants to avoid swap
but wants to have large contiguous areas available any method to achieve
this state.  This patch introduces a sysctl for controlling compaction
behavior with respect to the unevictable lru.  Users who demand no page
faults after a page is present can set compact_unevictable_allowed to 0
and users who need the large contiguous areas can enable compaction on
locked memory by leaving the default value of 1.

To illustrate this problem I wrote a quick test program that mmaps a
large number of 1MB files filled with random data.  These maps are
created locked and read only.  Then every other mmap is unmapped and I
attempt to allocate huge pages to the static huge page pool.  When the
compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
after fragmenting memory.  When the value is set to 1, allocations
succeed.

Signed-off-by: Eric B Munson &lt;emunson@akamai.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration.  The POSIX real time
extension explicitly states that mlock() will prevent a major page
fault, but the spirit of this is that mlock() should give a process the
ability to control sources of latency, including minor page faults.
However, the mlock manpage only explicitly says that a locked page will
not be written to swap and this can cause some confusion.  The
compaction code today does not give a developer who wants to avoid swap
but wants to have large contiguous areas available any method to achieve
this state.  This patch introduces a sysctl for controlling compaction
behavior with respect to the unevictable lru.  Users who demand no page
faults after a page is present can set compact_unevictable_allowed to 0
and users who need the large contiguous areas can enable compaction on
locked memory by leaving the default value of 1.

To illustrate this problem I wrote a quick test program that mmaps a
large number of 1MB files filled with random data.  These maps are
created locked and read only.  Then every other mmap is unmapped and I
attempt to allocate huge pages to the static huge page pool.  When the
compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
after fragmenting memory.  When the value is set to 1, allocations
succeed.

Signed-off-by: Eric B Munson &lt;emunson@akamai.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
