linux-toradex.git/kernel/sysctl.c, branch v2.6.25-rc2

hugetlb: fix overcommit locking

2008-02-14T00:21:18+00:00

proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
hold hugetlb_lock over the call.  Use a dummy variable to store the sysctl
result, like in hugetlb_sysctl_handler(), then grab the lock to update
nr_overcommit_huge_pages.

Signed-off-by: Nishanth Aravamudan 
Reported-by: Miles Lane 
Cc: Adam Litke 
Cc: David Gibson 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

sched: rt-group: interface

2008-02-13T14:45:39+00:00

Change the rt_ratio interface to rt_runtime_us, to match rt_period_us.
This avoids picking a granularity for the ratio.

Extend the /sys/kernel/uids// interface to allow setting
the group's rt_runtime.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Ingo Molnar

printk_ratelimit() functions should use CONFIG_PRINTK

2008-02-08T17:22:39+00:00

Makes an embedded image a bit smaller.

Signed-off-by: Joe Perches 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Nuke duplicate header from sysctl.c

2008-02-08T17:22:34+00:00

Don't include linux/security.h twice in kernel/sysctl.c

Signed-off-by: Jesper Juhl 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Pidns: make full use of xxx_vnr() calls

2008-02-08T17:22:29+00:00

Some time ago the xxx_vnr() calls (e.g.  pid_vnr or find_task_by_vpid) were
_all_ converted to operate on the current pid namespace.  After this each call
like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
one.

Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
appropriate.

Signed-off-by: Pavel Emelyanov 
Reviewed-by: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Balbir Singh 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlb: add locking for overcommit sysctl

2008-02-08T17:22:23+00:00

When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
proc_doulongvec_minmax() directly.  However, hugetlb.c's locking rules
require that all counter modifications occur under the hugetlb_lock.  Add a
callback into the hugetlb code similar to the one for nr_hugepages.  Grab
the lock around the manipulation of nr_overcommit_hugepages in
proc_doulongvec_minmax().

Signed-off-by: Nishanth Aravamudan 
Acked-by: Adam Litke 
Cc: David Gibson 
Cc: William Lee Irwin III 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

oom: add sysctl to enable task memory dump

2008-02-07T16:42:19+00:00

Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing.  Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.

This is helpful for determining why there was an OOM condition and which
rogue task caused it.

It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.

If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.

Cc: Andrea Arcangeli 
Cc: Christoph Lameter 
Cc: Balbir Singh 
Signed-off-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

get rid of NR_OPEN and introduce a sysctl_nr_open

2008-02-06T18:41:06+00:00

NR_OPEN (historically set to 1024*1024) actually forbids processes to open
more than 1024*1024 handles.

Unfortunatly some production servers hit the not so 'ridiculously high
value' of 1024*1024 file descriptors per process.

Changing NR_OPEN is not considered safe because of vmalloc space potential
exhaust.

This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
1024*1024, so that admins can decide to change this limit if their workload
needs it.

[akpm@linux-foundation.org: export it for sparc64]
Signed-off-by: Eric Dumazet 
Cc: Alan Cox 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: "David S. Miller" 
Cc: Ralf Baechle 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

capabilities: introduce per-process capability bounding set

2008-02-05T17:44:20+00:00

The capability bounding set is a set beyond which capabilities cannot grow.
 Currently cap_bset is per-system.  It can be manipulated through sysctl,
but only init can add capabilities.  Root can remove capabilities.  By
default it includes all caps except CAP_SETPCAP.

This patch makes the bounding set per-process when file capabilities are
enabled.  It is inherited at fork from parent.  Noone can add elements,
CAP_SETPCAP is required to remove them.

One example use of this is to start a safer container.  For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.

The bounding set will not affect pP and pE immediately.  It will only
affect pP' and pE' after subsequent exec()s.  It also does not affect pI,
and exec() does not constrain pI'.  So to really start a shell with no way
of regain CAP_MKNOD, you would do

	prctl(PR_CAPBSET_DROP, CAP_MKNOD);
	cap_t cap = cap_get_proc();
	cap_value_t caparray[1];
	caparray[0] = CAP_MKNOD;
	cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
	cap_set_proc(cap);
	cap_free(cap);

The following test program will get and set the bounding
set (but not pI).  For instance

	./bset get
		(lists capabilities in bset)
	./bset drop cap_net_raw
		(starts shell with new bset)
		(use capset, setuid binary, or binary with
		file capabilities to try to increase caps)

************************************************************
cap_bound.c
************************************************************
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 

 #ifndef PR_CAPBSET_READ
 #define PR_CAPBSET_READ 23
 #endif

 #ifndef PR_CAPBSET_DROP
 #define PR_CAPBSET_DROP 24
 #endif

int usage(char *me)
{
	printf("Usage: %s get\n", me);
	printf("       %s drop \n", me);
	return 1;
}

 #define numcaps 32
char *captable[numcaps] = {
	"cap_chown",
	"cap_dac_override",
	"cap_dac_read_search",
	"cap_fowner",
	"cap_fsetid",
	"cap_kill",
	"cap_setgid",
	"cap_setuid",
	"cap_setpcap",
	"cap_linux_immutable",
	"cap_net_bind_service",
	"cap_net_broadcast",
	"cap_net_admin",
	"cap_net_raw",
	"cap_ipc_lock",
	"cap_ipc_owner",
	"cap_sys_module",
	"cap_sys_rawio",
	"cap_sys_chroot",
	"cap_sys_ptrace",
	"cap_sys_pacct",
	"cap_sys_admin",
	"cap_sys_boot",
	"cap_sys_nice",
	"cap_sys_resource",
	"cap_sys_time",
	"cap_sys_tty_config",
	"cap_mknod",
	"cap_lease",
	"cap_audit_write",
	"cap_audit_control",
	"cap_setfcap"
};

int getbcap(void)
{
	int comma=0;
	unsigned long i;
	int ret;

	printf("i know of %d capabilities\n", numcaps);
	printf("capability bounding set:");
	for (i=0; i
Signed-off-by: Andrew G. Morgan 
Cc: Stephen Smalley 
Cc: James Morris 
Cc: Chris Wright 
Cc: Casey Schaufler a
Signed-off-by: "Serge E. Hallyn" 
Tested-by: Jiri Slaby 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/page-writeback: highmem_is_dirtyable option

2008-02-05T17:44:18+00:00

Add vm.highmem_is_dirtyable toggle

A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process.  On 2.6.16 this process took a few
minutes.  With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.

Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.

[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana 
Cc: Ethan Solomita 
Cc: Peter Zijlstra 
Cc: WU Fengguang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds