summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-03-23userns: user namespaces: convert several capable() callsSerge E. Hallyn
CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(), because the resource comes from current's own ipc namespace. setuid/setgid are to uids in own namespace, so again checks can be against current_user_ns(). Changelog: Jan 11: Use task_ns_capable() in place of sched_capable(). Jan 11: Use nsown_capable() as suggested by Bastian Blank. Jan 11: Clarify (hopefully) some logic in futex and sched.c Feb 15: use ns_capable for ipc, not nsown_capable Feb 23: let copy_ipcs handle setting ipc_ns->user_ns Feb 23: pass ns down rather than taking it from current [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23userns: add a user namespace owner of ipc nsSerge E. Hallyn
Changelog: Feb 15: Don't set new ipc->user_ns if we didn't create a new ipc_ns. Feb 23: Move extern declaration to ipc_namespace.h, and group fwd declarations at top. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23userns: user namespaces: convert all capable checks in kernel/sys.cSerge E. Hallyn
This allows setuid/setgid in containers. It also fixes some corner cases where kernel logic foregoes capability checks when uids are equivalent. The latter will need to be done throughout the whole kernel. Changelog: Jan 11: Use nsown_capable() as suggested by Bastian Blank. Jan 11: Fix logic errors in uid checks pointed out by Bastian. Feb 15: allow prlimit to current (was regression in previous version) Feb 23: remove debugging printks, uninline set_one_prio_perm and make it bool, and document its return value. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23userns: make has_capability* into real functionsSerge E. Hallyn
So we can let type safety keep things sane, and as a bonus we can remove the declaration of init_user_ns in capability.h. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23userns: allow ptrace from non-init user namespacesSerge E. Hallyn
ptrace is allowed to tasks in the same user namespace according to the usual rules (i.e. the same rules as for two tasks in the init user namespace). ptrace is also allowed to a user namespace to which the current task the has CAP_SYS_PTRACE capability. Changelog: Dec 31: Address feedback by Eric: . Correct ptrace uid check . Rename may_ptrace_ns to ptrace_capable . Also fix the cap_ptrace checks. Jan 1: Use const cred struct Jan 11: use task_ns_capable() in place of ptrace_capable(). Feb 23: same_or_ancestore_user_ns() was not an appropriate check to constrain cap_issubset. Rather, cap_issubset() only is meaningful when both capsets are in the same user_ns. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23userns: allow killing tasks in your own or child usernsSerge E. Hallyn
Changelog: Dec 8: Fixed bug in my check_kill_permission pointed out by Eric Biederman. Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred() for clarity Dec 31: address comment by Eric Biederman: don't need cred/tcred in check_kill_permission. Jan 1: use const cred struct. Jan 11: Per Bastian Blank's advice, clean up kill_ok_by_cred(). Feb 16: kill_ok_by_cred: fix bad parentheses Feb 23: per akpm, let compiler inline kill_ok_by_cred Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23userns: allow sethostname in a containerSerge E. Hallyn
Changelog: Feb 23: let clone_uts_ns() handle setting uts->user_ns To do so we need to pass in the task_struct who'll get the utsname, so we can get its user_ns. Feb 23: As per Oleg's coment, just pass in tsk, instead of two of its members. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23userns: security: make capabilities relative to the user namespaceSerge E. Hallyn
- Introduce ns_capable to test for a capability in a non-default user namespace. - Teach cap_capable to handle capabilities in a non-default user namespace. The motivation is to get to the unprivileged creation of new namespaces. It looks like this gets us 90% of the way there, with only potential uid confusion issues left. I still need to handle getting all caps after creation but otherwise I think I have a good starter patch that achieves all of your goals. Changelog: 11/05/2010: [serge] add apparmor 12/14/2010: [serge] fix capabilities to created user namespaces Without this, if user serge creates a user_ns, he won't have capabilities to the user_ns he created. THis is because we were first checking whether his effective caps had the caps he needed and returning -EPERM if not, and THEN checking whether he was the creator. Reverse those checks. 12/16/2010: [serge] security_real_capable needs ns argument in !security case 01/11/2011: [serge] add task_ns_capable helper 01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion 02/16/2011: [serge] fix a logic bug: the root user is always creator of init_user_ns, but should not always have capabilities to it! Fix the check in cap_capable(). 02/21/2011: Add the required user_ns parameter to security_capable, fixing a compile failure. 02/23/2011: Convert some macros to functions as per akpm comments. Some couldn't be converted because we can't easily forward-declare them (they are inline if !SECURITY, extern if SECURITY). Add a current_user_ns function so we can use it in capability.h without #including cred.h. Move all forward declarations together to the top of the #ifdef __KERNEL__ section, and use kernel-doc format. 02/23/2011: Per dhowells, clean up comment in cap_capable(). 02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable. (Original written and signed off by Eric; latest, modified version acked by him) [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: export current_user_ns() for ecryptfs] [serge.hallyn@canonical.com: remove unneeded extra argument in selinux's task_has_capability] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23userns: add a user_namespace as creator/owner of uts_namespaceSerge E. Hallyn
The expected course of development for user namespaces targeted capabilities is laid out at https://wiki.ubuntu.com/UserNamespace. Goals: - Make it safe for an unprivileged user to unshare namespaces. They will be privileged with respect to the new namespace, but this should only include resources which the unprivileged user already owns. - Provide separate limits and accounting for userids in different namespaces. Status: Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID capabilities. What this gets you is a whole new set of userids, meaning that user 500 will have a different 'struct user' in your namespace than in other namespaces. So any accounting information stored in struct user will be unique to your namespace. However, throughout the kernel there are checks which - simply check for a capability. Since root in a child namespace has all capabilities, this means that a child namespace is not constrained. - simply compare uid1 == uid2. Since these are the integer uids, uid 500 in namespace 1 will be said to be equal to uid 500 in namespace 2. As a result, the lxc implementation at lxc.sf.net does not use user namespaces. This is actually helpful because it leaves us free to develop user namespaces in such a way that, for some time, user namespaces may be unuseful. Bugs aside, this patchset is supposed to not at all affect systems which are not actively using user namespaces, and only restrict what tasks in child user namespace can do. They begin to limit privilege to a user namespace, so that root in a container cannot kill or ptrace tasks in the parent user namespace, and can only get world access rights to files. Since all files currently belong to the initila user namespace, that means that child user namespaces can only get world access rights to *all* files. While this temporarily makes user namespaces bad for system containers, it starts to get useful for some sandboxing. I've run the 'runltplite.sh' with and without this patchset and found no difference. This patch: copy_process() handles CLONE_NEWUSER before the rest of the namespaces. So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace will have the new user namespace as its owner. That is what we want, since we want root in that new userns to be able to have privilege over it. Changelog: Feb 15: don't set uts_ns->user_ns if we didn't create a new uts_ns. Feb 23: Move extern init_user_ns declaration from init/version.c to utsname.h. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23pidns: call pid_ns_prepare_proc() from create_pid_namespace()Eric W. Biederman
Reorganize proc_get_sb() so it can be called before the struct pid of the first process is allocated. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23pid: remove the child_reaper special case in init/main.cEric W. Biederman
This patchset is a cleanup and a preparation to unshare the pid namespace. These prerequisites prepare for Eric's patchset to give a file descriptor to a namespace and join an existing namespace. This patch: It turns out that the existing assignment in copy_process of the child_reaper can handle the initial assignment of child_reaper we just need to generalize the test in kernel/fork.c Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sysctl: restrict write access to dmesg_restrictRichard Weinberger
When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel ring buffer. But a root user without CAP_SYS_ADMIN is able to reset dmesg_restrict to 0. This is an issue when e.g. LXC (Linux Containers) are used and complete user space is running without CAP_SYS_ADMIN. A unprivileged and jailed root user can bypass the dmesg_restrict protection. With this patch writing to dmesg_restrict is only allowed when root has CAP_SYS_ADMIN. Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Dan Rosenberg <drosenberg@vsecurity.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Cc: Eric Paris <eparis@redhat.com> Cc: Kees Cook <kees.cook@canonical.com> Cc: James Morris <jmorris@namei.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sysctl: add some missing input constraint checksPetr Holasek
Add boundaries of allowed input ranges for: dirty_expire_centisecs, drop_caches, overcommit_memory, page-cluster and panic_on_oom. Signed-off-by: Petr Holasek <pholasek@redhat.com> Acked-by: Dave Young <hidave.darkstar@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sysctl_check: drop dead codeDenis Kirjanov
Drop dead code. Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sysctl_check: drop table->procname checksDenis Kirjanov
Since the for loop checks for the table->procname drop useless table->procname checks inside the loop body Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23cpuset: hold callback_mutex in cpuset_post_clone()Li Zefan
Chaning cpuset->mems/cpuset->cpus should be protected under callback_mutex. cpuset_clone() doesn't follow this rule. It's ok because it's called when creating and initializing a cgroup, but we'd better hold the lock to avoid subtil break in the future. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23cpuset: fix unchecked calls to NODEMASK_ALLOC()Li Zefan
Those functions that use NODEMASK_ALLOC() can't propagate errno to users, but will fail silently. Fix it by using a static nodemask_t variable for each function, and those variables are protected by cgroup_mutex; [akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_attach()Li Zefan
oldcs->mems_allowed is not modified during cpuset_attach(), so we don't have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it to cpuset_migrate_mm(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_sprintf_memlist()Li Zefan
It's not necessary to copy cpuset->mems_allowed to a buffer allocated by NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf(). As spotted by Paul, a side effect is we fix a bug that the function can return -ENOMEM but the caller doesn't expect negative return value. Therefore change the return value of cpuset_sprintf_cpulist() and cpuset_sprintf_memlist() from int to size_t. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23memcg: remove direct page_cgroup-to-page pointerJohannes Weiner
In struct page_cgroup, we have a full word for flags but only a few are reserved. Use the remaining upper bits to encode, depending on configuration, the node or the section, to enable page_cgroup-to-page lookups without a direct pointer. This saves a full word for every page in a system with memory cgroups enabled. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23memcg: res_counter_read_u64(): fix potential races on 32-bit machinesKAMEZAWA Hiroyuki
res_counter_read_u64 reads u64 value without lock. It's dangerous in a 32bit environment. Add locking. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22printk: allow setting DEFAULT_MESSAGE_LEVEL via KconfigMandeep Singh Baines
We've been burned by regressions/bugs which we later realized could have been triaged quicker if only we'd paid closer attention to dmesg. To make it easier to audit dmesg, we'd like to make DEFAULT_MESSAGE_LEVEL Kconfig-settable. That way we can set it to KERN_NOTICE and audit any messages <= KERN_WARNING. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Cc: Olof Johansson <olofj@chromium.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22printk: use %pK for /proc/kallsyms and /proc/modulesKees Cook
In an effort to reduce kernel address leaks that might be used to help target kernel privilege escalation exploits, this patch uses %pK when displaying addresses in /proc/kallsyms, /proc/modules, and /sys/module/*/sections/*. Note that this changes %x to %p, so some legitimately 0 values in /proc/kallsyms would have changed from 00000000 to "(null)". To avoid this, "(null)" is not used when using the "K" format. Anything that was already successfully parsing "(null)" in addition to full hex digits should have no problem with this change. (Thanks to Joe Perches for the suggestion.) Due to the %x to %p, "void *" casts are needed since these addresses are already "unsigned long" everywhere internally, due to their starting life as ELF section offsets. Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: Eugene Teo <eugene@redhat.com> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22console: prevent registered consoles from dumping old kernel message over againFeng Tang
For a platform with many consoles like: "console=tty1 console=ttyMFD2 console=ttyS0 earlyprintk=mrst" Each time when the non "selected_console" (tty1 and ttyMFD2 here) get registered, the existing kernel message will be printed out on registered consoles again, the "mrst" early console will get some same message for 3 times, and "tty1" will get some for twice. As suggested by Andrew Morton, every time a new console is registered, it will be set as the "exclusive" console which will dump the already existing kernel messages. Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Greg KH <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22console: allow to retain boot console via boot option keep_bootconFabio M. Di Nitto
On some architectures, the boot process involves de-registering the boot console (early boot), initialize drivers and then re-register the console. This mechanism introduces a window in which no printk can happen on the console and messages are buffered and then printed once the new console is available. If a kernel crashes during this window, all it's left on the boot console is "console [foo] enabled, bootconsole disabled" making debug of the crash rather 'interesting'. By adding "keep_bootcon" option, do not unregister the boot console, that will allow to printk everything that is happening up to the crash. The option is clearly meant only for debugging purposes as it introduces lots of duplicated info printed on console, but will make bug report from users easier as it doesn't require a kernel build just to figure out where we crash. Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22kernel/watchdog.c: always return NOTIFY_OK during cpu up/down eventsDon Zickus
This patch addresses a couple of problems. One was the case when the hardlockup failed to start, it also failed to start the softlockup. There were valid cases when the hardlockup shouldn't start and that shouldn't block the softlockup (no lapic, bios controls perf counters). The second problem was when the hardlockup failed to start on boxes (from a no lapic or bios controlled perf counter case), it reported failure to the cpu notifier chain. This blocked the notifier from continuing to start other more critical pieces of cpu bring-up (in our case based on a 2.6.32 fork, it was the mce). As a result, during soft cpu online/offline testing, the system would panic when a cpu was offlined because the cpu notifier would succeed in processing a watchdog disable cpu event and would panic in the mce case as a result of un-initialized variables from a never executed cpu up event. I realized the hardlockup/softlockup cases are really just debugging aids and should never impede the progress of a cpu up/down event. Therefore I modified the code to always return NOTIFY_OK and instead rely on printks to inform the user of problems. Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22kernel/watchdog.c: allow hardlockup to panic by defaultDon Zickus
When a cpu is considered stuck, instead of limping along and just printing a warning, it is sometimes preferred to just panic, let kdump capture the vmcore and reboot. This gets the machine back into a stable state quickly while saving the info that got it into a stuck state to begin with. Add a Kconfig option to allow users to set the hardlockup to panic by default. Also add in a 'nmi_watchdog=nopanic' to override this. [akpm@linux-foundation.org: fix strncmp length] Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22sys_unshare: remove the dead CLONE_THREAD/SIGHAND/VM codeOleg Nesterov
Cleanup: kill the dead code which does nothing but complicates the code and confuses the reader. sys_unshare(CLONE_THREAD/SIGHAND/VM) is not really implemented, and I doubt very much it will ever work. At least, nobody even tried since the original 99d1419d96d7df9cfa56 ("unshare system call -v5: system call handler function") was applied more than 4 years ago. And the code is not consistent. unshare_thread() always fails unconditionally, while unshare_sighand() and unshare_vm() pretend to work if there is nothing to unshare. Remove unshare_thread(), unshare_sighand(), unshare_vm() helpers and related variables and add a simple CLONE_THREAD | CLONE_SIGHAND| CLONE_VM check into check_unshare_flags(). Also, move the "CLONE_NEWNS needs CLONE_FS" check from check_unshare_flags() to sys_unshare(). This looks more consistent and matches the similar do_sysvsem check in sys_unshare(). Note: with or without this patch "atomic_read(mm->mm_users) > 1" can give a false positive due to get_task_mm(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Janak Desai <janak@us.ibm.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22kernel/cpu.c: fix many errors related to style.Michael Rodriguez
Change the printk() calls to have the KERN_INFO/KERN_ERROR stuff, and fixes other coding style errors. Not _all_ of them are gone, though. [akpm@linux-foundation.org: revert the bits I disagree with] Signed-off-by: Michael Rodriguez <dkingston02@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22smp: move smp setup functions to kernel/smp.cAmerigo Wang
Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter setup functions from init/main.c to kenrel/smp.c, saves some #ifdef CONFIG_SMP. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Rakib Mullick <rakib.mullick@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22move x86 specific oops=panic to generic codeOlaf Hering
The oops=panic cmdline option is not x86 specific, move it to generic code. Update documentation. Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22kthread: use kthread_create_on_node()Eric Dumazet
ksoftirqd, kworker, migration, and pktgend kthreads can be created with kthread_create_on_node(), to get proper NUMA affinities for their stack and task_struct. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22kthread: NUMA aware kthread_create_on_node()Eric Dumazet
All kthreads being created from a single helper task, they all use memory from a single node for their kernel stack and task struct. This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter to parameters already used by kthread_create(). This parameter serves in allocating memory for the new kthread on its memory node if possible. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22mm: NUMA aware alloc_thread_info_node()Eric Dumazet
Add a node parameter to alloc_thread_info(), and change its name to alloc_thread_info_node() This change is needed to allow NUMA aware kthread_create_on_cpu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22mm: NUMA aware alloc_task_struct_node()Eric Dumazet
All kthreads being created from a single helper task, they all use memory from a single node for their kernel stack and task struct. This patch suite creates kthread_create_on_cpu(), adding a 'cpu' parameter to parameters already used by kthread_create(). This parameter serves in allocating memory for the new kthread on its memory node if available. Users of this new function are : ksoftirqd, kworker, migration, pktgend... This patch: Add a node parameter to alloc_task_struct(), and change its name to alloc_task_struct_node() This change is needed to allow NUMA aware kthread_create_on_cpu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22cgroups: if you list_empty() a head then don't list_del() itPhil Carmody
list_del() leaves poison in the prev and next pointers. The next list_empty() will compare those poisons, and say the list isn't empty. Any list operations that assume the node is on a list because of such a check will be fooled into dereferencing poison. One needs to INIT the node after the del, and fortunately there's already a wrapper for that - list_del_init(). Some of the dels are followed by deallocations, so can be ignored, and one can be merged with an add to make a move. Apart from that, I erred on the side of caution in making nodes list_empty()-queriable. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-21Prevent rt_sigqueueinfo and rt_tgsigqueueinfo from spoofing the signal codeJulien Tinnes
Userland should be able to trust the pid and uid of the sender of a signal if the si_code is SI_TKILL. Unfortunately, the kernel has historically allowed sigqueueinfo() to send any si_code at all (as long as it was negative - to distinguish it from kernel-generated signals like SIGILL etc), so it could spoof a SI_TKILL with incorrect siginfo values. Happily, it looks like glibc has always set si_code to the appropriate SI_QUEUE, so there are probably no actual user code that ever uses anything but the appropriate SI_QUEUE flag. So just tighten the check for si_code (we used to allow any negative value), and add a (one-time) warning in case there are binaries out there that might depend on using other si_code values. Signed-off-by: Julien Tinnes <jln@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-20Merge branch 'trivial' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 * 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: (25 commits) video: change to new flag variable scsi: change to new flag variable rtc: change to new flag variable rapidio: change to new flag variable pps: change to new flag variable net: change to new flag variable misc: change to new flag variable message: change to new flag variable memstick: change to new flag variable isdn: change to new flag variable ieee802154: change to new flag variable ide: change to new flag variable hwmon: change to new flag variable dma: change to new flag variable char: change to new flag variable fs: change to new flag variable xtensa: change to new flag variable um: change to new flag variables s390: change to new flag variable mips: change to new flag variable ... Fix up trivial conflict in drivers/hwmon/Makefile
2011-03-18Merge branches 'irq-fixes-for-linus' and 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix incorrect unlock in __setup_irq() cris: Use generic show_interrupts() genirq: show_interrupts: Check desc->name before printing it blindly cris: Use accessor functions to set IRQ_PER_CPU flag cris: Fix irq conversion fallout * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, kernel-doc: Fix runqueue_is_locked() description
2011-03-18Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits) trace, filters: Initialize the match variable in process_ops() properly trace, documentation: Fix branch profiling location in debugfs oprofile, s390: Cleanups oprofile, s390: Remove hwsampler_files.c and merge it into init.c perf: Fix tear-down of inherited group events perf: Reorder & optimize perf_event_context to remove alignment padding on 64 bit builds perf: Handle stopped state with tracepoints perf: Fix the software events state check perf, powerpc: Handle events that raise an exception without overflowing perf, x86: Use INTEL_*_CONSTRAINT() for all PEBS event constraints perf, x86: Clean up SandyBridge PEBS events perf lock: Fix sorting by wait_min perf tools: Version incorrect with some versions of grep perf evlist: New command to list the names of events present in a perf.data file perf script: Add support for H/W and S/W events perf script: Add support for dumping symbols perf script: Support custom field selection for output perf script: Move printing of 'common' data from print_event and rename perf tracing: Remove print_graph_cpu and print_graph_proc from trace-event-parse perf script: Change process_event prototype ...
2011-03-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits) doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore Update cpuset info & webiste for cgroups dcdbas: force SMI to happen when expected arch/arm/Kconfig: remove one to many l's in the word. asm-generic/user.h: Fix spelling in comment drm: fix printk typo 'sracth' Remove one to many n's in a word Documentation/filesystems/romfs.txt: fixing link to genromfs drivers:scsi Change printk typo initate -> initiate serial, pch uart: Remove duplicate inclusion of linux/pci.h header fs/eventpoll.c: fix spelling mm: Fix out-of-date comments which refers non-existent functions drm: Fix printk typo 'failled' coh901318.c: Change initate to initiate. mbox-db5500.c Change initate to initiate. edac: correct i82975x error-info reported edac: correct i82975x mci initialisation edac: correct commented info fs: update comments to point correct document target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c ... Trivial conflict in fs/eventpoll.c (spelling vs addition)
2011-03-18trace, filters: Initialize the match variable in process_ops() properlyIngo Molnar
Make sure the 'match' variable always has a value. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-17Merge branch 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
* 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (55 commits) KVM: unbreak userspace that does not sets tss address KVM: MMU: cleanup pte write path KVM: MMU: introduce a common function to get no-dirty-logged slot KVM: fix rcu usage in init_rmode_* functions KVM: fix kvmclock regression due to missing clock update KVM: emulator: Fix permission checking in io permission bitmap KVM: emulator: Fix io permission checking for 64bit guest KVM: SVM: Load %gs earlier if CONFIG_X86_32_LAZY_GS=n KVM: x86: Remove useless regs_page pointer from kvm_lapic KVM: improve comment on rcu use in irqfd_deassign KVM: MMU: remove unused macros KVM: MMU: cleanup page alloc and free KVM: MMU: do not record gfn in kvm_mmu_pte_write KVM: MMU: move mmu pages calculated out of mmu lock KVM: MMU: set spte accessed bit properly KVM: MMU: fix kvm_mmu_slot_remove_write_access dropping intermediate W bits KVM: Start lock documentation KVM: better readability of efer_reserved_bits KVM: Clear async page fault hash after switching to real mode KVM: VMX: Initialize vm86 TSS only once. ...
2011-03-17smp_call_function_interrupt: use typedef and %pfMilton Miller
Use the newly added smp_call_func_t in smp_call_function_interrupt for the func variable, and make the comment above the WARN more assertive and explicit. Also, func is a function pointer and does not need an offset, so use %pf not %pS. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-17smp_call_function_many: handle concurrent clearing of maskMilton Miller
Mike Galbraith reported finding a lockup ("perma-spin bug") where the cpumask passed to smp_call_function_many was cleared by other cpu(s) while a cpu was preparing its call_data block, resulting in no cpu to clear the last ref and unlock the block. Having cpus clear their bit asynchronously could be useful on a mask of cpus that might have a translation context, or cpus that need a push to complete an rcu window. Instead of adding a BUG_ON and requiring yet another cpumask copy, just detect the race and handle it. Note: arch_send_call_function_ipi_mask must still handle an empty cpumask because the data block is globally visible before the that arch callback is made. And (obviously) there are no guarantees to which cpus are notified if the mask is changed during the call; only cpus that were online and had their mask bit set during the whole call are guaranteed to be called. Reported-by: Mike Galbraith <efault@gmx.de> Reported-by: Jan Beulich <JBeulich@novell.com> Acked-by: Jan Beulich <jbeulich@novell.com> Cc: stable@kernel.org Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-17call_function_many: add missing orderingMilton Miller
Paul McKenney's review pointed out two problems with the barriers in the 2.6.38 update to the smp call function many code. First, a barrier that would force the func and info members of data to be visible before their consumption in the interrupt handler was missing. This can be solved by adding a smp_wmb between setting the func and info members and setting setting the cpumask; this will pair with the existing and required smp_rmb ordering the cpumask read before the read of refs. This placement avoids the need a second smp_rmb in the interrupt handler which would be executed on each of the N cpus executing the call request. (I was thinking this barrier was present but was not). Second, the previous write to refs (establishing the zero that we the interrupt handler was testing from all cpus) was performed by a third party cpu. This would invoke transitivity which, as a recient or concurrent addition to memory-barriers.txt now explicitly states, would require a full smp_mb(). However, we know the cpumask will only be set by one cpu (the data owner) and any preivous iteration of the mask would have cleared by the reading cpu. By redundantly writing refs to 0 on the owning cpu before the smp_wmb, the write to refs will follow the same path as the writes that set the cpumask, which in turn allows us to keep the barrier in the interrupt handler a smp_rmb instead of promoting it to a smp_mb (which will be be executed by N cpus for each of the possible M elements on the list). I moved and expanded the comment about our (ab)use of the rcu list primitives for the concurrent walk earlier into this function. I considered moving the first two paragraphs to the queue list head and lock, but felt it would have been too disconected from the code. Cc: Paul McKinney <paulmck@linux.vnet.ibm.com> Cc: stable@kernel.org (2.6.32 and later) Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-17call_function_many: fix list delete vs add raceMilton Miller
Peter pointed out there was nothing preventing the list_del_rcu in smp_call_function_interrupt from running before the list_add_rcu in smp_call_function_many. Fix this by not setting refs until we have gotten the lock for the list. Take advantage of the wmb in list_add_rcu to save an explicit additional one. I tried to force this race with a udelay before the lock & list_add and by mixing all 64 online cpus with just 3 random cpus in the mask, but was unsuccessful. Still, inspection shows a valid race, and the fix is a extension of the existing protection window in the current code. Cc: stable@kernel.org (v2.6.32 and later) Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-17export pid symbols needed for kvm_vcpu_on_spinRik van Riel
Export the symbols required for a race-free kvm_vcpu_on_spin. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17genirq: Fix incorrect unlock in __setup_irq()Dan Carpenter
goto out_thread is called before we take the lock. It causes a gcc warning: "kernel/irq/manage.c:858: warning: ‘flags’ may be used uninitialized in this function" [ tglx: Moved unlock before free_cpumask_var() ] Signed-off-by: Dan Carpenter <error27@gmail.com> LKML-Reference: <20110317114307.GJ2008@bicker> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-17genirq: show_interrupts: Check desc->name before printing it blindlyThomas Gleixner
desc->name is not required and not used by all architectures. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>