summaryrefslogtreecommitdiff
path: root/fs/exec.c
AgeCommit message (Collapse)Author
2013-11-29exec/ptrace: fix get_dumpable() incorrect testsKees Cook
commit d049f74f2dbe71354d43d393ac3a188947811348 upstream. The get_dumpable() return value is not boolean. Most users of the function actually want to be testing for non-SUID_DUMP_USER(1) rather than SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a protected state. Almost all places did this correctly, excepting the two places fixed in this patch. Wrong logic: if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ } or if (dumpable == 0) { /* be protective */ } or if (!dumpable) { /* be protective */ } Correct logic: if (dumpable != SUID_DUMP_USER) { /* be protective */ } or if (dumpable != 1) { /* be protective */ } Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a user was able to ptrace attach to processes that had dropped privileges to that user. (This may have been partially mitigated if Yama was enabled.) The macros have been moved into the file that declares get/set_dumpable(), which means things like the ia64 code can see them too. CVE-2013-2929 Reported-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29exec: do not abuse ->cred_guard_mutex in threadgroup_lock()Oleg Nesterov
commit e56fb2874015370e3b7f8d85051f6dce26051df9 upstream. threadgroup_lock() takes signal->cred_guard_mutex to ensure that thread_group_leader() is stable. This doesn't look nice, the scope of this lock in do_execve() is huge. And as Dave pointed out this can lead to deadlock, we have the following dependencies: do_execve: cred_guard_mutex -> i_mutex cgroup_mount: i_mutex -> cgroup_mutex attach_task_by_pid: cgroup_mutex -> cred_guard_mutex Change de_thread() to take threadgroup_change_begin() around the switch-the-leader code and change threadgroup_lock() to avoid ->cred_guard_mutex. Note that de_thread() can't sleep with ->group_rwsem held, this can obviously deadlock with the exiting leader if the writer is active, so it does threadgroup_change_end() before schedule(). Reported-by: Dave Jones <davej@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [ zhj: adjust context ] Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-03perf: Disable monitoring on setuid processes for regular usersStephane Eranian
commit 2976b10f05bd7f6dab9f9e7524451ddfed656a89 upstream. There was a a bug in setup_new_exec(), whereby the test to disabled perf monitoring was not correct because the new credentials for the process were not yet committed and therefore the get_dumpable() test was never firing. The patch fixes the problem by moving the perf_event test until after the credentials are committed. Signed-off-by: Stephane Eranian <eranian@google.com> Tested-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07mm: allow arch code to control the user page table ceilingHugh Dickins
commit 6ee8630e02be6dd89926ca0fbc21af68b23dc087 upstream. On architectures where a pgd entry may be shared between user and kernel (e.g. ARM+LPAE), freeing page tables needs a ceiling other than 0. This patch introduces a generic USER_PGTABLES_CEILING that arch code can override. It is the responsibility of the arch code setting the ceiling to ensure the complete freeing of the page tables (usually in pgd_free()). [catalin.marinas@arm.com: commit log; shift_arg_pages(), asm-generic/pgtables.h changes] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28exec: use -ELOOP for max recursion depthKees Cook
commit d740269867021faf4ce38a449353d2b986c34a67 upstream. To avoid an explosion of request_module calls on a chain of abusive scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon as maximum recursion depth is hit, the error will fail all the way back up the chain, aborting immediately. This also has the side-effect of stopping the user's shell from attempting to reexecute the top-level file as a shell script. As seen in the dash source: if (cmd != path_bshell && errno == ENOEXEC) { *argv-- = cmd; *argv = cmd = path_bshell; goto repeat; } The above logic was designed for running scripts automatically that lacked the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC, things continue to behave as the shell expects. Additionally, when tracking recursion, the binfmt handlers should not be involved. The recursion being tracked is the depth of calls through search_binary_handler(), so that function should be exclusively responsible for tracking the depth. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: halfdog <me@halfdog.net> Cc: P J P <ppandit@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11exec: do not leave bprm->interp on stackKees Cook
commit b66c5984017533316fd1951770302649baf1aa33 upstream. If a series of scripts are executed, each triggering module loading via unprintable bytes in the script header, kernel stack contents can leak into the command line. Normally execution of binfmt_script and binfmt_misc happens recursively. However, when modules are enabled, and unprintable bytes exist in the bprm->buf, execution will restart after attempting to load matching binfmt modules. Unfortunately, the logic in binfmt_script and binfmt_misc does not expect to get restarted. They leave bprm->interp pointing to their local stack. This means on restart bprm->interp is left pointing into unused stack memory which can then be copied into the userspace argv areas. After additional study, it seems that both recursion and restart remains the desirable way to handle exec with scripts, misc, and modules. As such, we need to protect the changes to interp. This changes the logic to require allocation for any changes to the bprm->interp. To avoid adding a new kmalloc to every exec, the default value is left as-is. Only when passing through binfmt_script or binfmt_misc does an allocation take place. For a proof of concept, see DoTest.sh from: http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/ Signed-off-by: Kees Cook <keescook@chromium.org> Cc: halfdog <me@halfdog.net> Cc: P J P <ppandit@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31freezer: exec should clear PF_NOFREEZE along with PF_KTHREADOleg Nesterov
commit b40a79591ca918e7b91b0d9b6abd5d00f2e88c19 upstream. flush_old_exec() clears PF_KTHREAD but forgets about PF_NOFREEZE. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09posix_types.h: Cleanup stale __NFDBITS and related definitionsJosh Boyer
commit 8ded2bbc1845e19c771eb55209aab166ef011243 upstream. Recently, glibc made a change to suppress sign-conversion warnings in FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the kernel's definition of __NFDBITS if applications #include <linux/types.h> after including <sys/select.h>. A build failure would be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2 flags to gcc. It was suggested that the kernel should either match the glibc definition of __NFDBITS or remove that entirely. The current in-kernel uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no uses of the related __FDELT and __FDMASK defines. Given that, we'll continue the cleanup that was started with commit 8b3d1cda4f5f ("posix_types: Remove fd_set macros") and drop the remaining unused macros. Additionally, linux/time.h has similar macros defined that expand to nothing so we'll remove those at the same time. Reported-by: Jeff Law <law@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Boyer <jwboyer@redhat.com> [ .. and fix up whitespace as per akpm ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16mm: correctly synchronize rss-counters at exit/execKonstantin Khlebnikov
commit 4fe7efdbdfb1c7e7a7f31decfd831c0f31d37091 upstream. do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does put_user(clear_child_tid) which can update task->rss_stat and thus make mm->rss_stat inconsistent. This triggers the "BUG:" printk in check_mm(). Let's fix this bug in the safest way, and optimize/cleanup this later. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-04Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar. * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/p4: Add format attributes tracing, sched, vfs: Fix 'old_pid' usage in trace_sched_process_exec()
2012-03-31tracing, sched, vfs: Fix 'old_pid' usage in trace_sched_process_exec()Oleg Nesterov
1. TRACE_EVENT(sched_process_exec) forgets to actually use the old pid argument, it sets ->old_pid = p->pid. 2. search_binary_handler() uses the wrong pid number. tracepoint needs the global pid_t from the root namespace, while old_pid is the virtual pid number as it seen by the tracer/parent. With this patch we have two pid_t's in search_binary_handler(), not really nice. Perhaps we should switch to "struct pid*", but in this case it would be better to cleanup the current code first and move the "depth == 0" code outside. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Smith <dsmith@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Link: http://lkml.kernel.org/r/20120330162636.GA4857@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-29Merge branch 'x86-x32-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x32 support for x86-64 from Ingo Molnar: "This tree introduces the X32 binary format and execution mode for x86: 32-bit data space binaries using 64-bit instructions and 64-bit kernel syscalls. This allows applications whose working set fits into a 32 bits address space to make use of 64-bit instructions while using a 32-bit address space with shorter pointers, more compressed data structures, etc." Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c} * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) x32: Fix alignment fail in struct compat_siginfo x32: Fix stupid ia32/x32 inversion in the siginfo format x32: Add ptrace for x32 x32: Switch to a 64-bit clock_t x32: Provide separate is_ia32_task() and is_x32_task() predicates x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls x86/x32: Fix the binutils auto-detect x32: Warn and disable rather than error if binutils too old x32: Only clear TIF_X32 flag once x32: Make sure TS_COMPAT is cleared for x32 tasks fs: Remove missed ->fds_bits from cessation use of fd_set structs internally fs: Fix close_on_exec pointer in alloc_fdtable x32: Drop non-__vdso weak symbols from the x32 VDSO x32: Fix coding style violations in the x32 VDSO code x32: Add x32 VDSO support x32: Allow x32 to be configured x32: If configured, add x32 system calls to system call tables x32: Handle process creation x32: Signal-related system calls x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h> ...
2012-03-28Add #includes needed to permit the removal of asm/system.hDavid Howells
asm/system.h is a cause of circular dependency problems because it contains commonly used primitive stuff like barrier definitions and uncommonly used stuff like switch_to() that might require MMU definitions. asm/system.h has been disintegrated by this point on all arches into the following common segments: (1) asm/barrier.h Moved memory barrier definitions here. (2) asm/cmpxchg.h Moved xchg() and cmpxchg() here. #included in asm/atomic.h. (3) asm/bug.h Moved die() and similar here. (4) asm/exec.h Moved arch_align_stack() here. (5) asm/elf.h Moved AT_VECTOR_SIZE_ARCH here. (6) asm/switch_to.h Moved switch_to() here. Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-22Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge first batch of patches from Andrew Morton: "A few misc things and all the MM queue" * emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits) memcg: avoid THP split in task migration thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE memcg: clean up existing move charge code mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read() mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() mm/memcontrol.c: s/stealed/stolen/ memcg: fix performance of mem_cgroup_begin_update_page_stat() memcg: remove PCG_FILE_MAPPED memcg: use new logic for page stat accounting memcg: remove PCG_MOVE_LOCK flag from page_cgroup memcg: simplify move_account() check memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) memcg: kill dead prev_priority stubs memcg: remove PCG_CACHE page_cgroup flag memcg: let css_get_next() rely upon rcu_read_lock() cgroup: revert ss_id_lock to spinlock idr: make idr_get_next() good for rcu_read_lock() memcg: remove unnecessary thp check in page stat accounting memcg: remove redundant returns memcg: enum lru_list lru ...
2012-03-21mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat()David Rientjes
sync_mm_rss() can only be used for current to avoid race conditions in iterating and clearing its per-task counters. Remove the task argument for it and its helper function, __sync_task_rss_stat(), to avoid thinking it can be used safely for anything other than current. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "This is _not_ all; in particular, Miklos' and Jan's stuff is not there yet." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) ext4: initialization of ext4_li_mtx needs to be done earlier debugfs-related mode_t whack-a-mole hfsplus: add an ioctl to bless files hfsplus: change finder_info to u32 hfsplus: initialise userflags qnx4: new helper - try_extent() qnx4: get rid of qnx4_bread/qnx4_getblk take removal of PF_FORKNOEXEC to flush_old_exec() trim includes in inode.c um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it um: embed ->stub_pages[] into mmu_context gadgetfs: list_for_each_safe() misuse ocfs2: fix leaks on failure exits in module_init ecryptfs: make register_filesystem() the last potential failure exit ntfs: forgets to unregister sysctls on register_filesystem() failure logfs: missing cleanup on register_filesystem() failure jfs: mising cleanup on register_filesystem() failure make configfs_pin_fs() return root dentry on success configfs: configfs_create_dir() has parent dentry in dentry->d_parent configfs: sanitize configfs_create() ...
2012-03-21Merge branch 'kmap_atomic' of git://github.com/congwang/linuxLinus Torvalds
Pull kmap_atomic cleanup from Cong Wang. It's been in -next for a long time, and it gets rid of the (no longer used) second argument to k[un]map_atomic(). Fix up a few trivial conflicts in various drivers, and do an "evil merge" to catch some new uses that have come in since Cong's tree. * 'kmap_atomic' of git://github.com/congwang/linux: (59 commits) feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename] drbd: remove the second argument of k[un]map_atomic() zcache: remove the second argument of k[un]map_atomic() gma500: remove the second argument of k[un]map_atomic() dm: remove the second argument of k[un]map_atomic() tomoyo: remove the second argument of k[un]map_atomic() sunrpc: remove the second argument of k[un]map_atomic() rds: remove the second argument of k[un]map_atomic() net: remove the second argument of k[un]map_atomic() mm: remove the second argument of k[un]map_atomic() lib: remove the second argument of k[un]map_atomic() power: remove the second argument of k[un]map_atomic() kdb: remove the second argument of k[un]map_atomic() udf: remove the second argument of k[un]map_atomic() ubifs: remove the second argument of k[un]map_atomic() squashfs: remove the second argument of k[un]map_atomic() reiserfs: remove the second argument of k[un]map_atomic() ocfs2: remove the second argument of k[un]map_atomic() ntfs: remove the second argument of k[un]map_atomic() ...
2012-03-20take removal of PF_FORKNOEXEC to flush_old_exec()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20__register_binfmt() made voidAl Viro
Just don't pass NULL to it - nobody does, anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20exec: move de_thread()->setmax_mm_hiwater_rss() into exec_mmap()Oleg Nesterov
Minor cleanup. de_thread()->setmax_mm_hiwater_rss() looks a bit strange, move it into exec_mmap() which plays with old_mm. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-20exit_signal: simplify the "we have changed execution domain" logicOleg Nesterov
exit_notify() checks "tsk->self_exec_id != tsk->parent_exec_id" to handle the "we have changed execution domain" case. We can change do_thread() to always set ->exit_signal = SIGCHLD and remove this check to simplify the code. We could change setup_new_exec() instead, this looks more logical because it increments ->self_exec_id. But note that de_thread() already resets ->exit_signal if it changes the leader, let's keep both changes close to each other. Note that we change ->exit_signal lockless, this changes the rules. Thereafter ->exit_signal is not stable under tasklist but this is fine, the only possible change is OLDSIG -> SIGCHLD. This can race with eligible_child() but the race is harmless. We can race with reparent_leader() which changes our ->exit_signal in parallel, but it does the same change to SIGCHLD. The noticeable user-visible change is that the execing task is not "visible" to do_wait()->eligible_child(__WCLONE) right after exec. To me this looks more logical, and this is consistent with mt case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-20fs: remove the second argument of k[un]map_atomic()Cong Wang
Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-12Merge branch 'perf/urgent' into perf/coreIngo Molnar
Merge reason: We are going to queue up a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05coredump_wait: don't call complete_vfork_done()Oleg Nesterov
Now that CLONE_VFORK is killable, coredump_wait() no longer needs complete_vfork_done(). zap_threads() should find and kill all tasks with the same ->mm, this includes our parent if ->vfork_done is set. mm_release() becomes the only caller, unexport complete_vfork_done(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05vfork: introduce complete_vfork_done()Oleg Nesterov
No functional changes. Move the clear-and-complete-vfork_done code into the new trivial helper, complete_vfork_done(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/perf.h tools/perf/util/top.h Merge reason: resolve these cherry-picking conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-23tracepoint, vfs, sched: Add exec() tracepointDavid Smith
Added a minimal exec tracepoint. Exec is an important major event in the life of a task, like fork(), clone() or exit(), all of which we already trace. [ We also do scheduling re-balancing during exec() - so it's useful from a scheduler instrumentation POV as well. ] If you want to watch a task start up, when it gets exec'ed is a good place to start. With the addition of this tracepoint, exec's can be monitored and better picture of general system activity can be obtained. This tracepoint will also enable better process life tracking, allowing you to answer questions like "what process keeps starting up binary X?". This tracepoint can also be useful in ftrace filtering and trigger conditions: i.e. starting or stopping filtering when exec is called. Signed-off-by: David Smith <dsmith@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/4F314D19.7030504@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-19Replace the fd_sets in struct fdtable with an array of unsigned longsDavid Howells
Replace the fd_sets in struct fdtable with an array of unsigned longs and then use the standard non-atomic bit operations rather than the FD_* macros. This: (1) Removes the abuses of struct fd_set: (a) Since we don't want to allocate a full fd_set the vast majority of the time, we actually, in effect, just allocate a just-big-enough array of unsigned longs and cast it to an fd_set type - so why bother with the fd_set at all? (b) Some places outside of the core fdtable handling code (such as SELinux) want to look inside the array of unsigned longs hidden inside the fd_set struct for more efficient iteration over the entire set. (2) Eliminates the use of FD_*() macros in the kernel completely. (3) Permits the __FD_*() macros to be deleted entirely where not exposed to userspace. Signed-off-by: David Howells <dhowells@redhat.com> Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.uk Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk>
2012-02-19Wrap accesses to the fd_sets in struct fdtableDavid Howells
Wrap accesses to the fd_sets in struct fdtable (for recording open files and close-on-exec flags) so that we can move away from using fd_sets since we abuse the fd_set structs by not allocating the full-sized structure under normal circumstances and by non-core code looking at the internals of the fd_sets. The first abuse means that use of FD_ZERO() on these fd_sets is not permitted, since that cannot be told about their abnormal lengths. This introduces six wrapper functions for setting, clearing and testing close-on-exec flags and fd-is-open flags: void __set_close_on_exec(int fd, struct fdtable *fdt); void __clear_close_on_exec(int fd, struct fdtable *fdt); bool close_on_exec(int fd, const struct fdtable *fdt); void __set_open_fd(int fd, struct fdtable *fdt); void __clear_open_fd(int fd, struct fdtable *fdt); bool fd_is_open(int fd, const struct fdtable *fdt); Note that I've prepended '__' to the names of the set/clear functions because they require the caller to hold a lock to use them. Note also that I haven't added wrappers for looking behind the scenes at the the array. Possibly that should exist too. Signed-off-by: David Howells <dhowells@redhat.com> Link: http://lkml.kernel.org/r/20120216174942.23314.1364.stgit@warthog.procyon.org.uk Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk>
2012-02-06exec: fix use-after-free bug in setup_new_exec()Heiko Carstens
Setting the task name is done within setup_new_exec() by accessing bprm->filename. However this happens after flush_old_exec(). This may result in a use after free bug, flush_old_exec() may "complete" vfork_done, which will wake up the parent which in turn may free the passed in filename. To fix this add a new tcomm field in struct linux_binprm which contains the now early generated task name until it is used. Fixes this bug on s390: Unable to handle kernel pointer dereference at virtual kernel address 0000000039768000 Process kworker/u:3 (pid: 245, task: 000000003a3dc840, ksp: 0000000039453818) Krnl PSW : 0704000180000000 0000000000282e94 (setup_new_exec+0xa0/0x374) Call Trace: ([<0000000000282e2c>] setup_new_exec+0x38/0x374) [<00000000002dd12e>] load_elf_binary+0x402/0x1bf4 [<0000000000280a42>] search_binary_handler+0x38e/0x5bc [<0000000000282b6c>] do_execve_common+0x410/0x514 [<0000000000282cb6>] do_execve+0x46/0x58 [<00000000005bce58>] kernel_execve+0x28/0x70 [<000000000014ba2e>] ____call_usermodehelper+0x102/0x140 [<00000000005bc8da>] kernel_thread_starter+0x6/0xc [<00000000005bc8d4>] kernel_thread_starter+0x0/0xc Last Breaking-Event-Address: [<00000000002830f0>] setup_new_exec+0x2fc/0x374 Kernel panic - not syncing: Fatal exception: panic_on_oops Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10tracepoint: add tracepoints for debugging oom_score_adjKAMEZAWA Hiroyuki
oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter # echo "oom_score_adj != 0" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-03trim fs/internal.hAl Viro
some stuff in there can actually become static; some belongs to pnode.h as it's a private interface between namespace.c and pnode.c... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-10-31oom: remove oom_disable_countDavid Rientjes
This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-11move RLIMIT_NPROC check from set_user() to do_execve_common()Vasiliy Kulikov
The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC check in set_user() to check for NPROC exceeding via setuid() and similar functions. Before the check there was a possibility to greatly exceed the allowed number of processes by an unprivileged user if the program relied on rlimit only. But the check created new security threat: many poorly written programs simply don't check setuid() return code and believe it cannot fail if executed with root privileges. So, the check is removed in this patch because of too often privilege escalations related to buggy programs. The NPROC can still be enforced in the common code flow of daemons spawning user processes. Most of daemons do fork()+setuid()+execve(). The check introduced in execve() (1) enforces the same limit as in setuid() and (2) doesn't create similar security issues. Neil Brown suggested to track what specific process has exceeded the limit by setting PF_NPROC_EXCEEDED process flag. With the change only this process would fail on execve(), and other processes' execve() behaviour is not changed. Solar Designer suggested to re-check whether NPROC limit is still exceeded at the moment of execve(). If the process was sleeping for days between set*uid() and execve(), and the NPROC counter step down under the limit, the defered execve() failure because NPROC limit was exceeded days ago would be unexpected. If the limit is not exceeded anymore, we clear the flag on successful calls to execve() and fork(). The flag is also cleared on successful calls to set_user() as the limit was exceeded for the previous user, not the current one. Similar check was introduced in -ow patches (without the process flag). v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user(). Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26fs/exec.c:acct_arg_size(): ptl is no longer needed for add_mm_counter()Oleg Nesterov
acct_arg_size() takes ->page_table_lock around add_mm_counter() if !SPLIT_RSS_COUNTING. This is not needed after commit 172703b08cd0 ("mm: delete non-atomic mm counter implementation"). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26exec: do not retry load_binary method if CONFIG_MODULES=nTetsuo Handa
If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats handler because the list will not be modified by request_module(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Richard Weinberger <richard@nod.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26exec: do not call request_module() twice from search_binary_handler()Tetsuo Handa
Currently, search_binary_handler() tries to load binary loader module using request_module() if a loader for the requested program is not yet loaded. But second attempt of request_module() does not affect the result of search_binary_handler(). If request_module() triggered recursion, calling request_module() twice causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions. It is not an infinite loop but is sufficient for users to consider as a hang up. Therefore, this patch changes not to call request_module() twice, making 1 to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Richard Weinberger <richard@nod.at> Tested-by: Richard Weinberger <richard@nod.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26fs/exec.c: use BUILD_BUG_ON for VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUPMichal Hocko
Commit a8bef8ff6ea1 ("mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS and VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time one, so BUILD_BUG_ON is more appropriate. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26do_coredump: fix the "ispipe" error checkOleg Nesterov
do_coredump() assumes that if format_corename() fails it should return -ENOMEM. This is not true, for example cn_print_exe_file() can propagate the error from d_path. Even if it was true, this is too fragile. Change the code to check "ispipe < 0". Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26coredump: escape / in hostname and commJiri Slaby
Change every occurence of / in comm and hostname to !. If the process changes its name to contain /, the core is not dumped (if the directory tree doesn't exist like that). The same with hostname being something like myhost/3. Fix this behaviour by using the escape loop used in %E. (We extract it to a separate function.) Now both with comm == myprocess/1 and hostname == myhost/1, the core is dumped like (kernel.core_pattern='core.%p.%e.%h): core.2349.myprocess!1.myhost!1 Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26coredump: use task comm instead of (unknown)Jiri Slaby
If we don't know the file corresponding to the binary (i.e. exe_file is unknown), use "task->comm (path unknown)" instead of simple "(unknown)" as suggested by ak. The fallback is the same as %e except it will append "(path unknown)". Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.
2011-07-22Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/miscLinus Torvalds
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits) ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction connector: add an event for monitoring process tracers ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task() ptrace_init_task: initialize child->jobctl explicitly has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/ ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/ ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented() ptrace: ptrace_reparented() should check same_thread_group() redefine thread_group_leader() as exit_signal >= 0 do not change dead_task->exit_signal kill task_detached() reparent_leader: check EXIT_DEAD instead of task_detached() make do_notify_parent() __must_check, update the callers __ptrace_detach: avoid task_detached(), check do_notify_parent() kill tracehook_notify_death() make do_notify_parent() return bool ptrace: s/tracehook_tracer_task()/ptrace_parent()/ ...
2011-07-22ptrace: do_wait(traced_leader_killed_by_mt_exec) can block foreverOleg Nesterov
Test-case: void *tfunc(void *arg) { execvp("true", NULL); return NULL; } int main(void) { int pid; if (fork()) { pthread_t t; kill(getpid(), SIGSTOP); pthread_create(&t, NULL, tfunc, NULL); for (;;) pause(); } pid = getppid(); assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0); while (wait(NULL) > 0) ptrace(PTRACE_CONT, pid, 0,0); return 0; } It is racy, exit_notify() does __wake_up_parent() too. But in the likely case it triggers the problem: de_thread() does release_task() and the old leader goes away without the notification, the tracer sleeps in do_wait() without children/tracees. Change de_thread() to do __wake_up_parent(traced_leader->parent). Since it is already EXIT_DEAD we can do this without ptrace_unlink(), EXIT_DEAD threads do not exist from do_wait's pov. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>
2011-07-20consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handlingAl Viro
new helper: would_dump(bprm, file). Checks if we are allowed to read the file and if we are not - sets ENFORCE_NODUMP. Exported, used in places that previously open-coded the same logics. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-01ptrace: make former thread ID available via PTRACE_GETEVENTMSG after ↵Denys Vlasenko
PTRACE_EVENT_EXEC stop When multithreaded program execs under ptrace, all traced threads report WIFEXITED status, except for thread group leader and the thread which execs. Unless tracer tracks thread group relationship between tracees, which is a nontrivial task, it will not detect that execed thread no longer exists. This patch allows tracer to figure out which thread performed this exec, by requesting PTRACE_GETEVENTMSG in PTRACE_EVENT_EXEC stop. Another, samller problem which is solved by this patch is that tracer now can figure out which of the several concurrent execs in multithreaded program succeeded. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2011-06-27redefine thread_group_leader() as exit_signal >= 0Oleg Nesterov
Change de_thread() to set old_leader->exit_signal = -1. This is good for the consistency, it is no longer the leader and all sub-threads have exit_signal = -1 set by copy_process(CLONE_THREAD). And this allows us to micro-optimize thread_group_leader(), it can simply check exit_signal >= 0. This also makes sense because we should move ->group_leader from task_struct to signal_struct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Tejun Heo <tj@kernel.org>
2011-06-22ptrace: kill clone/exec tracehooksTejun Heo
At this point, tracehooks aren't useful to mainline kernel and mostly just add an extra layer of obfuscation. Although they have comments, without actual in-kernel users, it is difficult to tell what are their assumptions and they're actually trying to achieve. To mainline kernel, they just aren't worth keeping around. This patch kills the following clone and exec related tracehooks. tracehook_prepare_clone() tracehook_finish_clone() tracehook_report_clone() tracehook_report_clone_complete() tracehook_unsafe_exec() The changes are mostly trivial - logic is moved to the caller and comments are merged and adjusted appropriately. The only exception is in check_unsafe_exec() where LSM_UNSAFE_PTRACE* are OR'd to bprm->unsafe instead of setting it, which produces the same result as the field is always zero on entry. It also tests p->ptrace instead of (p->ptrace & PT_PTRACED) for consistency, which also gives the same result. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2011-06-22ptrace: kill trivial tracehooksTejun Heo
At this point, tracehooks aren't useful to mainline kernel and mostly just add an extra layer of obfuscation. Although they have comments, without actual in-kernel users, it is difficult to tell what are their assumptions and they're actually trying to achieve. To mainline kernel, they just aren't worth keeping around. This patch kills the following trivial tracehooks. * Ones testing whether task is ptraced. Replace with ->ptrace test. tracehook_expect_breakpoints() tracehook_consider_ignored_signal() tracehook_consider_fatal_signal() * ptrace_event() wrappers. Call directly. tracehook_report_exec() tracehook_report_exit() tracehook_report_vfork_done() * ptrace_release_task() wrapper. Call directly. tracehook_finish_release_task() * noop tracehook_prepare_release_task() tracehook_report_death() This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2011-06-17KEYS/DNS: Fix ____call_usermodehelper() to not lose the session keyringDavid Howells
____call_usermodehelper() now erases any credentials set by the subprocess_inf::init() function. The problem is that commit 17f60a7da150 ("capabilites: allow the application of capability limits to usermode helpers") creates and commits new credentials with prepare_kernel_cred() after the call to the init() function. This wipes all keyrings after umh_keys_init() is called. The best way to deal with this is to put the init() call just prior to the commit_creds() call, and pass the cred pointer to init(). That means that umh_keys_init() and suchlike can modify the credentials _before_ they are published and potentially in use by the rest of the system. This prevents request_key() from working as it is prevented from passing the session keyring it set up with the authorisation token to /sbin/request-key, and so the latter can't assume the authority to instantiate the key. This causes the in-kernel DNS resolver to fail with ENOKEY unconditionally. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>