From 3cf5d076fb4d48979f382bc9452765bf8b79e740 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 23 May 2019 10:17:27 -0500 Subject: signal: Remove task parameter from force_sig All of the remaining callers pass current into force_sig so remove the task parameter to make this obvious and to make misuse more difficult in the future. This also makes it clear force_sig passes current into force_sig_info. Signed-off-by: "Eric W. Biederman" --- include/linux/syscalls.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/syscalls.h') diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..fd6e0f5ebfdf 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -266,7 +266,7 @@ static inline void addr_limit_user_check(void) if (CHECK_DATA_CORRUPTION(!segment_eq(get_fs(), USER_DS), "Invalid address limit on user-mode return")) - force_sig(SIGKILL, current); + force_sig(SIGKILL); #ifdef TIF_FSCHECK clear_thread_flag(TIF_FSCHECK); -- cgit v1.2.3 From 7f192e3cd316ba58c88dfa26796cf77789dd9872 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Sat, 25 May 2019 11:36:41 +0200 Subject: fork: add clone3 This adds the clone3 system call. As mentioned several times already (cf. [7], [8]) here's the promised patchset for clone3(). We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last free flag from clone(). Independent of the CLONE_PIDFD patchset a time namespace has been discussed at Linux Plumber Conference last year and has been sent out and reviewed (cf. [5]). It is expected that it will go upstream in the not too distant future. However, it relies on the addition of the CLONE_NEWTIME flag to clone(). The only other good candidate - CLONE_DETACHED - is currently not recyclable as we have identified at least two large or widely used codebases that currently pass this flag (cf. [2], [3], and [4]). Given that CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively blocked. clone3() has the advantage that it will unblock this patchset again. In general, clone3() is extensible and allows for the implementation of new features. The idea is to keep clone3() very simple and close to the original clone(), specifically, to keep on supporting old clone()-based workloads. We know there have been various creative proposals how a new process creation syscall or even api is supposed to look like. Some people even going so far as to argue that the traditional fork()+exec() split should be abandoned in favor of an in-kernel version of spawn(). Independent of whether or not we personally think spawn() is a good idea this patchset has and does not want to have anything to do with this. One stance we take is that there's no real good alternative to clone()+exec() and we need and want to support this model going forward; independent of spawn(). The following requirements guided clone3(): - bump the number of available flags - move arguments that are currently passed as separate arguments in clone() into a dedicated struct clone_args - choose a struct layout that is easy to handle on 32 and on 64 bit - choose a struct layout that is extensible - give new flags that currently need to abuse another flag's dedicated return argument in clone() their own dedicated return argument (e.g. CLONE_PIDFD) - use a separate kernel internal struct kernel_clone_args that is properly typed according to current kernel conventions in fork.c and is different from the uapi struct clone_args - port _do_fork() to use kernel_clone_args so that all process creation syscalls such as fork(), vfork(), clone(), and clone3() behave identical (Arnd suggested, that we can probably also port do_fork() itself in a separate patchset.) - ease of transition for userspace from clone() to clone3() This very much means that we do *not* remove functionality that userspace currently relies on as the latter is a good way of creating a syscall that won't be adopted. - do not try to be clever or complex: keep clone3() as dumb as possible In accordance with Linus suggestions (cf. [11]), clone3() has the following signature: /* uapi */ struct clone_args { __aligned_u64 flags; __aligned_u64 pidfd; __aligned_u64 child_tid; __aligned_u64 parent_tid; __aligned_u64 exit_signal; __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; }; /* kernel internal */ struct kernel_clone_args { u64 flags; int __user *pidfd; int __user *child_tid; int __user *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; long sys_clone3(struct clone_args __user *uargs, size_t size) clone3() cleanly supports all of the supported flags from clone() and thus all legacy workloads. The advantage of sticking close to the old clone() is the low cost for userspace to switch to this new api. Quite a lot of userspace apis (e.g. pthreads) are based on the clone() syscall. With the new clone3() syscall supporting all of the old workloads and opening up the ability to add new features should make switching to it for userspace more appealing. In essence, glibc can just write a simple wrapper to switch from clone() to clone3(). There has been some interest in this patchset already. We have received a patch from the CRIU corner for clone3() that would set the PID/TID of a restored process without /proc/sys/kernel/ns_last_pid to eliminate a race. /* User visible differences to legacy clone() */ - CLONE_DETACHED will cause EINVAL with clone3() - CSIGNAL is deprecated It is superseeded by a dedicated "exit_signal" argument in struct clone_args freeing up space for additional flags. This is based on a suggestion from Andrei and Linus (cf. [9] and [10]) /* References */ [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343 [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233 [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740 [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/ [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/ [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/ [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/ [9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/ [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/ [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/ Suggested-by: Linus Torvalds Signed-off-by: Christian Brauner Acked-by: Arnd Bergmann Acked-by: Serge Hallyn Cc: Kees Cook Cc: Pavel Emelyanov Cc: Jann Horn Cc: David Howells Cc: Andrew Morton Cc: Oleg Nesterov Cc: Adrian Reber Cc: Linus Torvalds Cc: Andrei Vagin Cc: Al Viro Cc: Florian Weimer Cc: linux-api@vger.kernel.org --- include/linux/syscalls.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include/linux/syscalls.h') diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..60a81f374ca3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -70,6 +70,7 @@ struct sigaltstack; struct rseq; union bpf_attr; struct io_uring_params; +struct clone_args; #include #include @@ -852,6 +853,9 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *, int __user *, unsigned long); #endif #endif + +asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size); + asmlinkage long sys_execve(const char __user *filename, const char __user *const __user *argv, const char __user *const __user *envp); -- cgit v1.2.3 From 32fcb426ec001cb6d5a4a195091a8486ea77e2df Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 24 May 2019 12:43:51 +0200 Subject: pid: add pidfd_open() This adds the pidfd_open() syscall. It allows a caller to retrieve pollable pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a process that is created via traditional fork()/clone() calls that is only referenced by a PID: int pidfd = pidfd_open(1234, 0); ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0); With the introduction of pidfds through CLONE_PIDFD it is possible to created pidfds at process creation time. However, a lot of processes get created with traditional PID-based calls such as fork() or clone() (without CLONE_PIDFD). For these processes a caller can currently not create a pollable pidfd. This is a problem for Android's low memory killer (LMK) and service managers such as systemd. Both are examples of tools that want to make use of pidfds to get reliable notification of process exit for non-parents (pidfd polling) and race-free signal sending (pidfd_send_signal()). They intend to switch to this API for process supervision/management as soon as possible. Having no way to get pollable pidfds from PID-only processes is one of the biggest blockers for them in adopting this api. With pidfd_open() making it possible to retrieve pidfds for PID-based processes we enable them to adopt this api. In line with Arnd's recent changes to consolidate syscall numbers across architectures, I have added the pidfd_open() syscall to all architectures at the same time. Signed-off-by: Christian Brauner Reviewed-by: David Howells Reviewed-by: Oleg Nesterov Acked-by: Arnd Bergmann Cc: "Eric W. Biederman" Cc: Kees Cook Cc: Joel Fernandes (Google) Cc: Thomas Gleixner Cc: Jann Horn Cc: Andy Lutomirsky Cc: Andrew Morton Cc: Aleksa Sarai Cc: Linus Torvalds Cc: Al Viro Cc: linux-api@vger.kernel.org --- include/linux/syscalls.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/syscalls.h') diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..989055e0b501 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock, struct old_timex32 __user *tx); asmlinkage long sys_syncfs(int fd); asmlinkage long sys_setns(int fd, int nstype); +asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags); asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg, unsigned int vlen, unsigned flags); asmlinkage long sys_process_vm_readv(pid_t pid, -- cgit v1.2.3 From 33488845f211afcdb7e5c00a3152890e06cdc78e Mon Sep 17 00:00:00 2001 From: Al Viro Date: Fri, 31 May 2019 20:09:15 -0400 Subject: constify ksys_mount() string arguments Signed-off-by: Al Viro --- include/linux/syscalls.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux/syscalls.h') diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..2a0ac10a6f95 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1228,8 +1228,8 @@ asmlinkage long sys_ni_syscall(void); * the ksys_xyzyyz() functions prototyped below. */ -int ksys_mount(char __user *dev_name, char __user *dir_name, char __user *type, - unsigned long flags, void __user *data); +int ksys_mount(const char __user *dev_name, const char __user *dir_name, + const char __user *type, unsigned long flags, void __user *data); int ksys_umount(char __user *name, int flags); int ksys_dup(unsigned int fildes); int ksys_chroot(const char __user *filename); -- cgit v1.2.3