linux-toradex.git/fs/proc, branch v4.9.100

fs/proc/kcore.c: use probe_kernel_read() instead of memcpy()

2018-02-17T12:21:18+00:00

commit d0290bc20d4739b7a900ae37eb5d4cc3be2b393f upstream.

Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext
data") added a bounce buffer to avoid hardened usercopy checks.  Copying
to the bounce buffer was implemented with a simple memcpy() assuming
that it is always valid to read from kernel memory iff the
kern_addr_valid() check passed.

A simple, but pointless, test case like "dd if=/proc/kcore of=/dev/null"
now can easily crash the kernel, since the former execption handling on
invalid kernel addresses now doesn't work anymore.

Also adding a kern_addr_valid() implementation wouldn't help here.  Most
architectures simply return 1 here, while a couple implemented a page
table walk to figure out if something is mapped at the address in
question.

With DEBUG_PAGEALLOC active mappings are established and removed all the
time, so that relying on the result of kern_addr_valid() before
executing the memcpy() also doesn't work.

Therefore simply use probe_kernel_read() to copy to the bounce buffer.
This also allows to simplify read_kcore().

At least on s390 this fixes the observed crashes and doesn't introduce
warnings that were removed with df04abfd181a ("fs/proc/kcore.c: Add
bounce buffer for ktext data"), even though the generic
probe_kernel_read() implementation uses uaccess functions.

While looking into this I'm also wondering if kern_addr_valid() could be
completely removed...(?)

Link: http://lkml.kernel.org/r/20171202132739.99971-1-heiko.carstens@de.ibm.com
Fixes: df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Signed-off-by: Heiko Carstens 
Acked-by: Kees Cook 
Cc: Jiri Olsa 
Cc: Al Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

proc: fix coredump vs read /proc/*/stat race

2018-01-23T18:57:08+00:00

commit 8bb2ee192e482c5d500df9f2b1b26a560bd3026f upstream.

do_task_stat() accesses IP and SP of a task without bumping reference
count of a stack (which became an entity with independent lifetime at
some point).

Steps to reproduce:

    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 

    int main(void)
    {
    	setrlimit(RLIMIT_CORE, &(struct rlimit){});

    	while (1) {
    		char buf[64];
    		char buf2[4096];
    		pid_t pid;
    		int fd;

    		pid = fork();
    		if (pid == 0) {
    			*(volatile int *)0 = 0;
    		}

    		snprintf(buf, sizeof(buf), "/proc/%u/stat", pid);
    		fd = open(buf, O_RDONLY);
    		read(fd, buf2, sizeof(buf2));
    		close(fd);

    		waitpid(pid, NULL, 0);
    	}
    	return 0;
    }

    BUG: unable to handle kernel paging request at 0000000000003fd8
    IP: do_task_stat+0x8b4/0xaf0
    PGD 800000003d73e067 P4D 800000003d73e067 PUD 3d558067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 1417 Comm: a.out Not tainted 4.15.0-rc8-dirty #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014
    RIP: 0010:do_task_stat+0x8b4/0xaf0
    Call Trace:
     proc_single_show+0x43/0x70
     seq_read+0xe6/0x3b0
     __vfs_read+0x1e/0x120
     vfs_read+0x84/0x110
     SyS_read+0x3d/0xa0
     entry_SYSCALL_64_fastpath+0x13/0x6c
    RIP: 0033:0x7f4d7928cba0
    RSP: 002b:00007ffddb245158 EFLAGS: 00000246
    Code: 03 b7 a0 01 00 00 4c 8b 4c 24 70 4c 8b 44 24 78 4c 89 74 24 18 e9 91 f9 ff ff f6 45 4d 02 0f 84 fd f7 ff ff 48 8b 45 40 48 89 ef <48> 8b 80 d8 3f 00 00 48 89 44 24 20 e8 9b 97 eb ff 48 89 44 24
    RIP: do_task_stat+0x8b4/0xaf0 RSP: ffffc90000607cc8
    CR2: 0000000000003fd8

John Ogness said: for my tests I added an else case to verify that the
race is hit and correctly mitigated.

Link: http://lkml.kernel.org/r/20180116175054.GA11513@avx2
Signed-off-by: Alexey Dobriyan 
Reported-by: "Kohli, Gaurav" 
Tested-by: John Ogness 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

tty fix oops when rmmod 8250

2017-12-20T09:07:32+00:00

[ Upstream commit c79dde629d2027ca80329c62854a7635e623d527 ]

After rmmod 8250.ko
tty_kref_put starts kwork (release_one_tty) to release proc interface
oops when accessing driver->driver_name in proc_tty_unregister_driver

Use jprobe, found driver->driver_name point to 8250.ko
static static struct uart_driver serial8250_reg
.driver_name= serial,

Use name in proc_dir_entry instead of driver->driver_name to fix oops

test on linux 4.1.12:

BUG: unable to handle kernel paging request at ffffffffa01979de
IP: [] strchr+0x0/0x30
PGD 1a0d067 PUD 1a0e063 PMD 851c1f067 PTE 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: ... ...  [last unloaded: 8250]
CPU: 7 PID: 116 Comm: kworker/7:1 Tainted: G           O    4.1.12 #1
Hardware name: Insyde RiverForest/Type2 - Board Product Name1, BIOS NE5KV904 12/21/2015
Workqueue: events release_one_tty
task: ffff88085b684960 ti: ffff880852884000 task.ti: ffff880852884000
RIP: 0010:[]  [] strchr+0x0/0x30
RSP: 0018:ffff880852887c90  EFLAGS: 00010282
RAX: ffffffff81a5eca0 RBX: ffffffffa01979de RCX: 0000000000000004
RDX: ffff880852887d10 RSI: 000000000000002f RDI: ffffffffa01979de
RBP: ffff880852887cd8 R08: 0000000000000000 R09: ffff88085f5d94d0
R10: 0000000000000195 R11: 0000000000000000 R12: ffffffffa01979de
R13: ffff880852887d00 R14: ffffffffa01979de R15: ffff88085f02e840
FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa01979de CR3: 0000000001a0c000 CR4: 00000000001406e0
Stack:
 ffffffff812349b1 ffff880852887cb8 ffff880852887d10 ffff88085f5cd6c2
 ffff880852800a80 ffffffffa01979de ffff880852800a84 0000000000000010
 ffff88085bb28bd8 ffff880852887d38 ffffffff812354f0 ffff880852887d08
Call Trace:
 [] ? __xlate_proc_name+0x71/0xd0
 [] remove_proc_entry+0x40/0x180
 [] ? _raw_spin_lock_irqsave+0x41/0x60
 [] ? destruct_tty_driver+0x60/0xe0
 [] proc_tty_unregister_driver+0x28/0x40
 [] destruct_tty_driver+0x88/0xe0
 [] tty_driver_kref_put+0x1d/0x20
 [] release_one_tty+0x5a/0xd0
 [] process_one_work+0x139/0x420
 [] worker_thread+0x121/0x450
 [] ? process_scheduled_works+0x40/0x40
 [] kthread+0xec/0x110
 [] ? tg_rt_schedulable+0x210/0x220
 [] ? kthread_freezable_should_stop+0x80/0x80
 [] ret_from_fork+0x42/0x70
 [] ? kthread_freezable_should_stop+0x80/0x80

Signed-off-by: nixiaoming 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

fs/proc: Report eip/esp in /prod/PID/stat for coredumping

2017-10-05T07:43:58+00:00

commit fd7d56270b526ca3ed0c224362e3c64a0f86687a upstream.

Commit 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in
/proc/PID/stat") stopped reporting eip/esp because it is
racy and dangerous for executing tasks. The comment adds:

    As far as I know, there are no use programs that make any
    material use of these fields, so just get rid of them.

However, existing userspace core-dump-handler applications (for
example, minicoredumper) are using these fields since they
provide an excellent cross-platform interface to these valuable
pointers. So that commit introduced a user space visible
regression.

Partially revert the change and make the readout possible for
tasks with the proper permissions and only if the target task
has the PF_DUMPCORE flag set.

Fixes: 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in> /proc/PID/stat")
Reported-by: Marco Felsch 
Signed-off-by: John Ogness 
Reviewed-by: Andy Lutomirski 
Cc: Tycho Andersen 
Cc: Kees Cook 
Cc: Peter Zijlstra 
Cc: Brian Gerst 
Cc: Tetsuo Handa 
Cc: Borislav Petkov 
Cc: Al Viro 
Cc: Linux API 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/87poatfwg6.fsf@linutronix.de
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

mm: larger stack guard gap, between vmas

2017-06-24T05:11:18+00:00

commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov 
Original-patch-by: Michal Hocko 
Signed-off-by: Hugh Dickins 
Acked-by: Michal Hocko 
Tested-by: Helge Deller  # parisc
Signed-off-by: Linus Torvalds 
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
Signed-off-by: Willy Tarreau 
Signed-off-by: Greg Kroah-Hartman

proc: add a schedule point in proc_pid_readdir()

2017-06-17T04:41:56+00:00

[ Upstream commit 3ba4bceef23206349d4130ddf140819b365de7c8 ]

We have seen proc_pid_readdir() invocations holding cpu for more than 50
ms.  Add a cond_resched() to be gentle with other tasks.

[akpm@linux-foundation.org: coding style fix]
Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com
Signed-off-by: Eric Dumazet 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

proc: Fix unbalanced hard link numbers

2017-05-25T13:44:37+00:00

commit d66bb1607e2d8d384e53f3d93db5c18483c8c4f7 upstream.

proc_create_mount_point() forgot to increase the parent's nlink, and
it resulted in unbalanced hard link numbers, e.g. /proc/fs shows one
less than expected.

Fixes: eb6d38d5427b ("proc: Allow creating permanently empty directories...")
Reported-by: Tristan Ye 
Signed-off-by: Takashi Iwai 
Signed-off-by: Eric W. Biederman 
Signed-off-by: Greg Kroah-Hartman

thp: fix MADV_DONTNEED vs clear soft dirty race

2017-04-21T07:31:19+00:00

commit 5b7abeae3af8c08c577e599dd0578b9e3ee6687b upstream.

Yet another instance of the same race.

Fix is identical to change_huge_pmd().

See "thp: fix MADV_DONTNEED vs.  numa balancing race" for more details.

Link: http://lkml.kernel.org/r/20170302151034.27829-5-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Cc: Hillf Danton 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

sysctl: Drop reference added by grab_header in proc_sys_readdir

2017-01-19T19:18:04+00:00

commit 93362fa47fe98b62e4a34ab408c4a418432e7939 upstream.

Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
added by grab_header when return from !dir_emit_dots path.
It can cause any path called unregister_sysctl_table will
wait forever.

The calltrace of CVE-2016-9191:

[ 5535.960522] Call Trace:
[ 5535.963265]  [] schedule+0x3f/0xa0
[ 5535.968817]  [] schedule_timeout+0x3db/0x6f0
[ 5535.975346]  [] ? wait_for_completion+0x45/0x130
[ 5535.982256]  [] wait_for_completion+0xc3/0x130
[ 5535.988972]  [] ? wake_up_q+0x80/0x80
[ 5535.994804]  [] drop_sysctl_table+0xc4/0xe0
[ 5536.001227]  [] drop_sysctl_table+0x77/0xe0
[ 5536.007648]  [] unregister_sysctl_table+0x4d/0xa0
[ 5536.014654]  [] unregister_sysctl_table+0x7f/0xa0
[ 5536.021657]  [] unregister_sched_domain_sysctl+0x15/0x40
[ 5536.029344]  [] partition_sched_domains+0x44/0x450
[ 5536.036447]  [] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5536.043844]  [] rebuild_sched_domains_locked+0x64/0xb0
[ 5536.051336]  [] update_flag+0x11d/0x210
[ 5536.057373]  [] ? mutex_lock_nested+0x2df/0x450
[ 5536.064186]  [] ? cpuset_css_offline+0x1b/0x60
[ 5536.070899]  [] ? trace_hardirqs_on+0xd/0x10
[ 5536.077420]  [] ? mutex_lock_nested+0x2df/0x450
[ 5536.084234]  [] ? css_killed_work_fn+0x25/0x220
[ 5536.091049]  [] cpuset_css_offline+0x35/0x60
[ 5536.097571]  [] css_killed_work_fn+0x5c/0x220
[ 5536.104207]  [] process_one_work+0x1df/0x710
[ 5536.110736]  [] ? process_one_work+0x160/0x710
[ 5536.117461]  [] worker_thread+0x12b/0x4a0
[ 5536.123697]  [] ? process_one_work+0x710/0x710
[ 5536.130426]  [] kthread+0xfe/0x120
[ 5536.135991]  [] ret_from_fork+0x1f/0x40
[ 5536.142041]  [] ? kthread_create_on_node+0x230/0x230

One cgroup maintainer mentioned that "cgroup is trying to offline
a cpuset css, which takes place under cgroup_mutex.  The offlining
ends up trying to drain active usages of a sysctl table which apprently
is not happening."
The real reason is that proc_sys_readdir doesn't drop reference added
by grab_header when return from !dir_emit_dots path. So this cpuset
offline path will wait here forever.

See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13

Fixes: f0c3b5093add ("[readdir] convert procfs")
Reported-by: CAI Qian 
Tested-by: Yang Shukui 
Signed-off-by: Zhou Chengming 
Acked-by: Al Viro 
Signed-off-by: Eric W. Biederman 
Signed-off-by: Greg Kroah-Hartman

proc: fix NULL dereference when reading /proc//auxv

2016-10-28T01:43:43+00:00

Reading auxv of any kernel thread results in NULL pointer dereferencing
in auxv_read() where mm can be NULL.  Fix that by checking for NULL mm
and bailing out early.  This is also the original behavior changed by
recent commit c5317167854e ("proc: switch auxv to use of __mem_open()").

  # cat /proc/2/auxv
  Unable to handle kernel NULL pointer dereference at virtual address 000000a8
  Internal error: Oops: 17 [#1] PREEMPT SMP ARM
  CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
  Hardware name: BCM2709
  task: ea3b0b00 task.stack: e99b2000
  PC is at auxv_read+0x24/0x4c
  LR is at do_readv_writev+0x2fc/0x37c
  Process cat (pid: 113, stack limit = 0xe99b2210)
  Call chain:
    auxv_read
    do_readv_writev
    vfs_readv
    default_file_splice_read
    splice_direct_to_actor
    do_splice_direct
    do_sendfile
    SyS_sendfile64
    ret_fast_syscall

Fixes: c5317167854e ("proc: switch auxv to use of __mem_open()")
Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Leon Yu 
Acked-by: Oleg Nesterov 
Acked-by: Michal Hocko 
Cc: Al Viro 
Cc: Kees Cook 
Cc: John Stultz 
Cc: Mateusz Guzik 
Cc: Janis Danisevskis 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds