From 33c35aa4817864e056fd772230b0c6b552e36ea2 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Mon, 15 May 2017 09:34:06 -0400
Subject: cgroup: Prevent kill_css() from being called more than once

The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+
---
 kernel/cgroup/cgroup.c | 5 +++++
 1 file changed, 5 insertions(+)

(limited to 'kernel')

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3c9a0e1b3c9..8d4e85eae42c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4265,6 +4265,11 @@ static void kill_css(struct cgroup_subsys_state *css)
 {
 	lockdep_assert_held(&cgroup_mutex);
 
+	if (css->flags & CSS_DYING)
+		return;
+
+	css->flags |= CSS_DYING;
+
 	/*
 	 * This must happen before css is disassociated with its cgroup.
 	 * See seq_css() for details.
-- 
cgit v1.2.3


From 41c25707d21716826e3c1f60967f5550610ec1c9 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 24 May 2017 12:03:48 -0400
Subject: cpuset: consider dying css as offline

In most cases, a cgroup controller don't care about the liftimes of
cgroups.  For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.

However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not.  Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period.  The effects of cgroup
removals are delayed when seen from userland.

This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending.  This gets rid of the userland visible
delays.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cpuset.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'kernel')

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f6501f4f6040..ae643412948a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -176,9 +176,9 @@ typedef enum {
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
-static inline bool is_cpuset_online(const struct cpuset *cs)
+static inline bool is_cpuset_online(struct cpuset *cs)
 {
-	return test_bit(CS_ONLINE, &cs->flags);
+	return test_bit(CS_ONLINE, &cs->flags) && !css_is_dying(&cs->css);
 }
 
 static inline int is_cpu_exclusive(const struct cpuset *cs)
-- 
cgit v1.2.3


From 5720acf4bfc142ba568d5b6782fceaf62ed15e0b Mon Sep 17 00:00:00 2001
From: Miroslav Benes <mbenes@suse.cz>
Date: Fri, 26 May 2017 14:45:21 +0200
Subject: livepatch: Make livepatch dependent on !TRIM_UNUSED_KSYMS

If TRIM_UNUSED_KSYMS is enabled, all unneeded exported symbols are made
unexported. Two-pass build of the kernel is done to find out which
symbols are needed based on a configuration. This effectively
complicates things for out-of-tree modules.

Livepatch exports functions to (un)register and enable/disable a live
patch. The only in-tree module which uses these functions is a sample in
samples/livepatch/. If the sample is disabled, the functions are
trimmed and out-of-tree live patches cannot be built.

Note that live patches are intended to be built out-of-tree.

Suggested-by: Michal Marek <mmarek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---
 kernel/livepatch/Kconfig | 1 +
 1 file changed, 1 insertion(+)

(limited to 'kernel')

diff --git a/kernel/livepatch/Kconfig b/kernel/livepatch/Kconfig
index 045022557936..ec4565122e65 100644
--- a/kernel/livepatch/Kconfig
+++ b/kernel/livepatch/Kconfig
@@ -10,6 +10,7 @@ config LIVEPATCH
 	depends on SYSFS
 	depends on KALLSYMS_ALL
 	depends on HAVE_LIVEPATCH
+	depends on !TRIM_UNUSED_KSYMS
 	help
 	  Say Y here if you want to support kernel live patching.
 	  This option has no runtime impact until a kernel "patch"
-- 
cgit v1.2.3


From 40da1b11f01e43aad1aa6cea64681b6125e8a2a7 Mon Sep 17 00:00:00 2001
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Fri, 2 Jun 2017 16:27:14 +0200
Subject: cpu/hotplug: Drop the device lock on error

If a custom CPU target is specified and that one is not available _or_
can't be interrupted then the code returns to userland without dropping a
lock as notices by lockdep:

|echo 133 > /sys/devices/system/cpu/cpu7/hotplug/target
| ================================================
| [ BUG: lock held when returning to user space! ]
| ------------------------------------------------
| bash/503 is leaving the kernel with locks still held!
| 1 lock held by bash/503:
|  #0:  (device_hotplug_lock){+.+...}, at: [<ffffffff815b5650>] lock_device_hotplug_sysfs+0x10/0x40

So release the lock then.

Fixes: 757c989b9994 ("cpu/hotplug: Make target state writeable")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170602142714.3ogo25f2wbq6fjpj@linutronix.de
---
 kernel/cpu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'kernel')

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9ae6fbe5b5cf..cb5103413bd8 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1658,13 +1658,13 @@ static ssize_t write_cpuhp_target(struct device *dev,
 	ret = !sp->name || sp->cant_stop ? -EINVAL : 0;
 	mutex_unlock(&cpuhp_state_mutex);
 	if (ret)
-		return ret;
+		goto out;
 
 	if (st->state < target)
 		ret = do_cpu_up(dev->id, target);
 	else
 		ret = do_cpu_down(dev->id, target);
-
+out:
 	unlock_device_hotplug();
 	return ret ? ret : count;
 }
-- 
cgit v1.2.3


From f4781e76f90df7aec400635d73ea4c35ee1d4765 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 30 May 2017 23:15:34 +0200
Subject: alarmtimer: Prevent overflow of relative timers

Andrey reported a alartimer related RCU stall while fuzzing the kernel with
syzkaller.

The reason for this is an overflow in ktime_add() which brings the
resulting time into negative space and causes immediate expiry of the
timer. The following rearm with a small interval does not bring the timer
back into positive space due to the same issue.

This results in a permanent firing alarmtimer which hogs the CPU.

Use ktime_add_safe() instead which detects the overflow and clamps the
result to KTIME_SEC_MAX.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kostya Serebryany <kcc@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170530211655.802921648@linutronix.de
---
 kernel/time/alarmtimer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

(limited to 'kernel')

diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 5cb5b0008d97..2b2e032b4eb4 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -387,7 +387,7 @@ void alarm_start_relative(struct alarm *alarm, ktime_t start)
 {
 	struct alarm_base *base = &alarm_bases[alarm->type];
 
-	start = ktime_add(start, base->gettime());
+	start = ktime_add_safe(start, base->gettime());
 	alarm_start(alarm, start);
 }
 EXPORT_SYMBOL_GPL(alarm_start_relative);
@@ -475,7 +475,7 @@ u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval)
 		overrun++;
 	}
 
-	alarm->node.expires = ktime_add(alarm->node.expires, interval);
+	alarm->node.expires = ktime_add_safe(alarm->node.expires, interval);
 	return overrun;
 }
 EXPORT_SYMBOL_GPL(alarm_forward);
@@ -666,7 +666,7 @@ static int alarm_timer_set(struct k_itimer *timr, int flags,
 		ktime_t now;
 
 		now = alarm_bases[timr->it.alarm.alarmtimer.type].gettime();
-		exp = ktime_add(now, exp);
+		exp = ktime_add_safe(now, exp);
 	}
 
 	alarm_start(&timr->it.alarm.alarmtimer, exp);
-- 
cgit v1.2.3


From ff86bf0c65f14346bf2440534f9ba5ac232c39a0 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 30 May 2017 23:15:35 +0200
Subject: alarmtimer: Rate limit periodic intervals

The alarmtimer code has another source of potentially rearming itself too
fast. Interval timers with a very samll interval have a similar CPU hog
effect as the previously fixed overflow issue.

The reason is that alarmtimers do not implement the normal protection
against this kind of problem which the other posix timer use:

  timer expires -> queue signal -> deliver signal -> rearm timer

This scheme brings the rearming under scheduler control and prevents
permanently firing timers which hog the CPU.

Bringing this scheme to the alarm timer code is a major overhaul because it
lacks all the necessary mechanisms completely.

So for a quick fix limit the interval to one jiffie. This is not
problematic in practice as alarmtimers are usually backed by an RTC for
suspend which have 1 second resolution. It could be therefor argued that
the resolution of this clock should be set to 1 second in general, but
that's outside the scope of this fix.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kostya Serebryany <kcc@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170530211655.896767100@linutronix.de
---
 kernel/time/alarmtimer.c | 8 ++++++++
 1 file changed, 8 insertions(+)

(limited to 'kernel')

diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 2b2e032b4eb4..ee2f4202d82a 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -660,6 +660,14 @@ static int alarm_timer_set(struct k_itimer *timr, int flags,
 
 	/* start the timer */
 	timr->it.alarm.interval = timespec64_to_ktime(new_setting->it_interval);
+
+	/*
+	 * Rate limit to the tick as a hot fix to prevent DOS. Will be
+	 * mopped up later.
+	 */
+	if (timr->it.alarm.interval < TICK_NSEC)
+		timr->it.alarm.interval = TICK_NSEC;
+
 	exp = timespec64_to_ktime(new_setting->it_value);
 	/* Convert (if necessary) to absolute time */
 	if (flags != TIMER_ABSTIME) {
-- 
cgit v1.2.3


From f3b7eaae1b35eb8077610eb7c7db042c9b0645e1 Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Wed, 7 Jun 2017 00:57:37 +0200
Subject: Revert "ACPI / sleep: Ignore spurious SCI wakeups from
 suspend-to-idle"

Revert commit eed4d47efe95 (ACPI / sleep: Ignore spurious SCI wakeups
from suspend-to-idle) as it turned out to be premature and triggered
a number of different issues on various systems.

That includes, but is not limited to, premature suspend-to-RAM aborts
on Dell XPS 13 (9343) reported by Dominik.

The issue the commit in question attempted to address is real and
will need to be taken care of going forward, but evidently more work
is needed for this purpose.

Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/power/process.c |  2 +-
 kernel/power/suspend.c | 29 ++++-------------------------
 2 files changed, 5 insertions(+), 26 deletions(-)

(limited to 'kernel')

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 78672d324a6e..c7209f060eeb 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -132,7 +132,7 @@ int freeze_processes(void)
 	if (!pm_freezing)
 		atomic_inc(&system_freezing_cnt);
 
-	pm_wakeup_clear(true);
+	pm_wakeup_clear();
 	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
 	error = try_to_freeze_tasks(true);
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index c0248c74d6d4..15e6baef5c73 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -72,8 +72,6 @@ static void freeze_begin(void)
 
 static void freeze_enter(void)
 {
-	trace_suspend_resume(TPS("machine_suspend"), PM_SUSPEND_FREEZE, true);
-
 	spin_lock_irq(&suspend_freeze_lock);
 	if (pm_wakeup_pending())
 		goto out;
@@ -100,27 +98,6 @@ static void freeze_enter(void)
  out:
 	suspend_freeze_state = FREEZE_STATE_NONE;
 	spin_unlock_irq(&suspend_freeze_lock);
-
-	trace_suspend_resume(TPS("machine_suspend"), PM_SUSPEND_FREEZE, false);
-}
-
-static void s2idle_loop(void)
-{
-	do {
-		freeze_enter();
-
-		if (freeze_ops && freeze_ops->wake)
-			freeze_ops->wake();
-
-		dpm_resume_noirq(PMSG_RESUME);
-		if (freeze_ops && freeze_ops->sync)
-			freeze_ops->sync();
-
-		if (pm_wakeup_pending())
-			break;
-
-		pm_wakeup_clear(false);
-	} while (!dpm_suspend_noirq(PMSG_SUSPEND));
 }
 
 void freeze_wake(void)
@@ -394,8 +371,10 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
 	 * all the devices are suspended.
 	 */
 	if (state == PM_SUSPEND_FREEZE) {
-		s2idle_loop();
-		goto Platform_early_resume;
+		trace_suspend_resume(TPS("machine_suspend"), state, true);
+		freeze_enter();
+		trace_suspend_resume(TPS("machine_suspend"), state, false);
+		goto Platform_wake;
 	}
 
 	error = disable_nonboot_cpus();
-- 
cgit v1.2.3


From cc1582c231ea041fbc68861dfaf957eaf902b829 Mon Sep 17 00:00:00 2001
From: Jin Yao <yao.jin@linux.intel.com>
Date: Thu, 25 May 2017 18:09:07 +0800
Subject: perf/core: Drop kernel samples even though :u is specified

When doing sampling, for example:

  perf record -e cycles:u ...

On workloads that do a lot of kernel entry/exits we see kernel
samples, even though :u is specified. This is due to skid existing.

This might be a security issue because it can leak kernel addresses even
though kernel sampling support is disabled.

The patch drops the kernel samples if exclude_kernel is specified.

For example, test on Haswell desktop:

  perf record -e cycles:u <mgen>
  perf report --stdio

Before patch applied:

    99.77%  mgen     mgen              [.] buf_read
     0.20%  mgen     mgen              [.] rand_buf_init
     0.01%  mgen     [kernel.vmlinux]  [k] apic_timer_interrupt
     0.00%  mgen     mgen              [.] last_free_elem
     0.00%  mgen     libc-2.23.so      [.] __random_r
     0.00%  mgen     libc-2.23.so      [.] _int_malloc
     0.00%  mgen     mgen              [.] rand_array_init
     0.00%  mgen     [kernel.vmlinux]  [k] page_fault
     0.00%  mgen     libc-2.23.so      [.] __random
     0.00%  mgen     libc-2.23.so      [.] __strcasestr
     0.00%  mgen     ld-2.23.so        [.] strcmp
     0.00%  mgen     ld-2.23.so        [.] _dl_start
     0.00%  mgen     libc-2.23.so      [.] sched_setaffinity@@GLIBC_2.3.4
     0.00%  mgen     ld-2.23.so        [.] _start

We can see kernel symbols apic_timer_interrupt and page_fault.

After patch applied:

    99.79%  mgen     mgen           [.] buf_read
     0.19%  mgen     mgen           [.] rand_buf_init
     0.00%  mgen     libc-2.23.so   [.] __random_r
     0.00%  mgen     mgen           [.] rand_array_init
     0.00%  mgen     mgen           [.] last_free_elem
     0.00%  mgen     libc-2.23.so   [.] vfprintf
     0.00%  mgen     libc-2.23.so   [.] rand
     0.00%  mgen     libc-2.23.so   [.] __random
     0.00%  mgen     libc-2.23.so   [.] _int_malloc
     0.00%  mgen     libc-2.23.so   [.] _IO_doallocbuf
     0.00%  mgen     ld-2.23.so     [.] do_lookup_x
     0.00%  mgen     ld-2.23.so     [.] open_verify.constprop.7
     0.00%  mgen     ld-2.23.so     [.] _dl_important_hwcaps
     0.00%  mgen     libc-2.23.so   [.] sched_setaffinity@@GLIBC_2.3.4
     0.00%  mgen     ld-2.23.so     [.] _start

There are only userspace symbols.

Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Cc: kan.liang@intel.com
Cc: mark.rutland@arm.com
Cc: will.deacon@arm.com
Cc: yao.jin@intel.com
Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

(limited to 'kernel')

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6e75a5c9412d..6c4e523dc1e2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7316,6 +7316,21 @@ int perf_event_account_interrupt(struct perf_event *event)
 	return __perf_event_account_interrupt(event, 1);
 }
 
+static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
+{
+	/*
+	 * Due to interrupt latency (AKA "skid"), we may enter the
+	 * kernel before taking an overflow, even if the PMU is only
+	 * counting user events.
+	 * To avoid leaking information to userspace, we must always
+	 * reject kernel samples when exclude_kernel is set.
+	 */
+	if (event->attr.exclude_kernel && !user_mode(regs))
+		return false;
+
+	return true;
+}
+
 /*
  * Generic event overflow handling, sampling.
  */
@@ -7336,6 +7351,12 @@ static int __perf_event_overflow(struct perf_event *event,
 
 	ret = __perf_event_account_interrupt(event, throttle);
 
+	/*
+	 * For security, drop the skid kernel samples if necessary.
+	 */
+	if (!sample_is_allowed(event, regs))
+		return ret;
+
 	/*
 	 * XXX event_limit might not quite work as expected on inherited
 	 * events
-- 
cgit v1.2.3


From dac8bbbae1d0ccba96402d25deeed3a2e87992c6 Mon Sep 17 00:00:00 2001
From: Petr Mladek <pmladek@suse.com>
Date: Thu, 8 Jun 2017 12:01:30 +0200
Subject: Revert "printk: fix double printing with earlycon"

This reverts commit cf39bf58afdaabc0b86f141630fb3fd18190294e.

The commit regression to users that define both console=ttyS1
and console=ttyS0 on the command line, see
https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain

The kernel log messages always appeared only on one serial port. It is
even documented in Documentation/admin-guide/serial-console.rst:

"Note that you can only define one console per device type (serial,
video)."

The above mentioned commit changed the order in which the command line
parameters are searched. As a result, the kernel log messages go to
the last mentioned ttyS* instead of the first one.

We long thought that using two console=ttyS* on the command line
did not make sense. But then we realized that console= parameters
were handled also by systemd, see
http://0pointer.de/blog/projects/serial-console.html

"By default systemd will instantiate one serial-getty@.service on
the main kernel console, if it is not a virtual terminal."

where

"[4] If multiple kernel consoles are used simultaneously, the main
console is the one listed first in /sys/class/tty/console/active,
which is the last one listed on the kernel command line."

This puts the original report into another light. The system is running
in qemu. The first serial port is used to store the messages into a file.
The second one is used to login to the system via a socket. It depends
on systemd and the historic kernel behavior.

By other words, systemd causes that it makes sense to define both
console=ttyS1 console=ttyS0 on the command line. The kernel fix
caused regression related to userspace (systemd) and need to be
reverted.

In addition, it went out that the fix helped only partially.
The messages still were duplicated when the boot console was
removed early by late_initcall(printk_late_init). Then the entire
log was replayed when the same console was registered as a normal one.

Link: 20170606160339.GC7604@pathway.suse.cz
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Robin Murphy <robin.murphy@arm.com>,
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com>
Cc: linux-serial@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/printk/printk.c | 46 ++++++++++------------------------------------
 1 file changed, 10 insertions(+), 36 deletions(-)

(limited to 'kernel')

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 779479ac9f57..9dbceb76e6bc 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -269,7 +269,6 @@ static struct console *exclusive_console;
 #define MAX_CMDLINECONSOLES 8
 
 static struct console_cmdline console_cmdline[MAX_CMDLINECONSOLES];
-static int console_cmdline_cnt;
 
 static int preferred_console = -1;
 int console_set_on_cmdline;
@@ -1906,25 +1905,12 @@ static int __add_preferred_console(char *name, int idx, char *options,
 	 *	See if this tty is not yet registered, and
 	 *	if we have a slot free.
 	 */
-	for (i = 0, c = console_cmdline; i < console_cmdline_cnt; i++, c++) {
+	for (i = 0, c = console_cmdline;
+	     i < MAX_CMDLINECONSOLES && c->name[0];
+	     i++, c++) {
 		if (strcmp(c->name, name) == 0 && c->index == idx) {
-			if (brl_options)
-				return 0;
-
-			/*
-			 * Maintain an invariant that will help to find if
-			 * the matching console is preferred, see
-			 * register_console():
-			 *
-			 * The last non-braille console is always
-			 * the preferred one.
-			 */
-			if (i != console_cmdline_cnt - 1)
-				swap(console_cmdline[i],
-				     console_cmdline[console_cmdline_cnt - 1]);
-
-			preferred_console = console_cmdline_cnt - 1;
-
+			if (!brl_options)
+				preferred_console = i;
 			return 0;
 		}
 	}
@@ -1937,7 +1923,6 @@ static int __add_preferred_console(char *name, int idx, char *options,
 	braille_set_options(c, brl_options);
 
 	c->index = idx;
-	console_cmdline_cnt++;
 	return 0;
 }
 /*
@@ -2477,23 +2462,12 @@ void register_console(struct console *newcon)
 	}
 
 	/*
-	 * See if this console matches one we selected on the command line.
-	 *
-	 * There may be several entries in the console_cmdline array matching
-	 * with the same console, one with newcon->match(), another by
-	 * name/index:
-	 *
-	 *	pl011,mmio,0x87e024000000,115200 -- added from SPCR
-	 *	ttyAMA0 -- added from command line
-	 *
-	 * Traverse the console_cmdline array in reverse order to be
-	 * sure that if this console is preferred then it will be the first
-	 * matching entry.  We use the invariant that is maintained in
-	 * __add_preferred_console().
+	 *	See if this console matches one we selected on
+	 *	the command line.
 	 */
-	for (i = console_cmdline_cnt - 1; i >= 0; i--) {
-		c = console_cmdline + i;
-
+	for (i = 0, c = console_cmdline;
+	     i < MAX_CMDLINECONSOLES && c->name[0];
+	     i++, c++) {
 		if (!newcon->match ||
 		    newcon->match(newcon, c->name, c->index, c->options) != 0) {
 			/* default matching */
-- 
cgit v1.2.3


From cdf7abc4610a7f1c43d06cda246c5f748a4fd267 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed, 31 May 2017 14:03:10 +0200
Subject: srcu: Allow use of Tiny/Tree SRCU from both process and interrupt
 context

Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device.  This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq().  If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case.  KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods.  It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).

However, the docs are overly conservative.  You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts.  In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU.  For those two implementations, only srcu_read_lock()
is unsafe.

When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller.  Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.

Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
a single variable.  Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
implementation already supports mixed-context use of srcu_read_lock()
and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
and srcu_read_unlock() in each handler are nested and paired properly.
In other words, it is still illegal to (say) invoke srcu_read_lock()
in an interrupt handler and to invoke the matching srcu_read_unlock()
in a softirq handler.  Therefore, the only change required for Tiny SRCU
is to its comments.

Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
---
 kernel/rcu/srcutiny.c | 7 ++++---
 kernel/rcu/srcutree.c | 5 ++---
 2 files changed, 6 insertions(+), 6 deletions(-)

(limited to 'kernel')

diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
index 36e1f82faed1..32798eb14853 100644
--- a/kernel/rcu/srcutiny.c
+++ b/kernel/rcu/srcutiny.c
@@ -97,8 +97,9 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
 
 /*
  * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct.  Must be called from process context.
- * Returns an index that must be passed to the matching srcu_read_unlock().
+ * srcu_struct.  Can be invoked from irq/bh handlers, but the matching
+ * __srcu_read_unlock() must be in the same handler instance.  Returns an
+ * index that must be passed to the matching srcu_read_unlock().
  */
 int __srcu_read_lock(struct srcu_struct *sp)
 {
@@ -112,7 +113,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
 
 /*
  * Removes the count for the old reader from the appropriate element of
- * the srcu_struct.  Must be called from process context.
+ * the srcu_struct.
  */
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 3ae8474557df..157654fa436a 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
 
 /*
  * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct.  Must be called from process context.
+ * srcu_struct.
  * Returns an index that must be passed to the matching srcu_read_unlock().
  */
 int __srcu_read_lock(struct srcu_struct *sp)
@@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	idx = READ_ONCE(sp->srcu_idx) & 0x1;
-	__this_cpu_inc(sp->sda->srcu_lock_count[idx]);
+	this_cpu_inc(sp->sda->srcu_lock_count[idx]);
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	return idx;
 }
@@ -375,7 +375,6 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different
  * CPU than that which was incremented by the corresponding srcu_read_lock().
- * Must be called from process context.
  */
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
-- 
cgit v1.2.3


From 1123a6041654e8f889014659593bad4168e542c2 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed, 31 May 2017 14:03:11 +0200
Subject: srcu: Allow use of Classic SRCU from both process and interrupt
 context

Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device.  This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq().  If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case.  KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods.  It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).

However, the docs are overly conservative.  You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts.  In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU.  For those two implementations, only srcu_read_lock()
is unsafe.

When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller.  Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.

Cc: stable@vger.kernel.org
Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/srcu.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

(limited to 'kernel')

diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 584d8a983883..dea03614263f 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -263,7 +263,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
 
 /*
  * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct.  Must be called from process context.
+ * srcu_struct.
  * Returns an index that must be passed to the matching srcu_read_unlock().
  */
 int __srcu_read_lock(struct srcu_struct *sp)
@@ -271,7 +271,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	idx = READ_ONCE(sp->completed) & 0x1;
-	__this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
+	this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	return idx;
 }
@@ -281,7 +281,6 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different
  * CPU than that which was incremented by the corresponding srcu_read_lock().
- * Must be called from process context.
  */
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
-- 
cgit v1.2.3


From f67abed585efe251edda52dc9690020d6441890f Mon Sep 17 00:00:00 2001
From: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Date: Fri, 9 Jun 2017 10:00:29 +0200
Subject: sched/fair: Fix typo in printk message

'schedstats' kernel parameter should be set to enable/disable, so
correct the printk hint saying that it should be set to 'enable'
rather than 'enabled' to enable scheduler tracepoints.

Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496995229-31245-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'kernel')

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d71109321841..c77e4b1d51c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3563,7 +3563,7 @@ static inline void check_schedstat_required(void)
 			trace_sched_stat_runtime_enabled())  {
 		printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
 			     "stat_blocked and stat_runtime require the "
-			     "kernel parameter schedstats=enabled or "
+			     "kernel parameter schedstats=enable or "
 			     "kernel.sched_schedstats=1\n");
 	}
 #endif
-- 
cgit v1.2.3


From 252d2a4117bc181b287eeddf848863788da733ae Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto@kernel.org>
Date: Fri, 9 Jun 2017 11:49:15 -0700
Subject: sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()

idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().

This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes.  Nonetheless, I think it
should be backported because it's trivial.  There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'kernel')

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 803c3bc274c4..326d4f88e2b1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5605,7 +5605,7 @@ void idle_task_exit(void)
 	BUG_ON(cpu_online(smp_processor_id()));
 
 	if (mm != &init_mm) {
-		switch_mm_irqs_off(mm, &init_mm, current);
+		switch_mm(mm, &init_mm, current);
 		finish_arch_post_lock_switch();
 	}
 	mmdrop(mm);
-- 
cgit v1.2.3


From ff0a6d6f932ff4b9eb8e8140f98cc1cf763d0d78 Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Date: Mon, 12 Jun 2017 14:16:16 +0200
Subject: Revert "cpufreq: schedutil: Reduce frequencies slower"

Revert commit 39b64aa1c007 (cpufreq: schedutil: Reduce frequencies
slower) that introduced unintentional changes in behavior leading
to adverse effects on some systems.

Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/cpufreq_schedutil.c | 3 ---
 1 file changed, 3 deletions(-)

(limited to 'kernel')

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 76877a62b5fa..8773d1efdfab 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -101,9 +101,6 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
 	if (sg_policy->next_freq == next_freq)
 		return;
 
-	if (sg_policy->next_freq > next_freq)
-		next_freq = (sg_policy->next_freq + next_freq) >> 1;
-
 	sg_policy->next_freq = next_freq;
 	sg_policy->last_freq_update_time = time;
 
-- 
cgit v1.2.3


From 94114c367553f3301747e47f6947cabde947575f Mon Sep 17 00:00:00 2001
From: Stephen Boyd <sboyd@codeaurora.org>
Date: Wed, 7 Jun 2017 23:36:03 -0700
Subject: tick/broadcast: Make tick_broadcast_setup_oneshot() static

This function isn't used outside of tick-broadcast.c, so let's
mark it static.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/20170608063603.13276-1-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-broadcast.c | 4 +++-
 kernel/time/tick-internal.h  | 2 --
 2 files changed, 3 insertions(+), 3 deletions(-)

(limited to 'kernel')

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 987e496bb51a..b398c2ea69b2 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -37,9 +37,11 @@ static int tick_broadcast_forced;
 static __cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
 
 #ifdef CONFIG_TICK_ONESHOT
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
 static void tick_broadcast_clear_oneshot(int cpu);
 static void tick_resume_broadcast_oneshot(struct clock_event_device *bc);
 #else
+static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); }
 static inline void tick_broadcast_clear_oneshot(int cpu) { }
 static inline void tick_resume_broadcast_oneshot(struct clock_event_device *bc) { }
 #endif
@@ -867,7 +869,7 @@ static void tick_broadcast_init_next_event(struct cpumask *mask,
 /**
  * tick_broadcast_setup_oneshot - setup the broadcast device
  */
-void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	int cpu = smp_processor_id();
 
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index f738251000fe..be0ac01f2e12 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -126,7 +126,6 @@ static inline int tick_check_oneshot_change(int allow_nohz) { return 0; }
 
 /* Functions related to oneshot broadcasting */
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
-extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
 extern void tick_broadcast_switch_to_oneshot(void);
 extern void tick_shutdown_broadcast_oneshot(unsigned int cpu);
 extern int tick_broadcast_oneshot_active(void);
@@ -134,7 +133,6 @@ extern void tick_check_oneshot_broadcast_this_cpu(void);
 bool tick_broadcast_oneshot_available(void);
 extern struct cpumask *tick_get_broadcast_oneshot_mask(void);
 #else /* !(BROADCAST && ONESHOT): */
-static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); }
 static inline void tick_broadcast_switch_to_oneshot(void) { }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int cpu) { }
 static inline int tick_broadcast_oneshot_active(void) { return 0; }
-- 
cgit v1.2.3


From fa07ab72cbb0d843429e61bf179308aed6cbe0dd Mon Sep 17 00:00:00 2001
From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Sun, 11 Jun 2017 00:38:36 +0200
Subject: genirq: Release resources in __setup_irq() error path

In case __irq_set_trigger() fails the resources requested via
irq_request_resources() are not released.

Add the missing release call into the error handling path.

Fixes: c1bacbae8192 ("genirq: Provide irq_request/release_resources chip callbacks")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/655538f5-cb20-a892-ff15-fbd2dd1fa4ec@gmail.com
---
 kernel/irq/manage.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

(limited to 'kernel')

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 070be980c37a..425170d4439b 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1312,8 +1312,10 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
 			ret = __irq_set_trigger(desc,
 						new->flags & IRQF_TRIGGER_MASK);
 
-			if (ret)
+			if (ret) {
+				irq_release_resources(desc);
 				goto out_mask;
+			}
 		}
 
 		desc->istate &= ~(IRQS_AUTODETECT | IRQS_SPURIOUS_DISABLED | \
-- 
cgit v1.2.3


From 57db7e4a2d92c2d3dfbca4ef8057849b2682436b Mon Sep 17 00:00:00 2001
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, 13 Jun 2017 04:31:16 -0500
Subject: signal: Only reschedule timers on signals timers have sent

Thomas Gleixner  wrote:
> The CRIU support added a 'feature' which allows a user space task to send
> arbitrary (kernel) signals to itself. The changelog says:
>
>   The kernel prevents sending of siginfo with positive si_code, because
>   these codes are reserved for kernel.  I think we can allow a task to
>   send such a siginfo to itself.  This operation should not be dangerous.
>
> Quite contrary to that claim, it turns out that it is outright dangerous
> for signals with info->si_code == SI_TIMER. The following code sequence in
> a user space task allows to crash the kernel:
>
>    id = timer_create(CLOCK_XXX, ..... signo = SIGX);
>    timer_set(id, ....);
>    info->si_signo = SIGX;
>    info->si_code = SI_TIMER:
>    info->_sifields._timer._tid = id;
>    info->_sifields._timer._sys_private = 2;
>    rt_[tg]sigqueueinfo(..., SIGX, info);
>    sigemptyset(&sigset);
>    sigaddset(&sigset, SIGX);
>    rt_sigtimedwait(sigset, info);
>
> For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this
> results in a kernel crash because sigwait() dequeues the signal and the
> dequeue code observes:
>
>   info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0
>
> which triggers the following callchain:
>
>  do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer()
>
> arm_timer() executes a list_add() on the timer, which is already armed via
> the timer_set() syscall. That's a double list add which corrupts the posix
> cpu timer list. As a consequence the kernel crashes on the next operation
> touching the posix cpu timer list.
>
> Posix clocks which are internally implemented based on hrtimers are not
> affected by this because hrtimer_start() can handle already armed timers
> nicely, but it's a reliable way to trigger the WARN_ON() in
> hrtimer_forward(), which complains about calling that function on an
> already armed timer.

This problem has existed since the posix timer code was merged into
2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to
inject not just a signal (which linux has supported since 1.0) but the
full siginfo of a signal.

The core problem is that the code will reschedule in response to
signals getting dequeued not just for signals the timers sent but
for other signals that happen to a si_code of SI_TIMER.

Avoid this confusion by testing to see if the queued signal was
preallocated as all timer signals are preallocated, and so far
only the timer code preallocates signals.

Move the check for if a timer needs to be rescheduled up into
collect_signal where the preallocation check must be performed,
and pass the result back to dequeue_signal where the code reschedules
timers.   This makes it clear why the code cares about preallocated
timers.

Cc: stable@vger.kernel.org
Reported-by: Thomas Gleixner <tglx@linutronix.de>
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reference: 66dd34ad31e5 ("signal: allow to send any siginfo to itself")
Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO")
Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/signal.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

(limited to 'kernel')

diff --git a/kernel/signal.c b/kernel/signal.c
index ca92bcfeb322..45b4c1ffe14e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -510,7 +510,8 @@ int unhandled_signal(struct task_struct *tsk, int sig)
 	return !tsk->ptrace;
 }
 
-static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
+static void collect_signal(int sig, struct sigpending *list, siginfo_t *info,
+			   bool *resched_timer)
 {
 	struct sigqueue *q, *first = NULL;
 
@@ -532,6 +533,12 @@ static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
 still_pending:
 		list_del_init(&first->list);
 		copy_siginfo(info, &first->info);
+
+		*resched_timer =
+			(first->flags & SIGQUEUE_PREALLOC) &&
+			(info->si_code == SI_TIMER) &&
+			(info->si_sys_private);
+
 		__sigqueue_free(first);
 	} else {
 		/*
@@ -548,12 +555,12 @@ still_pending:
 }
 
 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
-			siginfo_t *info)
+			siginfo_t *info, bool *resched_timer)
 {
 	int sig = next_signal(pending, mask);
 
 	if (sig)
-		collect_signal(sig, pending, info);
+		collect_signal(sig, pending, info, resched_timer);
 	return sig;
 }
 
@@ -565,15 +572,16 @@ static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
  */
 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
 {
+	bool resched_timer = false;
 	int signr;
 
 	/* We only dequeue private signals from ourselves, we don't let
 	 * signalfd steal them
 	 */
-	signr = __dequeue_signal(&tsk->pending, mask, info);
+	signr = __dequeue_signal(&tsk->pending, mask, info, &resched_timer);
 	if (!signr) {
 		signr = __dequeue_signal(&tsk->signal->shared_pending,
-					 mask, info);
+					 mask, info, &resched_timer);
 #ifdef CONFIG_POSIX_TIMERS
 		/*
 		 * itimer signal ?
@@ -621,7 +629,7 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
 		current->jobctl |= JOBCTL_STOP_DEQUEUED;
 	}
 #ifdef CONFIG_POSIX_TIMERS
-	if ((info->si_code & __SI_MASK) == __SI_TIMER && info->si_sys_private) {
+	if (resched_timer) {
 		/*
 		 * Release the siglock to ensure proper locking order
 		 * of timer locks outside of siglocks.  Note, we leave
-- 
cgit v1.2.3


From ceea5e3771ed2378668455fa21861bead7504df5 Mon Sep 17 00:00:00 2001
From: John Stultz <john.stultz@linaro.org>
Date: Thu, 8 Jun 2017 16:44:20 -0700
Subject: time: Fix clock->read(clock) race around clocksource changes

In tests, which excercise switching of clocksources, a NULL
pointer dereference can be observed on AMR64 platforms in the
clocksource read() function:

u64 clocksource_mmio_readl_down(struct clocksource *c)
{
	return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
}

This is called from the core timekeeping code via:

	cycle_now = tkr->read(tkr->clock);

tkr->read is the cached tkr->clock->read() function pointer.
When the clocksource is changed then tkr->clock and tkr->read
are updated sequentially. The code above results in a sequential
load operation of tkr->read and tkr->clock as well.

If the store to tkr->clock hits between the loads of tkr->read
and tkr->clock, then the old read() function is called with the
new clock pointer. As a consequence the read() function
dereferences a different data structure and the resulting 'reg'
pointer can point anywhere including NULL.

This problem was introduced when the timekeeping code was
switched over to use struct tk_read_base. Before that, it was
theoretically possible as well when the compiler decided to
reload clock in the code sequence:

     now = tk->clock->read(tk->clock);

Add a helper function which avoids the issue by reading
tk_read_base->clock once into a local variable clk and then issue
the read function via clk->read(clk). This guarantees that the
read() function always gets the proper clocksource pointer handed
in.

Since there is now no use for the tkr.read pointer, this patch
also removes it, and to address stopping the fast timekeeper
during suspend/resume, it introduces a dummy clocksource to use
rather then just a dummy read function.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: stable <stable@vger.kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Daniel Mentz <danielmentz@google.com>
Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timekeeping.c | 52 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 16 deletions(-)

(limited to 'kernel')

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 9652bc57fd09..eff94cb8e89e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -118,6 +118,26 @@ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
 	tk->offs_boot = ktime_add(tk->offs_boot, delta);
 }
 
+/*
+ * tk_clock_read - atomic clocksource read() helper
+ *
+ * This helper is necessary to use in the read paths because, while the
+ * seqlock ensures we don't return a bad value while structures are updated,
+ * it doesn't protect from potential crashes. There is the possibility that
+ * the tkr's clocksource may change between the read reference, and the
+ * clock reference passed to the read function.  This can cause crashes if
+ * the wrong clocksource is passed to the wrong read function.
+ * This isn't necessary to use when holding the timekeeper_lock or doing
+ * a read of the fast-timekeeper tkrs (which is protected by its own locking
+ * and update logic).
+ */
+static inline u64 tk_clock_read(struct tk_read_base *tkr)
+{
+	struct clocksource *clock = READ_ONCE(tkr->clock);
+
+	return clock->read(clock);
+}
+
 #ifdef CONFIG_DEBUG_TIMEKEEPING
 #define WARNING_FREQ (HZ*300) /* 5 minute rate-limiting */
 
@@ -175,7 +195,7 @@ static inline u64 timekeeping_get_delta(struct tk_read_base *tkr)
 	 */
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-		now = tkr->read(tkr->clock);
+		now = tk_clock_read(tkr);
 		last = tkr->cycle_last;
 		mask = tkr->mask;
 		max = tkr->clock->max_cycles;
@@ -209,7 +229,7 @@ static inline u64 timekeeping_get_delta(struct tk_read_base *tkr)
 	u64 cycle_now, delta;
 
 	/* read clocksource */
-	cycle_now = tkr->read(tkr->clock);
+	cycle_now = tk_clock_read(tkr);
 
 	/* calculate the delta since the last update_wall_time */
 	delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
@@ -238,12 +258,10 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	++tk->cs_was_changed_seq;
 	old_clock = tk->tkr_mono.clock;
 	tk->tkr_mono.clock = clock;
-	tk->tkr_mono.read = clock->read;
 	tk->tkr_mono.mask = clock->mask;
-	tk->tkr_mono.cycle_last = tk->tkr_mono.read(clock);
+	tk->tkr_mono.cycle_last = tk_clock_read(&tk->tkr_mono);
 
 	tk->tkr_raw.clock = clock;
-	tk->tkr_raw.read = clock->read;
 	tk->tkr_raw.mask = clock->mask;
 	tk->tkr_raw.cycle_last = tk->tkr_mono.cycle_last;
 
@@ -404,7 +422,7 @@ static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf)
 
 		now += timekeeping_delta_to_ns(tkr,
 				clocksource_delta(
-					tkr->read(tkr->clock),
+					tk_clock_read(tkr),
 					tkr->cycle_last,
 					tkr->mask));
 	} while (read_seqcount_retry(&tkf->seq, seq));
@@ -461,6 +479,10 @@ static u64 dummy_clock_read(struct clocksource *cs)
 	return cycles_at_suspend;
 }
 
+static struct clocksource dummy_clock = {
+	.read = dummy_clock_read,
+};
+
 /**
  * halt_fast_timekeeper - Prevent fast timekeeper from accessing clocksource.
  * @tk: Timekeeper to snapshot.
@@ -477,13 +499,13 @@ static void halt_fast_timekeeper(struct timekeeper *tk)
 	struct tk_read_base *tkr = &tk->tkr_mono;
 
 	memcpy(&tkr_dummy, tkr, sizeof(tkr_dummy));
-	cycles_at_suspend = tkr->read(tkr->clock);
-	tkr_dummy.read = dummy_clock_read;
+	cycles_at_suspend = tk_clock_read(tkr);
+	tkr_dummy.clock = &dummy_clock;
 	update_fast_timekeeper(&tkr_dummy, &tk_fast_mono);
 
 	tkr = &tk->tkr_raw;
 	memcpy(&tkr_dummy, tkr, sizeof(tkr_dummy));
-	tkr_dummy.read = dummy_clock_read;
+	tkr_dummy.clock = &dummy_clock;
 	update_fast_timekeeper(&tkr_dummy, &tk_fast_raw);
 }
 
@@ -649,11 +671,10 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
  */
 static void timekeeping_forward_now(struct timekeeper *tk)
 {
-	struct clocksource *clock = tk->tkr_mono.clock;
 	u64 cycle_now, delta;
 	u64 nsec;
 
-	cycle_now = tk->tkr_mono.read(clock);
+	cycle_now = tk_clock_read(&tk->tkr_mono);
 	delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
 	tk->tkr_mono.cycle_last = cycle_now;
 	tk->tkr_raw.cycle_last  = cycle_now;
@@ -929,8 +950,7 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-
-		now = tk->tkr_mono.read(tk->tkr_mono.clock);
+		now = tk_clock_read(&tk->tkr_mono);
 		systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq;
 		systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
 		base_real = ktime_add(tk->tkr_mono.base,
@@ -1108,7 +1128,7 @@ int get_device_system_crosststamp(int (*get_time_fn)
 		 * Check whether the system counter value provided by the
 		 * device driver is on the current timekeeping interval.
 		 */
-		now = tk->tkr_mono.read(tk->tkr_mono.clock);
+		now = tk_clock_read(&tk->tkr_mono);
 		interval_start = tk->tkr_mono.cycle_last;
 		if (!cycle_between(interval_start, cycles, now)) {
 			clock_was_set_seq = tk->clock_was_set_seq;
@@ -1629,7 +1649,7 @@ void timekeeping_resume(void)
 	 * The less preferred source will only be tried if there is no better
 	 * usable source. The rtc part is handled separately in rtc core code.
 	 */
-	cycle_now = tk->tkr_mono.read(clock);
+	cycle_now = tk_clock_read(&tk->tkr_mono);
 	if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
 		cycle_now > tk->tkr_mono.cycle_last) {
 		u64 nsec, cyc_delta;
@@ -2030,7 +2050,7 @@ void update_wall_time(void)
 #ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
 	offset = real_tk->cycle_interval;
 #else
-	offset = clocksource_delta(tk->tkr_mono.read(tk->tkr_mono.clock),
+	offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
 				   tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
 #endif
 
-- 
cgit v1.2.3


From 3d88d56c5873f6eebe23e05c3da701960146b801 Mon Sep 17 00:00:00 2001
From: John Stultz <john.stultz@linaro.org>
Date: Thu, 8 Jun 2017 16:44:21 -0700
Subject: time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting

Due to how the MONOTONIC_RAW accumulation logic was handled,
there is the potential for a 1ns discontinuity when we do
accumulations. This small discontinuity has for the most part
gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
in their vDSO clock_gettime implementation, we've seen failures
with the inconsistency-check test in kselftest.

This patch addresses the issue by using the same sub-ns
accumulation handling that CLOCK_MONOTONIC uses, which avoids
the issue for in-kernel users.

Since the ARM64 vDSO implementation has its own clock_gettime
calculation logic, this patch reduces the frequency of errors,
but failures are still seen. The ARM64 vDSO will need to be
updated to include the sub-nanosecond xtime_nsec values in its
calculation for this issue to be completely fixed.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Daniel Mentz <danielmentz@google.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "stable #4 . 8+" <stable@vger.kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timekeeping.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

(limited to 'kernel')

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index eff94cb8e89e..b602c48cb841 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -280,7 +280,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	/* Go back from cycles -> shifted ns */
 	tk->xtime_interval = interval * clock->mult;
 	tk->xtime_remainder = ntpinterval - tk->xtime_interval;
-	tk->raw_interval = (interval * clock->mult) >> clock->shift;
+	tk->raw_interval = interval * clock->mult;
 
 	 /* if changing clocks, convert xtime_nsec shift units */
 	if (old_clock) {
@@ -1996,7 +1996,7 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 				    u32 shift, unsigned int *clock_set)
 {
 	u64 interval = tk->cycle_interval << shift;
-	u64 raw_nsecs;
+	u64 snsec_per_sec;
 
 	/* If the offset is smaller than a shifted interval, do nothing */
 	if (offset < interval)
@@ -2011,14 +2011,15 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 	*clock_set |= accumulate_nsecs_to_secs(tk);
 
 	/* Accumulate raw time */
-	raw_nsecs = (u64)tk->raw_interval << shift;
-	raw_nsecs += tk->raw_time.tv_nsec;
-	if (raw_nsecs >= NSEC_PER_SEC) {
-		u64 raw_secs = raw_nsecs;
-		raw_nsecs = do_div(raw_secs, NSEC_PER_SEC);
-		tk->raw_time.tv_sec += raw_secs;
+	tk->tkr_raw.xtime_nsec += (u64)tk->raw_time.tv_nsec << tk->tkr_raw.shift;
+	tk->tkr_raw.xtime_nsec += tk->raw_interval << shift;
+	snsec_per_sec = (u64)NSEC_PER_SEC << tk->tkr_raw.shift;
+	while (tk->tkr_raw.xtime_nsec >= snsec_per_sec) {
+		tk->tkr_raw.xtime_nsec -= snsec_per_sec;
+		tk->raw_time.tv_sec++;
 	}
-	tk->raw_time.tv_nsec = raw_nsecs;
+	tk->raw_time.tv_nsec = tk->tkr_raw.xtime_nsec >> tk->tkr_raw.shift;
+	tk->tkr_raw.xtime_nsec -= (u64)tk->raw_time.tv_nsec << tk->tkr_raw.shift;
 
 	/* Accumulate error between NTP and clock interval */
 	tk->ntp_error += tk->ntp_tick << shift;
-- 
cgit v1.2.3


From 842c08846420baa619fe3cb8c9af538efdb89428 Mon Sep 17 00:00:00 2001
From: Petr Mladek <pmladek@suse.com>
Date: Wed, 14 Jun 2017 10:54:52 +0200
Subject: livepatch: Fix stacking of patches with respect to RCU

rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure
access and manipulation of the list of patches that modify the same function.
In particular, it is the variable func_stack that is accessible from the ftrace
handler via struct ftrace_ops and klp_ops.

Of course, it synchronizes also some states of the patch on the top of the
stack, e.g. func->transition in klp_ftrace_handler.

At the same time, this mechanism guards also the manipulation of
task->patch_state. It is modified according to the state of the transition and
the state of the process.

Now, all this works well as long as RCU works well. Sadly livepatching might
get into some corner cases when this is not true. For example, RCU is not
watching when rcu_read_lock() is taken in idle threads.  It is because they
might sleep and prevent reaching the grace period for too long.

There are ways how to make RCU watching even in idle threads, see
rcu_irq_enter(). But there is a small location inside RCU infrastructure when
even this does not work.

This small problematic location can be detected either before calling
rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching().
Sadly, there is no safe way how to handle it.  Once we detect that RCU was not
watching, we might see inconsistent state of the function stack and the related
variables in klp_ftrace_handler(). Then we could do a wrong decision, use an
incompatible implementation of the function and break the consistency of the
system. We could warn but we could not avoid the damage.

Fortunately, ftrace has similar problems and they seem to be solved well there.
It uses a heavy weight implementation of some RCU operations. In particular, it
replaces:

  + rcu_read_lock() with preempt_disable_notrace()
  + rcu_read_unlock() with preempt_enable_notrace()
  + synchronize_rcu() with schedule_on_each_cpu(sync_work)

My understanding is that this is RCU implementation from a stone age. It meets
the core RCU requirements but it is rather ineffective. Especially, it does not
allow to batch or speed up the synchronize calls.

On the other hand, it is very trivial. It allows to safely trace and/or
livepatch even the RCU core infrastructure.  And the effectiveness is a not a
big issue because using ftrace or livepatches on productive systems is a rare
operation.  The safety is much more important than a negligible extra load.

Note that the alternative implementation follows the RCU principles. Therefore,
     we could and actually must use list_*_rcu() variants when manipulating the
     func_stack.  These functions allow to access the pointers in the right
     order and with the right barriers. But they do not use any other
     information that would be set only by rcu_read_lock().

Also note that there are actually two problems solved in ftrace:

First, it cares about the consistency of RCU read sections.  It is being solved
the way as described and used in this patch.

Second, ftrace needs to make sure that nobody is inside the dynamic trampoline
when it is being freed. For this, it also calls synchronize_rcu_tasks() in
preemptive kernel in ftrace_shutdown().

Livepatch has similar problem but it is solved by ftrace for free.
klp_ftrace_handler() is a good guy and never sleeps. In addition, it is
registered with FTRACE_OPS_FL_DYNAMIC. It causes that
unregister_ftrace_function() calls:

	* schedule_on_each_cpu(ftrace_sync) - always
	* synchronize_rcu_tasks() - in preemptive kernel

The effect is that nobody is neither inside the dynamic trampoline nor inside
the ftrace handler after unregister_ftrace_function() returns.

[jkosina@suse.cz: reformat changelog, fix comment]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---
 kernel/livepatch/patch.c      |  8 ++++++--
 kernel/livepatch/transition.c | 36 +++++++++++++++++++++++++++++++-----
 2 files changed, 37 insertions(+), 7 deletions(-)

(limited to 'kernel')

diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index f8269036bf0b..52c4e907c14b 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -59,7 +59,11 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 
 	ops = container_of(fops, struct klp_ops, fops);
 
-	rcu_read_lock();
+	/*
+	 * A variant of synchronize_sched() is used to allow patching functions
+	 * where RCU is not watching, see klp_synchronize_transition().
+	 */
+	preempt_disable_notrace();
 
 	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
 				      stack_node);
@@ -115,7 +119,7 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 
 	klp_arch_set_pc(regs, (unsigned long)func->new_func);
 unlock:
-	rcu_read_unlock();
+	preempt_enable_notrace();
 }
 
 /*
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index adc0cc64aa4b..b004a1fb6032 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -48,6 +48,28 @@ static void klp_transition_work_fn(struct work_struct *work)
 }
 static DECLARE_DELAYED_WORK(klp_transition_work, klp_transition_work_fn);
 
+/*
+ * This function is just a stub to implement a hard force
+ * of synchronize_sched(). This requires synchronizing
+ * tasks even in userspace and idle.
+ */
+static void klp_sync(struct work_struct *work)
+{
+}
+
+/*
+ * We allow to patch also functions where RCU is not watching,
+ * e.g. before user_exit(). We can not rely on the RCU infrastructure
+ * to do the synchronization. Instead hard force the sched synchronization.
+ *
+ * This approach allows to use RCU functions for manipulating func_stack
+ * safely.
+ */
+static void klp_synchronize_transition(void)
+{
+	schedule_on_each_cpu(klp_sync);
+}
+
 /*
  * The transition to the target patch state is complete.  Clean up the data
  * structures.
@@ -73,7 +95,7 @@ static void klp_complete_transition(void)
 		 * func->transition gets cleared, the handler may choose a
 		 * removed function.
 		 */
-		synchronize_rcu();
+		klp_synchronize_transition();
 	}
 
 	if (klp_transition_patch->immediate)
@@ -92,7 +114,7 @@ static void klp_complete_transition(void)
 
 	/* Prevent klp_ftrace_handler() from seeing KLP_UNDEFINED state */
 	if (klp_target_state == KLP_PATCHED)
-		synchronize_rcu();
+		klp_synchronize_transition();
 
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, task) {
@@ -136,7 +158,11 @@ void klp_cancel_transition(void)
  */
 void klp_update_patch_state(struct task_struct *task)
 {
-	rcu_read_lock();
+	/*
+	 * A variant of synchronize_sched() is used to allow patching functions
+	 * where RCU is not watching, see klp_synchronize_transition().
+	 */
+	preempt_disable_notrace();
 
 	/*
 	 * This test_and_clear_tsk_thread_flag() call also serves as a read
@@ -153,7 +179,7 @@ void klp_update_patch_state(struct task_struct *task)
 	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
 		task->patch_state = READ_ONCE(klp_target_state);
 
-	rcu_read_unlock();
+	preempt_enable_notrace();
 }
 
 /*
@@ -539,7 +565,7 @@ void klp_reverse_transition(void)
 		clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
 
 	/* Let any remaining calls to klp_update_patch_state() complete */
-	synchronize_rcu();
+	klp_synchronize_transition();
 
 	klp_start_transition();
 }
-- 
cgit v1.2.3


From 8a1898db51a3390241cd5fae267dc8aaa9db0f8b Mon Sep 17 00:00:00 2001
From: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Date: Tue, 20 Jun 2017 12:26:39 +0200
Subject: perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)

If the event for which an AUX area is about to be allocated, does
not support setting up an AUX area, rb_alloc_aux() return -ENOTSUPP.

This error condition is being returned unfiltered to the user space,
and, for example, the perf tools fails with:

  failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0x3fff497a1c8, 512)=22)

This error can be easily seen with "perf record -m 128,256 -e cpu-clock".

The 524 error code maps to -ENOTSUPP (in rb_alloc_aux()). The -ENOTSUPP
error code shall be only used within the kernel.  So the correct error
code would then be -EOPNOTSUPP.

With this commit, the perf tool then reports:

  failed to mmap with 95 (Operation not supported)

which is more clear.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pu Hou <bjhoupu@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas-Mich Richter <tmricht@linux.vnet.ibm.com>
Cc: acme@kernel.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/1497954399-6355-1-git-send-email-brueckner@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/ring_buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'kernel')

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 2831480c63a2..ee97196bb151 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -580,7 +580,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	int ret = -ENOMEM, max_order = 0;
 
 	if (!has_aux(event))
-		return -ENOTSUPP;
+		return -EOPNOTSUPP;
 
 	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
 		/*
-- 
cgit v1.2.3


From 0f17976568b3f72e676450af0c0db6f8752253d6 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Date: Thu, 29 Jun 2017 10:05:45 -0400
Subject: ftrace: Fix regression with module command in stack_trace_filter

When doing the following command:

 # echo ":mod:kvm_intel" > /sys/kernel/tracing/stack_trace_filter

it triggered a crash.

This happened with the clean up of probes. It required all callers to the
regex function (doing ftrace filtering) to have ops->private be a pointer to
a trace_array. But for the stack tracer, that is not the case.

Allow for the ops->private to be NULL, and change the function command
callbacks to handle the trace_array pointer being NULL as well.

Fixes: d2afd57a4b96 ("tracing/ftrace: Allow instances to have their own function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c          |  3 ---
 kernel/trace/trace.c           |  3 +++
 kernel/trace/trace_functions.c | 12 ++++++++++++
 kernel/trace/trace_stack.c     |  6 ++++--
 4 files changed, 19 insertions(+), 5 deletions(-)

(limited to 'kernel')

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 9e5841dc14b5..b308be30dfb9 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -4337,9 +4337,6 @@ static int ftrace_process_regex(struct ftrace_iterator *iter,
 
 	command = strsep(&next, ":");
 
-	if (WARN_ON_ONCE(!tr))
-		return -EINVAL;
-
 	mutex_lock(&ftrace_cmd_mutex);
 	list_for_each_entry(p, &ftrace_commands, list) {
 		if (strcmp(p->name, command) == 0) {
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1122f151466f..091e801145c9 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6881,6 +6881,9 @@ ftrace_trace_snapshot_callback(struct trace_array *tr, struct ftrace_hash *hash,
 	char *number;
 	int ret;
 
+	if (!tr)
+		return -ENODEV;
+
 	/* hash funcs only work with set_ftrace_filter */
 	if (!enable)
 		return -EINVAL;
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index a3bddbfd0874..a0910c0cdf2e 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -654,6 +654,9 @@ ftrace_trace_onoff_callback(struct trace_array *tr, struct ftrace_hash *hash,
 {
 	struct ftrace_probe_ops *ops;
 
+	if (!tr)
+		return -ENODEV;
+
 	/* we register both traceon and traceoff to this callback */
 	if (strcmp(cmd, "traceon") == 0)
 		ops = param ? &traceon_count_probe_ops : &traceon_probe_ops;
@@ -670,6 +673,9 @@ ftrace_stacktrace_callback(struct trace_array *tr, struct ftrace_hash *hash,
 {
 	struct ftrace_probe_ops *ops;
 
+	if (!tr)
+		return -ENODEV;
+
 	ops = param ? &stacktrace_count_probe_ops : &stacktrace_probe_ops;
 
 	return ftrace_trace_probe_callback(tr, ops, hash, glob, cmd,
@@ -682,6 +688,9 @@ ftrace_dump_callback(struct trace_array *tr, struct ftrace_hash *hash,
 {
 	struct ftrace_probe_ops *ops;
 
+	if (!tr)
+		return -ENODEV;
+
 	ops = &dump_probe_ops;
 
 	/* Only dump once. */
@@ -695,6 +704,9 @@ ftrace_cpudump_callback(struct trace_array *tr, struct ftrace_hash *hash,
 {
 	struct ftrace_probe_ops *ops;
 
+	if (!tr)
+		return -ENODEV;
+
 	ops = &cpudump_probe_ops;
 
 	/* Only dump once. */
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 76aa04d4c925..b4a751e8f9d6 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -409,7 +409,9 @@ static const struct file_operations stack_trace_fops = {
 static int
 stack_trace_filter_open(struct inode *inode, struct file *file)
 {
-	return ftrace_regex_open(&trace_ops, FTRACE_ITER_FILTER,
+	struct ftrace_ops *ops = inode->i_private;
+
+	return ftrace_regex_open(ops, FTRACE_ITER_FILTER,
 				 inode, file);
 }
 
@@ -476,7 +478,7 @@ static __init int stack_trace_init(void)
 			NULL, &stack_trace_fops);
 
 	trace_create_file("stack_trace_filter", 0444, d_tracer,
-			NULL, &stack_trace_filter_fops);
+			  &trace_ops, &stack_trace_filter_fops);
 
 	if (stack_trace_filter_buf[0])
 		ftrace_set_early_filter(&trace_ops, stack_trace_filter_buf, 1);
-- 
cgit v1.2.3


From 6bdf6abc56b53103324dfd270a86580306e1a232 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 29 Jun 2017 03:04:59 +0200
Subject: bpf: prevent leaking pointer via xadd on unpriviledged

Leaking kernel addresses on unpriviledged is generally disallowed,
for example, verifier rejects the following:

  0: (b7) r0 = 0
  1: (18) r2 = 0xffff897e82304400
  3: (7b) *(u64 *)(r1 +48) = r2
  R2 leaks addr into ctx

Doing pointer arithmetic on them is also forbidden, so that they
don't turn into unknown value and then get leaked out. However,
there's xadd as a special case, where we don't check the src reg
for being a pointer register, e.g. the following will pass:

  0: (b7) r0 = 0
  1: (7b) *(u64 *)(r1 +48) = r0
  2: (18) r2 = 0xffff897e82304400 ; map
  4: (db) lock *(u64 *)(r1 +48) += r2
  5: (95) exit

We could store the pointer into skb->cb, loose the type context,
and then read it out from there again to leak it eventually out
of a map value. Or more easily in a different variant, too:

   0: (bf) r6 = r1
   1: (7a) *(u64 *)(r10 -8) = 0
   2: (bf) r2 = r10
   3: (07) r2 += -8
   4: (18) r1 = 0x0
   6: (85) call bpf_map_lookup_elem#1
   7: (15) if r0 == 0x0 goto pc+3
   R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
   8: (b7) r3 = 0
   9: (7b) *(u64 *)(r0 +0) = r3
  10: (db) lock *(u64 *)(r0 +0) += r6
  11: (b7) r0 = 0
  12: (95) exit

  from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
  11: (b7) r0 = 0
  12: (95) exit

Prevent this by checking xadd src reg for pointer types. Also
add a couple of test cases related to this.

Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 kernel/bpf/verifier.c | 5 +++++
 1 file changed, 5 insertions(+)

(limited to 'kernel')

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 339c8a1371de..a8a725697bed 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -989,6 +989,11 @@ static int check_xadd(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	if (err)
 		return err;
 
+	if (is_pointer_value(env, insn->src_reg)) {
+		verbose("R%d leaks addr into mem\n", insn->src_reg);
+		return -EACCES;
+	}
+
 	/* check whether atomic_add can read the memory */
 	err = check_mem_access(env, insn->dst_reg, insn->off,
 			       BPF_SIZE(insn->code), BPF_READ, -1);
-- 
cgit v1.2.3


From 9e52b32567126fe146f198971364f68d3bc5233f Mon Sep 17 00:00:00 2001
From: Sabrina Dubroca <sd@queasysnail.net>
Date: Thu, 22 Jun 2017 11:24:42 +0200
Subject: tracing/kprobes: Allow to create probe with a module name starting
 with a digit

Always try to parse an address, since kstrtoul() will safely fail when
given a symbol as input. If that fails (which will be the case for a
symbol), try to parse a symbol instead.

This allows creating a probe such as:

    p:probe/vlan_gro_receive 8021q:vlan_gro_receive+0

Which is necessary for this command to work:

    perf probe -m 8021q -a vlan_gro_receive

Link: http://lkml.kernel.org/r/fd72d666f45b114e2c5b9cf7e27b91de1ec966f1.1498122881.git.sd@queasysnail.net

Cc: stable@vger.kernel.org
Fixes: 413d37d1e ("tracing: Add kprobe-based event tracer")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace_kprobe.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

(limited to 'kernel')

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c129fca6ec99..b53c8d369163 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -707,20 +707,16 @@ static int create_trace_kprobe(int argc, char **argv)
 		pr_info("Probe point is not specified.\n");
 		return -EINVAL;
 	}
-	if (isdigit(argv[1][0])) {
-		/* an address specified */
-		ret = kstrtoul(&argv[1][0], 0, (unsigned long *)&addr);
-		if (ret) {
-			pr_info("Failed to parse address.\n");
-			return ret;
-		}
-	} else {
+
+	/* try to parse an address. if that fails, try to read the
+	 * input as a symbol. */
+	if (kstrtoul(argv[1], 0, (unsigned long *)&addr)) {
 		/* a symbol specified */
 		symbol = argv[1];
 		/* TODO: support .init module functions */
 		ret = traceprobe_split_symbol_offset(symbol, &offset);
 		if (ret) {
-			pr_info("Failed to parse symbol.\n");
+			pr_info("Failed to parse either an address or a symbol.\n");
 			return ret;
 		}
 		if (offset && is_return &&
-- 
cgit v1.2.3