summaryrefslogtreecommitdiff
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
2015-01-16sched: Add missing rcu protection to wake_up_all_idle_cpusAndy Lutomirski
commit fd7de1e8d5b2b2b35e71332fafb899f584597150 upstream. Locklessly doing is_idle_task(rq->curr) is only okay because of RCU protection. The older variant of the broken code checked rq->curr == rq->idle instead and therefore didn't need RCU. Fixes: f6be8af1c95d ("sched: Add new API wake_up_if_idle() to wake up the idle cpu") Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Chuansheng Liu <chuansheng.liu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-03context_tracking: Restore previous state in schedule_userAndy Lutomirski
It appears that some SCHEDULE_USER (asm for schedule_user) callers in arch/x86/kernel/entry_64.S are called from RCU kernel context, and schedule_user will return in RCU user context. This causes RCU warnings and possible failures. This is intended to be a minimal fix suitable for 3.18. Reported-and-tested-by: Dave Jones <davej@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-16sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistencyStanislaw Gruszka
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc test case in cost of breaking another one. After that commit, calling clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result of Y time being smaller than X time. Reproducer/tester can be found further below, it can be compiled and ran by: gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread while ./tst-cpuclock2 ; do : ; done This reproducer, when running on a buggy kernel, will complain about "clock_gettime difference too small". Issue happens because on start in thread_group_cputimer() we initialize sum_exec_runtime of cputimer with threads runtime not yet accounted and then add the threads runtime to running cputimer again on scheduler tick, making it's sum_exec_runtime bigger than actual threads runtime. KOSAKI Motohiro posted a fix for this problem, but that patch was never applied: https://lkml.org/lkml/2013/5/26/191 . This patch takes different approach to cure the problem. It calls update_curr() when cputimer starts, that assure we will have updated stats of running threads and on the next schedule tick we will account only the runtime that elapsed from cputimer start. That also assure we have consistent state between cpu times of individual threads and cpu time of the process consisted by those threads. Full reproducer (tst-cpuclock2.c): #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <stdio.h> #include <time.h> #include <pthread.h> #include <stdint.h> #include <inttypes.h> /* Parameters for the Linux kernel ABI for CPU clocks. */ #define CPUCLOCK_SCHED 2 #define MAKE_PROCESS_CPUCLOCK(pid, clock) \ ((~(clockid_t) (pid) << 3) | (clockid_t) (clock)) static pthread_barrier_t barrier; /* Help advance the clock. */ static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) ; return NULL; } /* Don't use the glibc wrapper. */ static int do_nanosleep(int flags, const struct timespec *req) { clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED); return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL); } static int64_t tsdiff(const struct timespec *before, const struct timespec *after) { int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec; int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec; return after_i - before_i; } int main(void) { int result = 0; pthread_t th; pthread_barrier_init(&barrier, NULL, 2); if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) { perror("pthread_create"); return 1; } pthread_barrier_wait(&barrier); /* The test. */ struct timespec before, after, sleeptimeabs; int64_t sleepdiff, diffabs; const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 }; /* The relative nanosleep. Not sure why this is needed, but its presence seems to make it easier to reproduce the problem. */ if (do_nanosleep(0, &sleeptime) != 0) { perror("clock_nanosleep"); return 1; } /* Get the current time. */ if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) { perror("clock_gettime[2]"); return 1; } /* Compute the absolute sleep time based on the current time. */ uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec; sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000; sleeptimeabs.tv_nsec = nsec % 1000000000; /* Sleep for the computed time. */ if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) { perror("absolute clock_nanosleep"); return 1; } /* Get the time after the sleep. */ if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) { perror("clock_gettime[3]"); return 1; } /* The time after sleep should always be equal to or after the absolute sleep time passed to clock_nanosleep. */ sleepdiff = tsdiff(&sleeptimeabs, &after); if (sleepdiff < 0) { printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff); result = 1; printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec); printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec); printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec); } /* The difference between the timestamps taken before and after the clock_nanosleep call should be equal to or more than the duration of the sleep. */ diffabs = tsdiff(&before, &after); if (diffabs < sleeptime.tv_nsec) { printf("clock_gettime difference too small: %" PRId64 "\n", diffabs); result = 1; } pthread_cancel(th); return result; } Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/cputime: Fix cpu_timer_sample_group() double accountingPeter Zijlstra
While looking over the cpu-timer code I found that we appear to add the delta for the calling task twice, through: cpu_timer_sample_group() thread_group_cputimer() thread_group_cputime() times->sum_exec_runtime += task_sched_runtime(); *sample = cputime.sum_exec_runtime + task_delta_exec(); Which would make the sample run ahead, making the sleep short. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-10sched/numa: Fix out of bounds read in sched_init_numa()Andrey Ryabinin
On latest mm + KASan patchset I've got this: ================================================================== BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c ============================================================================= BUG kmalloc-8 (Not tainted): kasan error ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0 __slab_alloc+0x4b4/0x4f0 __kmalloc_track_caller+0x15f/0x1e0 kstrdup+0x44/0x90 alloc_vfsmnt+0xb0/0x2c0 vfs_kern_mount+0x35/0x190 kern_mount_data+0x25/0x50 pid_ns_prepare_proc+0x19/0x50 alloc_pid+0x5e2/0x630 copy_process.part.41+0xdf5/0x2aa0 do_fork+0xf5/0x460 kernel_thread+0x21/0x30 rest_init+0x1e/0x90 start_kernel+0x522/0x531 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x15b/0x16a INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080 INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70 Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk. Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........ Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18 ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48 ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20 Call Trace: dump_stack (lib/dump_stack.c:52) print_trailer (mm/slub.c:645) object_err (mm/slub.c:652) ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063) kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178) ? kasan_poison_shadow (mm/kasan/kasan.c:48) ? kasan_unpoison_shadow (mm/kasan/kasan.c:54) ? kasan_poison_shadow (mm/kasan/kasan.c:48) ? kasan_kmalloc (mm/kasan/kasan.c:311) __asan_load4 (mm/kasan/kasan.c:371) ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063) sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063) kernel_init_freeable (init/main.c:869 init/main.c:997) ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248) ? rest_init (init/main.c:924) kernel_init (init/main.c:929) ? rest_init (init/main.c:924) ret_from_fork (arch/x86/kernel/entry_64.S:348) ? rest_init (init/main.c:924) Read of size 4 by task swapper/0: Memory state around the buggy address: ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc ^ ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Zero 'level' (e.g. on non-NUMA system) causing out of bounds access in this line: sched_max_numa_distance = sched_domains_numa_distance[level - 1]; Fix this by exiting from sched_init_numa() earlier. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Reviewed-by: Rik van Riel <riel@redhat.com> Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies") Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04sched: Remove lockdep check in sched_move_task()Kirill Tkhai
sched_move_task() is the only interface to change sched_task_group: cpu_cgrp_subsys methods and autogroup_move_group() use it. Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach() is ordered with other users of sched_move_task(). This means we do no need RCU here: if we've dereferenced a tg here, the .attach method hasn't been called for it yet. Thus, we should pass "true" to task_css_check() to silence lockdep warnings. Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group") Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched: stop the unbound recursion in preempt_schedule_context()Oleg Nesterov
preempt_schedule_context() does preempt_enable_notrace() at the end and this can call the same function again; exception_exit() is heavy and it is quite possible that need-resched is true again. 1. Change this code to dec preempt_count() and check need_resched() by hand. 2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid the enable/disable dance around __schedule(). But in this case we need to move into sched/core.c. 3. Cosmetic, but x86 forgets to declare this function. This doesn't really matter because it is only called by asm helpers, still it make sense to add the declaration into asm/preempt.h to match preempt_schedule(). Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Graf <agraf@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Chuck Ebbert <cebbert.lkml@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched: Fix race between task_group and sched_task_groupKirill Tkhai
The race may happen when somebody is changing task_group of a forking task. Child's cgroup is the same as parent's after dup_task_struct() (there just memory copying). Also, cfs_rq and rt_rq are the same as parent's. But if parent changes its task_group before it's called cgroup_post_fork(), we do not reflect this situation on child. Child's cfs_rq and rt_rq remain the same, while child's task_group changes in cgroup_post_fork(). To fix this we introduce fork() method, which calls sched_move_task() directly. This function changes sched_task_group on appropriate (also its logic has no problem with freshly created tasks, so we shouldn't introduce something special; we are able just to use it). Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456 Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-13Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave Hansen) - Various sched/idle refinements for better idle handling (Nicolas Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot) - sched/numa updates and optimizations (Rik van Riel) - sysbench speedup (Vincent Guittot) - capacity calculation cleanups/refactoring (Vincent Guittot) - Various cleanups to thread group iteration (Oleg Nesterov) - Double-rq-lock removal optimization and various refactorings (Kirill Tkhai) - various sched/deadline fixes ... and lots of other changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/dl: Use dl_bw_of() under rcu_read_lock_sched() sched/fair: Delete resched_cpu() from idle_balance() sched, time: Fix build error with 64 bit cputime_t on 32 bit systems sched: Improve sysbench performance by fixing spurious active migration sched/x86: Fix up typo in topology detection x86, sched: Add new topology for multi-NUMA-node CPUs sched/rt: Use resched_curr() in task_tick_rt() sched: Use rq->rd in sched_setaffinity() under RCU read lock sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask' sched: Use dl_bw_of() under RCU read lock sched/fair: Remove duplicate code from can_migrate_task() sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW sched: print_rq(): Don't use tasklist_lock sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock() sched: Fix the task-group check in tg_has_rt_tasks() sched/fair: Leverage the idle state info when choosing the "idlest" cpu sched: Let the scheduler see CPU idle states sched/deadline: Fix inter- exclusive cpusets migrations sched/deadline: Clear dl_entity params when setscheduling to different class sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault() ...
2014-10-13Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking updates from Ingo Molnar: "The main updates in this cycle were: - mutex MCS refactoring finishing touches: improve comments, refactor and clean up code, reduce debug data structure footprint, etc. - qrwlock finishing touches: remove old code, self-test updates. - small rwsem optimization - various smaller fixes/cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/lockdep: Revert qrwlock recusive stuff locking/rwsem: Avoid double checking before try acquiring write lock locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code locking/semaphore: Resolve some shadow warnings locking/selftest: Support queued rwlock locking/lockdep: Restrict the use of recursive read_lock() with qrwlock locking/spinlocks: Always evaluate the second argument of spin_lock_nested() locking/Documentation: Update locking/mutex-design.txt disadvantages locking/Documentation: Move locking related docs into Documentation/locking/ locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriate locking/mutexes: Refactor optimistic spinning code locking/mcs: Remove obsolete comment locking/mutexes: Document quick lock release when unlocking locking/mutexes: Standardize arguments in lock/unlock slowpaths locking: Remove deprecated smp_mb__() barriers
2014-10-03sched/dl: Use dl_bw_of() under rcu_read_lock_sched()Kirill Tkhai
rq->rd is freed using call_rcu_sched(), so rcu_read_lock() to access it is not enough. We should use either rcu_read_lock_sched() or preempt_disable(). Reported-by: Sasha Levin <sasha.levin@oracle.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Fixes: 66339c31bc39 "sched: Use dl_bw_of() under RCU read lock" Link: http://lkml.kernel.org/r/1412065417.20287.24.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: Use rq->rd in sched_setaffinity() under RCU read lockKirill Tkhai
Probability of use-after-free isn't zero in this place. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v3.14+ Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'Kirill Tkhai
Nothing is locked there, so label's name only confuses a reader. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140922183630.11015.59500.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: Use dl_bw_of() under RCU read lockKirill Tkhai
dl_bw_of() dereferences rq->rd which has to have RCU read lock held. Probability of use-after-free isn't zero here. Also add lockdep assert into dl_bw_cpus(). Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v3.14+ Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSWPeter Zijlstra
Kirill found that there's a subtle race in the __ARCH_WANT_UNLOCKED_CTXSW code, and instead of fixing it, remove the entire exception because neither arch that uses it seems to actually still require it. Boot tested on mips64el (qemu) only. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kirill Tkhai <tkhai@yandex.ru> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Qais Yousef <qais.yousef@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Tony Luck <tony.luck@intel.com> Cc: oleg@redhat.com Cc: linux@roeck-us.net Cc: linux-ia64@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Link: http://lkml.kernel.org/r/20140923150641.GH3312@worktop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use ↵Oleg Nesterov
task_rq_lock() 1. read_lock(tasklist_lock) does not need to disable irqs. 2. ->mm != NULL is a common mistake, use PF_KTHREAD. 3. The second ->mm check can be simply removed. 4. task_rq_lock() looks better than raw_spin_lock(&p->pi_lock) + __task_rq_lock(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921193338.GA28621@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: Fix the task-group check in tg_has_rt_tasks()Oleg Nesterov
tg_has_rt_tasks() wants to find an RT task in this task_group, but task_rq(p)->rt.tg wrongly checks the root rt_rq. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Link: http://lkml.kernel.org/r/20140921193336.GA28618@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched/deadline: Clear dl_entity params when setscheduling to different classJuri Lelli
When a task is using SCHED_DEADLINE and the user setschedules it to a different class its sched_dl_entity static parameters are not cleaned up. This causes a bug if the user sets it back to SCHED_DEADLINE with the same parameters again. The problem resides in the check we perform at the very beginning of dl_overflow(): if (new_bw == p->dl.dl_bw) return 0; This condition is met in the case depicted above, so the function returns and dl_b->total_bw is not updated (the p->dl.dl_bw is not added to it). After this, admission control is broken. This patch fixes the thing, properly clearing static parameters for a task that ceases to use SCHED_DEADLINE. Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Vincent Legout <vincent@legout.info> Tested-by: Luca Abeni <luca.abeni@unitn.it> Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Tested-by: Vincent Legout <vincent@legout.info> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Fabio Checconi <fchecconi@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1411118561-26323-2-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-21sched: Clean up some typos and grammatical errors in code/commentsZhihui Zhang
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1411262676-19928-1-git-send-email-zzhsuny@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Add default-disabled option to BUG() when stack end location is ↵Aaron Tomlin
overwritten Currently in the event of a stack overrun a call to schedule() does not check for this type of corruption. This corruption is often silent and can go unnoticed. However once the corrupted region is examined at a later stage, the outcome is undefined and often results in a sporadic page fault which cannot be handled. This patch checks for a stack overrun and takes appropriate action since the damage is already done, there is no point in continuing. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dzickus@redhat.com Cc: bmr@redhat.com Cc: jcastillo@redhat.com Cc: oleg@redhat.com Cc: riel@redhat.com Cc: prarit@redhat.com Cc: jgh@redhat.com Cc: minchan@kernel.org Cc: mpe@ellerman.id.au Cc: tglx@linutronix.de Cc: rostedt@goodmis.org Cc: hannes@cmpxchg.org Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David S. Miller <davem@davemloft.net> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lubomir Rintel <lkundrak@v3.sk> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1410527779-8133-4-git-send-email-atomlin@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Do not stop cpu in set_cpus_allowed_ptr() if task is not runningKirill Tkhai
If a task is queued but not running on it rq, we can simply migrate it without migration thread and switching of context. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1410519814.3569.7.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched/core: Use put_prev_task() accessor where possibleKirill Tkhai
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1410529300.3569.25.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Add new API wake_up_if_idle() to wake up the idle cpuChuansheng Liu
Implementing one new API wake_up_if_idle(), which is used to wake up the idle CPU. Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: daniel.lezcano@linaro.org Cc: rjw@rjwysocki.net Cc: linux-pm@vger.kernel.org Cc: changcheng.liu@intel.com Cc: xiaoming.wang@intel.com Cc: souvik.k.chakravarty@intel.com Cc: chuansheng.liu@intel.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409815075-4180-1-git-send-email-chuansheng.liu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09sched: Migrate waking tasksLai Jiangshan
Current code can fail to migrate a waking task (silently) when TTWU_QUEUE is enabled. When a task is waking, it is pending on the wake_list of the rq, but it is not queued (task->on_rq == 0). In this case, set_cpus_allowed_ptr() and __migrate_task() will not migrate it because its invisible to them. This behavior is incorrect, because the task has been already woken, it will be running on the wrong CPU without correct placement until the next wake-up or update for cpus_allowed. To fix this problem, we need to finish the wakeup (so they appear on the runqueue) before we migrate them. Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Jason J. Herne <jjherne@linux.vnet.ibm.com> Tested-by: Jason J. Herne <jjherne@linux.vnet.ibm.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/538ED7EB.5050303@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-08Merge tag 'v3.17-rc4' into sched/core, to prevent conflicts with upcoming ↵Ingo Molnar
patches, and to refresh the tree Linux 3.17-rc4
2014-09-07sched/deadline: Fix a precision problem in the microseconds rangexiaofeng.yan
An overrun could happen in function start_hrtick_dl() when a task with SCHED_DEADLINE runs in the microseconds range. For example, if a task with SCHED_DEADLINE has the following parameters: Task runtime deadline period P1 200us 500us 500us The deadline and period from task P1 are less than 1ms. In order to achieve microsecond precision, we need to enable HRTICK feature by the next command: PC#echo "HRTICK" > /sys/kernel/debug/sched_features PC#trace-cmd record -e sched_switch & PC#./schedtool -E -t 200000:500000:500000 -e ./test The binary test is in an endless while(1) loop here. Some pieces of trace.dat are as follows: <idle>-0 157.603157: sched_switch: :R ==> 2481:4294967295: test test-2481 157.603203: sched_switch: 2481:R ==> 0:120: swapper/2 <idle>-0 157.605657: sched_switch: :R ==> 2481:4294967295: test test-2481 157.608183: sched_switch: 2481:R ==> 2483:120: trace-cmd trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test We can get the runtime of P1 from the information above: runtime = 157.608183 - 157.605657 runtime = 0.002526(2.526ms) The correct runtime should be less than or equal to 200us at some point. The problem is caused by a conditional judgment "delta > 10000" in function start_hrtick_dl(). Because no hrtimer start up to control the rest of runtime when the reset of runtime is less than 10us. So the process will continue to run until tick-period is coming. Move the code with the limit of the least time slice from hrtick_start_fair() to hrtick_start() because the EDF schedule class also needs this function in start_hrtick_dl(). To fix this problem, we call hrtimer_start() unconditionally in start_hrtick_dl(), and make sure the scheduling slice won't be smaller than 10us in hrtimer_start(). Signed-off-by: Xiaofeng Yan <xiaofeng.yan@huawei.com> Reviewed-by: Li Zefan <lizefan@huawei.com> Acked-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com [ Massaged the changelog and the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-25sched: Add function single_task_running to let a task check if it is the ↵Tim Chen
only task running on a cpu This function will help an async task processing batched jobs from workqueue decide if it wants to keep processing on more chunks of batched work that can be delayed, or to accumulate more work for more efficient batched processing later. If no other tasks are running on the cpu, the batching process can take advantgae of the available cpu cycles to a make decision to continue processing the existing accumulated work to minimize delay, otherwise it will yield. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-08-20sched: Remove double_rq_lock() from __migrate_task()Kirill Tkhai
Avoid double_rq_lock() and use TASK_ON_RQ_MIGRATING for __migrate_task(). The advantage is (obviously) not holding two rq->lock's at the same time and thereby increasing parallelism. The important point to note is that because we acquire dst->lock immediately after releasing src->lock the potential wait time of task_rq_lock() callers on TASK_ON_RQ_MIGRATING is not longer than it would have been in the double rq lock scenario. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1408528070.23412.89.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING stateKirill Tkhai
This is a new p->on_rq state which will be used to indicate that a task is in a process of migrating between two RQs. It allows to get rid of double_rq_lock(), which we used to use to change a rq of a queued task before. Let's consider an example. To move a task between src_rq and dst_rq we will do the following: raw_spin_lock(&src_rq->lock); /* p is a task which is queued on src_rq */ p = ...; dequeue_task(src_rq, p, 0); p->on_rq = TASK_ON_RQ_MIGRATING; set_task_cpu(p, dst_cpu); raw_spin_unlock(&src_rq->lock); /* * Both RQs are unlocked here. * Task p is dequeued from src_rq * but its on_rq value is not zero. */ raw_spin_lock(&dst_rq->lock); p->on_rq = TASK_ON_RQ_QUEUED; enqueue_task(dst_rq, p, 0); raw_spin_unlock(&dst_rq->lock); While p->on_rq is TASK_ON_RQ_MIGRATING, task is considered as "migrating", and other parallel scheduler actions with it are not available to parallel callers. The parallel caller is spining till migration is completed. The unavailable actions are changing of cpu affinity, changing of priority etc, in other words all the functionality which used to require task_rq(p)->lock before (and related to the task). To implement TASK_ON_RQ_MIGRATING support we primarily are using the following fact. Most of scheduler users (from which we are protecting a migrating task) use task_rq_lock() and __task_rq_lock() to get the lock of task_rq(p). These primitives know that task's cpu may change, and they are spining while the lock of the right RQ is not held. We add one more condition into them, so they will be also spinning until the migration is finished. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1408528062.23412.88.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20sched: Add wrapper for checking task_struct::on_rqKirill Tkhai
Implement task_on_rq_queued() and use it everywhere instead of on_rq check. No functional changes. The only exception is we do not use the wrapper in check_for_tasks(), because it requires to export task_on_rq_queued() in global header files. Next patch in series would return it back, so we do not twist it from here to there. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20sched: s/do_each_thread/for_each_process_thread/ in core.cOleg Nesterov
Change kernel/sched/core.c to use for_each_process_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Frank Mayhar <fmayhar@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sanjay Rao <srao@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140813191953.GA19315@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13locking: Remove deprecated smp_mb__() barriersPeter Zijlstra
Its been a while and there are no in-tree users left, so remove the deprecated barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chen, Gong <gong.chen@linux.intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Joe Perches <joe@perches.com> Cc: John Sullivan <jsrhbz@kanargh.force9.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-12sched: Rename a misleading variable in build_overlap_sched_groups()Zhihui Zhang
The child variable in build_overlap_sched_groups() actually refers to the peer or sibling domain of the given CPU. Rename it to sibling to be consistent with the naming in build_group_mask(). Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1406942283-18249-1-git-send-email-zzhsuny@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-11Merge branches 'pm-sleep', 'pm-cpufreq' and 'pm-cpuidle'Rafael J. Wysocki
* pm-sleep: PM / hibernate: avoid unsafe pages in e820 reserved regions * pm-cpufreq: cpufreq: arm_big_little: fix module license spec cpufreq: speedstep-smi: fix decimal printf specifiers cpufreq: OPP: Avoid sleeping while atomic cpufreq: cpu0: Do not print error message when deferring cpufreq: integrator: Use set_cpus_allowed_ptr * pm-cpuidle: cpuidle: menu: Lookup CPU runqueues less cpuidle: menu: Call nr_iowait_cpu less times cpuidle: menu: Use ktime_to_us instead of reinventing the wheel cpuidle: menu: Use shifts when calculating averages where possible
2014-08-06cpuidle: menu: Lookup CPU runqueues lessMel Gorman
The menu governer makes separate lookups of the CPU runqueue to get load and number of IO waiters but it can be done with a single lookup. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-08-04Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Move the nohz kick code out of the scheduler tick to a dedicated IPI, from Frederic Weisbecker. This necessiated quite some background infrastructure rework, including: * Clean up some irq-work internals * Implement remote irq-work * Implement nohz kick on top of remote irq-work * Move full dynticks timer enqueue notification to new kick * Move multi-task notification to new kick * Remove unecessary barriers on multi-task notification - Remove proliferation of wait_on_bit() action functions and allow wait_on_bit_action() functions to support a timeout. (Neil Brown) - Another round of sched/numa improvements, cleanups and fixes. (Rik van Riel) - Implement fast idling of CPUs when the system is partially loaded, for better scalability. (Tim Chen) - Restructure and fix the CPU hotplug handling code that may leave cfs_rq and rt_rq's throttled when tasks are migrated away from a dead cpu. (Kirill Tkhai) - Robustify the sched topology setup code. (Peterz Zijlstra) - Improve sched_feat() handling wrt. static_keys (Jason Baron) - Misc fixes. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) sched/fair: Fix 'make xmldocs' warning caused by missing description sched: Use macro for magic number of -1 for setparam sched: Robustify topology setup sched: Fix sched_setparam() policy == -1 logic sched: Allow wait_on_bit_action() functions to support a timeout sched: Remove proliferation of wait_on_bit() action functions sched/numa: Revert "Use effective_load() to balance NUMA loads" sched: Fix static_key race with sched_feat() sched: Remove extra static_key*() function indirection sched/rt: Fix replenish_dl_entity() comments to match the current upstream code sched: Transform resched_task() into resched_curr() sched/deadline: Kill task_struct->pi_top_task sched: Rework check_for_tasks() sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime() sched/fair: Disable runtime_enabled on dying rq sched/numa: Change scan period code to match intent sched/numa: Rework best node setting in task_numa_migrate() sched/numa: Examine a task move when examining a task swap sched/numa: Simplify task_numa_compare() sched/numa: Use effective_load() to balance NUMA loads ...
2014-08-04Merge branch 'for-3.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Mostly changes to get the v2 interface ready. The core features are mostly ready now and I think it's reasonable to expect to drop the devel mask in one or two devel cycles at least for a subset of controllers. - cgroup added a controller dependency mechanism so that block cgroup can depend on memory cgroup. This will be used to finally support IO provisioning on the writeback traffic, which is currently being implemented. - The v2 interface now uses a separate table so that the interface files for the new interface are explicitly declared in one place. Each controller will explicitly review and add the files for the new interface. - cpuset is getting ready for the hierarchical behavior which is in the similar style with other controllers so that an ancestor's configuration change doesn't change the descendants' configurations irreversibly and processes aren't silently migrated when a CPU or node goes down. All the changes are to the new interface and no behavior changed for the multiple hierarchies" * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits) cpuset: fix the WARN_ON() in update_nodemasks_hier() cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup: distinguish the default and legacy hierarchies when handling cftypes cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[] cpuset: export effective masks to userspace cpuset: allow writing offlined masks to cpuset.cpus/mems cpuset: enable onlined cpu/node in effective masks cpuset: refactor cpuset_hotplug_update_tasks() cpuset: make cs->{cpus, mems}_allowed as user-configured masks cpuset: apply cs->effective_{cpus,mems} cpuset: initialize top_cpuset's configured masks at mount cpuset: use effective cpumask to build sched domains cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty cpuset: update cs->effective_{cpus, mems} when config changes cpuset: update cpuset->effective_{cpus,mems} at hotplug cpuset: add cs->effective_cpus and cs->effective_mems cgroup: clean up sane_behavior handling ...
2014-07-28sched: Use macro for magic number of -1 for setparamSteven Rostedt
Instead of passing around a magic number -1 for the sched_setparam() policy, use a more descriptive macro name like SETPARAM_POLICY. [ based on top of Daniel's sched_setparam() fix ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Daniel Bristot de Oliveira<bristot@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140723112826.6ed6cbce@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28sched: Robustify topology setupPeter Zijlstra
We hard assume that higher topology levels are supersets of lower levels. Detect, warn and try to fixup when we encounter this violated. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Josh Boyer <jwboyer@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Bruno Wolff III <bruno@wolff.to> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140722094740.GJ12054@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28Merge branch 'sched/urgent' into sched/core, to merge fixes before applying ↵Ingo Molnar
new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28sched: Fix sched_setparam() policy == -1 logicDaniel Bristot de Oliveira
The scheduler uses policy == -1 to preserve the current policy state to implement sched_setparam(). But, as (int) -1 is equals to 0xffffffff, it's matching the if (policy & SCHED_RESET_ON_FORK) on _sched_setscheduler(). This match changes the policy value to an invalid value, breaking the sched_setparam() syscall. This patch checks policy == -1 before check the SCHED_RESET_ON_FORK flag. The following program shows the bug: int main(void) { struct sched_param param = { .sched_priority = 5, }; sched_setscheduler(0, SCHED_FIFO, &param); param.sched_priority = 1; sched_setparam(0, &param); param.sched_priority = 0; sched_getparam(0, &param); if (param.sched_priority != 1) printf("failed priority setting (found %d instead of 1)\n", param.sched_priority); else printf("priority setting fine\n"); } Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> # 3.14+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Fixes: 7479f3c9cf67 "sched: Move SCHED_RESET_ON_FORK into attr::sched_flags" Link: http://lkml.kernel.org/r/9ebe0566a08dbbb3999759d3f20d6004bb2dbcfa.1406079891.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched: Fix static_key race with sched_feat()Jason Baron
As pointed out by Andi Kleen, the usage of static keys can be racy in sched_feat_disable() vs. sched_feat_enable(). Currently, we first check the value of keys->enabled, and subsequently update the branch direction. This, can be racy and can potentially leave the keys in an inconsistent state. Take the i_mutex around these calls to resolve the race. Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/9d7780c83db26683955cd01e6bc654ee2586e67f.1404315388.git.jbaron@akamai.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched: Transform resched_task() into resched_curr()Kirill Tkhai
We always use resched_task() with rq->curr argument. It's not possible to reschedule any task but rq's current. The patch introduces resched_curr(struct rq *) to replace all of the repeating patterns. The main aim is cleanup, but there is a little size profit too: (before) $ size kernel/sched/built-in.o text data bss dec hex filename 155274 16445 7042 178761 2ba49 kernel/sched/built-in.o $ size vmlinux text data bss dec hex filename 7411490 1178376 991232 9581098 92322a vmlinux (after) $ size kernel/sched/built-in.o text data bss dec hex filename 155130 16445 7042 178617 2b9b9 kernel/sched/built-in.o $ size vmlinux text data bss dec hex filename 7411362 1178376 991232 9580970 9231aa vmlinux I was choosing between resched_curr() and resched_rq(), and the first name looks better for me. A little lie in Documentation/trace/ftrace.txt. I have not actually collected the tracing again. With a hope the patch won't make execution times much worse :) Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched/deadline: Kill task_struct->pi_top_taskOleg Nesterov
Remove task_struct->pi_top_task. The only user, rt_mutex_setprio(), can use a local. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Dempsky <mdempsky@chromium.org> Cc: Michal Simek <michal.simek@xilinx.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20140606165206.GB29465@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-15cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypesTejun Heo
Currently, cgroup_subsys->base_cftypes is used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype arrays and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch renames cgroup_subsys->base_cftypes to cgroup_subsys->legacy_cftypes. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-07-05sched/fair: Disable runtime_enabled on dying rqKirill Tkhai
We kill rq->rd on the CPU_DOWN_PREPARE stage: cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains -> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline This unthrottles all throttled cfs_rqs. But the cpu is still able to call schedule() till take_cpu_down->__cpu_disable() is called from stop_machine. This case the tasks from just unthrottled cfs_rqs are pickable in a standard scheduler way, and they are picked by dying cpu. The cfs_rqs becomes throttled again, and migrate_tasks() in migration_call skips their tasks (one more unthrottle in migrate_tasks()->CPU_DYING does not happen, because rq->rd is already NULL). Patch sets runtime_enabled to zero. This guarantees, the runtime is not accounted, and the cfs_rqs won't exceed given cfs_rq->runtime_remaining = 1, and tasks will be pickable in migrate_tasks(). runtime_enabled is recalculated again when rq becomes online again. Ben Segall also noticed, we always enable runtime in tg_set_cfs_bandwidth(). Actually, we should do that for online cpus only. To prevent races with unthrottle_offline_cfs_rqs() we take get_online_cpus() lock. Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> CC: Konstantin Khorenko <khorenko@parallels.com> CC: Paul Turner <pjt@google.com> CC: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched: Fix clock_gettime(CLOCK_[PROCESS/THREAD]_CPUTIME_ID) monotonicityMike Galbraith
If a task has been dequeued, it has been accounted. Do not project cycles that may or may not ever be accounted to a dequeued task, as that may make clock_gettime() both inaccurate and non-monotonic. Protect update_rq_clock() from slight TSC skew while at it. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: kosaki.motohiro@jp.fujitsu.com Cc: pjt@google.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403588980.29711.11.camel@marge.simpson.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/core: Fix formatting issues in sched_can_stop_tick()Viresh Kumar
sched_can_stop_tick() is using 7 spaces instead of 8 spaces or a 'tab' at the beginning of few lines. Which doesn't align well with the Coding Guidelines. Also remove local variable 'rq' as it is used at only one place and we can directly use this_rq() instead. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: fweisbec@gmail.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/afb781733e4a9ffbced5eb9fd25cc0aa5c6ffd7a.1403596966.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-23rcu: Reduce overhead of cond_resched() checks for RCUPaul E. McKenney
Commit ac1bea85781e (Make cond_resched() report RCU quiescent states) fixed a problem where a CPU looping in the kernel with but one runnable task would give RCU CPU stall warnings, even if the in-kernel loop contained cond_resched() calls. Unfortunately, in so doing, it introduced performance regressions in Anton Blanchard's will-it-scale "open1" test. The problem appears to be not so much the increased cond_resched() path length as an increase in the rate at which grace periods complete, which increased per-update grace-period overhead. This commit takes a different approach to fixing this bug, mainly by moving the RCU-visible quiescent state from cond_resched() to rcu_note_context_switch(), and by further reducing the check to a simple non-zero test of a single per-CPU variable. However, this approach requires that the force-quiescent-state processing send resched IPIs to the offending CPUs. These will be sent only once the grace period has reached an age specified by the boot/sysfs parameter rcutree.jiffies_till_sched_qs, or once the grace period reaches an age halfway to the point at which RCU CPU stall warnings will be emitted, whichever comes first. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> [ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the ktest build robot. Also fixed smp_mb() comment as noted by Oleg Nesterov. ] Merge with e552592e (Reduce overhead of cond_resched() checks for RCU) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-16nohz: Use IPI implicit full barrier against rq->nr_running r/wFrederic Weisbecker
A full dynticks CPU is allowed to stop its tick when a single task runs. Meanwhile when a new task gets enqueued, the CPU must be notified so that it can restart its tick to maintain local fairness and other accounting details. This notification is performed by way of an IPI. Then when the target receives the IPI, we expect it to see the new value of rq->nr_running. Hence the following ordering scenario: CPU 0 CPU 1 write rq->running get IPI smp_wmb() smp_rmb() send IPI read rq->nr_running But Paul Mckenney says that nowadays IPIs imply a full barrier on all architectures. So we can safely remove this pair and rely on the implicit barriers that come along IPI send/receive. Lets just comment on this new assumption. Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>