summaryrefslogtreecommitdiff
path: root/kernel/time
AgeCommit message (Collapse)Author
2011-03-07clockevents: Prevent oneshot mode when broadcast device is periodicThomas Gleixner
commit 3a142a0672b48a853f00af61f184c7341ac9c99d upstream. When the per cpu timer is marked CLOCK_EVT_FEAT_C3STOP, then we only can switch into oneshot mode, when the backup broadcast device supports oneshot mode as well. Otherwise we would try to switch the broadcast device into an unsupported mode unconditionally. This went unnoticed so far as the current available broadcast devices support oneshot mode. Seth unearthed this problem while debugging and working around an hpet related BIOS wreckage. Add the necessary check to tick_is_oneshot_available(). Reported-and-tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <alpine.LFD.2.00.1102252231200.2701@localhost6.localdomain6> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13timekeeping: Fix clock_gettime vsyscall time warpLin Ming
commit 0696b711e4be45fa104c12329f617beb29c03f78 upstream. Since commit 0a544198 "timekeeping: Move NTP adjusted clock multiplier to struct timekeeper" the clock multiplier of vsyscall is updated with the unmodified clock multiplier of the clock source and not with the NTP adjusted multiplier of the timekeeper. This causes user space observerable time warps: new CLOCK-warp maximum: 120 nsecs, 00000025c337c537 -> 00000025c337c4bf Add a new argument "mult" to update_vsyscall() and hand in the timekeeping internal NTP adjusted multiplier. Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: "Zhang Yanmin" <yanmin_zhang@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tony Luck <tony.luck@intel.com> LKML-Reference: <1258436990.17765.83.camel@minggr.sh.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kurt Garloff <garloff@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13nohz: Reuse ktime in sub-functions of tick_check_idle.Martin Schwidefsky
commit eed3b9cf3fe3fcc7a50238dfcab63a63914e8f42 upstream. On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and tick_nohz_update_jiffies. Given the right conditions (ts->idle_active and/or ts->tick_stopped) both function get a time stamp with ktime_get. The same time stamp can be reused if both function require one. On s390 this change has the additional benefit that gcc inlines the tick_nohz_stop_idle function into tick_check_idle. The number of instructions to execute tick_check_idle drops from 225 to 144 (without the ktime_get optimization it is 367 vs 215 instructions). before: 0) | tick_check_idle() { 0) | tick_nohz_stop_idle() { 0) | ktime_get() { 0) | read_tod_clock() { 0) 0.601 us | } 0) 1.765 us | } 0) 3.047 us | } 0) | ktime_get() { 0) | read_tod_clock() { 0) 0.570 us | } 0) 1.727 us | } 0) | tick_do_update_jiffies64() { 0) 0.609 us | } 0) 8.055 us | } after: 0) | tick_check_idle() { 0) | ktime_get() { 0) | read_tod_clock() { 0) 0.617 us | } 0) 1.773 us | } 0) | tick_do_update_jiffies64() { 0) 0.593 us | } 0) 4.477 us | } Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: john stultz <johnstul@us.ibm.com> LKML-Reference: <20090929122533.206589318@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Jolly <jjolly@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13nohz: Introduce arch_needs_cpuMartin Schwidefsky
commit 3c5d92a0cfb5103c0d5ab74d4ae6373d3af38148 upstream. Allow the architecture to request a normal jiffy tick when the system goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is used to prevent the system going fully idle if there has been an interrupt other than a clock comparator interrupt since the last wakeup. On s390 the HiperSockets response time for 1 connection ping-pong goes down from 42 to 34 microseconds. The CPU cost decreases by 27%. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <20090929122533.402715150@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Jolly <jjolly@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-01hrtimer: Tune hrtimer_interrupt hang logicThomas Gleixner
commit 41d2e494937715d3150e5c75d01f0e75ae899337 upstream. The hrtimer_interrupt hang logic adjusts min_delta_ns based on the execution time of the hrtimer callbacks. This is error-prone for virtual machines, where a guest vcpu can be scheduled out during the execution of the callbacks (and the callbacks themselves can do operations that translate to blocking operations in the hypervisor), which in can lead to large min_delta_ns rendering the system unusable. Replace the current heuristics with something more reliable. Allow the interrupt code to try 3 times to catch up with the lost time. If that fails use the total time spent in the interrupt handler to defer the next timer interrupt so the system can catch up with other things which got delayed. Limit that deferment to 100ms. The retry events and the maximum time spent in the interrupt handler are recorded and exposed via /proc/timer_list Inspired by a patch from Marcelo. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: kvm@vger.kernel.org Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-01timekeeping: Prevent oops when GENERIC_TIME=njohn stultz
commit ad6759fbf35d104dbf573cd6f4c6784ad6823f7e upstream. Aaro Koskinen reported an issue in kernel.org bugzilla #15366, where on non-GENERIC_TIME systems, accessing /sys/devices/system/clocksource/clocksource0/current_clocksource results in an oops. It seems the timekeeper/clocksource rework missed initializing the curr_clocksource value in the !GENERIC_TIME case. Thanks to Aaro for reporting and diagnosing the issue as well as testing the fix! Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <1267475683.4216.61.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-23Export the symbol of getboottime and mmonotonic_to_bootbasedJason Wang
commit c93d89f3dbf0202bf19c07960ca8602b48c2f9a0 upstream. Export getboottime and monotonic_to_bootbased in order to let them could be used by following patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-09clocksource: fix compilation if no GENERIC_TIMEAaro Koskinen
commit a362c638bdf052bf424bce7645d39b101090f6ba upstream Commit a9238ce3bb0fda6e760780b702c6cbd3793087d3 broke compilation on platforms that do not implement GENERIC_TIME (e.g. iop32x): kernel/time/clocksource.c: In function 'clocksource_register': kernel/time/clocksource.c:556: error: implicit declaration of function 'clocksource_max_deferment' Provide the implementation of clocksource_max_deferment() also for such platforms. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-28nohz: Prevent clocksource wrapping during idleJon Hunter
commit 98962465ed9e6ea99c38e0af63fe1dcb5a79dc25 upstream. The dynamic tick allows the kernel to sleep for periods longer than a single tick, but it does not limit the sleep time currently. In the worst case the kernel could sleep longer than the wrap around time of the time keeping clock source which would result in losing track of time. Prevent this by limiting it to the safe maximum sleep time of the current time keeping clock source. The value is calculated when the clock source is registered. [ tglx: simplified the code a bit and massaged the commit msg ] Signed-off-by: Jon Hunter <jon-hunter@ti.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1250617512-23567-2-git-send-email-jon-hunter@ti.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-28clockevents: Add missing include to pacify sparseH Hartley Sweeten
commit 8e1a928a2ed7e8d5cad97c8e985294b4caedd168 upstream. Include "tick-internal.h" in order to pick up the extern function prototype for clockevents_shutdown(). This quiets the following sparse build noise: warning: symbol 'clockevents_shutdown' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE190901E24550@mi8nycmail19.Mi8.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Cc: johnstul@us.ibm.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-28clockevent: Don't remove broadcast device when cpu is deadXiaotian Feng
commit ea9d8e3f45404d411c00ae67b45cc35c58265bb7 upstream. Marc reported that the BUG_ON in clockevents_notify() triggers on his system. This happens because the kernel tries to remove an active clock event device (used for broadcasting) from the device list. The handling of devices which can be used as per cpu device and as a global broadcast device is suboptimal. The simplest solution for now (and for stable) is to check whether the device is used as global broadcast device, but this needs to be revisited. [ tglx: restored the cpuweight check and massaged the changelog ] Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Tested-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> LKML-Reference: <1262834564-13033-1-git-send-email-dfeng@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06clockevents: Prevent clockevent_devices list corruption on cpu hotplugThomas Gleixner
commit bb6eddf7676e1c1f3e637aa93c5224488d99036f upstream. Xiaotian Feng triggered a list corruption in the clock events list on CPU hotplug and debugged the root cause. If a CPU registers more than one per cpu clock event device, then only the active clock event device is removed on CPU_DEAD. The unused devices are kept in the clock events device list. On CPU up the clock event devices are registered again, which means that we list_add an already enqueued list_head. That results in list corruption. Resolve this by removing all devices which are associated to the dead CPU on CPU_DEAD. Reported-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-11headers: remove sched.h from interrupt.hAlexey Dobriyan
After m68k's task_thread_info() doesn't refer to current, it's possible to remove sched.h from interrupt.h and not break m68k! Many thanks to Heiko Carstens for allowing this. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-10-07NOHZ: update idle state also when NOHZ is inactiveEero Nurkkala
Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side effects with cpufreq governors on some systems. If the system did not switch into NOHZ mode ts->inidle is not set when tick_nohz_stop_sched_tick() is called from the idle routine. Therefor all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick() fail to call tick_nohz_start_idle(). This results in bogus idle accounting information which is passed to cpufreq governors. Set the inidle flag unconditionally of the NOHZ active state to keep the idle time accounting correct in any case. [ tglx: Added comment and tweaked the changelog ] Reported-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com> Cc: Rik van Riel <riel@redhat.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Greg KH <greg@kroah.com> Cc: Steven Noonan <steven@uplinklabs.net> Cc: stable@kernel.org LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-10-01const: constify remaining file_operationsAlexey Dobriyan
[akpm@linux-foundation.org: fix KVM] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-26Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Resume clocksource without taking the clocksource mutex
2009-09-24clocksource: Resume clocksource without taking the clocksource mutexMartin Schwidefsky
git commit 75c5158f70c065b9 converted the clocksource spinlock to a mutex. This causes the following BUG: BUG: sleeping function called from invalid context at kernel/mutex.c:280 in_atomic(): 0, irqs_disabled(): 1, pid: 2473, name: pm-suspend 2 locks held by pm-suspend/2473: #0: (&buffer->mutex){......}, at: [<ffffffff8115ab13>] sysfs_write_file+0x3c/0x137 #1: (pm_mutex){......}, at: [<ffffffff810865b5>] enter_state+0x39/0x130 Pid: 2473, comm: pm-suspend Not tainted 2.6.31 #1 Call Trace: [<ffffffff810792f0>] ? __debug_show_held_locks+0x22/0x24 [<ffffffff8104a2ef>] __might_sleep+0x107/0x10b [<ffffffff8141fca9>] mutex_lock_nested+0x25/0x43 [<ffffffff81073537>] clocksource_resume+0x1c/0x60 [<ffffffff81072902>] timekeeping_resume+0x1e/0x1c8 [<ffffffff812aee62>] __sysdev_resume+0x25/0xcf [<ffffffff812aef79>] sysdev_resume+0x6d/0xae [<ffffffff810864f8>] suspend_devices_and_enter+0x12b/0x1af [<ffffffff8108665b>] enter_state+0xdf/0x130 [<ffffffff81085dc3>] state_store+0xb6/0xd3 [<ffffffff81204c73>] kobj_attr_store+0x17/0x19 [<ffffffff8115abd2>] sysfs_write_file+0xfb/0x137 [<ffffffff811057d2>] vfs_write+0xae/0x10b [<ffffffff81208392>] ? __up_read+0x1a/0x7f [<ffffffff811058ef>] sys_write+0x4a/0x6e [<ffffffff81011b82>] system_call_fastpath+0x16/0x1b clocksource_resume is called early in the resume process, there is only one cpu, no processes are running and the interrupts are disabled. It is therefore possible to resume the clocksources without taking the clocksource mutex. Reported-by: Xiaotian Feng <xtfeng@gmail.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Xiaotian Feng <xtfeng@gmail.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090924172952.49697825@mschwide.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24time: add function to convert between calendar time and broken-down time for ↵Zhaolei
universal use There are many similar code in kernel for one object: convert time between calendar time and broken-down time. Here is some source I found: fs/ncpfs/dir.c fs/smbfs/proc.c fs/fat/misc.c fs/udf/udftime.c fs/cifs/netmisc.c net/netfilter/xt_time.c drivers/scsi/ips.c drivers/input/misc/hp_sdc_rtc.c drivers/rtc/rtc-lib.c arch/ia64/hp/sim/boot/fw-emu.c arch/m68k/mac/misc.c arch/powerpc/kernel/time.c arch/parisc/include/asm/rtc.h ... We can make a common function for this type of conversion, At least we can get following benefit: 1: Make kernel simple and unify 2: Easy to fix bug in converting code 3: Reduce clone of code in future For example, I'm trying to make ftrace display walltime, this patch will make me easy. This code is based on code from glibc-2.6 Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Machek <pavel@ucw.cz> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-18Merge branch 'timers-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (34 commits) time: Prevent 32 bit overflow with set_normalized_timespec() clocksource: Delay clocksource down rating to late boot clocksource: clocksource_select must be called with mutex locked clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash timers: Drop a function prototype clocksource: Resolve cpu hotplug dead lock with TSC unstable timer.c: Fix S/390 comments timekeeping: Fix invalid getboottime() value timekeeping: Fix up read_persistent_clock() breakage on sh timekeeping: Increase granularity of read_persistent_clock(), build fix time: Introduce CLOCK_REALTIME_COARSE x86: Do not unregister PIT clocksource on PIT oneshot setup/shutdown clocksource: Avoid clocksource watchdog circular locking dependency clocksource: Protect the watchdog rating changes with clocksource_mutex clocksource: Call clocksource_change_rating() outside of watchdog_lock timekeeping: Introduce read_boot_clock timekeeping: Increase granularity of read_persistent_clock() timekeeping: Update clocksource with stop_machine timekeeping: Add timekeeper read_clock helper functions timekeeping: Move NTP adjusted clock multiplier to struct timekeeper ... Fix trivial conflict due to MIPS lemote -> loongson renaming.
2009-09-14clocksource: Delay clocksource down rating to late bootThomas Gleixner
The down rating of clock sources in the early boot process via the clock source watchdog mechanism can happen way before the per cpu event queues are initialized. This leads to a boot crash on x86 when the TSC is marked unstable in the SMP bring up. The selection of a clock source for time keeping happens in the late boot process so we can safely delay the list manipulation until clocksource_done_booting() is called. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-09-14clocksource: clocksource_select must be called with mutex lockedThomas Gleixner
The callers of clocksource_select must hold clocksource_mutex to protect the clocksource_list. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-09-11clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crashMartin Schwidefsky
The watchdog timer is started after the watchdog clocksource and at least one watched clocksource have been registered. The clocksource work element watchdog_work is initialized just before the clocksource timer is started. This is too late for the clocksource_mark_unstable call from native_cpu_up. To fix this use a static initializer for watchdog_work. This resolves a boot crash reported by multiple people. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090911153305.3fe9a361@skybase> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-28clocksource: Resolve cpu hotplug dead lock with TSC unstableThomas Gleixner
Martin Schwidefsky analyzed it: To register a clocksource the clocksource_mutex is acquired and if necessary timekeeping_notify is called to install the clocksource as the timekeeper clock. timekeeping_notify uses stop_machine which needs to take cpu_add_remove_lock mutex. Starting a new cpu is done with the cpu_add_remove_lock mutex held. native_cpu_up checks the tsc of the new cpu and if the tsc is no good clocksource_change_rating is called. Which needs the clocksource_mutex and the deadlock is complete. The solution is to replace the TSC via the clocksource watchdog mechanism. Mark the TSC as unstable and schedule the watchdog work so it gets removed in the watchdog thread context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: John Stultz <johnstul@us.ibm.com>
2009-08-25timekeeping: Fix invalid getboottime() valueHiroshi Shimamoto
Don't use timespec_add_safe() with wall_to_monotonic, because wall_to_monotonic has negative values which will cause overflow in timespec_add_safe(). That makes btime in /proc/stat invalid. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <4A937FDE.4050506@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-21time: Introduce CLOCK_REALTIME_COARSEjohn stultz
After talking with some application writers who want very fast, but not fine-grained timestamps, I decided to try to implement new clock_ids to clock_gettime(): CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE which returns the time at the last tick. This is very fast as we don't have to access any hardware (which can be very painful if you're using something like the acpi_pm clocksource), and we can even use the vdso clock_gettime() method to avoid the syscall. The only trade off is you only get low-res tick grained time resolution. This isn't a new idea, I know Ingo has a patch in the -rt tree that made the vsyscall gettimeofday() return coarse grained time when the vsyscall64 sysctrl was set to 2. However this affects all applications on a system. With this method, applications can choose the proper speed/granularity trade-off for themselves. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: nikolag@ca.ibm.com Cc: Darren Hart <dvhltc@us.ibm.com> Cc: arjan@infradead.org Cc: jonathan@jonmasters.org LKML-Reference: <1250734414.6897.5.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-19clockevent: Prevent dead lock on clockevents_lockSuresh Siddha
Currently clockevents_notify() is called with interrupts enabled at some places and interrupts disabled at some other places. This results in a deadlock in this scenario. cpu A holds clockevents_lock in clockevents_notify() with irqs enabled cpu B waits for clockevents_lock in clockevents_notify() with irqs disabled cpu C doing set_mtrr() which will try to rendezvous of all the cpus. This will result in C and A come to the rendezvous point and waiting for B. B is stuck forever waiting for the spinlock and thus not reaching the rendezvous point. Fix the clockevents code so that clockevents_lock is taken with interrupts disabled and thus avoid the above deadlock. Also call lapic_timer_propagate_broadcast() on the destination cpu so that we avoid calling smp_call_function() in the clockevents notifier chain. This issue left us wondering if we need to change the MTRR rendezvous logic to use stop machine logic (instead of smp_call_function) or add a check in spinlock debug code to see if there are other spinlocks which gets taken under both interrupts enabled/disabled conditions. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com> Cc: "Brown Len" <len.brown@intel.com> LKML-Reference: <1250544899.2709.210.camel@sbs-t61.sc.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-19clocksource: Avoid clocksource watchdog circular locking dependencyMartin Schwidefsky
stop_machine from a multithreaded workqueue is not allowed because of a circular locking dependency between cpu_down and the workqueue execution. Use a kernel thread to do the clocksource downgrade. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: john stultz <johnstul@us.ibm.com> LKML-Reference: <20090818170942.3ab80c91@skybase> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-19clocksource: Protect the watchdog rating changes with clocksource_mutexThomas Gleixner
Martin pointed out that commit 6ea41d2529 (clocksource: Call clocksource_change_rating() outside of watchdog_lock) has a theoretical reference count problem. The calls to clocksource_change_rating() are now done outside of the clocksource mutex and outside of the watchdog lock. A concurrent clocksource_unregister() could remove the clock. Split out the code which changes the rating from clocksource_change_rating() into __clocksource_change_rating(). Protect the clocksource_watchdog_work() code sequence with the clocksource_mutex() and call __clocksource_change_rating(). LKML-Reference: <alpine.LFD.2.00.0908171038420.2782@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-08-17timers: Drop write permission on /proc/timer_listAmerigo Wang
/proc/timer_list and /proc/slabinfo are not supposed to be written, so there should be no write permissions on it. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: linux-mm@kvack.org Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20090817094525.6355.88682.sendpatchset@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-15clocksource: Call clocksource_change_rating() outside of watchdog_lockThomas Gleixner
The changes to the watchdog logic introduced a lock inversion between watchdog_lock and clocksource_mutex. Change the rating outside of watchdog_lock to avoid it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Introduce read_boot_clockMartin Schwidefsky
Add the new function read_boot_clock to get the exact time the system has been started. For architectures without support for exact boot time a new weak function is added that returns 0. Use the exact boot time to initialize wall_to_monotonic, or xtime if the read_boot_clock returned 0. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134811.296703241@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Increase granularity of read_persistent_clock()Martin Schwidefsky
The persistent clock of some architectures (e.g. s390) have a better granularity than seconds. To reduce the delta between the host clock and the guest clock in a virtualized system change the read_persistent_clock function to return a struct timespec. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134811.013873340@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Update clocksource with stop_machineMartin Schwidefsky
update_wall_time calls change_clocksource HZ times per second to check if a new clock source is available. In close to 100% of all calls there is no new clock. Replace the tick based check by an update done with stop_machine. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134810.711836357@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Add timekeeper read_clock helper functionsMartin Schwidefsky
Add timekeeper_read_clock_ntp and timekeeper_read_clock_raw and use them for getnstimeofday, ktime_get, ktime_get_ts and getrawmonotonic. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134810.435105711@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Move NTP adjusted clock multiplier to struct timekeeperMartin Schwidefsky
The clocksource structure has two multipliers, the unmodified multiplier clock->mult_orig and the NTP corrected multiplier clock->mult. The NTP multiplier is misplaced in the struct clocksource, this is private information of the timekeeping code. Add the mult field to the struct timekeeper to contain the NTP corrected value, keep the unmodifed multiplier in clock->mult and remove clock->mult_orig. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134810.149047645@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Add xtime_shift and ntp_error_shift to struct timekeeperMartin Schwidefsky
The xtime_nsec value in the timekeeper structure is shifted by a few bits to improve precision. This happens to be the same value as the clock->shift. To improve readability add xtime_shift to the timekeeper and use it instead of the clock->shift. Likewise add ntp_error_shift and replace all (NTP_SCALE_SHIFT - clock->shift) expressions. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134809.871899606@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Introduce struct timekeeperMartin Schwidefsky
Add struct timekeeper to keep the internal values timekeeping.c needs in regard to the currently selected clock source. This moves the timekeeping intervals, xtime_nsec and the ntp error value from struct clocksource to struct timekeeper. The raw_time is removed from the clocksource as well. It gets treated like xtime as a global variable. Eventually xtime raw_time should be moved to struct timekeeper. [ tglx: minor cleanup ] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134809.613209842@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15clocksource: Move watchdog downgrade to a work queue threadMartin Schwidefsky
Move the downgrade of an unstable clocksource from the timer interrupt context into the process context of a work queue thread. This is needed to be able to do the clocksource switch with stop_machine. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134809.354926067@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15clocksource: Refactor clocksource watchdogMartin Schwidefsky
Refactor clocksource watchdog code to make it more readable. Add clocksource_dequeue_watchdog to remove a clocksource from the watchdog list when it is unregistered. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134809.110881699@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15clocksource: Simplify clocksource watchdog resume logicMartin Schwidefsky
To resume the clocksource watchdog just remove the CLOCK_SOURCE_WATCHDOG bit from the watched clocksource. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134808.880925790@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15clocksource: Delay clocksource watchdog highres enablementMartin Schwidefsky
The clocksource watchdog marks a clock as highres capable before it checked the deviation from the watchdog clocksource even for a single time. Make sure that the deviation is at least checked once before doing the switch to highres mode. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134808.627795883@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15clocksource: Cleanup clocksource selectionMartin Schwidefsky
If a non high-resolution clocksource is first set as override clock and then registered it becomes active even if the system is in one-shot mode. Move the override check from sysfs_override_clocksource to the clocksource selection. That fixes the bug and simplifies the code. The check in clocksource_register for double registration of the same clocksource is removed without replacement. To find the initial clocksource a new weak function in jiffies.c is defined that returns the jiffies clocksource. The architecture code can then override the weak function with a more suitable clocksource, e.g. the TOD clock on s390. [ tglx: Folded in a fix from John Stultz ] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134808.388024160@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Move reset of cycle_last for tsc clocksource to tscMartin Schwidefsky
change_clocksource resets the cycle_last value to zero then sets it to a value read from the clocksource. The reset to zero is required only for the TSC clocksource to make the read_tsc function work after a resume. The reason is that the TSC read function uses cycle_last to detect backwards going TSCs. In the resume case cycle_last contains the TSC value from the last update before the suspend. On resume the TSC starts counting from 0 again and would trip over the cycle_last comparison. This is subtle and surprising. Move the reset to a resume function in the tsc code. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134808.142191175@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Remove clocksource inline functionsMartin Schwidefsky
The three inline functions clocksource_read, clocksource_enable and clocksource_disable are simple wrappers of an indirect call plus the copy from and to the mult_orig value. The functions are exclusively used by the timekeeping code which has intimate knowledge of the clocksource anyway. Therefore remove the inline functions. No functional change. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134807.903108946@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15timekeeping: Introduce timekeeping_leap_insertJohn Stultz
Move the adjustment of xtime, wall_to_monotonic and the update of the vsyscall variables to the timekeeping code. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <20090814134807.609730216@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-14Merge branch 'linus' into timers/coreThomas Gleixner
Reason: Martin's timekeeping cleanup series depends on both timers/core and mainline changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-07-19clocksource: Prevent NULL pointer dereferenceThomas Gleixner
Writing a zero length string to sys/.../current_clocksource will cause a NULL pointer dereference if the clock events system is in one shot (highres or nohz) mode. Pointed-out-by: Dan Carpenter <error27@gmail.com> LKML-Reference: <alpine.DEB.2.00.0907191545580.12306@bicker> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-07-10hrtimer: Fix migration expiry checkThomas Gleixner
The timer migration expiry check should prevent the migration of a timer to another CPU when the timer expires before the next event is scheduled on the other CPU. Migrating the timer might delay it because we can not reprogram the clock event device on the other CPU. But the code implementing that check has two flaws: - for !HIGHRES the check compares the expiry value with the clock events device expiry value which is wrong for CLOCK_REALTIME based timers. - the check is racy. It holds the hrtimer base lock of the target CPU, but the clock event device expiry value can be modified nevertheless, e.g. by an timer interrupt firing. The !HIGHRES case is easy to fix as we can enqueue the timer on the cpu which was selected by the load balancer. It runs the idle balancing code once per jiffy anyway. So the maximum delay for the timer is the same as when we keep the tick on the current cpu going. In the HIGHRES case we can get the next expiry value from the hrtimer cpu_base of the target CPU and serialize the update with the cpu_base lock. This moves the lock section in hrtimer_interrupt() so we can set next_event to KTIME_MAX while we are handling the expired timers and set it to the next expiry value after we handled the timers under the base lock. While the expired timers are processed timer migration is blocked because the expiry time of the timer is always <= KTIME_MAX. Also remove the now useless clockevents_get_next_event() function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-07-07timekeeping: Move ktime_get() functions to timekeeping.cThomas Gleixner
The ktime_get() functions for GENERIC_TIME=n are still located in hrtimer.c. Move them to time/timekeeping.c where they belong. LKML-Reference: <new-submission> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-07-07timekeeping: optimized ktime_get[_ts] for GENERIC_TIME=yMartin Schwidefsky
The generic ktime_get function defined in kernel/hrtimer.c is suboptimial for GENERIC_TIME=y: 0) | ktime_get() { 0) | ktime_get_ts() { 0) | getnstimeofday() { 0) | read_tod_clock() { 0) 0.601 us | } 0) 1.938 us | } 0) | set_normalized_timespec() { 0) 0.602 us | } 0) 4.375 us | } 0) 5.523 us | } Overall there are two read_seqbegin/read_seqretry loops and a lot of unnecessary struct timespec calculations. ktime_get returns a nano second value which is the sum of xtime, wall_to_monotonic and the nano second delta from the clock source. ktime_get can be optimized for GENERIC_TIME=y. The new version only calls clocksource_read: 0) | ktime_get() { 0) | read_tod_clock() { 0) 0.610 us | } 0) 1.977 us | } It uses a single read_seqbegin/readseqretry loop and just adds everthing to a nano second value. ktime_get_ts is optimized in a similar fashion. [ tglx: added WARN_ON(timekeeping_suspended) as in getnstimeofday() ] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: john stultz <johnstul@us.ibm.com> LKML-Reference: <20090707112728.3005244d@skybase> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>