Currently with 2.6.32-longterm, its possible for time() to occasionally
return values one second earlier then the previous time() call.
This happens because update_xtime_cache() does:
xtime_cache = xtime;
timespec_add_ns(&xtime_cache, nsec);
Its possible that xtime is 1sec,999msecs, and nsecs is 1ms, resulting in
a xtime_cache that is 2sec,0ms.
get_seconds() (which is used by sys_time()) does not take the
xtime_lock, which is ok as the xtime.tv_sec value is a long and can be
atomically read safely.
The problem occurs the next call to update_xtime_cache() if xtime has
not increased:
/* This sets xtime_cache back to 1sec, 999msec */
xtime_cache = xtime;
/* get_seconds, calls here, and sees a 1second inconsistency */
timespec_add_ns(&xtime_cache, nsec);
In order to resolve this, we could add locking to get_seconds(), but it
needs to be lock free, as it is called from the machine check handler,
opening a possible deadlock.
So instead, this patch introduces an intermediate value for the
calculations, so that we only assign xtime_cache once with the correct
time, using ACCESS_ONCE to make sure the compiler doesn't optimize out
any intermediate values.
The xtime_cache manipulations were removed with 2.6.35, so that kernel
and later do not need this change.
In 2.6.33 and 2.6.34 the logarithmic accumulation should make it so
xtime is updated each tick, so it is unlikely that two updates to
xtime_cache could occur while the difference between xtime and
xtime_cache crosses the second boundary. However, the paranoid might
want to pull this into 2.6.33/34-longterm just to be sure.
Thanks to Stephen for helping finally narrow down the root cause and
many hours of help with testing and validation. Also thanks to Max,
Andi, Eric and Paul for review of earlier attempts and helping clarify
what is possible with regard to out of order execution.
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is a merge of two mainline commits, intended
for stable@kernel.org submission for 2.6.27 kernel.
commit f833bab87f
and
commit 918aae42aa
Changelog of both:
Currently clockevents_notify() is called with interrupts enabled at
some places and interrupts disabled at some other places.
This results in a deadlock in this scenario.
cpu A holds clockevents_lock in clockevents_notify() with irqs enabled
cpu B waits for clockevents_lock in clockevents_notify() with irqs disabled
cpu C doing set_mtrr() which will try to rendezvous of all the cpus.
This will result in C and A come to the rendezvous point and waiting
for B. B is stuck forever waiting for the spinlock and thus not
reaching the rendezvous point.
Fix the clockevents code so that clockevents_lock is taken with
interrupts disabled and thus avoid the above deadlock.
Also call lapic_timer_propagate_broadcast() on the destination cpu so
that we avoid calling smp_call_function() in the clockevents notifier
chain.
This issue left us wondering if we need to change the MTRR rendezvous
logic to use stop machine logic (instead of smp_call_function) or add
a check in spinlock debug code to see if there are other spinlocks
which gets taken under both interrupts enabled/disabled conditions.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Brown Len" <len.brown@intel.com>
Cc: stable@kernel.org
LKML-Reference: <1250544899.2709.210.camel@sbs-t61.sc.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I got following warning on ia64 box:
In function 'acpi_processor_power_verify':
642: warning: passing argument 2 of 'smp_call_function_single' from
incompatible pointer type
This smp_call_function_single() was introduced by a commit
f833bab87f:
The problem is that the lapic_timer_propagate_broadcast() has 2 versions:
One is real code that modified in the above commit, and the other is NOP
code that used when !ARCH_APICTIMER_STOPS_ON_C3:
static void lapic_timer_propagate_broadcast(struct acpi_processor *pr) { }
So I got warning because of !ARCH_APICTIMER_STOPS_ON_C3.
We really want to do nothing here on !ARCH_APICTIMER_STOPS_ON_C3, so
modify lapic_timer_propagate_broadcast() of real version to use
smp_call_function_single() in it.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit b845b517b5 upstream.
Avoid deadlocks against rq->lock and xtime_lock by deferring the klogd
wakeup by polling from the timer tick.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 6c9bacb41c upstream.
Impact: fix time warp bug
Alex Shi, along with Yanmin Zhang have been noticing occasional time
inconsistencies recently. Through their great diagnosis, they found that
the xtime_nsec value used in update_wall_time was occasionally going
negative. After looking through the code for awhile, I realized we have
the possibility for an underflow when three conditions are met in
update_wall_time():
1) We have accumulated a second's worth of nanoseconds, so we
incremented xtime.tv_sec and appropriately decrement xtime_nsec.
(This doesn't cause xtime_nsec to go negative, but it can cause it
to be small).
2) The remaining offset value is large, but just slightly less then
cycle_interval.
3) clocksource_adjust() is speeding up the clock, causing a
corrective amount (compensating for the increase in the multiplier
being multiplied against the unaccumulated offset value) to be
subtracted from xtime_nsec.
This can cause xtime_nsec to underflow.
Unfortunately, since we notify the NTP subsystem via second_overflow()
whenever we accumulate a full second, and this effects the error
accumulation that has already occured, we cannot simply revert the
accumulated second from xtime nor move the second accumulation to after
the clocksource_adjust call without a change in behavior.
This leaves us with (at least) two options:
1) Simply return from clocksource_adjust() without making a change if we
notice the adjustment would cause xtime_nsec to go negative.
This would work, but I'm concerned that if a large adjustment was needed
(due to the error being large), it may be possible to get stuck with an
ever increasing error that becomes too large to correct (since it may
always force xtime_nsec negative). This may just be paranoia on my part.
2) Catch xtime_nsec if it is negative, then add back the amount its
negative to both xtime_nsec and the error.
This second method is consistent with how we've handled earlier rounding
issues, and also has the benefit that the error being added is always in
the oposite direction also always equal or smaller then the correction
being applied. So the risk of a corner case where things get out of
control is lessened.
This patch fixes bug 11970, as tested by Yanmin Zhang
http://bugzilla.kernel.org/show_bug.cgi?id=11970
Reported-by: alex.shi@intel.com
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Acked-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 74a03b69d1 upstream.
tick_handle_periodic() can lock up hard when a one shot clock event
device is used in combination with jiffies clocksource.
Avoid an endless loop issue by requiring that a highres valid
clocksource be installed before we call tick_periodic() in a loop when
using ONESHOT mode. The result is we will only increment jiffies once
per interrupt until a continuous hardware clocksource is available.
Without this, we can run into a endless loop, where each cycle through
the loop, jiffies is updated which increments time by tick_period or
more (due to clock steering), which can cause the event programming to
think the next event was before the newly incremented time and fail
causing tick_periodic() to be called again and the whole process loops
forever.
[ Impact: prevent hard lock up ]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 9a055117d3 upstream.
To keep the raw monotonic patch simple first introduce
clocksource_forward_now(), which takes care of the offset since the last
update_wall_time() call and adds it to the clock, so there is no need
anymore to deal with it explicitly at various places, which need to make
significant changes to the clock.
This is also gets rid of the timekeeping_suspend_nsecs, instead of
waiting until resume, the value is accumulated during suspend. In the end
there is only a single user of __get_nsec_offset() left, so I integrated
it back to getnstimeofday().
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 1c5745aa38 upstream.
Redo:
5b7dba4: sched_clock: prevent scd->clock from moving backwards
which had to be reverted due to s2ram hangs:
ca7e716: Revert "sched_clock: prevent scd->clock from moving backwards"
... this time with resume restoring GTOD later in the sequence
taken into account as well.
The "timekeeping_suspended" flag is not very nice but we cannot call into
GTOD before it has been properly resumed and the scheduler will run very
early in the resume sequence.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Impact: jiffies increment too fast.
Hugh Dickins noted that with NOHZ=n and HIGHRES=n jiffies get
incremented too fast. The reason is a wrong check in the broadcast
enter/exit code, which keeps the local apic timer in periodic mode
when the switch happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: per CPU hrtimers can be migrated from a dead CPU
The hrtimer code has no knowledge about per CPU timers, but we need to
prevent the migration of such timers and warn when such a timer is
active at migration time.
Explicitely mark the timers as per CPU and use a more understandable
mode descriptor for the interrupts safe unlocked callback mode, which
is used by hrtimer_sleeper and the scheduler code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
kernel/time/tick-common.c: In function ‘tick_setup_periodic’:
kernel/time/tick-common.c:113: error: implicit declaration of function ‘tick_broadcast_oneshot_active’
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: timer hang on CPU online observed on AMD C1E systems
When a CPU is brought online then the broadcast machinery can
be in the one shot state already. Check this and setup the timer
device of the new CPU in one shot mode so the broadcast code
can pick up the next_event value correctly.
Another AMD C1E oddity, as we switch to broadcast immediately and
not after the full bring up via the ACPI cpu idle code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: Possible hang on CPU online observed on AMD C1E machines.
The broadcast setup code looks at the mode of the tick device to
determine whether it needs to be shut down or setup. This is wrong
when the broadcast mode is set to one shot already. This can happen
when a CPU is brought online as it goes through the periodic setup
first.
The problem went unnoticed as sane systems do not call into that code
before the switch to one shot for the clock event device happens.
The AMD C1E idle routine switches over immediately and thereby shuts
down the just setup device before the first interrupt happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: possible hang on CPU onlining in timer one shot mode.
The tick_next_period variable is only used during boot on nohz/highres
enabled systems, but for CPU onlining it needs to be maintained when
the per cpu clock events device operates in one shot mode.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: rare hang which can be triggered on CPU online.
tick_do_timer_cpu keeps track of the CPU which updates jiffies
via do_timer. The value -1 is used to signal, that currently no
CPU is doing this. There are two cases, where the variable can
have this state:
boot:
necessary for systems where the boot cpu id can be != 0
nohz long idle sleep:
When the CPU which did the jiffies update last goes into
a long idle sleep it drops the update jiffies duty so
another CPU which is not idle can pick it up and keep
jiffies going.
Using the same value for both situations is wrong, as the CPU online
code can see the -1 state when the timer of the newly onlined CPU is
setup. The setup for a newly onlined CPU goes through periodic mode
and can pick up the do_timer duty without being aware of the nohz /
highres mode of the already running system.
Use two separate states and make them constants to avoid magic
numbers confusion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The device shut down does not cleanup the next_event variable of the
clock event device. So when the device is reactivated the possible
stale next_event value can prevent the device to be reprogrammed as it
claims to wait on a event already.
This is the root cause of the resurfacing suspend/resume problem,
where systems need key press to come back to life.
Fix this by setting next_event to KTIME_MAX when the device is shut
down. Use a separate function for shutdown which takes care of that
and only keep the direct set mode call in the broadcast code, where we
can not touch the next_event value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The issue of the endless reprogramming loop due to a too small
min_delta_ns was fixed with the previous updates of the clock events
code, but we had no information about the spread of this problem. I
added a WARN_ON to get automated information via kerneloops.org and to
get some direct reports, which allowed me to analyse the affected
machines.
The WARN_ON has served its purpose and would be annoying for a release
kernel. Remove it and just keep the information about the increase of
the min_delta_ns value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We have a bug in the calculation of the next jiffie to trigger the RTC
synchronisation. The aim here is to run sync_cmos_clock() as close as
possible to the middle of a second. Which means we want this function to
be called less than or equal to half a jiffie away from when now.tv_nsec
equals 5e8 (500000000).
If this is not the case for a given call to the function, for this purpose
instead of updating the RTC we calculate the offset in nanoseconds to the
next point in time where now.tv_nsec will be equal 5e8. The calculated
offset is then converted to jiffies as these are the unit used by the
timer.
Hovewer timespec_to_jiffies() used here uses a ceil()-type rounding mode,
where the resulting value is rounded up. As a result the range of
now.tv_nsec when the timer will trigger is from 5e8 to 5e8 + TICK_NSEC
rather than the desired 5e8 - TICK_NSEC / 2 to 5e8 + TICK_NSEC / 2.
As a result if for example sync_cmos_clock() happens to be called at the
time when now.tv_nsec is between 5e8 + TICK_NSEC / 2 and 5e8 to 5e8 +
TICK_NSEC, it will simply be rescheduled HZ jiffies later, falling in the
same range of now.tv_nsec again. Similarly for cases offsetted by an
integer multiple of TICK_NSEC.
This change addresses the problem by subtracting TICK_NSEC / 2 from the
nanosecond offset to the next point in time where now.tv_nsec will be
equal 5e8, effectively shifting the following rounding in
timespec_to_jiffies() so that it produces a rounded-to-nearest result.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Until the C1E patches arrived there where no users of periodic broadcast
before switching to oneshot mode. Now we need to trigger a possible
waiter for a periodic broadcast when switching to oneshot mode.
Otherwise we can starve them for ever.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If HLT stops the TSC, we'll fail to account idle time, thereby inflating the
actual process times. Fix this by re-calibrating the clock against GTOD when
leaving nohz mode.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The C1E/HPET bug reports on AMDX2/RS690 systems where tracked down to a
too small value of the HPET minumum delta for programming an event.
The clockevents code needs to enforce an interrupt event on the clock event
device in some cases. The enforcement code was stupid and naive, as it just
added the minimum delta to the current time and tried to reprogram the device.
When the minimum delta is too small, then this loops forever.
Add a sanity check. Allow reprogramming to fail 3 times, then print a warning
and double the minimum delta value to make sure, that this does not happen again.
Use the same function for both tick-oneshot and tick-broadcast code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While chasing the C1E/HPET bugreports I went through the clock events
code inch by inch and found that the broadcast device can be initialized
and shutdown multiple times. Multiple shutdowns are not critical, but
useless waste of time. Multiple initializations are simply broken. Another
CPU might have the device in use already after the first initialization and
the second init could just render it unusable again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In tick_oneshot_setup we program the device to the given next_event,
but we do not check the return value. We need to make sure that the
device is programmed enforced so the interrupt handler engine starts
working. Split out the reprogramming function from tick_program_event()
and call it with the device, which was handed in to tick_setup_oneshot().
Set the force argument, so the devices is firing an interrupt.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The reprogramming of the periodic broadcast handler was broken,
when the first programming returned -ETIME. The clockevents code
stores the new expiry value in the clock events device next_event field
only when the programming time has not been elapsed yet. The loop in
question calculates the new expiry value from the next_event value
and therefor never increases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is a ordering related problem with clockevents code, due to which
clockevents_register_device() called after tickless/highres switch
will not work. The new clockevent ends up with clockevents_handle_noop as
event handler, resulting in no timer activity.
The problematic path seems to be
* old device already has hrtimer_interrupt as the event_handler
* new clockevent device registers with a higher rating
* tick_check_new_device() is called
* clockevents_exchange_device() gets called
* old->event_handler is set to clockevents_handle_noop
* tick_setup_device() is called for the new device
* which sets new->event_handler using the old->event_handler which is noop.
Change the ordering so that new device inherits the proper handler.
This does not have any issue in normal case as most likely all the clockevent
devices are setup before the highres switch. But, can potentially be affecting
some corner case where HPET force detect happens after the highres switch.
This was a problem with HPET in MSI mode code that we have been experimenting
with.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On the tickless system(CONFIG_NO_HZ=y and CONFIG_HIGH_RES_TIMERS=n), after
I made an offlined cpu online, I found this cpu's event handler was
tick_handle_periodic, not tick_nohz_handler.
After debuging, I found this bug was caused by the wrong tick mode. the
tick mode is not changed to NOHZ_MODE_INACTIVE when the cpu is offline.
This patch fixes this bug.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Found an interactivity problem on a quad core test-system - simple
CPU loops would occasionally delay the system un an unacceptable way.
After much debugging with Peter Zijlstra it turned out that the problem
is caused by the string of sched_clock() changes - they caused the CPU
clock to jump backwards a bit - which confuses the scheduler arithmetics.
(which is unsigned for performance reasons)
So revert:
# c300ba2: sched_clock: and multiplier for TSC to gtod drift
# c0c8773: sched_clock: only update deltas with local reads.
# af52a90: sched_clock: stop maximum check on NO HZ
# f7cce27: sched_clock: widen the max and min time
This solves the interactivity problems.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
* Replace previous instances of the cpumask_of_cpu_ptr* macros
with a the new (lvalue capable) generic cpumask_of_cpu().
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
nohz: adjust tick_nohz_stop_sched_tick() call of s390 as well
nohz: prevent tick stop outside of the idle loop
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
cpumask: Provide a generic set of CPUMASK_ALLOC macros
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
Revert "cpumask: introduce new APIs"
cpumask: make for_each_cpu_mask a bit smaller
net: Pass reference to cpumask variable in net/sunrpc/svc.c
...
Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
* 'core/softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
softlockup: fix invalid proc_handler for softlockup_panic
softlockup: fix watchdog task wakeup frequency
softlockup: fix watchdog task wakeup frequency
softlockup: show irqtrace
softlockup: print a module list on being stuck
softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
softlockup: fix false positives on nohz if CPU is 100% idle for more than 60 seconds
softlockup: fix softlockup_thresh fix
softlockup: fix softlockup_thresh unaligned access and disable detection at runtime
softlockup: allow panic on lockup
This allow to dynamically generate attributes and share show/store
functions between attributes. Right now most attributes are generated
by special macros and lots of duplicated code. With the attribute
passed it's instead possible to attach some data to the attribute
and then use that in shared low level functions to do different things.
I need this for the dynamically generated bank attributes in the x86
machine check code, but it'll allow some further cleanups.
I converted all users in tree to the new show/store prototype. It's a single
huge patch to avoid unbisectable sections.
Runtime tested: x86-32, x86-64
Compiled only: ia64, powerpc
Not compile tested/only grep converted: sh, arm, avr32
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* Optimize various places where a pointer to the cpumask_of_cpu value
will result in reducing stack pressure.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jack Ren and Eric Miao tracked down the following long standing
problem in the NOHZ code:
scheduler switch to idle task
enable interrupts
Window starts here
----> interrupt happens (does not set NEED_RESCHED)
irq_exit() stops the tick
----> interrupt happens (does set NEED_RESCHED)
return from schedule()
cpu_idle(): preempt_disable();
Window ends here
The interrupts can happen at any point inside the race window. The
first interrupt stops the tick, the second one causes the scheduler to
rerun and switch away from idle again and we end up with the tick
disabled.
The fact that it needs two interrupts where the first one does not set
NEED_RESCHED and the second one does made the bug obscure and extremly
hard to reproduce and analyse. Kudos to Jack and Eric.
Solution: Limit the NOHZ functionality to the idle loop to make sure
that we can not run into such a situation ever again.
cpu_idle()
{
preempt_disable();
while(1) {
tick_nohz_stop_sched_tick(1); <- tell NOHZ code that we
are in the idle loop
while (!need_resched())
halt();
tick_nohz_restart_sched_tick(); <- disables NOHZ mode
preempt_enable_no_resched();
schedule();
preempt_disable();
}
}
In hindsight we should have done this forever, but ...
/me grabs a large brown paperbag.
Debugged-by: Jack Ren <jack.ren@marvell.com>,
Debugged-by: eric miao <eric.y.miao@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'timers/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: add PCI ID for 6300ESB force hpet
x86: add another PCI ID for ICH6 force-hpet
kernel-paramaters: document pmtmr= command line option
acpi_pm clccksource: fix printk format warning
nohz: don't stop idle tick if softirqs are pending.
pmtmr: allow command line override of ioport
nohz: reduce jiffies polling overhead
hrtimer: Remove unused variables in ktime_divns()
hrtimer: remove warning in hres_timers_resume
posix-timers: print RT watchdog message
* 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (76 commits)
sched_clock: and multiplier for TSC to gtod drift
sched_clock: record TSC after gtod
sched_clock: only update deltas with local reads.
sched_clock: fix calculation of other CPU
sched_clock: stop maximum check on NO HZ
sched_clock: widen the max and min time
sched_clock: record from last tick
sched: fix accounting in task delay accounting & migration
sched: add avg-overlap support to RT tasks
sched: terminate newidle balancing once at least one task has moved over
sched: fix warning
sched: build fix
sched: sched_clock_cpu() based cpu_clock(), lockdep fix
sched: export cpu_clock
sched: make sched_{rt,fair}.c ifdefs more readable
sched: bias effective_load() error towards failing wake_affine().
sched: incremental effective_load()
sched: correct wakeup weight calculations
sched: fix mult overflow
sched: update shares on wakeup
...
Working with ftrace I would get large jumps of 11 millisecs or more with
the clock tracer. This killed the latencing timings of ftrace and also
caused the irqoff self tests to fail.
What was happening is with NO_HZ the idle would stop the jiffy counter and
before the jiffy counter was updated the sched_clock would have a bad
delta jiffies to compare with the gtod with the maximum.
The jiffies would stop and the last sched_tick would record the last gtod.
On wakeup, the sched clock update would compare the gtod + delta jiffies
(which would be zero) and compare it to the TSC. The TSC would have
correctly (with a stable TSC) moved forward several jiffies. But because the
jiffies has not been updated yet the clock would be prevented from moving
forward because it would appear that the TSC jumped too far ahead.
The clock would then virtually stop, until the jiffies are updated. Then
the next sched clock update would see that the clock was very much behind
since the delta jiffies is now correct. This would then jump the clock
forward by several jiffies.
This caused ftrace to report several milliseconds of interrupts off
latency at every resume from NO_HZ idle.
This patch adds hooks into the nohz code to disable the checking of the
maximum clock update when nohz is in effect. It resumes the max check
when nohz has updated the jiffies again.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In case a cpu goes idle but softirqs are pending only an error message is
printed to the console. It may take a very long time until the pending
softirqs will finally be executed. Worst case would be a hanging system.
With this patch the timer tick just continues and the softirqs will be
executed after the next interrupt. Still a delay but better than a
hanging system.
Currently we have at least two device drivers on s390 which under certain
circumstances schedule a tasklet from process context. This is a reason
why we can end up with pending softirqs when going idle. Fixing these
drivers seems to be non-trivial.
However there is no question that the drivers should be fixed.
This patch shouldn't be considered as a bug fix. It just is intended to
keep a system running even if device drivers are buggy.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Glauber <jan.glauber@de.ibm.com>
Cc: Stefan Weinhuber <wein@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
C1E on AMD machines is like C3 but without control from the OS. Up to
now we disabled the local apic timer for those machines as it stops
when the CPU goes into C1E. This excludes those machines from high
resolution timers / dynamic ticks, which hurts especially X2 based
laptops.
The current boot time C1E detection has another, more serious flaw
as well: some BIOSes do not enable C1E until the ACPI processor module
is loaded. This causes systems to stop working after that point.
To work nicely with C1E enabled machines we use a separate idle
function, which checks on idle entry whether C1E was enabled in the
Interrupt Pending Message MSR. This allows us to do timer broadcasting
for C1E and covers the late enablement of C1E as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's never used and the comments refer to nonatomic and retry
interchangably. So get rid of it.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fix (probably theoretical only) rq->clock update bug:
in tick_nohz_update_jiffies() [which is called on all irq
entry on all cpus where the irq entry hits an idle cpu] we
call touch_softlockup_watchdog() before we update jiffies.
That works fine most of the time when idle timeouts are within
60 seconds. But when an idle timeout is beyond 60 seconds,
jiffies is updated with a jump of more than 60 seconds,
which causes a jump in cpu-clock of more than 60 seconds,
triggering a false positive.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate
Reviewed-by: Paul Jackson <pj@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no harm, when users can read the info and we ask often enough
during debugging for this kind of information.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
File permissions for
/sys/devices/system/clocksource/clocksource0/available_clocksource
are 600 which allows write access. But this is in fact a read only
file. So change permissions to 400.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Remove the leap second handling from second_overflow(), which doesn't have to
check for it every second anymore. With CONFIG_NO_HZ this also makes sure the
leap second is handled close to the full second. Additionally this makes it
possible to abort a leap second properly by resetting the STA_INS/STA_DEL
status bits.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>