<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/kernel, branch v2.6.16.50</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>hrtimer: prevent overrun DoS in hrtimer_forward()</title>
<updated>2007-04-08T23:04:23+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2007-04-08T23:04:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=75da4e64268ffdbd56143b4642d340b9bf3ee350'/>
<id>75da4e64268ffdbd56143b4642d340b9bf3ee350</id>
<content type='text'>
hrtimer_forward() does not check for the possible overflow of
timer-&gt;expires. This can happen on 64 bit machines with large interval
values and results currently in an endless loop in the softirq because
the expiry value becomes negative and therefor the timer is expired all
the time.

Check for this condition and set the expiry value to the max. expiry
time in the future.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
hrtimer_forward() does not check for the possible overflow of
timer-&gt;expires. This can happen on 64 bit machines with large interval
values and results currently in an endless loop in the softirq because
the expiry value becomes negative and therefor the timer is expired all
the time.

Check for this condition and set the expiry value to the max. expiry
time in the future.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Fix reparenting to the same thread group. (take 2)</title>
<updated>2007-04-08T21:47:24+00:00</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2007-04-08T21:47:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f95540a5193f909b5700555ceafc69bcb13d2174'/>
<id>f95540a5193f909b5700555ceafc69bcb13d2174</id>
<content type='text'>
This patch fixes the case when we reparent to a different thread in the
same thread group.  This modifies the code so that we do not send
signals and do not change the signal to send to SIGCHLD unless we have
change the thread group of our parents.  It also suppresses sending
pdeath_sig in this cas as well since the result of geppid doesn't
change.

Thanks to Oleg for spotting my bug of only fixing this for non-ptraced
tasks.

This fixes the issues identified by Albert Cahalan in thread
http://lkml.org/lkml/2006/12/21/22

Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch fixes the case when we reparent to a different thread in the
same thread group.  This modifies the code so that we do not send
signals and do not change the signal to send to SIGCHLD unless we have
change the thread group of our parents.  It also suppresses sending
pdeath_sig in this cas as well since the result of geppid doesn't
change.

Thanks to Oleg for spotting my bug of only fixing this for non-ptraced
tasks.

This fixes the issues identified by Albert Cahalan in thread
http://lkml.org/lkml/2006/12/21/22

Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>load_module: no BUG if module_subsys uninitialized</title>
<updated>2007-03-20T23:28:30+00:00</updated>
<author>
<name>Ed Swierk</name>
<email>eswierk@arastra.com</email>
</author>
<published>2007-03-20T23:28:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=de37843b235a600c2402a44ab235d0d1634f82ca'/>
<id>de37843b235a600c2402a44ab235d0d1634f82ca</id>
<content type='text'>
Invoking load_module() before param_sysfs_init() is called crashes in
mod_sysfs_setup(), since the kset in module_subsys is not initialized yet.

In my case, net-pf-1 is getting modprobed as a result of hotplug trying to
create a UNIX socket.  Calls to hotplug begin after the topology_init
initcall.

Another patch for the same symptom (module_subsys-initialize-earlier.patch)
moves param_sysfs_init() to the subsys initcalls, but this is still not
early enough in the boot process in some cases.  In particular,
topology_init() causes /sbin/hotplug to run, which requests net-pf-1 (the
UNIX socket protocol) which can be compiled as a module.  Moving
param_sysfs_init() to the postcore initcalls fixes this particular race,
but there might well be other cases where a usermodehelper causes a module
to load earlier still.

The patch makes load_module() return an error rather than crashing the
kernel if invoked before module_subsys is initialized.

Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Invoking load_module() before param_sysfs_init() is called crashes in
mod_sysfs_setup(), since the kset in module_subsys is not initialized yet.

In my case, net-pf-1 is getting modprobed as a result of hotplug trying to
create a UNIX socket.  Calls to hotplug begin after the topology_init
initcall.

Another patch for the same symptom (module_subsys-initialize-earlier.patch)
moves param_sysfs_init() to the subsys initcalls, but this is still not
early enough in the boot process in some cases.  In particular,
topology_init() causes /sbin/hotplug to run, which requests net-pf-1 (the
UNIX socket protocol) which can be compiled as a module.  Moving
param_sysfs_init() to the postcore initcalls fixes this particular race,
but there might well be other cases where a usermodehelper causes a module
to load earlier still.

The patch makes load_module() return an error rather than crashing the
kernel if invoked before module_subsys is initialized.

Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>softirq: remove BUG_ONs which can incorrectly trigger</title>
<updated>2006-12-15T00:38:04+00:00</updated>
<author>
<name>Zachary Amsden</name>
<email>zach@vmware.com</email>
</author>
<published>2006-12-15T00:38:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e89da8fc50c416abf4abab77d851e051df0edbec'/>
<id>e89da8fc50c416abf4abab77d851e051df0edbec</id>
<content type='text'>
It is possible to have tasklets get scheduled before softirqd has had a chance
to spawn on all CPUs.  This is totally harmless; after success during action
CPU_UP_PREPARE, action CPU_ONLINE will be called, which immediately wakes
softirqd on the appropriate CPU to process the already pending tasklets.  So
there is no danger of having a missed wakeup for any tasklets that were
already pending.

In particular, i386 is affected by this during startup, and is visible when
using a very large initrd; during the time it takes for the initrd to be
decompressed, a timer IRQ can come in and schedule RCU callbacks.  It is also
possible that resending of a hardware IRQ via a softirq triggers the same bug.

Because of different timing conditions, this shows up in all emulators and
virtual machines tested, including Xen, VMware, Virtual PC, and Qemu.  It is
also possible to trigger on native hardware with a large enough initrd,
although I don't have a reliable case demonstrating that.

Signed-off-by: Zachary Amsden &lt;zach@vmware.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It is possible to have tasklets get scheduled before softirqd has had a chance
to spawn on all CPUs.  This is totally harmless; after success during action
CPU_UP_PREPARE, action CPU_ONLINE will be called, which immediately wakes
softirqd on the appropriate CPU to process the already pending tasklets.  So
there is no danger of having a missed wakeup for any tasklets that were
already pending.

In particular, i386 is affected by this during startup, and is visible when
using a very large initrd; during the time it takes for the initrd to be
decompressed, a timer IRQ can come in and schedule RCU callbacks.  It is also
possible that resending of a hardware IRQ via a softirq triggers the same bug.

Because of different timing conditions, this shows up in all emulators and
virtual machines tested, including Xen, VMware, Virtual PC, and Qemu.  It is
also possible to trigger on native hardware with a large enough initrd,
although I don't have a reliable case demonstrating that.

Signed-off-by: Zachary Amsden &lt;zach@vmware.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Remove redundant up() in stop_machine()</title>
<updated>2006-12-04T12:58:06+00:00</updated>
<author>
<name>Zhou Yingchao</name>
<email>yingchao.zhou@gmail.com</email>
</author>
<published>2006-12-04T12:58:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4c3b882b6efbf091fdc0109a952844668d431e6a'/>
<id>4c3b882b6efbf091fdc0109a952844668d431e6a</id>
<content type='text'>
An up() is called in kernel/stop_machine.c on failure, and also in the
caller (unconditionally).

Signed-off-by: Zhou Yingchao &lt;yingchao.zhou@gmail.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
An up() is called in kernel/stop_machine.c on failure, and also in the
caller (unconditionally).

Signed-off-by: Zhou Yingchao &lt;yingchao.zhou@gmail.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fix sys_getppid oopses on debug kernel</title>
<updated>2006-11-24T02:08:27+00:00</updated>
<author>
<name>Kirill Korotaev</name>
<email>dev@openvz.org</email>
</author>
<published>2006-11-24T02:08:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7eeeab0403c1d0dffbbe236e229ab98fde0aacac'/>
<id>7eeeab0403c1d0dffbbe236e229ab98fde0aacac</id>
<content type='text'>
sys_getppid() optimization can access a freed memory.  On kernels with
DEBUG_SLAB turned ON, this results in Oops.  As Dave Hansen noted, this
optimization is also unsafe for memory hotplug.

So this patch always takes the lock to be safe.

Signed-off-by: Kirill Korotaev &lt;dev@openvz.org&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
sys_getppid() optimization can access a freed memory.  On kernels with
DEBUG_SLAB turned ON, this results in Oops.  As Dave Hansen noted, this
optimization is also unsafe for memory hotplug.

So this patch always takes the lock to be safe.

Signed-off-by: Kirill Korotaev &lt;dev@openvz.org&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Fix longstanding load balancing bug in the scheduler</title>
<updated>2006-11-11T11:13:53+00:00</updated>
<author>
<name>Christoph Lameter</name>
<email>clameter@sgi.com</email>
</author>
<published>2006-11-11T11:13:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=ef147950bee58ecfa672433badf679608fb90668'/>
<id>ef147950bee58ecfa672433badf679608fb90668</id>
<content type='text'>
The scheduler will stop load balancing if the most busy processor contains
processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu.  If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e.  If you have processor 0 busy running four tasks pinned via taskset,
there are none on processor 1 and one just started two processes on
processor 2 then the scheduler will not move one of the two processes away
from processor 2.

This patch fixes that issue by forcing the scheduler to come out of its
corner and retrying the load balancing by considering other processors for
load balancing.

This patch was originally developed by John Hawkes and discussed at

    http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=113901368523205&amp;w=2.

I have removed extraneous material and gone back to equipping struct rq
with the cpu the queue is associated with since this makes the patch much
easier and it is likely that others in the future will have the same
difficulty of figuring out which processor owns which runqueue.

The overhead added through these patches is a single word on the stack if
the kernel is configured to support 32 cpus or less (32 bit).  For 32 bit
environments the maximum number of cpus that can be configued is 255 which
would result in the use of 32 bytes additional on the stack.  On IA64 up to
1k cpus can be configured which will result in the use of 128 additional
bytes on the stack.  The maximum additional cache footprint is one
cacheline.  Typically memory use will be much less than a cacheline and the
additional cpumask will be placed on the stack in a cacheline that already
contains other local variable.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The scheduler will stop load balancing if the most busy processor contains
processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu.  If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e.  If you have processor 0 busy running four tasks pinned via taskset,
there are none on processor 1 and one just started two processes on
processor 2 then the scheduler will not move one of the two processes away
from processor 2.

This patch fixes that issue by forcing the scheduler to come out of its
corner and retrying the load balancing by considering other processors for
load balancing.

This patch was originally developed by John Hawkes and discussed at

    http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=113901368523205&amp;w=2.

I have removed extraneous material and gone back to equipping struct rq
with the cpu the queue is associated with since this makes the patch much
easier and it is likely that others in the future will have the same
difficulty of figuring out which processor owns which runqueue.

The overhead added through these patches is a single word on the stack if
the kernel is configured to support 32 cpus or less (32 bit).  For 32 bit
environments the maximum number of cpus that can be configued is 255 which
would result in the use of 32 bytes additional on the stack.  On IA64 up to
1k cpus can be configured which will result in the use of 128 additional
bytes on the stack.  The maximum additional cache footprint is one
cacheline.  Typically memory use will be much less than a cacheline and the
additional cpumask will be placed on the stack in a cacheline that already
contains other local variable.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>posix-cpu-timers: prevent signal delivery starvation</title>
<updated>2006-11-05T08:02:46+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2006-11-05T08:02:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fe8187b80c7b4fee6fd5dfb0d68891230ae59dcb'/>
<id>fe8187b80c7b4fee6fd5dfb0d68891230ae59dcb</id>
<content type='text'>
The integer divisions in the timer accounting code can round the result
down to 0.  Adding 0 is without effect and the signal delivery stops.

Clamp the division result to minimum 1 to avoid this.

Problem was reported by Seongbae Park &lt;spark@google.com&gt;, who provided
also an inital patch.

Roland sayeth:

  I have had some more time to think about the problem, and to reproduce it
  using Toyo's test case.  For the record, if my understanding of the problem
  is correct, this happens only in one very particular case.  First, the
  expiry time has to be so soon that in cputime_t units (usually 1s/HZ ticks)
  it's &lt; nthreads so the division yields zero.  Second, it only affects each
  thread that is so new that its CPU time accumulation is zero so now+0 is
  still zero and -&gt;it_*_expires winds up staying zero.  For the VIRT and PROF
  clocks when cputime_t is tick granularity (or the SCHED clock on
  configurations where sched_clock's value only advances on clock ticks), this
  is not hard to arrange with new threads starting up and blocking before they
  accumulate a whole tick of CPU time.  That's what happens in Toyo's test
  case.

  Note that in general it is fine for that division to round down to zero,
  and set each thread's expiry time to its "now" time.  The problem only
  arises with thread's whose "now" value is still zero, so that now+0 winds up
  0 and is interpreted as "not set" instead of "&gt;= now".  So it would be a
  sufficient and more precise fix to just use max(ticks, 1) inside the loop
  when setting each it_*_expires value.

  But, it does no harm to round the division up to one and always advance
  every thread's expiry time.  If the thread didn't already fire timers for
  the expiry time of "now", there is no expectation that it will do so before
  the next tick anyway.  So I followed Thomas's patch in lifting the max out
  of the loops.

  This patch also covers the reload cases, which are harder to write a test
  for (and I didn't try).  I've tested it with Toyo's case and it fixes that.

[toyoa@mvista.com: fix: min_t -&gt; max_t]
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Roland McGrath &lt;roland@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The integer divisions in the timer accounting code can round the result
down to 0.  Adding 0 is without effect and the signal delivery stops.

Clamp the division result to minimum 1 to avoid this.

Problem was reported by Seongbae Park &lt;spark@google.com&gt;, who provided
also an inital patch.

Roland sayeth:

  I have had some more time to think about the problem, and to reproduce it
  using Toyo's test case.  For the record, if my understanding of the problem
  is correct, this happens only in one very particular case.  First, the
  expiry time has to be so soon that in cputime_t units (usually 1s/HZ ticks)
  it's &lt; nthreads so the division yields zero.  Second, it only affects each
  thread that is so new that its CPU time accumulation is zero so now+0 is
  still zero and -&gt;it_*_expires winds up staying zero.  For the VIRT and PROF
  clocks when cputime_t is tick granularity (or the SCHED clock on
  configurations where sched_clock's value only advances on clock ticks), this
  is not hard to arrange with new threads starting up and blocking before they
  accumulate a whole tick of CPU time.  That's what happens in Toyo's test
  case.

  Note that in general it is fine for that division to round down to zero,
  and set each thread's expiry time to its "now" time.  The problem only
  arises with thread's whose "now" value is still zero, so that now+0 winds up
  0 and is interpreted as "not set" instead of "&gt;= now".  So it would be a
  sufficient and more precise fix to just use max(ticks, 1) inside the loop
  when setting each it_*_expires value.

  But, it does no harm to round the division up to one and always advance
  every thread's expiry time.  If the thread didn't already fire timers for
  the expiry time of "now", there is no expectation that it will do so before
  the next tick anyway.  So I followed Thomas's patch in lifting the max out
  of the loops.

  This patch also covers the reload cases, which are harder to write a test
  for (and I didn't try).  I've tested it with Toyo's case and it fixes that.

[toyoa@mvista.com: fix: min_t -&gt; max_t]
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Roland McGrath &lt;roland@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kernel/kmod.c: fix a race condition in usermodehelper.</title>
<updated>2006-09-22T00:32:57+00:00</updated>
<author>
<name>Martin Schwidefsky</name>
<email>schwidefsky@de.ibm.com</email>
</author>
<published>2006-09-22T00:32:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=579adde438b457e2110e6f6ec11fb7ffefa43b8c'/>
<id>579adde438b457e2110e6f6ec11fb7ffefa43b8c</id>
<content type='text'>
There is a race between call_usermodehelper_keys, __call_usermodehelper
and wait_for_helper. It should only happen if preemption is enabled or
on a virtualized system.

If the cpu is preempted or put to sleep by the hypervisor in
__call_usermodehelper between the creation of the wait_for_helper
thread and the second check on sub_info-&gt;wait, the whole execution
of wait_for_helper including the complete call and the continuation
after the wait_for_completion in call_usermodehelper_keys can have
happened before __call_usermodehelper checks sub_info-&gt;wait for the
second time. Since sub_info can already have been clobbered,
sub_info-&gt;wait could be zero and complete is called a second time
with an invalid argument. This has happened on s390. It took me only
three days to find out ..

Thanks to Arnd Bergmann for his help to spot this bug.

Kenneth Lee also sent the same patch independently.

Signed-off-by: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There is a race between call_usermodehelper_keys, __call_usermodehelper
and wait_for_helper. It should only happen if preemption is enabled or
on a virtualized system.

If the cpu is preempted or put to sleep by the hypervisor in
__call_usermodehelper between the creation of the wait_for_helper
thread and the second check on sub_info-&gt;wait, the whole execution
of wait_for_helper including the complete call and the continuation
after the wait_for_completion in call_usermodehelper_keys can have
happened before __call_usermodehelper checks sub_info-&gt;wait for the
second time. Since sub_info can already have been clobbered,
sub_info-&gt;wait could be zero and complete is called a second time
with an invalid argument. This has happened on s390. It took me only
three days to find out ..

Thanks to Arnd Bergmann for his help to spot this bug.

Kenneth Lee also sent the same patch independently.

Signed-off-by: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fix misoptimization in futex unqueue_me</title>
<updated>2006-09-06T14:01:43+00:00</updated>
<author>
<name>Christian Borntraeger</name>
<email>borntrae@de.ibm.com</email>
</author>
<published>2006-09-06T14:01:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=397f9fde9b9a9149be2249e1439ccb3a74b85594'/>
<id>397f9fde9b9a9149be2249e1439ccb3a74b85594</id>
<content type='text'>
This patch adds a barrier() in futex unqueue_me to avoid aliasing of two
pointers.

On my s390x system I saw the following oops:

Unable to handle kernel pointer dereference at virtual kernel address
0000000000000000
Oops: 0004 [#1]
CPU:    0    Not tainted
Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8)
Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30)
Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000
           0000000000000000 0000000000000000 000001fe00002028 00000000000c091f
           000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0
           00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8
Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04
Call Trace:
([&lt;0000000000144f90&gt;] unqueue_me+0x40/0xe4)
 [&lt;0000000000145a0c&gt;] do_futex+0x33c/0xc40
 [&lt;000000000014643e&gt;] sys_futex+0x12e/0x144
 [&lt;000000000010bb00&gt;] sysc_noemu+0x10/0x16
 [&lt;000002000003741c&gt;] 0x2000003741c

The code in question is:

static int unqueue_me(struct futex_q *q)
{
        int ret = 0;
        spinlock_t *lock_ptr;

        /* In the common case we don't take the spinlock, which is nice. */
 retry:
        lock_ptr = q-&gt;lock_ptr;
        if (lock_ptr != 0) {
                spin_lock(lock_ptr);
                /*
                 * q-&gt;lock_ptr can change between reading it and
                 * spin_lock(), causing us to take the wrong lock.  This
                 * corrects the race condition.
[...]

and my compiler (gcc 4.1.0) makes the following out of it:

00000000000003c8 &lt;unqueue_me&gt;:
     3c8:       eb bf f0 70 00 24       stmg    %r11,%r15,112(%r15)
     3ce:       c0 d0 00 00 00 00       larl    %r13,3ce &lt;unqueue_me+0x6&gt;
                        3d0: R_390_PC32DBL      .rodata+0x2a
     3d4:       a7 f1 1e 00             tml     %r15,7680
     3d8:       a7 84 00 01             je      3da &lt;unqueue_me+0x12&gt;
     3dc:       b9 04 00 ef             lgr     %r14,%r15
     3e0:       a7 fb ff d0             aghi    %r15,-48
     3e4:       b9 04 00 b2             lgr     %r11,%r2
     3e8:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
                /* write q-&gt;lock_ptr in r12 */
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       a7 84 00 4b             je      48e &lt;unqueue_me+0xc6&gt;
                /* if r12 is zero then jump over the code.... */
     3fc:       e3 20 b0 28 00 04       lg      %r2,40(%r11)
                /* write q-&gt;lock_ptr in r2 */
     402:       c0 e5 00 00 00 00       brasl   %r14,402 &lt;unqueue_me+0x3a&gt;
                        404: R_390_PC32DBL      _spin_lock+0x2
                /* use r2 as parameter for spin_lock */

So the code becomes more or less:
if (q-&gt;lock_ptr != 0) spin_lock(q-&gt;lock_ptr)
instead of
if (lock_ptr != 0) spin_lock(lock_ptr)

Which caused the oops from above.
After adding a barrier gcc creates code without this problem:
[...] (the same)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       b9 04 00 2c             lgr     %r2,%r12
     3fc:       a7 84 00 48             je      48c &lt;unqueue_me+0xc4&gt;
     400:       c0 e5 00 00 00 00       brasl   %r14,400 &lt;unqueue_me+0x38&gt;
                        402: R_390_PC32DBL      _spin_lock+0x2

As a general note, this code of unqueue_me seems a bit fishy. The retry logic
of unqueue_me only works if we can guarantee, that the original value of
q-&gt;lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We
know that q-&gt;lock_ptr can change. I dont know what happens with the original
spinlock, as I am not an expert with the futex code.

Signed-off-by: Christian Borntraeger &lt;borntrae@de.ibm.com&gt;
Acked-by: Ingo Molnar &lt;mingo@redhat.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch adds a barrier() in futex unqueue_me to avoid aliasing of two
pointers.

On my s390x system I saw the following oops:

Unable to handle kernel pointer dereference at virtual kernel address
0000000000000000
Oops: 0004 [#1]
CPU:    0    Not tainted
Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8)
Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30)
Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000
           0000000000000000 0000000000000000 000001fe00002028 00000000000c091f
           000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0
           00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8
Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04
Call Trace:
([&lt;0000000000144f90&gt;] unqueue_me+0x40/0xe4)
 [&lt;0000000000145a0c&gt;] do_futex+0x33c/0xc40
 [&lt;000000000014643e&gt;] sys_futex+0x12e/0x144
 [&lt;000000000010bb00&gt;] sysc_noemu+0x10/0x16
 [&lt;000002000003741c&gt;] 0x2000003741c

The code in question is:

static int unqueue_me(struct futex_q *q)
{
        int ret = 0;
        spinlock_t *lock_ptr;

        /* In the common case we don't take the spinlock, which is nice. */
 retry:
        lock_ptr = q-&gt;lock_ptr;
        if (lock_ptr != 0) {
                spin_lock(lock_ptr);
                /*
                 * q-&gt;lock_ptr can change between reading it and
                 * spin_lock(), causing us to take the wrong lock.  This
                 * corrects the race condition.
[...]

and my compiler (gcc 4.1.0) makes the following out of it:

00000000000003c8 &lt;unqueue_me&gt;:
     3c8:       eb bf f0 70 00 24       stmg    %r11,%r15,112(%r15)
     3ce:       c0 d0 00 00 00 00       larl    %r13,3ce &lt;unqueue_me+0x6&gt;
                        3d0: R_390_PC32DBL      .rodata+0x2a
     3d4:       a7 f1 1e 00             tml     %r15,7680
     3d8:       a7 84 00 01             je      3da &lt;unqueue_me+0x12&gt;
     3dc:       b9 04 00 ef             lgr     %r14,%r15
     3e0:       a7 fb ff d0             aghi    %r15,-48
     3e4:       b9 04 00 b2             lgr     %r11,%r2
     3e8:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
                /* write q-&gt;lock_ptr in r12 */
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       a7 84 00 4b             je      48e &lt;unqueue_me+0xc6&gt;
                /* if r12 is zero then jump over the code.... */
     3fc:       e3 20 b0 28 00 04       lg      %r2,40(%r11)
                /* write q-&gt;lock_ptr in r2 */
     402:       c0 e5 00 00 00 00       brasl   %r14,402 &lt;unqueue_me+0x3a&gt;
                        404: R_390_PC32DBL      _spin_lock+0x2
                /* use r2 as parameter for spin_lock */

So the code becomes more or less:
if (q-&gt;lock_ptr != 0) spin_lock(q-&gt;lock_ptr)
instead of
if (lock_ptr != 0) spin_lock(lock_ptr)

Which caused the oops from above.
After adding a barrier gcc creates code without this problem:
[...] (the same)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       b9 04 00 2c             lgr     %r2,%r12
     3fc:       a7 84 00 48             je      48c &lt;unqueue_me+0xc4&gt;
     400:       c0 e5 00 00 00 00       brasl   %r14,400 &lt;unqueue_me+0x38&gt;
                        402: R_390_PC32DBL      _spin_lock+0x2

As a general note, this code of unqueue_me seems a bit fishy. The retry logic
of unqueue_me only works if we can guarantee, that the original value of
q-&gt;lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We
know that q-&gt;lock_ptr can change. I dont know what happens with the original
spinlock, as I am not an expert with the futex code.

Signed-off-by: Christian Borntraeger &lt;borntrae@de.ibm.com&gt;
Acked-by: Ingo Molnar &lt;mingo@redhat.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</pre>
</div>
</content>
</entry>
</feed>
