From 75065ff619e42fe35178eda863cbcddd57776794 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 8 May 2008 14:06:19 +0200 Subject: Revert "relay: fix splice problem" This reverts commit c3270e577c18b3d0e984c3371493205a4807db9d. --- kernel/relay.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/relay.c b/kernel/relay.c index 7de644cdec43..bc24dcdc570f 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -1191,7 +1191,7 @@ static ssize_t relay_file_splice_read(struct file *in, ret = 0; spliced = 0; - while (len && !spliced) { + while (len) { ret = subbuf_splice_actor(in, ppos, pipe, len, flags, &nonpad_ret); if (ret < 0) break; -- cgit v1.2.3 From bf726eab3711cf192405d21688a4b21e07b6188a Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Thu, 8 May 2008 11:53:48 +0200 Subject: semaphore: fix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Yanmin Zhang reported: | Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th | regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, | and Itanium Montecito. Bisect located the patch below: | | 64ac24e738823161693bf791f87adc802cf529ff is first bad commit | commit 64ac24e738823161693bf791f87adc802cf529ff | Author: Matthew Wilcox | Date: Fri Mar 7 21:55:58 2008 -0500 | | Generic semaphore implementation | | After I manually reverted the patch against 2.6.26-rc1 while fixing | lots of conflicts/errors, aim7 regression became less than 2%. i reproduced the AIM7 workload and can confirm Yanmin's findings that -.26-rc1 regresses over .25 - by over 67% here. Looking at the workload i found and fixed what i believe to be the real bug causing the AIM7 regression: it was inefficient wakeup / scheduling / locking behavior of the new generic semaphore code, causing suboptimal performance. The problem comes from the following code. The new semaphore code does this on down(): spin_lock_irqsave(&sem->lock, flags); if (likely(sem->count > 0)) sem->count--; else __down(sem); spin_unlock_irqrestore(&sem->lock, flags); and this on up(): spin_lock_irqsave(&sem->lock, flags); if (likely(list_empty(&sem->wait_list))) sem->count++; else __up(sem); spin_unlock_irqrestore(&sem->lock, flags); where __up() does: list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); and where __down() does this in essence: list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; waiter.up = 0; for (;;) { [...] spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); if (waiter.up) return 0; } the fastpath looks good and obvious, but note the following property of the contended path: if there's a task on the ->wait_list, the up() of the current owner will "pass over" ownership to that waiting task, in a wake-one manner, via the waiter->up flag and by removing the waiter from the wait list. That is all and fine in principle, but as implemented in kernel/semaphore.c it also creates a nasty, hidden source of contention! The contention comes from the following property of the new semaphore code: the new owner owns the semaphore exclusively, even if it is not running yet. So if the old owner, even if just a few instructions later, does a down() [lock_kernel()] again, it will be blocked and will have to wait on the new owner to eventually be scheduled (possibly on another CPU)! Or if another task gets to lock_kernel() sooner than the "new owner" scheduled, it will be blocked unnecessarily and for a very long time when there are 2000 tasks running. I.e. the implementation of the new semaphores code does wake-one and lock ownership in a very restrictive way - it does not allow opportunistic re-locking of the lock at all and keeps the scheduler from picking task order intelligently. This kind of scheduling, with 2000 AIM7 processes running, creates awful cross-scheduling between those 2000 tasks, causes reduced parallelism, a throttled runqueue length and a lot of idle time. With increasing number of CPUs it causes an exponentially worse behavior in AIM7, as the chance for a newly woken new-owner task to actually run anytime soon is less and less likely. Note that it takes just a tiny bit of contention for the 'new-semaphore catastrophy' to happen: the wakeup latencies get added to whatever small contention there is, and quickly snowball out of control! I believe Yanmin's findings and numbers support this analysis too. The best fix for this problem is to use the same scheduling logic that the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and wanted because we do not want to over-schedule), but also allow opportunistic locking of the lock even if a wakee is already "in flight". The patch below implements this new logic. With this patch applied the AIM7 regression is largely fixed on my quad testbox: # v2.6.25 vanilla: .................. Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 56096.4 91 207.5 789.7 0.4675 2000 55894.4 94 208.2 792.7 0.4658 # v2.6.26-rc1-166-gc0a1811 vanilla: ................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 33230.6 83 350.3 784.5 0.2769 2000 31778.1 86 366.3 783.6 0.2648 # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup: ............................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55707.1 92 209.0 795.6 0.4642 2000 55704.4 96 209.0 796.0 0.4642 i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 performance levels and have zero idle time during the test, as expected. Btw., interactivity also improved dramatically with the fix - for example console-switching became almost instantaneous during this workload (which after all is running 2000 tasks at once!), without the patch it was stuck for a minute at times. There's another nice side-effect of this speedup patch, the new generic semaphore code got even smaller: text data bss dec hex filename 1241 0 0 1241 4d9 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after (because the waiter.up complication got removed.) Longer-term we should look into using the mutex code for the generic semaphore code as well - but i's not easy due to legacies and it's outside of the scope of v2.6.26 and outside the scope of this patch as well. Bisected-by: "Zhang, Yanmin" Signed-off-by: Ingo Molnar --- kernel/semaphore.c | 64 +++++++++++++++++++++++++----------------------------- 1 file changed, 30 insertions(+), 34 deletions(-) (limited to 'kernel') diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 5c2942e768cd..5e41217239e8 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c @@ -54,10 +54,9 @@ void down(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(!sem->count)) __down(sem); + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); } EXPORT_SYMBOL(down); @@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(!sem->count)) result = __down_interruptible(sem); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(!sem->count)) result = __down_killable(sem); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, long jiffies) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(!sem->count)) result = __down_timeout(sem, jiffies); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -179,9 +178,8 @@ void up(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (likely(list_empty(&sem->wait_list))) - sem->count++; - else + sem->count++; + if (unlikely(!list_empty(&sem->wait_list))) __up(sem); spin_unlock_irqrestore(&sem->lock, flags); } @@ -192,7 +190,6 @@ EXPORT_SYMBOL(up); struct semaphore_waiter { struct list_head list; struct task_struct *task; - int up; }; /* @@ -205,33 +202,34 @@ static inline int __sched __down_common(struct semaphore *sem, long state, { struct task_struct *task = current; struct semaphore_waiter waiter; + int ret = 0; - list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; - waiter.up = 0; + list_add_tail(&waiter.list, &sem->wait_list); for (;;) { - if (state == TASK_INTERRUPTIBLE && signal_pending(task)) - goto interrupted; - if (state == TASK_KILLABLE && fatal_signal_pending(task)) - goto interrupted; - if (timeout <= 0) - goto timed_out; + if (state == TASK_INTERRUPTIBLE && signal_pending(task)) { + ret = -EINTR; + break; + } + if (state == TASK_KILLABLE && fatal_signal_pending(task)) { + ret = -EINTR; + break; + } + if (timeout <= 0) { + ret = -ETIME; + break; + } __set_task_state(task, state); spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); - if (waiter.up) - return 0; + if (sem->count > 0) + break; } - timed_out: - list_del(&waiter.list); - return -ETIME; - - interrupted: list_del(&waiter.list); - return -EINTR; + return ret; } static noinline void __sched __down(struct semaphore *sem) @@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem) { struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); - list_del(&waiter->list); - waiter->up = 1; wake_up_process(waiter->task); } -- cgit v1.2.3 From 46151122e0a2e80e5a6b2889f595e371fe2b600d Mon Sep 17 00:00:00 2001 From: Mike Galbraith Date: Thu, 8 May 2008 17:00:42 +0200 Subject: sched: fix weight calculations The conversion between virtual and real time is as follows: dvt = rw/w * dt <=> dt = w/rw * dvt Since we want the fair sleeper granularity to be in real time, we actually need to do: dvt = - rw/w * l This bug could be related to the regression reported by Yanmin Zhang: | Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots | of regressions with 2.6.26-rc1: | | 1) 8-core stoakley: 28%; | 2) 16-core tigerton: 20%; | 3) Itanium Montvale: 50%. Reported-by: "Zhang, Yanmin" Signed-off-by: Mike Galbraith Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index c863663d204d..e24ecd39c4b8 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -662,10 +662,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) if (!initial) { /* sleeps upto a single latency don't count. */ if (sched_feat(NEW_FAIR_SLEEPERS)) { + unsigned long thresh = sysctl_sched_latency; + + /* + * convert the sleeper threshold into virtual time + */ if (sched_feat(NORMALIZED_SLEEPER)) - vruntime -= calc_delta_weight(sysctl_sched_latency, se); - else - vruntime -= sysctl_sched_latency; + thresh = calc_delta_fair(thresh, se); + + vruntime -= thresh; } /* ensure we never gain time by being placed backwards. */ -- cgit v1.2.3 From 5be7a4792a31df6f2cd44bfba8da467ea20a0642 Mon Sep 17 00:00:00 2001 From: Paul Menage Date: Tue, 6 May 2008 20:42:41 -0700 Subject: Fix cpuset sched_relax_domain_level control file Due to a merge conflict, the sched_relax_domain_level control file was marked as being handled by cpuset_read/write_u64, but the code to handle it was actually in cpuset_common_file_read/write. Since the value being written/read is in fact a signed integer, it should be treated as such; this patch adds cpuset_read/write_s64 functions, and uses them to handle the sched_relax_domain_level file. With this patch, the sched_relax_domain_level can be read and written, and the correct contents seen/updated. Signed-off-by: Paul Menage Cc: Hidetoshi Seto Cc: Paul Jackson Cc: Ingo Molnar Reviewed-by: Li Zefan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 52 ++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 40 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 8da627d33804..86ea9e34e326 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -1031,11 +1031,9 @@ int current_cpuset_is_being_rebound(void) return task_cs(current) == cpuset_being_rebound; } -static int update_relax_domain_level(struct cpuset *cs, char *buf) +static int update_relax_domain_level(struct cpuset *cs, s64 val) { - int val = simple_strtol(buf, NULL, 10); - - if (val < 0) + if ((int)val < 0) val = -1; if (val != cs->relax_domain_level) { @@ -1280,9 +1278,6 @@ static ssize_t cpuset_common_file_write(struct cgroup *cont, case FILE_MEMLIST: retval = update_nodemask(cs, buffer); break; - case FILE_SCHED_RELAX_DOMAIN_LEVEL: - retval = update_relax_domain_level(cs, buffer); - break; default: retval = -EINVAL; goto out2; @@ -1348,6 +1343,30 @@ static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val) return retval; } +static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val) +{ + int retval = 0; + struct cpuset *cs = cgroup_cs(cgrp); + cpuset_filetype_t type = cft->private; + + cgroup_lock(); + + if (cgroup_is_removed(cgrp)) { + cgroup_unlock(); + return -ENODEV; + } + switch (type) { + case FILE_SCHED_RELAX_DOMAIN_LEVEL: + retval = update_relax_domain_level(cs, val); + break; + default: + retval = -EINVAL; + break; + } + cgroup_unlock(); + return retval; +} + /* * These ascii lists should be read in a single call, by using a user * buffer large enough to hold the entire map. If read in smaller @@ -1406,9 +1425,6 @@ static ssize_t cpuset_common_file_read(struct cgroup *cont, case FILE_MEMLIST: s += cpuset_sprintf_memlist(s, cs); break; - case FILE_SCHED_RELAX_DOMAIN_LEVEL: - s += sprintf(s, "%d", cs->relax_domain_level); - break; default: retval = -EINVAL; goto out; @@ -1449,6 +1465,18 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft) } } +static s64 cpuset_read_s64(struct cgroup *cont, struct cftype *cft) +{ + struct cpuset *cs = cgroup_cs(cont); + cpuset_filetype_t type = cft->private; + switch (type) { + case FILE_SCHED_RELAX_DOMAIN_LEVEL: + return cs->relax_domain_level; + default: + BUG(); + } +} + /* * for the common functions, 'private' gives the type of file @@ -1499,8 +1527,8 @@ static struct cftype files[] = { { .name = "sched_relax_domain_level", - .read_u64 = cpuset_read_u64, - .write_u64 = cpuset_write_u64, + .read_s64 = cpuset_read_s64, + .write_s64 = cpuset_write_s64, .private = FILE_SCHED_RELAX_DOMAIN_LEVEL, }, -- cgit v1.2.3 From a5dd69707424a35d2d2cc094e870f595ad61e916 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Fri, 9 May 2008 16:24:21 +1000 Subject: module: be more picky about allowing missing module versions We allow missing __versions sections, because modprobe --force strips it. It makes less sense to allow sections where there's no version for a specific symbol the module uses, so disallow that. Signed-off-by: Rusty Russell Signed-off-by: Linus Torvalds --- kernel/module.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 8e4528c9909f..2584c0e2762d 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -917,6 +917,10 @@ static int check_version(Elf_Shdr *sechdrs, if (!crc) return 1; + /* No versions at all? modprobe --force does this. */ + if (versindex == 0) + return try_to_force_load(mod, symname) == 0; + versions = (void *) sechdrs[versindex].sh_addr; num_versions = sechdrs[versindex].sh_size / sizeof(struct modversion_info); @@ -932,8 +936,9 @@ static int check_version(Elf_Shdr *sechdrs, goto bad_version; } - if (!try_to_force_load(mod, symname)) - return 1; + printk(KERN_WARNING "%s: no symbol version for %s\n", + mod->name, symname); + return 0; bad_version: printk("%s: disagrees about version of symbol %s\n", -- cgit v1.2.3 From 91e37a793b5a9436a2d12b2f0a8f52db3a133e1d Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Fri, 9 May 2008 16:25:28 +1000 Subject: module: don't ignore vermagic string if module doesn't have modversions Linus found a logic bug: we ignore the version number in a module's vermagic string if we have CONFIG_MODVERSIONS set, but modversions also lets through a module with no __versions section for modprobe --force (with tainting, but still). We should only ignore the start of the vermagic string if the module actually *has* crcs to check. Rather than (say) having an entertaining hissy fit and creating a config option to work around the buggy code. Signed-off-by: Rusty Russell Signed-off-by: Linus Torvalds --- kernel/module.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 2584c0e2762d..f5e9491ef7ac 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -957,11 +957,14 @@ static inline int check_modstruct_version(Elf_Shdr *sechdrs, return check_version(sechdrs, versindex, "struct_module", mod, crc); } -/* First part is kernel version, which we ignore. */ -static inline int same_magic(const char *amagic, const char *bmagic) +/* First part is kernel version, which we ignore if module has crcs. */ +static inline int same_magic(const char *amagic, const char *bmagic, + bool has_crcs) { - amagic += strcspn(amagic, " "); - bmagic += strcspn(bmagic, " "); + if (has_crcs) { + amagic += strcspn(amagic, " "); + bmagic += strcspn(bmagic, " "); + } return strcmp(amagic, bmagic) == 0; } #else @@ -981,7 +984,8 @@ static inline int check_modstruct_version(Elf_Shdr *sechdrs, return 1; } -static inline int same_magic(const char *amagic, const char *bmagic) +static inline int same_magic(const char *amagic, const char *bmagic, + bool has_crcs) { return strcmp(amagic, bmagic) == 0; } @@ -1874,7 +1878,7 @@ static struct module *load_module(void __user *umod, err = try_to_force_load(mod, "magic"); if (err) goto free_hdr; - } else if (!same_magic(modmagic, vermagic)) { + } else if (!same_magic(modmagic, vermagic, versindex)) { printk(KERN_ERR "%s: version magic '%s' should be '%s'\n", mod->name, modmagic, vermagic); err = -ENOEXEC; -- cgit v1.2.3 From 00b41ec2611dc98f87f30753ee00a53db648d662 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 10 May 2008 20:43:22 -0700 Subject: Revert "semaphore: fix" This reverts commit bf726eab3711cf192405d21688a4b21e07b6188a, as it has been reported to cause a regression with processes stuck in __down(), apparently because some missing wakeup. Quoth Sven Wegener: "I'm currently investigating a regression that has showed up with my last git pull yesterday. Bisecting the commits showed bf726e "semaphore: fix" to be the culprit, reverting it fixed the issue. Symptoms: During heavy filesystem usage (e.g. a kernel compile) I get several compiler processes in uninterruptible sleep, blocking all i/o on the filesystem. System is an Intel Core 2 Quad running a 64bit kernel and userspace. Filesystem is xfs on top of lvm. See below for the output of sysrq-w." See http://lkml.org/lkml/2008/5/10/45 for full report. In the meantime, we can just fix the BKL performance regression by reverting back to the good old BKL spinlock implementation instead, since any sleeping lock will generally perform badly, especially if it tries to be fair. Reported-by: Sven Wegener Cc: Andrew Morton Cc: Ingo Molnar Signed-off-by: Linus Torvalds --- kernel/semaphore.c | 64 +++++++++++++++++++++++++++++------------------------- 1 file changed, 34 insertions(+), 30 deletions(-) (limited to 'kernel') diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 5e41217239e8..5c2942e768cd 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c @@ -54,9 +54,10 @@ void down(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(!sem->count)) + if (likely(sem->count > 0)) + sem->count--; + else __down(sem); - sem->count--; spin_unlock_irqrestore(&sem->lock, flags); } EXPORT_SYMBOL(down); @@ -76,10 +77,10 @@ int down_interruptible(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(!sem->count)) - result = __down_interruptible(sem); - if (!result) + if (likely(sem->count > 0)) sem->count--; + else + result = __down_interruptible(sem); spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -102,10 +103,10 @@ int down_killable(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(!sem->count)) - result = __down_killable(sem); - if (!result) + if (likely(sem->count > 0)) sem->count--; + else + result = __down_killable(sem); spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -156,10 +157,10 @@ int down_timeout(struct semaphore *sem, long jiffies) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(!sem->count)) - result = __down_timeout(sem, jiffies); - if (!result) + if (likely(sem->count > 0)) sem->count--; + else + result = __down_timeout(sem, jiffies); spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -178,8 +179,9 @@ void up(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - sem->count++; - if (unlikely(!list_empty(&sem->wait_list))) + if (likely(list_empty(&sem->wait_list))) + sem->count++; + else __up(sem); spin_unlock_irqrestore(&sem->lock, flags); } @@ -190,6 +192,7 @@ EXPORT_SYMBOL(up); struct semaphore_waiter { struct list_head list; struct task_struct *task; + int up; }; /* @@ -202,34 +205,33 @@ static inline int __sched __down_common(struct semaphore *sem, long state, { struct task_struct *task = current; struct semaphore_waiter waiter; - int ret = 0; - waiter.task = task; list_add_tail(&waiter.list, &sem->wait_list); + waiter.task = task; + waiter.up = 0; for (;;) { - if (state == TASK_INTERRUPTIBLE && signal_pending(task)) { - ret = -EINTR; - break; - } - if (state == TASK_KILLABLE && fatal_signal_pending(task)) { - ret = -EINTR; - break; - } - if (timeout <= 0) { - ret = -ETIME; - break; - } + if (state == TASK_INTERRUPTIBLE && signal_pending(task)) + goto interrupted; + if (state == TASK_KILLABLE && fatal_signal_pending(task)) + goto interrupted; + if (timeout <= 0) + goto timed_out; __set_task_state(task, state); spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); - if (sem->count > 0) - break; + if (waiter.up) + return 0; } + timed_out: + list_del(&waiter.list); + return -ETIME; + + interrupted: list_del(&waiter.list); - return ret; + return -EINTR; } static noinline void __sched __down(struct semaphore *sem) @@ -256,5 +258,7 @@ static noinline void __sched __up(struct semaphore *sem) { struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); + list_del(&waiter->list); + waiter->up = 1; wake_up_process(waiter->task); } -- cgit v1.2.3 From 8e3e076c5a78519a9f64cd384e8f18bc21882ce0 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 10 May 2008 20:58:02 -0700 Subject: BKL: revert back to the old spinlock implementation The generic semaphore rewrite had a huge performance regression on AIM7 (and potentially other BKL-heavy benchmarks) because the generic semaphores had been rewritten to be simple to understand and fair. The latter, in particular, turns a semaphore-based BKL implementation into a mess of scheduling. The attempt to fix the performance regression failed miserably (see the previous commit 00b41ec2611dc98f87f30753ee00a53db648d662 'Revert "semaphore: fix"'), and so for now the simple and sane approach is to instead just go back to the old spinlock-based BKL implementation that never had any issues like this. This patch also has the advantage of being reported to fix the regression completely according to Yanmin Zhang, unlike the semaphore hack which still left a couple percentage point regression. As a spinlock, the BKL obviously has the potential to be a latency issue, but it's not really any different from any other spinlock in that respect. We do want to get rid of the BKL asap, but that has been the plan for several years. These days, the biggest users are in the tty layer (open/release in particular) and Alan holds out some hope: "tty release is probably a few months away from getting cured - I'm afraid it will almost certainly be the very last user of the BKL in tty to get fixed as it depends on everything else being sanely locked." so while we're not there yet, we do have a plan of action. Tested-by: Yanmin Zhang Cc: Ingo Molnar Cc: Andi Kleen Cc: Matthew Wilcox Cc: Alexander Viro Cc: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 27 ++++----------------------- 1 file changed, 4 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 58fb8af15776..c51b6565e07c 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4567,8 +4567,6 @@ EXPORT_SYMBOL(schedule); asmlinkage void __sched preempt_schedule(void) { struct thread_info *ti = current_thread_info(); - struct task_struct *task = current; - int saved_lock_depth; /* * If there is a non-zero preempt_count or interrupts are disabled, @@ -4579,16 +4577,7 @@ asmlinkage void __sched preempt_schedule(void) do { add_preempt_count(PREEMPT_ACTIVE); - - /* - * We keep the big kernel semaphore locked, but we - * clear ->lock_depth so that schedule() doesnt - * auto-release the semaphore: - */ - saved_lock_depth = task->lock_depth; - task->lock_depth = -1; schedule(); - task->lock_depth = saved_lock_depth; sub_preempt_count(PREEMPT_ACTIVE); /* @@ -4609,26 +4598,15 @@ EXPORT_SYMBOL(preempt_schedule); asmlinkage void __sched preempt_schedule_irq(void) { struct thread_info *ti = current_thread_info(); - struct task_struct *task = current; - int saved_lock_depth; /* Catch callers which need to be fixed */ BUG_ON(ti->preempt_count || !irqs_disabled()); do { add_preempt_count(PREEMPT_ACTIVE); - - /* - * We keep the big kernel semaphore locked, but we - * clear ->lock_depth so that schedule() doesnt - * auto-release the semaphore: - */ - saved_lock_depth = task->lock_depth; - task->lock_depth = -1; local_irq_enable(); schedule(); local_irq_disable(); - task->lock_depth = saved_lock_depth; sub_preempt_count(PREEMPT_ACTIVE); /* @@ -5853,8 +5831,11 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) spin_unlock_irqrestore(&rq->lock, flags); /* Set the preempt count _outside_ the spinlocks! */ +#if defined(CONFIG_PREEMPT) + task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0); +#else task_thread_info(idle)->preempt_count = 0; - +#endif /* * The idle tasks have their own, simple scheduling class: */ -- cgit v1.2.3 From c3921ab71507b108d51a0f1ee960f80cd668a93d Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 11 May 2008 16:04:48 -0700 Subject: Add new 'cond_resched_bkl()' helper function It acts exactly like a regular 'cond_resched()', but will not get optimized away when CONFIG_PREEMPT is set. Normal kernel code is already preemptable in the presense of CONFIG_PREEMPT, so cond_resched() is optimized away (see commit 02b67cc3ba36bdba351d6c3a00593f4ec550d9d3 "sched: do not do cond_resched() when CONFIG_PREEMPT"). But when wanting to conditionally reschedule while holding a lock, you need to use "cond_sched_lock(lock)", and the new function is the BKL equivalent of that. Also make fs/locks.c use it. Signed-off-by: Linus Torvalds --- kernel/sched.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index c51b6565e07c..8841a915545d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5525,7 +5525,6 @@ static void __cond_resched(void) } while (need_resched()); } -#if !defined(CONFIG_PREEMPT) || defined(CONFIG_PREEMPT_VOLUNTARY) int __sched _cond_resched(void) { if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) && @@ -5536,7 +5535,6 @@ int __sched _cond_resched(void) return 0; } EXPORT_SYMBOL(_cond_resched); -#endif /* * cond_resched_lock() - if a reschedule is pending, drop the given lock, -- cgit v1.2.3 From 0c70814c311581a6c86198db4f982aa683c68fb8 Mon Sep 17 00:00:00 2001 From: Mirco Tischler Date: Wed, 14 May 2008 16:05:46 -0700 Subject: cgroups: fix compile warning Return type of cpu_rt_runtime_write() should be int instead of ssize_t. Signed-off-by: Mirco Tischler Acked-by: Paul Menage Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 8841a915545d..cfa222a91539 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -8986,7 +8986,7 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft) #endif #ifdef CONFIG_RT_GROUP_SCHED -static ssize_t cpu_rt_runtime_write(struct cgroup *cgrp, struct cftype *cft, +static int cpu_rt_runtime_write(struct cgroup *cgrp, struct cftype *cft, s64 val) { return sched_group_set_rt_runtime(cgroup_tg(cgrp), val); -- cgit v1.2.3 From 3fc957721d18c93662f7d4dab455b80f53dd2641 Mon Sep 17 00:00:00 2001 From: Harvey Harrison Date: Wed, 14 May 2008 16:05:49 -0700 Subject: lib: create common ascii hex array Add a common hex array in hexdump.c so everyone can use it. Add a common hi/lo helper to avoid the shifting masking that is done to get the upper and lower nibbles of a byte value. Pull the pack_hex_byte helper from kgdb as it is opencoded many places in the tree that will be consolidated. Signed-off-by: Harvey Harrison Acked-by: Paul Mundt Cc: Jason Wessel Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kgdb.c | 8 -------- 1 file changed, 8 deletions(-) (limited to 'kernel') diff --git a/kernel/kgdb.c b/kernel/kgdb.c index 39e31a036f5b..14787de568b3 100644 --- a/kernel/kgdb.c +++ b/kernel/kgdb.c @@ -346,14 +346,6 @@ static void put_packet(char *buffer) } } -static char *pack_hex_byte(char *pkt, u8 byte) -{ - *pkt++ = hexchars[byte >> 4]; - *pkt++ = hexchars[byte & 0xf]; - - return pkt; -} - /* * Convert the memory pointed to by mem into hex, placing result in buf. * Return a pointer to the last char put in buf (null). May return an error. -- cgit v1.2.3 From 02afc6267f6d55d47aba9fcafdbd1b7230d2294a Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 8 May 2008 19:42:56 -0400 Subject: [PATCH] dup_fd() fixes, part 1 Move the sucker to fs/file.c in preparation to the rest Signed-off-by: Al Viro --- kernel/fork.c | 130 ---------------------------------------------------------- 1 file changed, 130 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 933e60ebccae..19908b26cf80 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -660,136 +660,6 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) return 0; } -static int count_open_files(struct fdtable *fdt) -{ - int size = fdt->max_fds; - int i; - - /* Find the last open fd */ - for (i = size/(8*sizeof(long)); i > 0; ) { - if (fdt->open_fds->fds_bits[--i]) - break; - } - i = (i+1) * 8 * sizeof(long); - return i; -} - -static struct files_struct *alloc_files(void) -{ - struct files_struct *newf; - struct fdtable *fdt; - - newf = kmem_cache_alloc(files_cachep, GFP_KERNEL); - if (!newf) - goto out; - - atomic_set(&newf->count, 1); - - spin_lock_init(&newf->file_lock); - newf->next_fd = 0; - fdt = &newf->fdtab; - fdt->max_fds = NR_OPEN_DEFAULT; - fdt->close_on_exec = (fd_set *)&newf->close_on_exec_init; - fdt->open_fds = (fd_set *)&newf->open_fds_init; - fdt->fd = &newf->fd_array[0]; - INIT_RCU_HEAD(&fdt->rcu); - fdt->next = NULL; - rcu_assign_pointer(newf->fdt, fdt); -out: - return newf; -} - -/* - * Allocate a new files structure and copy contents from the - * passed in files structure. - * errorp will be valid only when the returned files_struct is NULL. - */ -static struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) -{ - struct files_struct *newf; - struct file **old_fds, **new_fds; - int open_files, size, i; - struct fdtable *old_fdt, *new_fdt; - - *errorp = -ENOMEM; - newf = alloc_files(); - if (!newf) - goto out; - - spin_lock(&oldf->file_lock); - old_fdt = files_fdtable(oldf); - new_fdt = files_fdtable(newf); - open_files = count_open_files(old_fdt); - - /* - * Check whether we need to allocate a larger fd array and fd set. - * Note: we're not a clone task, so the open count won't change. - */ - if (open_files > new_fdt->max_fds) { - new_fdt->max_fds = 0; - spin_unlock(&oldf->file_lock); - spin_lock(&newf->file_lock); - *errorp = expand_files(newf, open_files-1); - spin_unlock(&newf->file_lock); - if (*errorp < 0) - goto out_release; - new_fdt = files_fdtable(newf); - /* - * Reacquire the oldf lock and a pointer to its fd table - * who knows it may have a new bigger fd table. We need - * the latest pointer. - */ - spin_lock(&oldf->file_lock); - old_fdt = files_fdtable(oldf); - } - - old_fds = old_fdt->fd; - new_fds = new_fdt->fd; - - memcpy(new_fdt->open_fds->fds_bits, - old_fdt->open_fds->fds_bits, open_files/8); - memcpy(new_fdt->close_on_exec->fds_bits, - old_fdt->close_on_exec->fds_bits, open_files/8); - - for (i = open_files; i != 0; i--) { - struct file *f = *old_fds++; - if (f) { - get_file(f); - } else { - /* - * The fd may be claimed in the fd bitmap but not yet - * instantiated in the files array if a sibling thread - * is partway through open(). So make sure that this - * fd is available to the new process. - */ - FD_CLR(open_files - i, new_fdt->open_fds); - } - rcu_assign_pointer(*new_fds++, f); - } - spin_unlock(&oldf->file_lock); - - /* compute the remainder to be cleared */ - size = (new_fdt->max_fds - open_files) * sizeof(struct file *); - - /* This is long word aligned thus could use a optimized version */ - memset(new_fds, 0, size); - - if (new_fdt->max_fds > open_files) { - int left = (new_fdt->max_fds-open_files)/8; - int start = open_files / (8 * sizeof(unsigned long)); - - memset(&new_fdt->open_fds->fds_bits[start], 0, left); - memset(&new_fdt->close_on_exec->fds_bits[start], 0, left); - } - - return newf; - -out_release: - kmem_cache_free(files_cachep, newf); -out: - return NULL; -} - static int copy_files(unsigned long clone_flags, struct task_struct * tsk) { struct files_struct *oldf, *newf; -- cgit v1.2.3 From eceea0b3df05ed262ae32e0c6340cc7a3626632d Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 10 May 2008 10:08:32 -0400 Subject: [PATCH] avoid multiplication overflows and signedness issues for max_fds Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and we don't want size calculation for ->fd[] to overflow. Signed-off-by: Al Viro --- kernel/sysctl.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index d7ffdc59816a..29116652dca8 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -81,6 +81,7 @@ extern int compat_log; extern int maps_protect; extern int sysctl_stat_interval; extern int latencytop_enabled; +extern int sysctl_nr_open_min, sysctl_nr_open_max; /* Constants used for minimum and maximum */ #if defined(CONFIG_DETECT_SOFTLOCKUP) || defined(CONFIG_HIGHMEM) @@ -1190,7 +1191,9 @@ static struct ctl_table fs_table[] = { .data = &sysctl_nr_open, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_dointvec_minmax, + .extra1 = &sysctl_nr_open_min, + .extra2 = &sysctl_nr_open_max, }, { .ctl_name = FS_DENTRY, -- cgit v1.2.3 From fcaf1eb8685a00a99259e138e403841e984385b0 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Wed, 14 May 2008 16:11:48 -0700 Subject: [patch 1/1] audit_send_reply(): fix error-path memory leak Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663 Reporter: Daniel Marjamki Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Al Viro --- kernel/audit.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index b7d3709cc452..e8692a5748c2 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -572,16 +572,17 @@ void audit_send_reply(int pid, int seq, int type, int done, int multi, skb = audit_make_reply(pid, seq, type, done, multi, payload, size); if (!skb) - return; + goto out; reply->pid = pid; reply->skb = skb; tsk = kthread_run(audit_send_reply_thread, reply, "audit_send_reply"); - if (IS_ERR(tsk)) { - kfree(reply); - kfree_skb(skb); - } + if (!IS_ERR(tsk)) + return; + kfree_skb(skb); +out: + kfree(reply); } /* -- cgit v1.2.3 From 6793a051fb9311f0f1ab7eafc5a9e69b8a1bd8d4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 14 May 2008 17:10:12 -0700 Subject: [PATCH] list_for_each_rcu must die: audit All uses of list_for_each_rcu() can be profitably replaced by the easier-to-use list_for_each_entry_rcu(). This patch makes this change for the Audit system, in preparation for removing the list_for_each_rcu() API entirely. This time with well-formed SOB. Signed-off-by: Paul E. McKenney Signed-off-by: Al Viro --- kernel/audit_tree.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 9ef5e0aacc3c..f7921a2ecf16 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -172,10 +172,9 @@ static void insert_hash(struct audit_chunk *chunk) struct audit_chunk *audit_tree_lookup(const struct inode *inode) { struct list_head *list = chunk_hash(inode); - struct list_head *pos; + struct audit_chunk *p; - list_for_each_rcu(pos, list) { - struct audit_chunk *p = container_of(pos, struct audit_chunk, hash); + list_for_each_entry_rcu(p, list, hash) { if (p->watch.inode == inode) { get_inotify_watch(&p->watch); return p; -- cgit v1.2.3