summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2009-01-18mm lockless pagecache barrier fixNick Piggin
commit e8c82c2e23e3527e0c9dc195e432c16784d270fa upstream. An XFS workload showed up a bug in the lockless pagecache patch. Basically it would go into an "infinite" loop, although it would sometimes be able to break out of the loop! The reason is a missing compiler barrier in the "increment reference count unless it was zero" case of the lockless pagecache protocol in the gang lookup functions. This would cause the compiler to use a cached value of struct page pointer to retry the operation with, rather than reload it. So the page might have been removed from pagecache and freed (refcount==0) but the lookup would not correctly notice the page is no longer in pagecache, and keep attempting to increment the refcount and failing, until the page gets reallocated for something else. This isn't a data corruption because the condition will be detected if the page has been reallocated. However it can result in a lockup. Linus points out that ACCESS_ONCE is also required in that pointer load, even if it's absence is not causing a bug on our particular build. The most general way to solve this is just to put an rcu_dereference in radix_tree_deref_slot. Assembly of find_get_pages, before: .L220: movq (%rbx), %rax #* ivtmp.1162, tmp82 movq (%rax), %rdi #, prephitmp.1149 .L218: testb $1, %dil #, prephitmp.1149 jne .L217 #, testq %rdi, %rdi # prephitmp.1149 je .L203 #, cmpq $-1, %rdi #, prephitmp.1149 je .L217 #, movl 8(%rdi), %esi # <variable>._count.counter, c testl %esi, %esi # c je .L218 #, after: .L212: movq (%rbx), %rax #* ivtmp.1109, tmp81 movq (%rax), %rdi #, ret testb $1, %dil #, ret jne .L211 #, testq %rdi, %rdi # ret je .L197 #, cmpq $-1, %rdi #, ret je .L211 #, movl 8(%rdi), %esi # <variable>._count.counter, c testl %esi, %esi # c je .L212 #, (notice the obvious infinite loop in the first example, if page->count remains 0) Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18mm: fix assertionNick Piggin
commit 18e6959c385f3edf3991fa6662a53dac4eb10d5b upstream. This assertion is incorrect for lockless pagecache. By definition if we have an unpinned page that we are trying to take a speculative reference to, it may become the tail of a compound page at any time (if it is freed, then reallocated as a compound page). It was still a valid assertion for the vmscan.c LRU isolation case, but it doesn't seem incredibly helpful... if somebody wants it, they can put it back directly where it applies in the vmscan code. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18fs: symlink write_begin allocation context fixNick Piggin
commit 54566b2c1594c2326a645a3551f9d989f7ba3c5e upstream. With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 33Heiko Carstens
commit 2b66421995d2e93c9d1a0111acf2581f8529c6e5 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 32Heiko Carstens
commit d4e82042c4cfa87a7d51710b71f568fe80132551 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18powerpc: Enable syscall wrappers for 64-bitBenjamin Herrenschmidt
commit ee6a093222549ac0c72cfd296c69fa5e7d6daa34 upstream. This enables the use of syscall wrappers to do proper sign extension for 64-bit programs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrapper infrastructureHeiko Carstens
commit 1a94bc34768e463a93cb3751819709ab0ea80a01 upstream. From: Martin Schwidefsky <schwidefsky@de.ibm.com> By selecting HAVE_SYSCALL_WRAPPERS architectures can activate system call wrappers in order to sign extend system call arguments. All architectures where the ABI defines that the caller of a function has to perform sign extension probably need this. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Rename old_readdir to sys_old_readdirHeiko Carstens
commit e55380edf68796d75bf41391a781c68ee678587d upstream. This way it matches the generic system call name convention. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Convert all system calls to return a longHeiko Carstens
commit 2ed7c03ec17779afb4fcfa3b8c61df61bd4879ba upstream. Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Move compat system call declarations to compat header fileHeiko Carstens
commit 4c696ba7982501d43dea11dbbaabd2aa8a19cc42 upstream. Move declarations to correct header file. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18inotify: fix type errors in interfacesMichael Kerrisk
commit 4ae8978cf92a96257cd8998a49e781be83571d64 upstream. The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes. For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is currently 'u32', it should be '__s32' . That is Robert's suggestion, and is consistent with the other declarations of watch descriptors in the kernel source, in particular, the inotify_event structure in include/linux/inotify.h: struct inotify_event { __s32 wd; /* watch descriptor */ __u32 mask; /* watch mask */ __u32 cookie; /* cookie to synchronize two events */ __u32 len; /* length (including nulls) of name */ char name[0]; /* stub for possible name */ }; The patch makes the changes needed for inotify_rm_watch(). Signed-off-by: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Robert Love <rlove@google.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18sched_clock: prevent scd->clock from moving backwards, take #2Thomas Gleixner
commit 1c5745aa380efb6417b5681104b007c8612fb496 upstream. Redo: 5b7dba4: sched_clock: prevent scd->clock from moving backwards which had to be reverted due to s2ram hangs: ca7e716: Revert "sched_clock: prevent scd->clock from moving backwards" ... this time with resume restoring GTOD later in the sequence taken into account as well. The "timekeeping_suspended" flag is not very nice but we cannot call into GTOD before it has been properly resumed and the scheduler will run very early in the resume sequence. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-18can: Fix CAN_(EFF|RTR)_FLAG handling in can_filterOliver Hartkopp
commit d253eee20195b25e298bf162a6e72f14bf4803e5 upstream. Due to a wrong safety check in af_can.c it was not possible to filter for SFF frames with a specific CAN identifier without getting the same selected CAN identifier from a received EFF frame also. This fix has a minimum (but user visible) impact on the CAN filter API and therefore the CAN version is set to a new date. Indeed the 'old' API is still working as-is. But when now setting CAN_(EFF|RTR)_FLAG in can_filter.can_mask you might get less traffic than before - but still the stuff that you expected to get for your defined filter ... Thanks to Kurt Van Dijck for pointing at this issue and for the review. Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net> Acked-by: Kurt Van Dijck <kurt.van.dijck@eia.be> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13pnp: make the resource type an unsigned longRene Herman
commit b563cf59c4d67da7d671788a9848416bfa4180ab upstream. PnP encodes the resource type directly as its struct resource->flags value which is an unsigned long. Make it so... Signed-off-by: Rene Herman <rene.herman@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13Allow recursion in binfmt_script and binfmt_miscKirill A. Shutemov
commit bf2a9a39639b8b51377905397a5005f444e9a892 upstream. binfmt_script and binfmt_misc disallow recursion to avoid stack overflow using sh_bang and misc_bang. It causes problem in some cases: $ echo '#!/bin/ls' > /tmp/t0 $ echo '#!/tmp/t0' > /tmp/t1 $ echo '#!/tmp/t1' > /tmp/t2 $ chmod +x /tmp/t* $ /tmp/t2 zsh: exec format error: /tmp/t2 Similar problem with binfmt_misc. This patch introduces field 'recursion_depth' into struct linux_binprm to track recursion level in binfmt_misc and binfmt_script. If recursion level more then BINPRM_MAX_RECURSION it generates -ENOEXEC. [akpm@linux-foundation.org: make linux_binprm.recursion_depth a uint] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13jbd: fix error handling for checkpoint ioHidehiro Kawai
commit 4afe978530702c934dfdb11f54073136818b2119 upstream. When a checkpointing IO fails, current JBD code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort the journal 3. when checkpointing fails, don't update the journal super block to prevent the journaled contents from being cleaned. For safety, don't update j_tail and j_tail_sequence either 4. when checkpointing fails, notify this error to the ext3 layer so that ext3 don't clear the needs_recovery flag, otherwise the journaled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag 6. prevent cleanup_journal_tail() from being called between __journal_drop_transaction() and journal_abort() (a race issue between journal_flush() and __log_wait_for_space() Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13Enforce a minimum SG_IO timeoutLinus Torvalds
commit f2f1fa78a155524b849edf359e42a3001ea652c0 upstream. There's no point in having too short SG_IO timeouts, since if the command does end up timing out, we'll end up through the reset sequence that is several seconds long in order to abort the command that timed out. As a result, shorter timeouts than a few seconds simply do not make sense, as the recovery would be longer than the timeout itself. Add a BLK_MIN_SG_TIMEOUT to match the existign BLK_DEFAULT_SG_TIMEOUT. Suggested-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05jbd2: fix /proc setup for devices that contain '/' in their namesTheodore Ts'o
trimed down version of commit 05496769e5da83ce22ed97345afd9c7b71d6bd24 upstream. Some devices such as "cciss/c0d0p9" will cause jbd2 setup and teardown failures when /proc filenames are created with embedded slashes. This is a slimmed down version of commit 05496769, with the stack reduction aspects of the patch omitted to meet the -stable criteria. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05libata: blacklist Seagate drives which time out FLUSH_CACHE when used with NCQTejun Heo
commit ac70a964b0e22a95af3628c344815857a01461b7 upstream. Some recent Seagate harddrives have firmware bug which causes FLUSH CACHE to timeout under certain circumstances if NCQ is being used. This can be worked around by disabling NCQ and fixed by updating the firmware. Implement ATA_HORKAGE_FIRMWARE_UPDATE and blacklist these devices. The wiki page has been updated to contain information on this issue. http://ata.wiki.kernel.org/index.php/Known_issues Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05PCI: Hotplug core: remove 'name'Alex Chiang
commit 58319b802a614f10f1b5238fbde7a4b2e9a60069 upstream. Now that the PCI core manages the 'name' for each individual hotplug driver, and all drivers (except rpaphp) have been converted to use hotplug_slot_name(), there is no need for the PCI hotplug core to drag around its own copy of name either. Cc: kristen.c.accardi@intel.com Cc: matthew@wil.cx Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05PCI, PCI Hotplug: introduce slot_name helpersAlex Chiang
commit 0ad772ec464d3fcf9d210836b97e654f393606c4 upstream In preparation for cleaning up the various hotplug drivers such that they don't have to manage their own 'name' parameters anymore, we provide the following convenience functions: pci_slot_name() hotplug_slot_name() These helpers will be used by individual hotplug drivers. Cc: kristen.c.accardi@intel.com Cc: matthew@wil.cx Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05PCI: update pci_create_slot() to take a 'hotplug' paramAlex Chiang
commit 828f37683e6d3ab5912989df0d04201db7ad798e upstream. Slot detection drivers can co-exist with hotplug drivers. The names of the detected/claimed slots may be different depending on module load order. For legacy reasons, we need to allow hotplug drivers to override the slot name if a detection driver is loaded first (and they find the same slots). Creating and overriding slot names should be an atomic operation, otherwise you get a locking nightmare as various drivers race to call pci_create_slot(). pci_create_slot() is already serialized by grabbing the pci_bus_sem. We update the API and add a 'hotplug' param, which is: set if the caller is a hotplug driver NULL if the caller is a detection driver pci_create_slot() does not actually use the 'hotplug' parameter in this patch. A later patch will add the logic that uses it. Cc: kristen.c.accardi@intel.com Cc: matthew@wil.cx Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05PCI Hotplug core: add 'name' param pci_hp_register interfaceAlex Chiang
commit 1359f2701b96abd9bb69c1273fb995a093b6409a upstream. Update pci_hp_register() to take a const char *name parameter. The motivation for this is to clean up the individual hotplug drivers so that each one does not have to manage its own name. The PCI core should be the place where we manage the name. We update the interface and all callsites first, in a "no functional change" manner, and clean up the drivers later. Cc: kristen.c.accardi@intel.com Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05lib/idr.c: fix rcu related race with idr_findManfred Spraul
commit 6ff2d39b91aec3dcae951afa982059e3dd9b49dc upstream. 2nd part of the fixes needed for http://bugzilla.kernel.org/show_bug.cgi?id=11796. When the idr tree is either grown or shrunk, then the update to the number of layers and the top pointer were not atomic. This race caused crashes. The attached patch fixes that by replicating the layers counter in each layer, thus idr_find doesn't need idp->layers anymore. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Clement Calmels <cboulte@gmail.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05Fix inotify watch removal/umount racesAl Viro
commit 8f7b0ba1c853919b85b54774775f567f30006107 upstream. Inotify watch removals suck violently. To kick the watch out we need (in this order) inode->inotify_mutex and ih->mutex. That's fine if we have a hold on inode; however, for all other cases we need to make damn sure we don't race with umount. We can *NOT* just grab a reference to a watch - inotify_unmount_inodes() will happily sail past it and we'll end with reference to inode potentially outliving its superblock. Ideally we just want to grab an active reference to superblock if we can; that will make sure we won't go into inotify_umount_inodes() until we are done. Cleanup is just deactivate_super(). However, that leaves a messy case - what if we *are* racing with umount() and active references to superblock can't be acquired anymore? We can bump ->s_count, grab ->s_umount, which will almost certainly wait until the superblock is shut down and the watch in question is pining for fjords. That's fine, but there is a problem - we might have hit the window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e. the moment when superblock is past the point of no return and is heading for shutdown) and the moment when deactivate_super() acquires ->s_umount. We could just do drop_super() yield() and retry, but that's rather antisocial and this stuff is luser-triggerable. OTOH, having grabbed ->s_umount and having found that we'd got there first (i.e. that ->s_root is non-NULL) we know that we won't race with inotify_umount_inodes(). So we could grab a reference to watch and do the rest as above, just with drop_super() instead of deactivate_super(), right? Wrong. We had to drop ih->mutex before we could grab ->s_umount. So the watch could've been gone already. That still can be dealt with - we need to save watch->wd, do idr_find() and compare its result with our pointer. If they match, we either have the damn thing still alive or we'd lost not one but two races at once, the watch had been killed and a new one got created with the same ->wd at the same address. That couldn't have happened in inotify_destroy(), but inotify_rm_wd() could run into that. Still, "new one got created" is not a problem - we have every right to kill it or leave it alone, whatever's more convenient. So we can use idr_find(...) == watch && watch->inode->i_sb == sb as "grab it and kill it" check. If it's been our original watch, we are fine, if it's a newcomer - nevermind, just pretend that we'd won the race and kill the fscker anyway; we are safe since we know that its superblock won't be going away. And yes, this is far beyond mere "not very pretty"; so's the entire concept of inotify to start with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05epoll: introduce resource usage limitsDavide Libenzi
commit 7ef9964e6d1b911b78709f144000aacadd0ebc21 upstream. It has been thought that the per-user file descriptors limit would also limit the resources that a normal user can request via the epoll interface. Vegard Nossum reported a very simple program (a modified version attached) that can make a normal user to request a pretty large amount of kernel memory, well within the its maximum number of fds. To solve such problem, default limits are now imposed, and /proc based configuration has been introduced. A new directory has been created, named /proc/sys/fs/epoll/ and inside there, there are two configuration points: max_user_instances = Maximum number of devices - per user max_user_watches = Maximum number of "watched" fds - per user The current default for "max_user_watches" limits the memory used by epoll to store "watches", to 1/32 of the amount of the low RAM. As example, a 256MB 32bit machine, will have "max_user_watches" set to roughly 90000. That should be enough to not break existing heavy epoll users. The default value for "max_user_instances" is set to 128, that should be enough too. This also changes the userspace, because a new error code can now come out from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already listed, so that should be ok. [akpm@linux-foundation.org: use get_current_user()] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Reported-by: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-20USB: don't register endpoints for interfaces that are going awayAlan Stern
commit 352d026338378b1f13f044e33c1047da6e470056 upstream. This patch (as1155) fixes a bug in usbcore. When interfaces are deleted, either because the device was disconnected or because of a configuration change, the extra attribute files and child endpoint devices may get left behind. This is because the core removes them before calling device_del(). But during device_del(), after the driver is unbound the core will reinstall altsetting 0 and recreate those extra attributes and children. The patch prevents this by adding a flag to record when the interface is in the midst of being unregistered. When the flag is set, the attribute files and child devices will not be created. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-20block: fix nr_phys_segments miscalculation bugFUJITA Tomonori
commit 8677142710516d986d932d6f1fba7be8382c1fec upstream backported by Nikanth Karthikesan <knikanth@suse.de> to the 2.6.27.y tree. block: fix nr_phys_segments miscalculation bug This fixes the bug reported by Nikanth Karthikesan <knikanth@suse.de>: http://lkml.org/lkml/2008/10/2/203 The root cause of the bug is that blk_phys_contig_segment miscalculates q->max_segment_size. blk_phys_contig_segment checks: req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size But blk_recalc_rq_segments might expect that req->biotail and the previous bio in the req are supposed be merged into one segment. blk_recalc_rq_segments might also expect that next_req->bio and the next bio in the next_req are supposed be merged into one segment. In such case, we merge two requests that can't be merged here. Later, blk_rq_map_sg gives more segments than it should. We need to keep track of segment size in blk_recalc_rq_segments and use it to see if two requests can be merged. This patch implements it in the similar way that we used to do for hw merging (virtual merging). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-13MTD: [NOR] Fix cfi_send_gen_cmd handling of x16 devices in x8 mode (v4)Eric W. Biederman
commit 467622ef2acb01986eab37ef96c3632b3ea35999 upstream For "unlock" cycles to 16bit devices in 8bit compatibility mode we need to use the byte addresses 0xaaa and 0x555. These effectively match the word address 0x555 and 0x2aa, except the latter has its low bit set. Most chips don't care about the value of the 'A-1' pin in x8 mode, but some -- like the ST M29W320D -- do. So we need to be careful to set it where appropriate. cfi_send_gen_cmd is only ever passed addresses where the low byte is 0x00, 0x55 or 0xaa. Of those, only addresses ending 0xaa are affected by this patch, by masking in the extra low bit when the device is known to be in compatibility mode. [dwmw2: Do it only when (cmd_ofs & 0xff) == 0xaa] v4: Fix stupid typo in cfi_build_cmd_addr that failed to compile I'm writing this patch way to late at night. v3: Bring all of the work back into cfi_build_cmd_addr including calling of map_bankwidth(map) and cfi_interleave(cfi) So every caller doesn't need to. v2: Only modified the address if we our device_type is larger than our bus width. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-07net: Fix recursive descent in __scm_destroy().David Miller
commit f8d570a4745835f2238a33b537218a1bb03fc671 and 3b53fbf4314594fa04544b02b2fc6e607912da18 upstream (because once wasn't good enough...) __scm_destroy() walks the list of file descriptors in the scm_fp_list pointed to by the scm_cookie argument. Those, in turn, can close sockets and invoke __scm_destroy() again. There is nothing which limits how deeply this can occur. The idea for how to fix this is from Linus. Basically, we do all of the fput()s at the top level by collecting all of the scm_fp_list objects hit by an fput(). Inside of the initial __scm_destroy() we keep running the list until it is empty. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-05ide-cd: temporary tray close fixBorislav Petkov
This one fixes http://bugzilla.kernel.org/show_bug.cgi?id=11602. A more generic fix for drives which cannot autoclose tray will follow. Signed-off-by: Borislav Petkov <petkovbb@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> [bart: add an extra parentheses for consistency with the rest of kernel code] Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-10-03include/linux/stacktrace.h: declare struct task_structAndrew Morton
include/linux/stacktrace.h:13: warning: 'struct task_struct' declared inside parameter list (This might be a hard error on sparc64, which uses this header and has -Werror) Reported-by: "Randy.Dunlap" <rdunlap@xenotime.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02mm: tiny-shmem nommu fixNick Piggin
The previous patch db203d53d474aa068984e409d807628f5841da1b ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-29hrtimer: prevent migration of per CPU hrtimersThomas Gleixner
Impact: per CPU hrtimers can be migrated from a dead CPU The hrtimer code has no knowledge about per CPU timers, but we need to prevent the migration of such timers and warn when such a timer is active at migration time. Explicitely mark the timers as per CPU and use a more understandable mode descriptor for the interrupts safe unlocked callback mode, which is used by hrtimer_sleeper and the scheduler code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-09-29hrtimer: mark migration stateThomas Gleixner
Impact: during migration active hrtimers can be seen as inactive The migration code removes the hrtimers from the queues of the dead CPU and sets the state temporary to INACTIVE. The enqueue code sets it to ACTIVE/PENDING again. Prevent that the wrong state can be seen by using a separate migration state bit. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-09-24MN10300: Move asm-arm/cnt32_to_63.h to include/linux/David Howells
Move asm-arm/cnt32_to_63.h to include/linux/ so that MN10300 can make use of it too. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: PCI: fix compiler warnings in pci_get_subsys() PCI: Fix pcie_aspm=force
2008-09-23smb.h: do not include linux/time.h in userspaceKirill A. Shutemov
linux/time.h conflicts with time.h from glibc It breaks building smbmount from samba. It's regression introduced by commit 76308da (" smb.h: uses struct timespec but didn't include linux/time.h"). Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: <stable@kernel.org> [2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: IPoIB: Fix deadlock on RTNL between bcast join comp and ipoib_stop() RDMA/nes: Fix client side QP destroy IB/mlx4: Fix up fast register page list format mlx4_core: Set RAE and init mtt_sz field in FRMR MPT entries
2008-09-16Fix PNP build failure, bugzilla #11276David Miller
This fill fix the following regression list entry: Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11276 Subject : build error: CONFIG_OPTIMIZE_INLINING=y causes gcc 4.2 to do stupid things Submitter : Randy Dunlap <randy.dunlap@oracle.com> Date : 2008-08-06 17:18 (38 days old) References : http://marc.info/?l=linux-kernel&m=121804329014332&w=4 http://lkml.org/lkml/2008/7/22/353 Handled-By : Bjorn Helgaas <bjorn.helgaas@hp.com> Patch : http://lkml.org/lkml/2008/7/22/364 with what I believe is a better fix than the one referenced in the regression entry above. These PNP header interfaces try to work in such a way that you can reference some of them even if PNP is not enabled, and the compiler was expected to optimize everything away. Which is mostly fine, except that there was one interface for which there was not provided an inline "NOP" implementation. Once we add that, all of these compile failures cannot handle any more. pnp: Provide NOP inline implementation of pnp_get_resource() when !PNP Fixes kernel bugzilla #11276. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-16PCI: fix compiler warnings in pci_get_subsys()Greg KH
pci_get_subsys() changed in 2.6.26 so that the from pointer is modified when the call is being invoked, so fix up the 'const' marking of it that the compiler is complaining about. Reported-by: Rufus & Azrael <rufus-azrael@numericable.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2008-09-15IB/mlx4: Fix up fast register page list formatVladimir Sokolovsky
Byte swap the addresses in the page list for fast register work requests to big endian to match what the HCA expectx. Also, the addresses must have the "present" bit set so that the HCA knows it can access them. Otherwise the HCA will fault the first time it accesses the memory region. Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-09-13Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: [libata] LBA28/LBA48 off-by-one bug in ata.h sata_inic162x: enable LED blinking ata: duplicate variable sparse warning
2008-09-13memstick: fix MSProHG 8-bit interface mode supportAlex Dubov
- 8-bit interface mode never worked properly. The only adapter I have which supports the 8b mode (the Jmicron) had some problems with its clock wiring and they discovered it only now. We also discovered that ProHG media is more sensitive to the ordering of initialization commands. - Make the driver fall back to highest supported mode instead of always falling back to serial. The driver will attempt the switch to 8b mode for any new MSPro card, but not all of them support it. Previously, these new cards ended up in serial mode, which is not the best idea (they work fine with 4b, after all). - Edit some macros for better conformance to Sony documentation Signed-off-by: Alex Dubov <oakad@yahoo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13mm: mark the correct zone as full when scanning zonelistsMel Gorman
The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when scanning zonelists to keep track of where in the zonelist it is. The zoneref that is returned corresponds to the the next zone that is to be scanned, not the current one. It was intended to be treated as an opaque list. When the page allocator is scanning a zonelist, it marks elements in the zonelist corresponding to zones that are temporarily full. As the zonelist is being updated, it uses the cursor here; if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); This is intended to prevent rescanning in the near future but the zoneref cursor does not correspond to the zone that has been found to be full. This is an easy misunderstanding to make so this patch corrects the problem by changing zoneref cursor to be the current zone being scanned instead of the next one. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13include/linux/ioport.h: add missing macro argument for devm_release_* familyHiroshi DOYU
akpm: these have no callers at this time, but they shall soon, so let's get them right. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hiroshi DOYU <Hiroshi.DOYU@nokia.com> Cc: Tony Lindgren <tony@atomide.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13[libata] LBA28/LBA48 off-by-one bug in ata.hTaisuke Yamada
I recently bought 3 HGST P7K500-series 500GB SATA drives and had trouble accessing the block right on the LBA28-LBA48 border. Here's how it fails (same for all 3 drives): # dd if=/dev/sdc bs=512 count=1 skip=268435455 > /dev/null dd: reading `/dev/sdc': Input/output error 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.288033 seconds, 0.0 kB/s # dmesg ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata1.00: BMDMA stat 0x25 ata1.00: cmd c8/00:08:f8:ff:ff/00:00:00:00:00/ef tag 0 dma 4096 in res 51/04:08:f8:ff:ff/00:00:00:00:00/ef Emask 0x1 (device error) ata1.00: status: { DRDY ERR } ata1.00: error: { ABRT } ata1.00: configured for UDMA/33 ata1: EH complete ... After some investigations, it turned out this seems to be caused by misinterpretation of the ATA specification on LBA28 access. Following part is the code in question: === include/linux/ata.h === static inline int lba_28_ok(u64 block, u32 n_block) { /* check the ending block number */ return ((block + n_block - 1) < ((u64)1 << 28)) && (n_block <= 256); } HGST drive (sometimes) fails with LBA28 access of {block = 0xfffffff, n_block = 1}, and this behavior seems to be comformant. Other drives, including other HGST drives are not that strict, through. >From the ATA specification: (http://www.t13.org/Documents/UploadedDocuments/project/d1410r3b-ATA-ATAPI-6.pdf) 8.15.29 Word (61:60): Total number of user addressable sectors This field contains a value that is one greater than the total number of user addressable sectors (see 6.2). The maximum value that shall be placed in this field is 0FFFFFFFh. So the driver shouldn't use the value of 0xfffffff for LBA28 request as this exceeds maximum user addressable sector. The logical maximum value for LBA28 is 0xffffffe. The obvious fix is to cut "- 1" part, and the patch attached just do that. I've been using the patched kernel for about a month now, and the same fix is also floating on the net for some time. So I believe this fix works reliably. Just FYI, many Windows/Intel platform users also seems to be struck by this, and HGST has issued a note pointing to Intel ICH8/9 driver. "28-bit LBA command is being used to access LBAs 29-bits in length" http://www.hitachigst.com/hddt/knowtree.nsf/cffe836ed7c12018862565b000530c74/b531b8bce8745fb78825740f00580e23 Also, *BSDs seems to have similar fix included sometime around ~2004, through I have not checked out exact portion of the code. Signed-off-by: Taisuke Yamada <tai@rakugaki.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-09-11block: disable sysfs parts of the disk command filterJens Axboe
We still have life time issues with the sysfs command filter kobject, so disable it for 2.6.27 release. We can revisit this and make it work properly for 2.6.28, for 2.6.27 release it's too risky. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-08Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: arch_reinit_sched_domains() must destroy domains to force rebuild sched, cpuset: rework sched domains and CPU hotplug handling (v4)
2008-09-06Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource, acpi_pm.c: check for monotonicity clocksource, acpi_pm.c: use proper read function also in errata mode ntp: fix calculation of the next jiffie to trigger RTC sync x86: HPET: read back compare register before reading counter x86: HPET fix moronic 32/64bit thinko clockevents: broadcast fixup possible waiters HPET: make minimum reprogramming delta useful clockevents: prevent endless loop lockup clockevents: prevent multiple init/shutdown clockevents: enforce reprogram in oneshot setup clockevents: prevent endless loop in periodic broadcast handler clockevents: prevent clockevent event_handler ending up handler_noop