summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-11-11hppfs: missing includeAl Viro
commit d6b722aa383a467a43d09ee38e866981abba08ab upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: ignore WANT bits in open downgradeJ. Bruce Fields
commit c30e92df30d7d5fe65262fbce5d1b7de675fe34e upstream. We don't use WANT bits yet--and sending them can probably trigger a BUG() further down. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: fix open downgrade, againJ. Bruce Fields
commit 3d02fa29dec920c597dd7b7db608a4bc71f088ce upstream. Yet another open-management regression: - nfs4_file_downgrade() doesn't remove the BOTH access bit on downgrade, so the server's idea of the stateid's access gets out of sync with the client's. If we want to keep an O_RDWR open in this case, we should do that in the file_put_access logic rather than here. - We forgot to convert v4 access to an open mode here. This logic has proven too hard to get right. In the future we may consider: - reexamining the lock/openowner relationship (locks probably don't really need to take their own references here). - adding open upgrade/downgrade support to the vfs. - removing the atomic operations. They're redundant as long as this is all under some other lock. Also, maybe some kind of additional static checking would help catch O_/NFS4_SHARE_ACCESS confusion. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: permit read opens of executable-only filesJ. Bruce Fields
commit a043226bc140a2c1dde162246d68a67e5043e6b2 upstream. A client that wants to execute a file must be able to read it. Read opens over nfs are therefore implicitly allowed for executable files even when those files are not readable. NFSv2/v3 get this right by using a passed-in NFSD_MAY_OWNER_OVERRIDE on read requests, but NFSv4 has gotten this wrong ever since dc730e173785e29b297aa605786c94adaffe2544 "nfsd4: fix owner-override on open", when we realized that the file owner shouldn't override permissions on non-reclaim NFSv4 opens. So we can't use NFSD_MAY_OWNER_OVERRIDE to tell nfsd_permission to allow reads of executable files. So, do the same thing we do whenever we encounter another weird NFS permission nit: define yet another NFSD_MAY_* flag. The industry's future standardization on 128-bit processors will be motivated primarily by the need for integers with enough bits for all the NFSD_MAY_* flags. Reported-by: Leonardo Borda <leonardoborda@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: fix seqid_mutating_errorJ. Bruce Fields
commit 576163005de286bbd418fcb99cfd0971523a0c6d upstream. The set of errors here does *not* agree with the set of errors specified in the rfc! While we're there, turn this macros into a function, for the usual reasons, and move it to the one place where it's actually used. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: stop using nfserr_resource for transitory errorsJ. Bruce Fields
commit 3e77246393c0a433247631a1f0e9ec98d3d78a1c upstream. The server is returning nfserr_resource for both permanent errors and for errors (like allocation failures) that might be resolved by retrying later. Save nfserr_resource for the former and use delay/jukebox for the latter. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: Remove check for a 32-bit cookie in nfsd4_readdir()Bernd Schubert
commit 832023bffb4b493f230be901f681020caf3ed1f8 upstream. Fan Yong <yong.fan@whamcloud.com> noticed setting FMODE_32bithash wouldn't work with nfsd v4, as nfsd4_readdir() checks for 32 bit cookies. However, according to RFC 3530 cookies have a 64 bit type and cookies are also defined as u64 in 'struct nfsd4_readdir'. So remove the test for >32-bit values. Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfs: don't try to migrate pages with active requestsJeff Layton
commit 2da956523526e440ef4f4dd174e26f5ac06fe011 upstream. nfs_find_and_lock_request will take a reference to the nfs_page and will then put it if the req is already locked. It's possible though that the reference will be the last one. That put then can kick off a whole series of reference puts: nfs_page nfs_open_context dentry inode If the inode ends up being deleted, then the VFS will call truncate_inode_pages. That function will try to take the page lock, but it was already locked when migrate_page was called. The code deadlocks. Fix this by simply refusing the migration request if PagePrivate is already set, indicating that the page is already associated with an active read or write request. We've had a customer test a backported version of this patch and the preliminary results seem good. Cc: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Harshula Jayasuriya <harshula@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfs: don't redirty inode when ncommit == 0 in nfs_commit_unstable_pagesJeff Layton
commit 3236c3e1adc0c7ec83eaff1de2d06746b7c5bb28 upstream. commit 420e3646 allowed the kernel to reduce the number of unnecessary commit calls by skipping the commit when there are a large number of outstanding pages. However, the current test in nfs_commit_unstable_pages does not handle the edge condition properly. When ncommit == 0, then that means that the kernel doesn't need to do anything more for the inode. The current test though in the WB_SYNC_NONE case will return true, and the inode will end up being marked dirty. Once that happens the inode will never be clean until there's a WB_SYNC_ALL flush. Fix this by immediately returning from nfs_commit_unstable_pages when ncommit == 0. Mike noticed this problem initially in RHEL5 (2.6.18-based kernel) which has a backported version of 420e3646. The inode cache there was growing very large. The inode cache was unable to be shrunk since the inodes were all marked dirty. Calling sync() would essentially "fix" the problem -- the WB_SYNC_ALL flush would result in the inodes all being marked clean. What I'm not clear on is how big a problem this is in mainline kernels as the writeback code there is very different. Either way, it seems incorrect to re-mark the inode dirty in this case. Reported-by: Mike McLean <mikem@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11Revert "NFS: Ensure that writeback_single_inode() calls write_inode() when ↵Trond Myklebust
syncing" commit 59b7c05fffba030e5d9e72324691e2f99aa69b79 upstream. This reverts commit b80c3cb628f0ebc241b02e38dd028969fb8026a2. The reverted commit was rendered obsolete by a VFS fix: commit 5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update dirty flags in two steps). We now no longer need to worry about writeback_single_inode() missing our marking the inode for COMMIT in 'do_writepages()' call. Reverting this patch, fixes a performance regression in which the inode would continuously get queued to the dirty list, causing the writeback code to unnecessarily try to send a COMMIT. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11epoll: fix spurious lockdep warningsNelson Elhage
commit d8805e633e054c816c47cb6e727c81f156d9253d upstream. epoll can acquire recursively acquire ep->mtx on multiple "struct eventpoll"s at once in the case where one epoll fd is monitoring another epoll fd. This is perfectly OK, since we're careful about the lock ordering, but it causes spurious lockdep warnings. Annotate the recursion using mutex_lock_nested, and add a comment explaining the nesting rules for good measure. Recent versions of systemd are triggering this, and it can also be demonstrated with the following trivial test program: --------------------8<-------------------- int main(void) { int e1, e2; struct epoll_event evt = { .events = EPOLLIN }; e1 = epoll_create1(0); e2 = epoll_create1(0); epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt); return 0; } --------------------8<-------------------- Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Nelson Elhage <nelhage@nelhage.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11CIFS: Fix DFS handling in cifs_get_file_infoPavel Shilovsky
commit 42274bb22afc3e877ae5abed787b0b09d7dede52 upstream. We should call cifs_all_info_to_fattr in rc == 0 case only. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11CIFS: Fix incorrect max RFC1002 write size valuePavel Shilovsky
commit 94443f43404239c2a6dc4252a7cb9e77f5b1eb6e upstream. ..the length field has only 17 bits. Acked-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25hfsplus: Fix kfree of wrong pointers in hfsplus_fill_super() error pathSeth Forshee
commit f588c960fcaa6fa8bf82930bb819c9aca4eb9347 upstream. Commit 6596528e391a ("hfsplus: ensure bio requests are not smaller than the hardware sectors") changed the pointers used for volume header allocations but failed to free the correct pointers in the error path path of hfsplus_fill_super() and hfsplus_read_wrapper. The second hunk came from a separate patch by Pavel Ivanov. Reported-by: Pavel Ivanov <paivanof@gmail.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Christoph Hellwig <hch@tuxera.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25xfs: revert to using a kthread for AIL pushingChristoph Hellwig
commit 0030807c66f058230bcb20d2573bcaf28852e804 upstream Currently we have a few issues with the way the workqueue code is used to implement AIL pushing: - it accidentally uses the same workqueue as the syncer action, and thus can be prevented from running if there are enough sync actions active in the system. - it doesn't use the HIGHPRI flag to queue at the head of the queue of work items At this point I'm not confident enough in getting all the workqueue flags and tweaks right to provide a perfectly reliable execution context for AIL pushing, which is the most important piece in XFS to make forward progress when the log fills. Revert back to use a kthread per filesystem which fixes all the above issues at the cost of having a task struct and stack around for each mounted filesystem. In addition this also gives us much better ways to diagnose any issues involving hung AIL pushing and removes a small amount of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25xfs: force the log if we encounter pinned buffers in .iop_pushbufChristoph Hellwig
commit 17b38471c3c07a49f0bbc2ecc2e92050c164e226 upstream We need to check for pinned buffers even in .iop_pushbuf given that inode items flush into the same buffers that may be pinned directly due operations on the unlinked inode list operating directly on buffers. To do this add a return value to .iop_pushbuf that tells the AIL push about this and use the existing log force mechanisms to unpin it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25xfs: do not update xa_last_pushed_lsn for locked itemsChristoph Hellwig
commit bc6e588a8971aa74c02e42db4d6e0248679f3738 upstream If an item was locked we should not update xa_last_pushed_lsn and thus skip it when restarting the AIL scan as we need to be able to lock and write it out as soon as possible. Otherwise heavy lock contention might starve AIL pushing too easily, especially given the larger backoff once we moved xa_last_pushed_lsn all the way to the target lsn. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25xfs: use a cursor for bulk AIL insertionDave Chinner
commit 1d8c95a363bf8cd4d4182dd19c01693b635311c2 upstream xfs: use a cursor for bulk AIL insertion Delayed logging can insert tens of thousands of log items into the AIL at the same LSN. When the committing of log commit records occur, we can get insertions occurring at an LSN that is not at the end of the AIL. If there are thousands of items in the AIL on the tail LSN, each insertion has to walk the AIL to find the correct place to insert the new item into the AIL. This can consume large amounts of CPU time and block other operations from occurring while the traversals are in progress. To avoid this repeated walk, use a AIL cursor to record where we should be inserting the new items into the AIL without having to repeat the walk. The cursor infrastructure already provides this functionality for push walks, so is a simple extension of existing code. While this will not avoid the initial walk, it will avoid repeating it tens of thousands of times during a single checkpoint commit. This version includes logic improvements from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25xfs: start periodic workers laterChristoph Hellwig
commit 2bcf6e970f5a88fa05dced5eeb0326e13d93c4a1 upstream Start the periodic sync workers only after we have finished xfs_mountfs and thus fully set up the filesystem structures. Without this we can call into xfs_qm_sync before the quotainfo strucute is set up if the mount takes unusually long, and probably hit other incomplete states as well. Also clean up the xfs_fs_fill_super error path by using consistent label names, and removing an impossible to reach case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25CIFS: Fix ERR_PTR dereference in cifs_get_rootPavel Shilovsky
commit 5b980b01212199833ee8023770fa4cbf1b85e9f4 upstream. move it to the beginning of the loop. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Cc: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25hfsplus: ensure bio requests are not smaller than the hardware sectorsSeth Forshee
commit 6596528e391ad978a6a120142cba97a1d7324cb6 upstream. Currently all bio requests are 512 bytes, which may fail for media whose physical sector size is larger than this. Ensure these requests are not smaller than the block device logical block size. BugLink: http://bugs.launchpad.net/bugs/734883 Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-25fuse: fix memory leakMiklos Szeredi
commit 5dfcc87fd79dfb96ed155b524337dbd0da4f5993 upstream. kmemleak is reporting that 32 bytes are being leaked by FUSE: unreferenced object 0xe373b270 (size 32): comm "fusermount", pid 1207, jiffies 4294707026 (age 2675.187s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<b05517d7>] kmemleak_alloc+0x27/0x50 [<b0196435>] kmem_cache_alloc+0xc5/0x180 [<b02455be>] fuse_alloc_forget+0x1e/0x20 [<b0245670>] fuse_alloc_inode+0xb0/0xd0 [<b01b1a8c>] alloc_inode+0x1c/0x80 [<b01b290f>] iget5_locked+0x8f/0x1a0 [<b0246022>] fuse_iget+0x72/0x1a0 [<b02461da>] fuse_get_root_inode+0x8a/0x90 [<b02465cf>] fuse_fill_super+0x3ef/0x590 [<b019e56f>] mount_nodev+0x3f/0x90 [<b0244e95>] fuse_mount+0x15/0x20 [<b019d1bc>] mount_fs+0x1c/0xc0 [<b01b5811>] vfs_kern_mount+0x41/0x90 [<b01b5af9>] do_kern_mount+0x39/0xd0 [<b01b7585>] do_mount+0x2e5/0x660 [<b01b7966>] sys_mount+0x66/0xa0 This leak report is consistent and happens once per boot on 3.1.0-rc5-dirty. This happens if a FORGET request is queued after the fuse device was released. Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-16exec: do not call request_module() twice from search_binary_handler()Tetsuo Handa
commit 912193521b719fbfc2f16776febf5232fe8ba261 upstream. Currently, search_binary_handler() tries to load binary loader module using request_module() if a loader for the requested program is not yet loaded. But second attempt of request_module() does not affect the result of search_binary_handler(). If request_module() triggered recursion, calling request_module() twice causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions. It is not an infinite loop but is sufficient for users to consider as a hang up. Therefore, this patch changes not to call request_module() twice, making 1 to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Richard Weinberger <richard@nod.at> Tested-by: Richard Weinberger <richard@nod.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Maxim Uvarov <muvarov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-03btrfs: fix d_off in the first direntHidetoshi Seto
commit 3765fefaee2da83f10829fa64a74e6b7360350cb upstream. Since the d_off in the first dirent for "." (that originates from the 4th argument "offset" of filldir() for the 2nd dirent for "..") is wrongly assigned in btrfs_real_readdir(), telldir returns same offset for different locations. | # mkfs.btrfs /dev/sdb1 | # mount /dev/sdb1 fs0 | # cd fs0 | # touch file0 file1 | # ../test | telldir: 0 | readdir: d_off = 2, d_name = "." | telldir: 2 | readdir: d_off = 2, d_name = ".." | telldir: 2 | readdir: d_off = 3, d_name = "file0" | telldir: 3 | readdir: d_off = 2147483647, d_name = "file1" | telldir: 2147483647 To fix this problem, pass filp->f_pos (which is loff_t) instead. | # ../test | telldir: 0 | readdir: d_off = 1, d_name = "." | telldir: 1 | readdir: d_off = 2, d_name = ".." | telldir: 2 | readdir: d_off = 3, d_name = "file0" : At the moment the "offset" for "." is unused because there is no preceding dirent, however it is better to pass filp->f_pos to follow grammatical usage. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Cc: Grazvydas Ignotas <notasas@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03writeback: update dirtied_when for synced inode to prevent livelockWu Fengguang
commit 94c3dcbb0b0cdfd82cedd21705424d8044edc42c upstream. Explicitly update .dirtied_when on synced inodes, so that they are no longer considered for writeback in the next round. It can prevent both of the following livelock schemes: - while true; do echo data >> f; done - while true; do touch f; done (in theory) The exact livelock condition is, during sync(1): (1) no new inodes are dirtied (2) an inode being actively dirtied On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX. When finished, it will be redirty_tail()ed because it's still dirty and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when on condition (1). The sync work will then revisit it on the next queue_io() and find it eligible again because its old ->dirtied_when predates the sync work start time. We'll do more aggressive "keep writeback as long as we wrote something" logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will no longer be enough to stop sync livelock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stageWu Fengguang
commit 6e6938b6d3130305a5960c86b1a9b21e58cf6144 upstream. sync(2) is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and do livelock prevention for it, too. Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance using page tagging") is a partial fix in that it only fixed the WB_SYNC_ALL phase livelock. Although ext4 is tested to no longer livelock with commit f446daaea9, it may due to some "redirty_tail() after pages_skipped" effect which is by no means a guarantee for _all_ the file systems. Note that writeback_inodes_sb() is called by not only sync(), they are treated the same because the other callers also need livelock prevention. Impact: It changes the order in which pages/inodes are synced to disk. Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode until finished with the current inode. Acked-by: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03teach /proc/$pid/numa_maps about transparent hugepagesDave Hansen
commit 32ef43848f283e0ef945d3c67e851c143fea3970 upstream. This is modeled after the smaps code. It detects transparent hugepages and then does a single gather_stats() for the page as a whole. This has two benifits: 1. It is more efficient since it does many pages in a single shot. 2. It does not have to break down the huge page. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03break out numa_maps gather_pte_stats() checksDave Hansen
commit 3200a8aaab0c9ccdc0f59b0dac2d4a47029137fa upstream. gather_pte_stats() does a number of checks on a target page to see whether it should even be considered for statistics. This breaks that code out in to a separate function so that we can use it in the transparent hugepage case in the next patch. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Christoph Lameter <cl@gentwo.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03make /proc/$pid/numa_maps gather_stats() take variable page sizeDave Hansen
commit eb4866d0066ffd5446751c102d64feb3318d8bd1 upstream. We need to teach the numa_maps code about transparent huge pages. The first step is to teach gather_stats() that the pte it is dealing with might represent more than one page. Note that will we use this in a moment for transparent huge pages since they have use a single pmd_t which _acts_ as a "surrogate" for a bunch of smaller pte_t's. I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to figure out how many _bytes_ "dirty=1" means, you must first know the hugetlbfs page size. That's easier said than done especially if you don't have visibility in to the mount. But, that's probably a discussion for another day especially since it would change behavior to fix it. But, just in case anyone wonders why this patch only passes a '1' in the hugetlb case... Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03Fix the conflict between rwpidforward and rw mount optionsSteve French
commit c9c7fa0064f4afe1d040e72f24c2256dd8ac402d upstream. Both these options are started with "rw" - that's why the first one isn't switched on even if it is specified. Fix this by adding a length check for "rw" option check. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03cifs: fix possible memory corruption in CIFSFindNextJeff Layton
commit 9438fabb73eb48055b58b89fc51e0bc4db22fabd upstream. The name_len variable in CIFSFindNext is a signed int that gets set to the resume_name_len in the cifs_search_info. The resume_name_len however is unsigned and for some infolevels is populated directly from a 32 bit value sent by the server. If the server sends a very large value for this, then that value could look negative when converted to a signed int. That would make that value pass the PATH_MAX check later in CIFSFindNext. The name_len would then be used as a length value for a memcpy. It would then be treated as unsigned again, and the memcpy scribbles over a ton of memory. Fix this by making the name_len an unsigned value in CIFSFindNext. Reported-by: Darren Lavender <dcl@hppine99.gbr.hp.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03restore pinning the victim dentry in vfs_rmdir()/vfs_rename_dir()Al Viro
commit 1d2ef5901483004d74947bbf78d5146c24038fe7 upstream. We used to get the victim pinned by dentry_unhash() prior to commit 64252c75a219 ("vfs: remove dget() from dentry_unhash()") and ->rmdir() and ->rename() instances relied on that; most of them don't care, but ones that used d_delete() themselves do. As the result, we are getting rmdir() oopses on NFS now. Just grab the reference before locking the victim and drop it explicitly after unlocking, same as vfs_rename_other() does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03fs/9p: Use protocol-defined value for lock/getlock 'type' field.Jim Garlick
commit 51b8b4fb32271d39fbdd760397406177b2b0fd36 upstream. Signed-off-by: Jim Garlick <garlick@llnl.gov> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03fs/9p: Always ask new inode in lookup for cache mode disabledAneesh Kumar K.V
commit 73f507171cfa407b19f254aef95cbb058c8180cf upstream. This make sure we don't end up reusing the unlinked inode object. The ideal way is to use inode i_generation. But i_generation is not available in userspace always. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03fs/9p: Add OS dependent open flags in 9p protocolAneesh Kumar K.V
commit f88657ce3f9713a0c62101dffb0e972a979e77b9 upstream. Some of the flags are OS/arch dependent we add a 9p protocol value which maps to asm-generic/fcntl.h values in Linux Based on the original patch from Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> [extra comments from author as to why this needs to go to stable: Earlier for different operation such as open we used the values of open flag as defined by the OS. But some of these flags such as O_DIRECT are arch dependent. So if we have the 9p client and server running on different architectures, we end up with client sending client architecture value of these open flag and server will try to map these values to what its architecture states. For ex: O_DIRECT on a x86 client maps to #define O_DIRECT 00040000 Where as on sparc server it will maps to #define O_DIRECT 0x100000 Hence we need to map these open flags to OS/arch independent flag values. Getting these changes to an early version of kernel ensures us that we work with different combination of client and server. We should ideally backport this patch to all possible kernel version.] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03fs/9p: Don't update file type when updating file attributesAneesh Kumar K.V
commit 45089142b1497dab2327d60f6c71c40766fc3ea4 upstream. We should only update attributes that we can change on stat2inode. Also do file type initialization in v9fs_init_inode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03fs/9p: Add fid before dentry instantiationAneesh Kumar K.V
commit 5441ae5eb3614d3c28f77073370738a2820c88e4 upstream. d_instantiate marks the dentry positive. So a parallel lookup and mkdir of the directory can find dentry that doesn't have fid attached. This can result in both the code path doing v9fs_fid_add which results in v9fs_dentry leak. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-039p: close ACL leaksAl Viro
commit 1ec95bf34d976b38897d1977b155a544d77b05e7 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03fs/9p: Always ask new inode in createAneesh Kumar K.V
commit ed80fcfac2565fa866d93ba14f0e75de17a8223e upstream. This make sure we don't end up reusing the unlinked inode object. The ideal way is to use inode i_generation. But i_generation is not available in userspace always. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03fs/9p: Fix invalid mount options/argsPrem Karat
commit a2dd43bb0d7b9ce28f8a39254c25840c0730498e upstream. Without this fix, if any invalid mount options/args are passed while mouting the 9p fs, no error (-EINVAL) is returned and default arg value is assigned. This fix returns -EINVAL when an invalid arguement is found while parsing mount options. Signed-off-by: Prem Karat <prem.karat@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03fs/9p: When doing inode lookup compare qid details and inode mode bits.Aneesh Kumar K.V
commit fd2421f54423f307ecd31bdebdca6bc317e0c492 upstream. This make sure we don't use wrong inode from the inode hash. The inode number of the file deleted is reused by the next file system object created and if we only use inode number for inode hash lookup we could end up with wrong struct inode. Also compare inode generation number. Not all Linux file system provide st_gen in userspace. So it could be 0; Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03Avoid dereferencing a 'request_queue' after last close.NeilBrown
commit 94007751bb02797ba87bac7aacee2731ac2039a3 upstream. On the last close of an 'md' device which as been stopped, the device is destroyed and in particular the request_queue is freed. The free is done in a separate thread so it might happen a short time later. __blkdev_put calls bdev_inode_switch_bdi *after* ->release has been called. Since commit f758eeabeb96f878c860e8f110f94ec8820822a9 bdev_inode_switch_bdi will dereference the 'old' bdi, which lives inside a request_queue, to get a spin lock. This causes the last close on an md device to sometime take a spin_lock which lives in freed memory - which results in an oops. So move the called to bdev_inode_switch_bdi before the call to ->release. Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29fuse: check size of FUSE_NOTIFY_INVAL_ENTRY messageMiklos Szeredi
commit c2183d1e9b3f313dd8ba2b1b0197c8d9fb86a7ae upstream. FUSE_NOTIFY_INVAL_ENTRY didn't check the length of the write so the message processing could overrun and result in a "kernel BUG at fs/fuse/dev.c:629!" Reported-by: Han-Wen Nienhuys <hanwenn@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29ext4: fix nomblk_io_submit option so it correctly converts uninit blocksTheodore Ts'o
commit 9dd75f1f1a02d656a11a7b9b9e6c2759b9c1e946 upstream. Bug discovered by Jan Kara: Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the old IO submission code but apparently it forgot to return the old handling of uninitialized buffers so we unconditionnaly call block_write_full_page() without specifying end_io function. So AFAICS we never convert unwritten extents to written in some cases. For example when I mount the fs as: mount -t ext4 -o nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600); char buf[1024]; memset(buf, 'a', sizeof(buf)); fallocate(fd, 0, 0, 16384); write(fd, buf, sizeof(buf)); I get a file full of zeros (after remounting the filesystem so that pagecache is dropped) instead of seeing the first KB contain 'a's. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN.Tao Ma
commit 32c80b32c053dc52712dedac5e4d0aa7c93fc353 upstream. EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should be done simultaneously since ext4_end_io_nolock always clear the flag and decrease the counter in the same time. We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process to wait forever. Part of the patch came from Eric in his e-mail, but it doesn't fix the problem met by Michael actually. http://marc.info/?l=linux-ext4&m=131316851417460&w=2 Reported-and-Tested-by: Michael Tokarev<mjt@tls.msk.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inodeJiaying Zhang
commit 2581fdc810889fdea97689cb62481201d579c796 upstream. Flush inode's i_completed_io_list before calling ext4_io_wait to prevent the following deadlock scenario: A page fault happens while some process is writing inode A. During page fault, shrink_icache_memory is called that in turn evicts another inode B. Inode B has some pending io_end work so it calls ext4_ioend_wait() that waits for inode B's i_ioend_count to become zero. However, inode B's ioend work was queued behind some of inode A's ioend work on the same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten thread on that cpu is processing inode A's ioend work, it tries to grab inode A's i_mutex lock. Since the i_mutex lock of inode A is still hold before the page fault happened, we enter a deadlock. Also moves ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion, ext4_evict_inode() is called before ext4_destroy_inode() and in ext4_evict_inode(), we may call ext4_truncate() without holding i_mutex lock. As a result, there is a race between flush_completed_IO that is called from ext4_ext_truncate() and ext4_end_io_work, which may cause corruption on an io_end structure. This change moves ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode() to ext4_evict_inode() to resolve the race between ext4_truncate() and ext4_end_io_work during inode deletion. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29ext4: Fix ext4_should_writeback_data() for no-journal modeCurt Wohlgemuth
commit 441c850857148935babe000fc2ba1455fe54a6a9 upstream. ext4_should_writeback_data() had an incorrect sequence of tests to determine if it should return 0 or 1: in particular, even in no-journal mode, 0 was being returned for a non-regular-file inode. This meant that, in non-journal mode, we would use ext4_journalled_aops for directories, symlinks, and other non-regular files. However, calling journalled aop callbacks when there is no valid handle, can cause problems. This would cause a kernel crash with Jan Kara's commit 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data"), because we now dereference 'handle' in ext4_journalled_write_end(). I also added BUG_ONs to check for a valid handle in the obviously journal-only aops callbacks. I tested this running xfstests with a scratch device in these modes: - no-journal - data=ordered - data=writeback - data=journal All work fine; the data=journal run has many failures and a crash in xfstests 074, but this is no different from a vanilla kernel. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29Btrfs: fix an oops of log replayliubo
commit 34f3e4f23ca3d259fe078f62a128d97ca83508ef upstream. When btrfs recovers from a crash, it may hit the oops below: ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:4580! [...] RIP: 0010:[<ffffffffa03df251>] [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs] [...] Call Trace: [<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs] [<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs] [<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs] [<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs] [<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs] [<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs] [<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs] [<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs] [...] This comes from that while replaying an inode ref item, we forget to check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree, then we will come to conflict corners which lead to BUG_ON(). Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Tested-by: Andy Lutomirski <luto@mit.edu> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29Btrfs: detect wether a device supports discardJosef Bacik
commit d5e2003c2bcda93a8f2e668eb4642d70c9c38301 upstream. We have a problem where if a user specifies discard but doesn't actually support it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem because this gets called (in a fashion) from the tree log recovery code, which has a nice little BUG_ON(ret) after it, which causes us to fail the tree log replay. So instead detect wether our devices support discard when we're adding them and then don't issue discards if we know that the device doesn't support it. And just for good measure set ret = 0 in btrfs_issue_discard just in case we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resetsTrond Myklebust
commit 910ac68a2b80c7de95bc8488734067b1bb15d583 upstream. If the client is in the process of resetting the session when it receives a callback, then returning NFS4ERR_DELAY may cause a deadlock with the DESTROY_SESSION call. Basically, if the client returns NFS4ERR_DELAY in response to the CB_SEQUENCE call, then the server is entitled to believe that the client is busy because it is already processing that call. In that case, the server is perfectly entitled to respond with a NFS4ERR_BACK_CHAN_BUSY to any DESTROY_SESSION call. Fix this by having the client reply with a NFS4ERR_BADSESSION in response to the callback if it is resetting the session. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>