summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-11-30fs: partitions: Fix warnings in fs/partitions/check.cColin Cross
Change-Id: I4398ace0c55d4833b1fcbb7a4e71ab8f0b1b044a Signed-off-by: Colin Cross <ccross@android.com>
2011-11-30block: genhd: Add disk/partition specific uevent callbacks for partition infoSan Mehat
For disk devices, a new uevent parameter 'NPARTS' specifies the number of partitions detected by the kernel. Partition devices get 'PARTN' which specifies the partitions index in the table. Signed-off-by: San Mehat <san@google.com>
2011-11-30proc: smaps: Allow smaps access for CAP_SYS_RESOURCESan Mehat
Signed-off-by: San Mehat <san@google.com>
2011-11-30fs: yaffs: don't force YAFFS_TRACE_ALWAYS for all trace levelsDima Zavin
Change-Id: I9ddc676382d26aef7f12145d412fe670cb486317 Signed-off-by: Dima Zavin <dima@android.com>
2011-11-30fs: yaffs: Import yaffs from Thu Dec 23 13:31:37 2010 +1300Arve Hjønnevåg
commit ddf33fed15c2376bfb602d62dd018c63fce60df8 Author: Timothy Manning <tfhmanning@gmail.com> Date: Thu Dec 23 13:31:37 2010 +1300 yaffs updated direct/timothy_tests/quick_tests Signed-off-by: Timothy Manning <tfhmanning@gmail.com> Change-Id: I5bbe5a05277bdf8a6fe188bbe4c77725b3fa2aae Signed-off-by: Dima Zavin <dima@android.com>
2011-11-30fs: block_dump: Don't display inode changes if block_dump < 2San Mehat
Signed-off-by: San Mehat <san@android.com>
2011-11-30Grants system server access to /proc/<pid>/oom_adj for Android applications.Mike Chan
Signed-off-by: Brian Swetland <swetland@google.com>
2011-11-30FAT: Add new ioctl VFAT_IOCTL_GET_VOLUME_ID for reading the volume ID.Mike Lockwood
Signed-off-by: Brian Swetland <swetland@google.com>
2011-11-26vmscan: fix shrinker callback bug in fs/super.cMikulas Patocka
commit 09f363c7363eb10cfb4b82094bd7064e5608258b upstream. The callback must not return -1 when nr_to_scan is zero. Fix the bug in fs/super.c and add this requirement to the callback specification. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-26nfs: when attempting to open a directory, fall back on normal lookup (try #5)Jeff Layton
commit 1788ea6e3b2a58cf4fb00206e362d9caff8d86a7 upstream. commit d953126 changed how nfs_atomic_lookup handles an -EISDIR return from an OPEN call. Prior to that patch, that caused the client to fall back to doing a normal lookup. When that patch went in, the code began returning that error to userspace. The d_revalidate codepath however never had the corresponding change, so it was still possible to end up with a NULL ctx->state pointer after that. That patch caused a regression. When we attempt to open a directory that does not have a cached dentry, that open now errors out with EISDIR. If you attempt the same open with a cached dentry, it will succeed. Fix this by reverting the change in nfs_atomic_lookup and allowing attempts to open directories to fall back to a normal lookup Also, add a NFSv4-specific f_ops->open routine that just returns -ENOTDIR. This should never be called if things are working properly, but if it ever is, then the dprintk may help in debugging. To facilitate this, a new file_operations field is also added to the nfs_rpc_ops struct. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-21hfs: add sanity check for file name lengthDan Carpenter
commit bc5b8a9003132ae44559edd63a1623b7b99dfb68 upstream. On a corrupted file system the ->len field could be wrong leading to a buffer overflow. Reported-and-acked-by: Clement LECIGNE <clement.lecigne@netasq.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11VFS: we need to set LOOKUP_JUMPED on mountpoint crossingAl Viro
commit a3fbbde70a0cec017f2431e8f8de208708c76acc upstream. Mountpoint crossing is similar to following procfs symlinks - we do not get ->d_revalidate() called for dentry we have arrived at, with unpleasant consequences for NFS4. Simple way to reproduce the problem in mainline: cat >/tmp/a.c <<'EOF' #include <unistd.h> #include <fcntl.h> #include <stdio.h> main() { struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1}; if (fcntl(0, F_SETLK, &fl)) perror("setlk"); } EOF cc /tmp/a.c -o /tmp/test then on nfs4: mount --bind file1 file2 /tmp/test < file1 # ok /tmp/test < file2 # spews "setlk: No locks available"... What happens is the missing call of ->d_revalidate() after mountpoint crossing and that's where NFS4 would issue OPEN request to server. The fix is simple - treat mountpoint crossing the same way we deal with following procfs-style symlinks. I.e. set LOOKUP_JUMPED... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11VFS: fix statfs() automounter semantics regressionDan McGee
commit 5c8a0fbba543d9428a486f0d1282bbcf3cf1d95a upstream. No one in their right mind would expect statfs() to not work on a automounter managed mount point. Fix it. [ I'm not sure about the "no one in their right mind" part. It's not mounted, and you didn't ask for it to be mounted. But nobody will really care, and this probably makes it match previous semantics, so.. - Linus ] This mirrors the fix made to the quota code in 815d405ceff0d69646. Signed-off-by: Dan McGee <dpmcgee@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11block: make gendisk hold a reference to its queueTejun Heo
commit f992ae801a7dec34a4ed99a6598bbbbfb82af4fb upstream. The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: fix race in xattr block allocation pathEric Sandeen
commit 6d6a435190bdf2e04c9465cde5bdc3ac68cf11a4 upstream. Ceph users reported that when using Ceph on ext4, the filesystem would often become corrupted, containing inodes with incorrect i_blocks counters. I managed to reproduce this with a very hacked-up "streamtest" binary from the Ceph tree. Ceph is doing a lot of xattr writes, to out-of-inode blocks. There is also another thread which does sync_file_range and close, of the same files. The problem appears to happen due to this race: sync/flush thread xattr-set thread ----------------- ---------------- do_writepages ext4_xattr_set ext4_da_writepages ext4_xattr_set_handle mpage_da_map_blocks ext4_xattr_block_set set DELALLOC_RESERVE ext4_new_meta_blocks ext4_mb_new_blocks if (!i_delalloc_reserved_flag) vfs_dq_alloc_block ext4_get_blocks down_write(i_data_sem) set i_delalloc_reserved_flag ... up_write(i_data_sem) if (i_delalloc_reserved_flag) vfs_dq_alloc_block_nofail In other words, the sync/flush thread pops in and sets i_delalloc_reserved_flag on the inode, which makes the xattr thread think that it's in a delalloc path in ext4_new_meta_blocks(), and add the block for a second time, after already having added it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks The real problem is that we shouldn't be using the DELALLOC_RESERVED state flag, and instead we should be passing EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of using an inode state flag. We'll fix this for now with using i_data_sem to prevent this race, but this is really not the right way to fix things. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: let ext4_page_mkwrite stop started handle in failureYongqiang Yang
commit fcbb5515825f1bb20b7a6f75ec48bee61416f879 upstream. The started journal handle should be stopped in failure case. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: call ext4_handle_dirty_metadata with correct inode in ext4_dx_add_entryTheodore Ts'o
commit 5930ea643805feb50a2f8383ae12eb6f10935e49 upstream. ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads that point to directory blocks assigned to the directory inode. However, the function calls ext4_handle_dirty_metadata with the inode of the file that's being added to the directory, not the directory inode itself. Therefore, correct the code to dirty the directory buffers with the directory inode, not the file inode. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: ext4_mkdir should dirty dir_block with newly created directory inodeDarrick J. Wong
commit f9287c1f2d329f4d78a3bbc9cf0db0ebae6f146a upstream. ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir". Unfortunately, dir_block belongs to the newly created directory (which is "inode"), not the parent directory (which is "dir"). Fix the incorrect association. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext4: ext4_rename should dirty dir_bh with the correct directoryDarrick J. Wong
commit bcaa992975041e40449be8c010c26192b8c8b409 upstream. When ext4_rename performs a directory rename (move), dir_bh is a buffer that is modified to update the '..' link in the directory being moved (old_inode). However, ext4_handle_dirty_metadata is called with the old parent directory inode (old_dir) and dir_bh, which is incorrect because dir_bh does not belong to the parent inode. Fix this error. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodesTheodore Ts'o
commit 1cd9f0976aa4606db8d6e3dc3edd0aca8019372a upstream. This doesn't make much sense, and it exposes a bug in the kernel where attempts to create a new file in an append-only directory using O_CREAT will fail (but still leave a zero-length file). This was discovered when xfstests #79 was generalized so it could run on all file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11binfmt_elf: fix PIE execution with randomization disabledJiri Kosina
commit a3defbe5c337dbc6da911f8cc49ae3cc3b49b453 upstream. The case of address space randomization being disabled in runtime through randomize_va_space sysctl is not treated properly in load_elf_binary(), resulting in SIGKILL coming at exec() time for certain PIE-linked binaries in case the randomization has been disabled at runtime prior to calling exec(). Handle the randomize_va_space == 0 case the same way as if we were not supporting .text randomization at all. Based on original patch by H.J. Lu and Josh Boyer. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: H.J. Lu <hongjiu.lu@intel.com> Cc: <stable@kernel.org> Tested-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11readlinkat: ensure we return ENOENT for the empty pathname for normal lookupsAndy Whitcroft
commit 1fa1e7f615f4d3ae436fa319af6e4eebdd4026a8 upstream. Since the commit below which added O_PATH support to the *at() calls, the error return for readlink/readlinkat for the empty pathname has switched from ENOENT to EINVAL: commit 65cfc6722361570bfe255698d9cd4dccaf47570d Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 13 15:56:26 2011 -0400 readlinkat(), fchownat() and fstatat() with empty relative pathnames This is both unexpected for userspace and makes readlink/readlinkat inconsistant with all other interfaces; and inconsistant with our stated return for these pathnames. As the readlinkat call does not have a flags parameter we cannot use the AT_EMPTY_PATH approach used in the other calls. Therefore expose whether the original path is infact entry via a new user_path_at_empty() path lookup function. Use this to determine whether to default to EINVAL or ENOENT for failures. Addresses http://bugs.launchpad.net/bugs/817187 [akpm@linux-foundation.org: remove unused getname_flags()] Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11/proc/self/numa_maps: restore "huge" tag for hugetlb vmasAndrew Morton
commit fc360bd9cdcf875639a77f07fafec26699c546f3 upstream. The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm: use walk_page_range() instead of custom page table walking code"). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Tested-by: Stephen Hemminger <shemminger@vyatta.com> Reviewed-by: Stephen Wilson <wilsons@start.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11vfs: add "device" tag to /proc/self/mountstatsBryan Schumaker
commit a877ee03ac010ded434b77f7831f43cbb1fcc60f upstream. nfsiostat was failing to find mounted filesystems on kernels after 2.6.38 because of changes to show_vfsstat() by commit c7f404b40a3665d9f4e9a927cc5c1ee0479ed8f9. This patch adds back the "device" tag before the nfs server entry so scripts can parse the mountstats file correctly. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: ignore WANT bits in open downgradeJ. Bruce Fields
commit c30e92df30d7d5fe65262fbce5d1b7de675fe34e upstream. We don't use WANT bits yet--and sending them can probably trigger a BUG() further down. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: fix open downgrade, againJ. Bruce Fields
commit 3d02fa29dec920c597dd7b7db608a4bc71f088ce upstream. Yet another open-management regression: - nfs4_file_downgrade() doesn't remove the BOTH access bit on downgrade, so the server's idea of the stateid's access gets out of sync with the client's. If we want to keep an O_RDWR open in this case, we should do that in the file_put_access logic rather than here. - We forgot to convert v4 access to an open mode here. This logic has proven too hard to get right. In the future we may consider: - reexamining the lock/openowner relationship (locks probably don't really need to take their own references here). - adding open upgrade/downgrade support to the vfs. - removing the atomic operations. They're redundant as long as this is all under some other lock. Also, maybe some kind of additional static checking would help catch O_/NFS4_SHARE_ACCESS confusion. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: permit read opens of executable-only filesJ. Bruce Fields
commit a043226bc140a2c1dde162246d68a67e5043e6b2 upstream. A client that wants to execute a file must be able to read it. Read opens over nfs are therefore implicitly allowed for executable files even when those files are not readable. NFSv2/v3 get this right by using a passed-in NFSD_MAY_OWNER_OVERRIDE on read requests, but NFSv4 has gotten this wrong ever since dc730e173785e29b297aa605786c94adaffe2544 "nfsd4: fix owner-override on open", when we realized that the file owner shouldn't override permissions on non-reclaim NFSv4 opens. So we can't use NFSD_MAY_OWNER_OVERRIDE to tell nfsd_permission to allow reads of executable files. So, do the same thing we do whenever we encounter another weird NFS permission nit: define yet another NFSD_MAY_* flag. The industry's future standardization on 128-bit processors will be motivated primarily by the need for integers with enough bits for all the NFSD_MAY_* flags. Reported-by: Leonardo Borda <leonardoborda@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: fix seqid_mutating_errorJ. Bruce Fields
commit 576163005de286bbd418fcb99cfd0971523a0c6d upstream. The set of errors here does *not* agree with the set of errors specified in the rfc! While we're there, turn this macros into a function, for the usual reasons, and move it to the one place where it's actually used. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: stop using nfserr_resource for transitory errorsJ. Bruce Fields
commit 3e77246393c0a433247631a1f0e9ec98d3d78a1c upstream. The server is returning nfserr_resource for both permanent errors and for errors (like allocation failures) that might be resolved by retrying later. Save nfserr_resource for the former and use delay/jukebox for the latter. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfsd4: Remove check for a 32-bit cookie in nfsd4_readdir()Bernd Schubert
commit 832023bffb4b493f230be901f681020caf3ed1f8 upstream. Fan Yong <yong.fan@whamcloud.com> noticed setting FMODE_32bithash wouldn't work with nfsd v4, as nfsd4_readdir() checks for 32 bit cookies. However, according to RFC 3530 cookies have a 64 bit type and cookies are also defined as u64 in 'struct nfsd4_readdir'. So remove the test for >32-bit values. Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfs: don't try to migrate pages with active requestsJeff Layton
commit 2da956523526e440ef4f4dd174e26f5ac06fe011 upstream. nfs_find_and_lock_request will take a reference to the nfs_page and will then put it if the req is already locked. It's possible though that the reference will be the last one. That put then can kick off a whole series of reference puts: nfs_page nfs_open_context dentry inode If the inode ends up being deleted, then the VFS will call truncate_inode_pages. That function will try to take the page lock, but it was already locked when migrate_page was called. The code deadlocks. Fix this by simply refusing the migration request if PagePrivate is already set, indicating that the page is already associated with an active read or write request. We've had a customer test a backported version of this patch and the preliminary results seem good. Cc: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Harshula Jayasuriya <harshula@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11nfs: don't redirty inode when ncommit == 0 in nfs_commit_unstable_pagesJeff Layton
commit 3236c3e1adc0c7ec83eaff1de2d06746b7c5bb28 upstream. commit 420e3646 allowed the kernel to reduce the number of unnecessary commit calls by skipping the commit when there are a large number of outstanding pages. However, the current test in nfs_commit_unstable_pages does not handle the edge condition properly. When ncommit == 0, then that means that the kernel doesn't need to do anything more for the inode. The current test though in the WB_SYNC_NONE case will return true, and the inode will end up being marked dirty. Once that happens the inode will never be clean until there's a WB_SYNC_ALL flush. Fix this by immediately returning from nfs_commit_unstable_pages when ncommit == 0. Mike noticed this problem initially in RHEL5 (2.6.18-based kernel) which has a backported version of 420e3646. The inode cache there was growing very large. The inode cache was unable to be shrunk since the inodes were all marked dirty. Calling sync() would essentially "fix" the problem -- the WB_SYNC_ALL flush would result in the inodes all being marked clean. What I'm not clear on is how big a problem this is in mainline kernels as the writeback code there is very different. Either way, it seems incorrect to re-mark the inode dirty in this case. Reported-by: Mike McLean <mikem@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11SUNRPC/NFS: make rpc pipe upcall genericPeng Tao
commit c1225158a8dad9e9d5eee8a17dbbd9c7cda05ab9 upstream. The same function is used by idmap, gss and blocklayout code. Make it generic. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11Revert "NFS: Ensure that writeback_single_inode() calls write_inode() when ↵Trond Myklebust
syncing" commit 59b7c05fffba030e5d9e72324691e2f99aa69b79 upstream. This reverts commit b80c3cb628f0ebc241b02e38dd028969fb8026a2. The reverted commit was rendered obsolete by a VFS fix: commit 5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update dirty flags in two steps). We now no longer need to worry about writeback_single_inode() missing our marking the inode for COMMIT in 'do_writepages()' call. Reverting this patch, fixes a performance regression in which the inode would continuously get queued to the dirty list, causing the writeback code to unnecessarily try to send a COMMIT. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11pnfsblock: fix writeback deadlockPeng Tao
commit 7542274519b3ba87555410c66e8356ac1e3bc9b3 upstream. We should check if the sector is already initialized before trying to grab the page from page cache. Otherwise when two pages of the same block are written back by two threads each calling from writepage_locked, it can cause deadlock like bellow. [ 1080.972099] INFO: task kswapd0:25 blocked for more than 120 seconds. [ 1080.972377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1080.972812] kswapd0 D ffff88000c4926c0 0 25 2 0x00000000 [ 1080.972816] ffff88000df276b0 0000000000000046 ffff88000df27640 ffffffff81013ba7 [ 1080.972821] ffff88000c492310 ffff88000df27fd8 ffff88000df27fd8 00000000001d3440 [ 1080.972824] ffff88000c378000 ffff88000c492310 ffff8800175d3d40 ffff880017fc75a8 [ 1080.972828] Call Trace: [ 1080.972860] [<ffffffff81013ba7>] ? read_tsc+0x9/0x19 [ 1080.972877] [<ffffffff810e0b23>] ? lock_page+0x2b/0x2b [ 1080.972899] [<ffffffff81475a1d>] io_schedule+0x63/0x7e [ 1080.972902] [<ffffffff810e0b31>] sleep_on_page+0xe/0x12 [ 1080.972905] [<ffffffff81475fe8>] __wait_on_bit_lock+0x46/0x8f [ 1080.972916] [<ffffffff810822d7>] ? lock_release_holdtime.part.7+0x6b/0x72 [ 1080.972919] [<ffffffff810e0af6>] __lock_page+0x66/0x68 [ 1080.972928] [<ffffffff81072705>] ? autoremove_wake_function+0x3d/0x3d [ 1080.972932] [<ffffffff810e0b1f>] lock_page+0x27/0x2b [ 1080.972934] [<ffffffff810e0bcf>] find_lock_page+0x34/0x57 [ 1080.972937] [<ffffffff810e1738>] find_or_create_page+0x34/0x8a [ 1080.972947] [<ffffffffa034245b>] bl_write_pagelist+0x205/0x6da [blocklayoutdriver] [ 1080.972951] [<ffffffffa034145d>] ? bl_free_lseg+0x38/0x38 [blocklayoutdriver] [ 1080.972995] [<ffffffffa02e27b9>] ? nfs_write_rpcsetup+0x118/0x123 [nfs] [ 1080.973033] [<ffffffffa030246b>] pnfs_generic_pg_writepages+0x10b/0x1f4 [nfs] [ 1080.973089] [<ffffffffa02deaae>] nfs_pageio_doio+0x1a/0x43 [nfs] [ 1080.973098] [<ffffffffa02df035>] nfs_pageio_complete+0x16/0x2d [nfs] [ 1080.973108] [<ffffffffa02e2d8f>] nfs_writepage_locked+0xa0/0xbf [nfs] [ 1080.973119] [<ffffffffa02e36a1>] nfs_writepage+0x16/0x2b [nfs] [ 1080.973122] [<ffffffff810e8762>] ? clear_page_dirty_for_io+0x87/0x9a [ 1080.973133] [<ffffffff810efc5b>] shrink_page_list+0x39b/0x6c8 [ 1080.973139] [<ffffffff810f03bb>] shrink_inactive_list+0x22c/0x39e [ 1080.973144] [<ffffffff810822d7>] ? lock_release_holdtime.part.7+0x6b/0x72 [ 1080.973148] [<ffffffff810f0c33>] shrink_zone+0x445/0x588 [ 1080.973152] [<ffffffff810f1a11>] balance_pgdat+0x2c2/0x56b [ 1080.973170] [<ffffffff81254208>] ? __bitmap_weight+0x34/0x80 [ 1080.973175] [<ffffffff810f1f78>] kswapd+0x2be/0x2fa [ 1080.973179] [<ffffffff810726c8>] ? __init_waitqueue_head+0x4b/0x4b [ 1080.973183] [<ffffffff810f1cba>] ? balance_pgdat+0x56b/0x56b [ 1080.973187] [<ffffffff81071f69>] kthread+0xa8/0xb0 [ 1080.973200] [<ffffffff814806b4>] kernel_thread_helper+0x4/0x10 [ 1080.973205] [<ffffffff81071ec1>] ? __init_kthread_worker+0x5a/0x5a [ 1080.973210] [<ffffffff814806b0>] ? gs_change+0x13/0x13 [ 1080.973213] no locks held by kswapd0/25. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11pnfsblock: fix NULL pointer dereferencePeng Tao
commit e6d05a757c314ad88d0649d3835a8a1daa964236 upstream. bl_add_page_to_bio returns error pointer. bio should be reset to NULL in failure cases as the out path always calls bl_submit_bio. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11pnfs: recoalesce when ld read pagelist failsPeng Tao
commit 9b7eecdcfeb943f130d86bbc249fde4994b6fe30 upstream. For pnfs pagelist read failure, we need to pg_recoalesce and resend IO to mds. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11pnfs: recoalesce when ld write pagelist failsPeng Tao
commit 8ce160c5ef06cc89c2b6b26bfa5ef7a5ce2c93e0 upstream. For pnfs pagelist write failure, we need to pg_recoalesce and resend IO to mds. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11pnfs: make _set_lo_fail genericPeng Tao
commit 1b0ae068779874f54b55aac3a2a992bcf3f2c3c4 upstream. file layout and block layout both use it to set mark layout io failure bit. So make it generic. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11pnfsblock: add missing rpc_put_mount and path_putPeng Tao
commit 760383f1ee4d14b0e0bdf0cddee648d9b8633429 upstream. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11pnfsblock: fix size of upcall messageJim Rees
commit fdc17abbc4b6094b34ee8ff5d91eaba8637594a2 upstream. Make the status field explicitly 32 bits. "...it's unlikely that the kernel and userspace would differ on the size of an int here, but it might be a good idea to go ahead and make that explicitly 32 bits in case we end up dealing with more exotic arches at some point in the future." Suggested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11pnfsblock: fix return code confusionJim Rees
commit 516f2e24faa7548a61d9ba790958528469c2e284 upstream. Always return PTR_ERR, not NULL, from nfs4_blk_get_deviceinfo and nfs4_blk_decode_device. Check for IS_ERR, not NULL, in bl_set_layoutdriver when calling nfs4_blk_get_deviceinfo. Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11epoll: fix spurious lockdep warningsNelson Elhage
commit d8805e633e054c816c47cb6e727c81f156d9253d upstream. epoll can acquire recursively acquire ep->mtx on multiple "struct eventpoll"s at once in the case where one epoll fd is monitoring another epoll fd. This is perfectly OK, since we're careful about the lock ordering, but it causes spurious lockdep warnings. Annotate the recursion using mutex_lock_nested, and add a comment explaining the nesting rules for good measure. Recent versions of systemd are triggering this, and it can also be demonstrated with the following trivial test program: --------------------8<-------------------- int main(void) { int e1, e2; struct epoll_event evt = { .events = EPOLLIN }; e1 = epoll_create1(0); e2 = epoll_create1(0); epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt); return 0; } --------------------8<-------------------- Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Nelson Elhage <nelhage@nelhage.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11CIFS: Fix DFS handling in cifs_get_file_infoPavel Shilovsky
commit 42274bb22afc3e877ae5abed787b0b09d7dede52 upstream. We should call cifs_all_info_to_fattr in rc == 0 case only. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11CIFS: Fix incorrect max RFC1002 write size valuePavel Shilovsky
commit 94443f43404239c2a6dc4252a7cb9e77f5b1eb6e upstream. ..the length field has only 17 bits. Acked-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-14Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: revert to using a kthread for AIL pushing xfs: force the log if we encounter pinned buffers in .iop_pushbuf xfs: do not update xa_last_pushed_lsn for locked items
2011-10-13Merge branch 'btrfs-3.0' of git://github.com/chrismason/linuxLinus Torvalds
* 'btrfs-3.0' of git://github.com/chrismason/linux: Btrfs: make sure not to defrag extents past i_size Btrfs: fix recursive auto-defrag
2011-10-11xfs: revert to using a kthread for AIL pushingChristoph Hellwig
Currently we have a few issues with the way the workqueue code is used to implement AIL pushing: - it accidentally uses the same workqueue as the syncer action, and thus can be prevented from running if there are enough sync actions active in the system. - it doesn't use the HIGHPRI flag to queue at the head of the queue of work items At this point I'm not confident enough in getting all the workqueue flags and tweaks right to provide a perfectly reliable execution context for AIL pushing, which is the most important piece in XFS to make forward progress when the log fills. Revert back to use a kthread per filesystem which fixes all the above issues at the cost of having a task struct and stack around for each mounted filesystem. In addition this also gives us much better ways to diagnose any issues involving hung AIL pushing and removes a small amount of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11xfs: force the log if we encounter pinned buffers in .iop_pushbufChristoph Hellwig
We need to check for pinned buffers even in .iop_pushbuf given that inode items flush into the same buffers that may be pinned directly due operations on the unlinked inode list operating directly on buffers. To do this add a return value to .iop_pushbuf that tells the AIL push about this and use the existing log force mechanisms to unpin it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11xfs: do not update xa_last_pushed_lsn for locked itemsChristoph Hellwig
If an item was locked we should not update xa_last_pushed_lsn and thus skip it when restarting the AIL scan as we need to be able to lock and write it out as soon as possible. Otherwise heavy lock contention might starve AIL pushing too easily, especially given the larger backoff once we moved xa_last_pushed_lsn all the way to the target lsn. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>