summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-09-29hfs,hfsplus: cache pages correctly between bnode_create and bnode_freeHin-Tak Leung
commit 7cb74be6fd827e314f81df3c5889b87e4c87c569 upstream. Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and hfs_bnode_find() for finding or creating pages corresponding to an inode) are immediately kmap()'ed and used (both read and write) and kunmap()'ed, and should not be page_cache_release()'ed until hfs_bnode_free(). This patch fixes a problem I first saw in July 2012: merely running "du" on a large hfsplus-mounted directory a few times on a reasonably loaded system would get the hfsplus driver all confused and complaining about B-tree inconsistencies, and generates a "BUG: Bad page state". Most recently, I can generate this problem on up-to-date Fedora 22 with shipped kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller mounts) and "du /mnt" simultaneously on two windows, where /mnt is a lightly-used QEMU VM image of the full Mac OS X 10.9: $ df -i / /home /mnt Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/fedora-root 3276800 551665 2725135 17% / /dev/mapper/fedora-home 52879360 716221 52163139 2% /home /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt After applying the patch, I was able to run "du /" (60+ times) and "du /mnt" (150+ times) continuously and simultaneously for 6+ hours. There are many reports of the hfsplus driver getting confused under load and generating "BUG: Bad page state" or other similar issues over the years. [1] The unpatched code [2] has always been wrong since it entered the kernel tree. The only reason why it gets away with it is that the kmap/memcpy/kunmap follow very quickly after the page_cache_release() so the kernel has not had a chance to reuse the memory for something else, most of the time. The current RW driver appears to have followed the design and development of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec 2001) had a B-tree node-centric approach to read_cache_page()/page_cache_release() per bnode_get()/bnode_put(), migrating towards version 0.2 (June 2002) of caching and releasing pages per inode extents. When the current RW code first entered the kernel [2] in 2005, there was an REF_PAGES conditional (and "//" commented out code) to switch between B-node centric paging to inode-centric paging. There was a mistake with the direction of one of the REF_PAGES conditionals in __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were removed, but a page_cache_release() was mistakenly left in (propagating the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out page_cache_release() in bnode_release() (which should be spanned by !REF_PAGES) was never enabled. References: [1]: Michael Fox, Apr 2013 http://www.spinics.net/lists/linux-fsdevel/msg63807.html ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'") Sasha Levin, Feb 2015 http://lkml.org/lkml/2015/2/20/85 ("use after free") https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887 https://bugzilla.kernel.org/show_bug.cgi?id=42342 https://bugzilla.kernel.org/show_bug.cgi?id=63841 https://bugzilla.kernel.org/show_bug.cgi?id=78761 [2]: http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db commit d1081202f1d0ee35ab0beb490da4b65d4bc763db Author: Andrew Morton <akpm@osdl.org> Date: Wed Feb 25 16:17:36 2004 -0800 [PATCH] HFS rewrite http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd commit 91556682e0bf004d98a529bf829d339abb98bbbd Author: Andrew Morton <akpm@osdl.org> Date: Wed Feb 25 16:17:48 2004 -0800 [PATCH] HFS+ support [3]: http://sourceforge.net/projects/linux-hfsplus/ http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/ http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/ http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\ fs/hfsplus/bnode.c?r1=1.4&r2=1.5 Date: Thu Jun 6 09:45:14 2002 +0000 Use buffer cache instead of page cache in bnode.c. Cache inode extents. [4]: http://git.kernel.org/cgit/linux/kernel/git/\ stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d commit a5e3985fa014029eb6795664c704953720cc7f7d Author: Roman Zippel <zippel@linux-m68k.org> Date: Tue Sep 6 15:18:47 2005 -0700 [PATCH] hfs: remove debug code Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Sergei Antonov <saproj@gmail.com> Reviewed-by: Anton Altaparmakov <anton@tuxera.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Sougata Santra <sougata@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29hfs: fix B-tree corruption after insertion at position 0Hin-Tak Leung
commit b4cc0efea4f0bfa2477c56af406cfcf3d3e58680 upstream. Fix B-tree corruption when a new record is inserted at position 0 in the node in hfs_brec_insert(). This is an identical change to the corresponding hfs b-tree code to Sergei Antonov's "hfsplus: fix B-tree corruption after insertion at position 0", to keep similar code paths in the hfs and hfsplus drivers in sync, where appropriate. Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net> Cc: Sergei Antonov <saproj@gmail.com> Cc: Joe Perches <joe@perches.com> Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29eCryptfs: Invalidate dcache entries when lower i_nlink is zeroTyler Hicks
commit 5556e7e6d30e8e9b5ee51b0e5edd526ee80e5e36 upstream. Consider eCryptfs dcache entries to be stale when the corresponding lower inode's i_nlink count is zero. This solves a problem caused by the lower inode being directly modified, without going through the eCryptfs mount, leaving stale eCryptfs dentries cached and the eCryptfs inode's i_nlink count not being cleared. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Richard Weinberger <richard@nod.at> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29fs: Don't dump core if the corefile would become world-readable.Jann Horn
commit 40f705a736eac10e7dca7ab5dd5ed675a6df031d upstream. On a filesystem like vfat, all files are created with the same owner and mode independent of who created the file. When a vfat filesystem is mounted with root as owner of all files and read access for everyone, root's processes left world-readable coredumps on it (but other users' processes only left empty corefiles when given write access because of the uid mismatch). Given that the old behavior was inconsistent and insecure, I don't see a problem with changing it. Now, all processes refuse to dump core unless the resulting corefile will only be readable by their owner. Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29fs: if a coredump already exists, unlink and recreate with O_EXCLJann Horn
commit fbb1816942c04429e85dbf4c1a080accc534299e upstream. It was possible for an attacking user to trick root (or another user) into writing his coredumps into an attacker-readable, pre-existing file using rename() or link(), causing the disclosure of secret data from the victim process' virtual memory. Depending on the configuration, it was also possible to trick root into overwriting system files with coredumps. Fix that issue by never writing coredumps into existing files. Requirements for the attack: - The attack only applies if the victim's process has a nonzero RLIMIT_CORE and is dumpable. - The attacker can trick the victim into coredumping into an attacker-writable directory D, either because the core_pattern is relative and the victim's cwd is attacker-writable or because an absolute core_pattern pointing to a world-writable directory is used. - The attacker has one of these: A: on a system with protected_hardlinks=0: execute access to a folder containing a victim-owned, attacker-readable file on the same partition as D, and the victim-owned file will be deleted before the main part of the attack takes place. (In practice, there are lots of files that fulfill this condition, e.g. entries in Debian's /var/lib/dpkg/info/.) This does not apply to most Linux systems because most distros set protected_hardlinks=1. B: on a system with protected_hardlinks=1: execute access to a folder containing a victim-owned, attacker-readable and attacker-writable file on the same partition as D, and the victim-owned file will be deleted before the main part of the attack takes place. (This seems to be uncommon.) C: on any system, independent of protected_hardlinks: write access to a non-sticky folder containing a victim-owned, attacker-readable file on the same partition as D (This seems to be uncommon.) The basic idea is that the attacker moves the victim-owned file to where he expects the victim process to dump its core. The victim process dumps its core into the existing file, and the attacker reads the coredump from it. If the attacker can't move the file because he does not have write access to the containing directory, he can instead link the file to a directory he controls, then wait for the original link to the file to be deleted (because the kernel checks that the link count of the corefile is 1). A less reliable variant that requires D to be non-sticky works with link() and does not require deletion of the original link: link() the file into D, but then unlink() it directly before the kernel performs the link count check. On systems with protected_hardlinks=0, this variant allows an attacker to not only gain information from coredumps, but also clobber existing, victim-writable files with coredumps. (This could theoretically lead to a privilege escalation.) Signed-off-by: Jann Horn <jann@thejh.net> Cc: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29Revert "NFSv4: Remove incorrect check in can_open_delegated()"Trond Myklebust
commit 36319608e28701c07cad80ae3be8b0fdfb1ab40f upstream. This reverts commit 4e379d36c050b0117b5d10048be63a44f5036115. This commit opens up a race between the recovery code and the open code. Reported-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFSv4.1: Fix a protocol issue with CLOSE stateidsTrond Myklebust
commit 4a1e2feb9d246775dee0f78ed5b18826bae2b1c5 upstream. According to RFC5661 Section 18.2.4, CLOSE is supposed to return the zero stateid. This means that nfs_clear_open_stateid_locked() cannot assume that the result stateid will always match the 'other' field of the existing open stateid when trying to determine a race with a parallel OPEN. Instead, we look at the argument, and check for matches. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFSv4.1/flexfiles: Fix a protocol error in layoutreturnTrond Myklebust
commit d13549074cf066d6d5bb29903d044beffea342d3 upstream. According to the flexfiles protocol, the layoutreturn should specify an array of errors in the following format: struct ff_ioerr4 { offset4 ffie_offset; length4 ffie_length; stateid4 ffie_stateid; device_error4 ffie_errors<>; }; This patch fixes up the code to ensure that our ffie_errors is indeed encoded as an array (albeit with only a single entry). Reported-by: Tom Haynes <thomas.haynes@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFS41/flexfiles: zero out DS write wccPeng Tao
commit 5420401079e152ff68a8024f6a375804b1c21505 upstream. We do not want to update inode attributes with DS values. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFSv4: Force a post-op attribute update when holding a delegationTrond Myklebust
commit aaae3f00d3f67f681a1f3cb7af999e976e8a24ce upstream. If the ctime or mtime or change attribute have changed because of an operation we initiated, we should make sure that we force an attribute update. However we do not want to mark the page cache for revalidation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFS41/flexfiles: update inode after write finishesPeng Tao
commit 69f230d907e8c1ca3f9bd528993eeb98f712b0dd upstream. Otherwise we break fstest case tests/read_write/mctime.t Does files layout need the same fix as well? Cc: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFS: nfs_set_pgio_error sometimes misses errorsTrond Myklebust
commit e9ae58aeee8842a50f7e199d602a5ccb2e41a95f upstream. We should ensure that we always set the pgio_header's error field if a READ or WRITE RPC call returns an error. The current code depends on 'hdr->good_bytes' always being initialised to a large value, which is not always done correctly by callers. When this happens, applications may end up missing important errors. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFS: Fix a NULL pointer dereference of migration recovery ops for v4.2 clientKinglong Mee
commit 18e3b739fdc826481c6a1335ce0c5b19b3d415da upstream. ---Steps to Reproduce-- <nfs-server> # cat /etc/exports /nfs/referal *(rw,insecure,no_subtree_check,no_root_squash,crossmnt) /nfs/old *(ro,insecure,subtree_check,root_squash,crossmnt) <nfs-client> # mount -t nfs nfs-server:/nfs/ /mnt/ # ll /mnt/*/ <nfs-server> # cat /etc/exports /nfs/referal *(rw,insecure,no_subtree_check,no_root_squash,crossmnt,refer=/nfs/old/@nfs-server) /nfs/old *(ro,insecure,subtree_check,root_squash,crossmnt) # service nfs restart <nfs-client> # ll /mnt/*/ --->>>>> oops here [ 5123.102925] BUG: unable to handle kernel NULL pointer dereference at (null) [ 5123.103363] IP: [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4] [ 5123.103752] PGD 587b9067 PUD 3cbf5067 PMD 0 [ 5123.104131] Oops: 0000 [#1] [ 5123.104529] Modules linked in: nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev vmw_balloon parport_pc parport i2c_piix4 shpchp auth_rpcgss nfs_acl vmw_vmci lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi serio_raw scsi_transport_spi e1000 mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd] [ 5123.105887] CPU: 0 PID: 15853 Comm: ::1-manager Tainted: G OE 4.2.0-rc6+ #214 [ 5123.106358] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 [ 5123.106860] task: ffff88007620f300 ti: ffff88005877c000 task.ti: ffff88005877c000 [ 5123.107363] RIP: 0010:[<ffffffffa03ed38b>] [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4] [ 5123.107909] RSP: 0018:ffff88005877fdb8 EFLAGS: 00010246 [ 5123.108435] RAX: ffff880053f3bc00 RBX: ffff88006ce6c908 RCX: ffff880053a0d240 [ 5123.108968] RDX: ffffea0000e6d940 RSI: ffff8800399a0000 RDI: ffff88006ce6c908 [ 5123.109503] RBP: ffff88005877fe28 R08: ffffffff81c708a0 R09: 0000000000000000 [ 5123.110045] R10: 00000000000001a2 R11: ffff88003ba7f5c8 R12: ffff880054c55800 [ 5123.110618] R13: 0000000000000000 R14: ffff880053a0d240 R15: ffff880053a0d240 [ 5123.111169] FS: 0000000000000000(0000) GS:ffffffff81c27000(0000) knlGS:0000000000000000 [ 5123.111726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5123.112286] CR2: 0000000000000000 CR3: 0000000054cac000 CR4: 00000000001406f0 [ 5123.112888] Stack: [ 5123.113458] ffffea0000e6d940 ffff8800399a0000 00000000000167d0 0000000000000000 [ 5123.114049] 0000000000000000 0000000000000000 0000000000000000 00000000a7ec82c6 [ 5123.114662] ffff88005877fe18 ffffea0000e6d940 ffff8800399a0000 ffff880054c55800 [ 5123.115264] Call Trace: [ 5123.115868] [<ffffffffa03fb44b>] nfs4_try_migration+0xbb/0x220 [nfsv4] [ 5123.116487] [<ffffffffa03fcb3b>] nfs4_run_state_manager+0x4ab/0x7b0 [nfsv4] [ 5123.117104] [<ffffffffa03fc690>] ? nfs4_do_reclaim+0x510/0x510 [nfsv4] [ 5123.117813] [<ffffffff810a4527>] kthread+0xd7/0xf0 [ 5123.118456] [<ffffffff810a4450>] ? kthread_worker_fn+0x160/0x160 [ 5123.119108] [<ffffffff816d9cdf>] ret_from_fork+0x3f/0x70 [ 5123.119723] [<ffffffff810a4450>] ? kthread_worker_fn+0x160/0x160 [ 5123.120329] Code: 4c 8b 6a 58 74 17 eb 52 48 8d 55 a8 89 c6 4c 89 e7 e8 4a b5 ff ff 8b 45 b0 85 c0 74 1c 4c 89 f9 48 8b 55 90 48 8b 75 98 48 89 df <41> ff 55 00 3d e8 d8 ff ff 41 89 c6 74 cf 48 8b 4d c8 65 48 33 [ 5123.121643] RIP [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4] [ 5123.122308] RSP <ffff88005877fdb8> [ 5123.122942] CR2: 0000000000000000 Fixes: ec011fe847 ("NFS: Introduce a vector of migration recovery ops") Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFSv4.1/pNFS: Fix borken function _same_data_server_addrs_locked()Trond Myklebust
commit 6f536936b79bd4b5cea8fb0e5b8b0bce8cd1ea4a upstream. - Switch back to using list_for_each_entry(). Fixes an incorrect test for list NULL termination. - Do not assume that lists are sorted. - Finally, consider an existing entry to match if it consists of a subset of the addresses in the new entry. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFS: Don't let the ctime override attribute barriers.Trond Myklebust
commit 7c2dad99d60c86ec686b3bfdcb787c450a7ea89f upstream. Chuck reports seeing cases where a GETATTR that happens to race with an asynchronous WRITE is overriding the file size, despite the attribute barrier being set by the writeback code. The culprit turns out to be the check in nfs_ctime_need_update(), which sees that the ctime is newer than the cached ctime, and assumes that it is safe to override the attribute barrier. This patch removes that override, and ensures that attribute barriers are always respected. Reported-by: Chuck Lever <chuck.lever@oracle.com> Fixes: a08a8cd375db9 ("NFS: Add attribute update barriers to NFS writebacks") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFSv4: don't set SETATTR for O_RDONLY|O_EXCLNeilBrown
commit efcbc04e16dfa95fef76309f89710dd1d99a5453 upstream. It is unusual to combine the open flags O_RDONLY and O_EXCL, but it appears that libre-office does just that. [pid 3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0 [pid 3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL <unfinished ...> NFSv4 takes O_EXCL as a sign that a setattr command should be sent, probably to reset the timestamps. When it was an O_RDONLY open, the SETATTR command does not identify any actual attributes to change. If no delegation was provided to the open, the SETATTR uses the all-zeros stateid and the request is accepted (at least by the Linux NFS server - no harm, no foul). If a read-delegation was provided, this is used in the SETATTR request, and a Netapp filer will justifiably claim NFS4ERR_BAD_STATEID, which the Linux client takes as a sign to retry - indefinitely. So only treat O_EXCL specially if O_CREAT was also given. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29nfsd: ensure that delegation stateid hash references are only put onceJeff Layton
commit 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream. It's possible that a DELEGRETURN could race with (e.g.) client expiry, in which case we could end up putting the delegation hash reference more than once. Have unhash_delegation_locked return a bool that indicates whether it was already unhashed. In the case of destroy_delegation we only conditionally put the hash reference if that returns true. The other callers of unhash_delegation_locked call it while walking list_heads that shouldn't yet be detached. If we find that it doesn't return true in those cases, then throw a WARN_ON as that indicates that we have a partially hashed delegation, and that something is likely very wrong. Tested-by: Andrew W Elble <aweits@rit.edu> Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29nfsd: ensure that the ol stateid hash reference is only put onceJeff Layton
commit e85687393f3ee0a77ccca016f903d1558bb69258 upstream. When an open or lock stateid is hashed, we take an extra reference to it. When we unhash it, we drop that reference. The code however does not properly account for the case where we have two callers concurrently trying to unhash the stateid. This can lead to list corruption and the hash reference being put more than once. Fix this by having unhash_ol_stateid use list_del_init on the st_perfile list_head, and then testing to see if that list_head is empty before releasing the hash reference. This means that some of the unhashing wrappers now become bool return functions so we can test to see whether the stateid was unhashed before we put the reference. Reported-by: Andrew W Elble <aweits@rit.edu> Tested-by: Andrew W Elble <aweits@rit.edu> Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com> Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bugKinglong Mee
commit 6896f15aabde505b35888039af93d1d182a0108a upstream. Currently we'll respond correctly to a request for either FS_LAYOUT_TYPES or LAYOUT_TYPES, but not to a request for both attributes simultaneously. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29NFSv4/pnfs: Ensure we don't miss a file extensionTrond Myklebust
commit 2b83d3de4c18af49800e0b26ae013db4fcf43a4a upstream. pNFS writes don't return attributes, however that doesn't mean that we should ignore the fact that they may be extending the file. This patch ensures that if a write is seen to extend the file, then we always set an attribute barrier, and update the cached file size. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29Btrfs: check if previous transaction aborted to avoid fs corruptionFilipe Manana
commit 1f9b8c8fbc9a4d029760b16f477b9d15500e3a34 upstream. While we are committing a transaction, it's possible the previous one is still finishing its commit and therefore we wait for it to finish first. However we were not checking if that previous transaction ended up getting aborted after we waited for it to commit, so we ended up committing the current transaction which can lead to fs corruption because the new superblock can point to trees that have had one or more nodes/leafs that were never durably persisted. The following sequence diagram exemplifies how this is possible: CPU 0 CPU 1 transaction N starts (...) btrfs_commit_transaction(N) cur_trans->state = TRANS_STATE_COMMIT_START; (...) cur_trans->state = TRANS_STATE_COMMIT_DOING; (...) cur_trans->state = TRANS_STATE_UNBLOCKED; root->fs_info->running_transaction = NULL; btrfs_start_transaction() --> starts transaction N + 1 btrfs_write_and_wait_transaction(trans, root); --> starts writing all new or COWed ebs created at transaction N creates some new ebs, COWs some existing ebs but doesn't COW or deletes eb X btrfs_commit_transaction(N + 1) (...) cur_trans->state = TRANS_STATE_COMMIT_START; (...) wait_for_commit(root, prev_trans); --> prev_trans == transaction N btrfs_write_and_wait_transaction() continues writing ebs --> fails writing eb X, we abort transaction N and set bit BTRFS_FS_STATE_ERROR on fs_info->fs_state, so no new transactions can start after setting that bit cleanup_transaction() btrfs_cleanup_one_transaction() wakes up task at CPU 1 continues, doesn't abort because cur_trans->aborted (transaction N + 1) is zero, and no checks for bit BTRFS_FS_STATE_ERROR in fs_info->fs_state are made btrfs_write_and_wait_transaction(trans, root); --> succeeds, no errors during writeback write_ctree_super(trans, root, 0); --> succeeds --> we have now a superblock that points us to some root that uses eb X, which was never written to disk In this scenario future attempts to read eb X from disk results in an error message like "parent transid verify failed on X wanted Y found Z". So fix this by aborting the current transaction if after waiting for the previous transaction we verify that it was aborted. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29CIFS: fix type confusion in copy offload ioctlJann Horn
commit 4c17a6d56bb0cad3066a714e94f7185a24b40f49 upstream. This might lead to local privilege escalation (code execution as kernel) for systems where the following conditions are met: - CONFIG_CIFS_SMB2 and CONFIG_CIFS_POSIX are enabled - a cifs filesystem is mounted where: - the mount option "vers" was used and set to a value >=2.0 - the attacker has write access to at least one file on the filesystem To attack this, an attacker would have to guess the target_tcon pointer (but guessing wrong doesn't cause a crash, it just returns an error code) and win a narrow race. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29Revert "ext4: remove block_device_ejected"Theodore Ts'o
commit bdfe0cbd746aa9b2509c2f6d6be17193cf7facd7 upstream. This reverts commit 08439fec266c3cc5702953b4f54bdf5649357de0. Unfortunately we still need to test for bdi->dev to avoid a crash when a USB stick is yanked out while a file system is mounted: usb 2-2: USB disconnect, device number 2 Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write JBD2: Error -5 detected when updating journal superblock for sdb1-8. BUG: unable to handle kernel paging request at 34beb000 IP: [<c136ce88>] __percpu_counter_add+0x18/0xc0 *pdpt = 0000000023db9001 *pde = 0000000000000000 Oops: 0000 [#1] SMP CPU: 0 PID: 4083 Comm: umount Tainted: G U OE 4.1.1-040101-generic #201507011435 Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011 task: ebf06b50 ti: ebebc000 task.ti: ebebc000 EIP: 0060:[<c136ce88>] EFLAGS: 00010082 CPU: 0 EIP is at __percpu_counter_add+0x18/0xc0 EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001 ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0 Stack: c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160 ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160 f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000 Call Trace: [<c1160454>] account_page_dirtied+0x74/0x110 [<c11e613f>] __set_page_dirty+0x3f/0xb0 [<c11e6203>] mark_buffer_dirty+0x53/0xc0 [<c124a0cb>] ext4_commit_super+0x17b/0x250 [<c124ac71>] ext4_put_super+0xc1/0x320 [<c11f04ba>] ? fsnotify_unmount_inodes+0x1aa/0x1c0 [<c11cfeda>] ? evict_inodes+0xca/0xe0 [<c11b925a>] generic_shutdown_super+0x6a/0xe0 [<c10a1df0>] ? prepare_to_wait_event+0xd0/0xd0 [<c1165a50>] ? unregister_shrinker+0x40/0x50 [<c11b92f6>] kill_block_super+0x26/0x70 [<c11b94f5>] deactivate_locked_super+0x45/0x80 [<c11ba007>] deactivate_super+0x47/0x60 [<c11d2b39>] cleanup_mnt+0x39/0x80 [<c11d2bc0>] __cleanup_mnt+0x10/0x20 [<c1080b51>] task_work_run+0x91/0xd0 [<c1011e3c>] do_notify_resume+0x7c/0x90 [<c1720da5>] work_notify Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 <64> 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89 EIP: [<c136ce88>] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40 CR2: 0000000034beb000 ---[ end trace dd564a7bea834ecd ]--- Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29ext4: don't manipulate recovery flag when freezing no-journal fsEric Sandeen
commit c642dc9e1aaed953597e7092d7df329e6234096e upstream. At some point along this sequence of changes: f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops bb04457 ext4: support freezing ext2 (nojournal) file systems 9ca9238 ext4: Use separate super_operations structure for no_journal filesystems ext4 started setting needs_recovery on filesystems without journals when they are unfrozen. This makes no sense, and in fact confuses blkid to the point where it doesn't recognize the filesystem at all. (freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs, see needs_recovery set on fs w/ no journal). To fix this, don't manipulate the INCOMPAT_RECOVER feature on filesystems without journals. Reported-by: Stu Mark <smark@datto.com> Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21fs: create and use seq_show_option for escapingKees Cook
commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream. Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21hpfs: update ctime and mtime on directory modificationMikulas Patocka
commit f49a26e7718dd30b49e3541e3e25aecf5e7294e2 upstream. Update ctime and mtime when a directory is modified. (though OS/2 doesn't update them anyway) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21fs: Set the size of empty dirs to 0.Eric W. Biederman
commit 4b75de8615050c1b0dd8d7794838c42f74ed36ba upstream. Before the make_empty_dir_inode calls were introduce into proc, sysfs, and sysctl those directories when stated reported an i_size of 0. make_empty_dir_inode started reporting an i_size of 2. At least one userspace application depended on stat returning i_size of 0. So modify make_empty_dir_inode to cause an i_size of 0 to be reported for these directories. Reported-by: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21xfs: Fix file type directory corruption for btree directoriesJan Kara
commit 037542345a82aaaa228ec280fe6ddff1568d169f upstream. Users have occasionally reported that file type for some directory entries is wrong. This mostly happened after updating libraries some libraries. After some debugging the problem was traced down to xfs_dir2_node_replace(). The function uses args->filetype as a file type to store in the replaced directory entry however it also calls xfs_da3_node_lookup_int() which will store file type of the current directory entry in args->filetype. Thus we fail to change file type of a directory entry to a proper type. Fix the problem by storing new file type in a local variable before calling xfs_da3_node_lookup_int(). Reported-by: Giacomo Comes <comes@naic.edu> Signed-off-by: Jan Kara <jack@suse.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21xfs: Fix xfs_attr_leafblock definitionJan Kara
commit ffeecc5213024ae663377b442eedcfbacf6d0c5d upstream. struct xfs_attr_leafblock contains 'entries' array which is declared with size 1 altough it can in fact contain much more entries. Since this array is followed by further struct members, gcc (at least in version 4.8.3) thinks that the array has the fixed size of 1 element and thus may optimize away all accesses beyond the end of array resulting in non-working code. This problem was only observed with userspace code in xfsprogs, however it's better to be safe in kernel as well and have matching kernel and xfsprogs definitions. Signed-off-by: Jan Kara <jack@suse.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21libxfs: readahead of dir3 data blocks should use the read verifierDarrick J. Wong
commit 2f123bce18943fff819bc10f8868ffb9149fc622 upstream. In the dir3 data block readahead function, use the regular read verifier to check the block's CRC and spot-check the block contents instead of directly calling only the spot-checking routine. This prevents corrupted directory data blocks from being read into the kernel, which can lead to garbage ls output and directory loops (if say one of the entries contains slashes and other junk). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16nfsd: do nfs4_check_fh in nfs4_check_file instead of nfs4_check_olstateidJeff Layton
commit 8fcd461db7c09337b6d2e22d25eb411123f379e3 upstream. Currently, preprocess_stateid_op calls nfs4_check_olstateid which verifies that the open stateid corresponds to the current filehandle in the call by calling nfs4_check_fh. If the stateid is a NFS4_DELEG_STID however, then no such check is done. This could cause incorrect enforcement of permissions, because the nfsd_permission() call in nfs4_check_file uses current the current filehandle, but any subsequent IO operation will use the file descriptor in the stateid. Move the call to nfs4_check_fh into nfs4_check_file instead so that it can be done for all stateid types. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> [bfields: moved fh check to avoid NULL deref in special stateid case] Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16nfsd: refactor nfs4_preprocess_stateid_opChristoph Hellwig
commit a0649b2d3fffb1cde8745568c767f3a55a3462bc upstream. Split out two self contained helpers to make the function more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Cc: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16signalfd: fix information leak in signalfd_copyinfoAmanieu d'Antras
commit 3ead7c52bdb0ab44f4bb1feed505a8323cc12ba7 upstream. This function may copy the si_addr_lsb field to user mode when it hasn't been initialized, which can leak kernel stack data to user mode. Just checking the value of si_code is insufficient because the same si_code value is shared between multiple signals. This is solved by checking the value of si_signo in addition to si_code. Signed-off-by: Amanieu d'Antras <amanieu@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16nfsd: Drop BUG_ON and ignore SECLABEL on absent filesystemKinglong Mee
commit c2227a39a078473115910512aa0f8d53bd915e60 upstream. On an absent filesystem (one served by another server), we need to be able to handle requests for certain attributest (like fs_locations, so the client can find out which server does have the filesystem), but others we can't. We forgot to take that into account when adding another attribute bitmask work for the SECURITY_LABEL attribute. There an export entry with the "refer" option can result in: [ 88.414272] kernel BUG at fs/nfsd/nfs4xdr.c:2249! [ 88.414828] invalid opcode: 0000 [#1] SMP [ 88.415368] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iosf_mbi ppdev btrfs coretemp crct10dif_pclmul crc32_pclmul crc32c_intel xor ghash_clmulni_intel raid6_pq vmw_balloon parport_pc parport i2c_piix4 shpchp vmw_vmci acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi mptscsih serio_raw mptbase e1000 scsi_transport_spi ata_generic pata_acpi [last unloaded: nfsd] [ 88.417827] CPU: 0 PID: 2116 Comm: nfsd Not tainted 4.0.7-300.fc22.x86_64 #1 [ 88.418448] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 [ 88.419093] task: ffff880079146d50 ti: ffff8800785d8000 task.ti: ffff8800785d8000 [ 88.419729] RIP: 0010:[<ffffffffa04b3c10>] [<ffffffffa04b3c10>] nfsd4_encode_fattr+0x820/0x1f00 [nfsd] [ 88.420376] RSP: 0000:ffff8800785db998 EFLAGS: 00010206 [ 88.421027] RAX: 0000000000000001 RBX: 000000000018091a RCX: ffff88006668b980 [ 88.421676] RDX: 00000000fffef7fc RSI: 0000000000000000 RDI: ffff880078d05000 [ 88.422315] RBP: ffff8800785dbb58 R08: ffff880078d043f8 R09: ffff880078d4a000 [ 88.422968] R10: 0000000000010000 R11: 0000000000000002 R12: 0000000000b0a23a [ 88.423612] R13: ffff880078d05000 R14: ffff880078683100 R15: ffff88006668b980 [ 88.424295] FS: 0000000000000000(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000 [ 88.424944] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 88.425597] CR2: 00007f40bc370f90 CR3: 0000000035af5000 CR4: 00000000001407f0 [ 88.426285] Stack: [ 88.426921] ffff8800785dbaa8 ffffffffa049e4af ffff8800785dba08 ffffffff813298f0 [ 88.427585] ffff880078683300 ffff8800769b0de8 0000089d00000001 0000000087f805e0 [ 88.428228] ffff880000000000 ffff880079434a00 0000000000000000 ffff88006668b980 [ 88.428877] Call Trace: [ 88.429527] [<ffffffffa049e4af>] ? exp_get_by_name+0x7f/0xb0 [nfsd] [ 88.430168] [<ffffffff813298f0>] ? inode_doinit_with_dentry+0x210/0x6a0 [ 88.430807] [<ffffffff8123833e>] ? d_lookup+0x2e/0x60 [ 88.431449] [<ffffffff81236133>] ? dput+0x33/0x230 [ 88.432097] [<ffffffff8123f214>] ? mntput+0x24/0x40 [ 88.432719] [<ffffffff812272b2>] ? path_put+0x22/0x30 [ 88.433340] [<ffffffffa049ac87>] ? nfsd_cross_mnt+0xb7/0x1c0 [nfsd] [ 88.433954] [<ffffffffa04b54e0>] nfsd4_encode_dirent+0x1b0/0x3d0 [nfsd] [ 88.434601] [<ffffffffa04b5330>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd] [ 88.435172] [<ffffffffa049c991>] nfsd_readdir+0x1c1/0x2a0 [nfsd] [ 88.435710] [<ffffffffa049a530>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd] [ 88.436447] [<ffffffffa04abf30>] nfsd4_encode_readdir+0x120/0x220 [nfsd] [ 88.437011] [<ffffffffa04b58cd>] nfsd4_encode_operation+0x7d/0x190 [nfsd] [ 88.437566] [<ffffffffa04aa6dd>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd] [ 88.438157] [<ffffffffa0496103>] nfsd_dispatch+0xc3/0x220 [nfsd] [ 88.438680] [<ffffffffa006f0cb>] svc_process_common+0x43b/0x690 [sunrpc] [ 88.439192] [<ffffffffa0070493>] svc_process+0x103/0x1b0 [sunrpc] [ 88.439694] [<ffffffffa0495a57>] nfsd+0x117/0x190 [nfsd] [ 88.440194] [<ffffffffa0495940>] ? nfsd_destroy+0x90/0x90 [nfsd] [ 88.440697] [<ffffffff810bb728>] kthread+0xd8/0xf0 [ 88.441260] [<ffffffff810bb650>] ? kthread_worker_fn+0x180/0x180 [ 88.441762] [<ffffffff81789e58>] ret_from_fork+0x58/0x90 [ 88.442322] [<ffffffff810bb650>] ? kthread_worker_fn+0x180/0x180 [ 88.442879] Code: 0f 84 93 05 00 00 83 f8 ea c7 85 a0 fe ff ff 00 00 27 30 0f 84 ba fe ff ff 85 c0 0f 85 a5 fe ff ff e9 e3 f9 ff ff 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 be 04 00 00 00 4c 89 ef 4c 89 8d 68 fe [ 88.444052] RIP [<ffffffffa04b3c10>] nfsd4_encode_fattr+0x820/0x1f00 [nfsd] [ 88.444658] RSP <ffff8800785db998> [ 88.445232] ---[ end trace 6cb9d0487d94a29f ]--- Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16ocfs2: fix shift left overflowJoseph Qi
commit 32e5a2a2be6b085febaac36efff495ad65a55e6c upstream. When using a large volume, for example 9T volume with 2T already used, frequent creation of small files with O_DIRECT when the IO is not cluster aligned may clear sectors in the wrong place. This will cause filesystem corruption. This is because p_cpos is a u32. When calculating the corresponding sector it should be converted to u64 first, otherwise it may overflow. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16ocfs2: fix BUG in ocfs2_downconvert_thread_do_work()Joseph Qi
commit 209f7512d007980fd111a74a064d70a3656079cf upstream. The "BUG_ON(list_empty(&osb->blocked_lock_list))" in ocfs2_downconvert_thread_do_work can be triggered in the following case: ocfs2dc has firstly saved osb->blocked_lock_count to local varibale processed, and then processes the dentry lockres. During the dentry put, it calls iput and then deletes rw, inode and open lockres from blocked list in ocfs2_mark_lockres_freeing. And this causes the variable `processed' to not reflect the number of blocked lockres to be processed, which triggers the BUG. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()Jan Kara
commit 8f2f3eb59dff4ec538de55f2e0592fec85966aab upstream. fsnotify_clear_marks_by_group_flags() can race with fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked() drops mark_mutex, a mark from the list iterated by fsnotify_clear_marks_by_group_flags() can be freed and thus the next entry pointer we have cached may become stale and we dereference free memory. Fix the problem by first moving marks to free to a special private list and then always free the first entry in the special list. This method is safe even when entries from the list can disappear once we drop the lock. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Ashish Sangwan <a.sangwan@samsung.com> Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10xfs: remote attributes need to be considered dataDave Chinner
commit df150ed102baa0e78c06e08e975dfb47147dd677 upstream. We don't log remote attribute contents, and instead write them synchronously before we commit the block allocation and attribute tree update transaction. As a result we are writing to the allocated space before the allcoation has been made permanent. As a result, we cannot consider this allocation to be a metadata allocation. Metadata allocation can take blocks from the free list and so reuse them before the transaction that freed the block is committed to disk. This behaviour is perfectly fine for journalled metadata changes as log recovery will ensure the free operation is replayed before the overwrite, but for remote attribute writes this is not the case. Hence we have to consider the remote attribute blocks to contain data and allocate accordingly. We do this by dropping the XFS_BMAPI_METADATA flag from the block allocation. This means the allocation will not use blocks that are on the busy list without first ensuring that the freeing transaction has been committed to disk and the blocks removed from the busy list. This ensures we will never overwrite a freed block without first ensuring that it is really free. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10xfs: remote attribute headers contain an invalid LSNDave Chinner
commit e3c32ee9e3e747fec01eb38e6610a9157d44c3ea upstream. In recent testing, a system that crashed failed log recovery on restart with a bad symlink buffer magic number: XFS (vda): Starting recovery (logdev: internal) XFS (vda): Bad symlink block magic! XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060 On examination of the log via xfs_logprint, none of the symlink buffers in the log had a bad magic number, nor were any other types of buffer log format headers mis-identified as symlink buffers. Tracing was used to find the buffer the kernel was tripping over, and xfs_db identified it's contents as: 000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e 020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d 040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d ..... This is a remote attribute buffer, which are notable in that they are not logged but are instead written synchronously by the remote attribute code so that they exist on disk before the attribute transactions are committed to the journal. The above remote attribute block has an invalid LSN in it - cycle 0xd002000, block 0 - which means when log recovery comes along to determine if the transaction that writes to the underlying block should be replayed, it sees a block that has a future LSN and so does not replay the buffer data in the transaction. Instead, it validates the buffer magic number and attaches the buffer verifier to it. It is this buffer magic number check that is failing in the above assert, indicating that we skipped replay due to the LSN of the underlying buffer. The problem here is that the remote attribute buffers cannot have a valid LSN placed into them, because the transaction that contains the attribute tree pointer changes and the block allocation that the attribute data is being written to hasn't yet been committed. Hence the LSN field in the attribute block is completely unwritten, thereby leaving the underlying contents of the block in the LSN field. It could have any value, and hence a future overwrite of the block by log recovery may or may not work correctly. Fix this by always writing an invalid LSN to the remote attribute block, as any buffer in log recovery that needs to write over the remote attribute should occur. We are protected from having old data written over the attribute by the fact that freeing the block before the remote attribute is written will result in the buffer being marked stale in the log and so all changes prior to the buffer stale transaction will be cancelled by log recovery. Hence it is safe to ignore the LSN in the case or synchronously written, unlogged metadata such as remote attribute blocks, and to ensure we do that correctly, we need to write an invalid LSN to all remote attribute blocks to trigger immediate recovery of metadata that is written over the top. As a further protection for filesystems that may already have remote attribute blocks with bad LSNs on disk, change the log recovery code to always trigger immediate recovery of metadata over remote attribute blocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10NFS: Fix a memory leak in nfs_do_recoalesceTrond Myklebust
commit 03d5eb65b53889fe98a5ecddfe205c16e3093190 upstream. If the function exits early, then we must put those requests that were not processed back onto the &mirror->pg_list so they can be cleaned up by nfs_pgio_error(). Fixes: a7d42ddb30997 ("nfs: add mirroring support to pgio layer") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_lockedTrond Myklebust
commit 3c38cbe2ade88240fabb585b408f779ad3b9a31b upstream. Otherwise, nfs4_select_rw_stateid() will always return the zero stateid instead of the correct open stateid. Fixes: f95549cf24660 ("NFSv4: More CLOSE/OPEN races") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10NFS: Don't revalidate the mapping if both size and change attr are up to dateTrond Myklebust
commit 85a23cee3f2c928475f31777ead5a71340a12fc3 upstream. If we've ensured that the size and the change attribute are both correct, then there is no point in marking those attributes as needing revalidation again. Only do so if we know the size is incorrect and was not updated. Fixes: f2467b6f64da ("NFS: Clear NFS_INO_REVAL_PAGECACHE when...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10mnt: In detach_mounts detach the appropriate unmounted mountEric W. Biederman
commit fe78fcc85a2046c51f1535710996860557eeec20 upstream. The handling of in detach_mounts of unmounted but connected mounts is buggy and can lead to an infinite loop. Correct the handling of unmounted mounts in detach_mount. When the mountpoint of an unmounted but connected mount is connected to a dentry, and that dentry is deleted we need to disconnect that mount from the parent mount and the deleted dentry. Nothing changes for the unmounted and connected children. They can be safely ignored. Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10mnt: Clarify and correct the disconnect logic in umount_treeEric W. Biederman
commit f2d0a123bcf16d1a9cf7942ddc98e0ef77862c2b upstream. rmdir mntpoint will result in an infinite loop when there is a mount locked on the mountpoint in another mount namespace. This is because the logic to test to see if a mount should be disconnected in umount_tree is buggy. Move the logic to decide if a mount should remain connected to it's mountpoint into it's own function disconnect_mount so that clarity of expression instead of terseness of expression becomes a virtue. When the conditions where it is invalid to leave a mount connected are first ruled out, the logic for deciding if a mount should be disconnected becomes much clearer and simpler. Fixes: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10freeing unlinked file indefinitely delayedAl Viro
commit 75a6f82a0d10ef8f13cd8fe7212911a0252ab99e upstream. Normally opening a file, unlinking it and then closing will have the inode freed upon close() (provided that it's not otherwise busy and has no remaining links, of course). However, there's one case where that does *not* happen. Namely, if you open it by fhandle with cold dcache, then unlink() and close(). In normal case you get d_delete() in unlink(2) notice that dentry is busy and unhash it; on the final dput() it will be forcibly evicted from dcache, triggering iput() and inode removal. In this case, though, we end up with *two* dentries - disconnected (created by open-by-fhandle) and regular one (used by unlink()). The latter will have its reference to inode dropped just fine, but the former will not - it's considered hashed (it is on the ->s_anon list), so it will stay around until the memory pressure will finally do it in. As the result, we have the final iput() delayed indefinitely. It's trivial to reproduce - void flush_dcache(void) { system("mount -o remount,rw /"); } static char buf[20 * 1024 * 1024]; main() { int fd; union { struct file_handle f; char buf[MAX_HANDLE_SZ]; } x; int m; x.f.handle_bytes = sizeof(x); chdir("/root"); mkdir("foo", 0700); fd = open("foo/bar", O_CREAT | O_RDWR, 0600); close(fd); name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); flush_dcache(); fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); unlink("foo/bar"); write(fd, buf, sizeof(buf)); system("df ."); /* 20Mb eaten */ close(fd); system("df ."); /* should've freed those 20Mb */ flush_dcache(); system("df ."); /* should be the same as #2 */ } will spit out something like Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 283282 21692 93% / - inode gets freed only when dentry is finally evicted (here we trigger than by remount; normally it would've happened in response to memory pressure hell knows when). Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03hpfs: hpfs_error: Remove static buffer, use vsprintf extension %pV insteadJoe Perches
commit a28e4b2b18ccb90df402da3f21e1a83c9d4f8ec1 upstream. Removing unnecessary static buffers is good. Use the vsprintf %pV extension instead. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03hpfs: kstrdup() out of memory handlingSanidhya Kashyap
commit ce657611baf902f14ae559ce4e0787ead6712067 upstream. There is a possibility of nothing being allocated to the new_opts in case of memory pressure, therefore return ENOMEM for such case. Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-039p: don't leave a half-initialized inode sitting aroundAl Viro
commit 0a73d0a204a4a04a1e110539c5a524ae51f91d6d upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03nfs: always update creds in mirror, even when we have an already connected dsJeff Layton
commit 0c8315dd56577445dd1afe6b9cfa06b7efdf2f82 upstream. A ds can be associated with more than one mirror, but we currently skip setting a mirror's credentials if we find that it's already set up with a connected client. The upshot is that we can end up sending DS writes with MDS credentials instead of properly setting them up. Fix nfs4_ff_layout_prepare_ds to always verify that the mirror's credentials are set up, even when we have a DS that's already connected. Reported-by: Tom Haynes <thomas.haynes@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03nfs: fix potential credential leak in ff_layout_update_mirror_credJeff Layton
commit a24221dca1868101c9b4b5adde4a6a5b1a3a64a7 upstream. If we have two tasks racing to update a mirror's credentials, then they can end up leaking one (or more) sets of credentials. The first task will set mirror->cred and then the second task will just overwrite it. Use a cmpxchg to ensure that the creds are only set once. If we get to the point where we would set mirror->cred and find that they're already set, then we just release the creds that were just found. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>