summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-05-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Followups to the parallel lookup work: - update docs - restore killability of the places that used to take ->i_mutex killably now that we have down_write_killable() merged - Additionally, it turns out that I missed a prerequisite for security_d_instantiate() stuff - ->getxattr() wasn't the only thing that could be called before dentry is attached to inode; with smack we needed the same treatment applied to ->setxattr() as well" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch ->setxattr() to passing dentry and inode separately switch xattr_handler->set() to passing dentry and inode separately restore killability of old mutex_lock_killable(&inode->i_mutex) users add down_write_killable_nested() update D/f/directory-locking
2016-05-27switch ->setxattr() to passing dentry and inode separatelyAl Viro
smack ->d_instantiate() uses ->setxattr(), so to be able to call it before we'd hashed the new dentry and attached it to inode, we need ->setxattr() instances getting the inode as an explicit argument rather than obtaining it from dentry. Similar change for ->getxattr() had been done in commit ce23e64. Unlike ->getxattr() (which is used by both selinux and smack instances of ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately it got missed back then. Reported-by: Seung-Woo Kim <sw0312.kim@samsung.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-27Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: "The meat of this is a change to use the mounter's credentials for operations that require elevated privileges (such as whiteout creation). This fixes behavior under user namespaces as well as being a nice cleanup" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: Do d_type check only if work dir creation was successful ovl: update documentation ovl: override creds with the ones from the superblock mounter
2016-05-27Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs cleanups and fixes from Chris Mason: "We have another round of fixes and a few cleanups. I have a fix for short returns from btrfs_copy_from_user, which finally nails down a very hard to find regression we added in v4.6. Dave is pushing around gfp parameters, mostly to cleanup internal apis and make it a little more consistent. The rest are smaller fixes, and one speelling fixup patch" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits) Btrfs: fix handling of faults from btrfs_copy_from_user btrfs: fix string and comment grammatical issues and typos btrfs: scrub: Set bbio to NULL before calling btrfs_map_block Btrfs: fix unexpected return value of fiemap Btrfs: free sys_array eb as soon as possible btrfs: sink gfp parameter to convert_extent_bit btrfs: make state preallocation more speculative in __set_extent_bit btrfs: untangle gotos a bit in convert_extent_bit btrfs: untangle gotos a bit in __clear_extent_bit btrfs: untangle gotos a bit in __set_extent_bit btrfs: sink gfp parameter to set_record_extent_bits btrfs: sink gfp parameter to set_extent_new btrfs: sink gfp parameter to set_extent_defrag btrfs: sink gfp parameter to set_extent_delalloc btrfs: sink gfp parameter to clear_extent_dirty btrfs: sink gfp parameter to clear_record_extent_bits btrfs: sink gfp parameter to clear_extent_bits btrfs: sink gfp parameter to set_extent_bits btrfs: make find_workspace warn if there are no workspaces btrfs: make find_workspace always succeed ...
2016-05-27mm: remove more IS_ERR_VALUE abusesLinus Torvalds
The do_brk() and vm_brk() return value was "unsigned long" and returned the starting address on success, and an error value on failure. The reasons are entirely historical, and go back to it basically behaving like the mmap() interface does. However, nobody actually wanted that interface, and it causes totally pointless IS_ERR_VALUE() confusion. What every single caller actually wants is just the simpler integer return of zero for success and negative error number on failure. So just convert to that much clearer and more common calling convention, and get rid of all the IS_ERR_VALUE() uses wrt vm_brk(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27remove lots of IS_ERR_VALUE abusesArnd Bergmann
Most users of IS_ERR_VALUE() in the kernel are wrong, as they pass an 'int' into a function that takes an 'unsigned long' argument. This happens to work because the type is sign-extended on 64-bit architectures before it gets converted into an unsigned type. However, anything that passes an 'unsigned short' or 'unsigned int' argument into IS_ERR_VALUE() is guaranteed to be broken, as are 8-bit integers and types that are wider than 'unsigned long'. Andrzej Hajda has already fixed a lot of the worst abusers that were causing actual bugs, but it would be nice to prevent any users that are not passing 'unsigned long' arguments. This patch changes all users of IS_ERR_VALUE() that I could find on 32-bit ARM randconfig builds and x86 allmodconfig. For the moment, this doesn't change the definition of IS_ERR_VALUE() because there are probably still architecture specific users elsewhere. Almost all the warnings I got are for files that are better off using 'if (err)' or 'if (err < 0)'. The only legitimate user I could find that we get a warning for is the (32-bit only) freescale fman driver, so I did not remove the IS_ERR_VALUE() there but changed the type to 'unsigned long'. For 9pfs, I just worked around one user whose calling conventions are so obscure that I did not dare change the behavior. I was using this definition for testing: #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \ unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO)) which ends up making all 16-bit or wider types work correctly with the most plausible interpretation of what IS_ERR_VALUE() was supposed to return according to its users, but also causes a compile-time warning for any users that do not pass an 'unsigned long' argument. I suggested this approach earlier this year, but back then we ended up deciding to just fix the users that are obviously broken. After the initial warning that caused me to get involved in the discussion (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus asked me to send the whole thing again. [ Updated the 9p parts as per Al Viro - Linus ] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Andrzej Hajda <a.hajda@samsung.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.org/lkml/2016/1/7/363 Link: https://lkml.org/lkml/2016/5/27/486 Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27direct-io: fix direct write stale data exposure from concurrent buffered readEryu Guan
Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are not allowed to allocate blocks(get_more_blocks() sets 'create' to 0 before calling get_block() callback), if it's a sparse file, direct writes fall back to buffered writes to avoid stale data exposure from concurrent buffered read. But there're two cases that can result in stale data exposure are not correctly detected. 1. The detection for "writing inside i_size" is not sufficient, writes can be treated as "extending writes" wrongly. For example, direct write 1FSB (file system block) to a 1FSB sparse file on ext2/3/4, starting from offset 0, in this case it's writing inside i_size, but 'create' is non-zero, because 'block_in_file' and '(i_size_read(inode) >> blkbits' are both zero. 2. Direct writes starting from or beyong i_size (not inside i_size) also could trigger block allocation and expose stale data. For example, consider a sparse file with i_size of 2k, and a write to offset 2k or 3k into the file, with a filesystem block size of 4k. (Thanks to Jeff Moyer for pointing this case out in his review.) The first problem can be demostrated by running ltp-aiodio test ADSP045 many times. When testing on extN filesystems, I see test failures occasionally, buffered read could read non-zero (stale) data. ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1 dio_sparse 0 TINFO : Dirtying free blocks dio_sparse 0 TINFO : Starting I/O tests non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa non-zero read at offset 0 dio_sparse 0 TINFO : Killing childrens(s) dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally The second problem can also be reproduced easily by a hacked dio_sparse program, which accepts an option to specify the write offset. What we should really do is to disable block allocation for writes that could result in filling holes inside i_size. Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27ocfs2: bump up o2cb network protocol versionJunxiao Bi
Two new messages are added to support negotiating hb timeout. Stop nodes frmo talking an old version to mount as they will cause the negotiation to fail. Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27ocfs2: o2hb: fix hb hung timeJunxiao Bi
hr_last_timeout_start should be set as the last time where hb is still OK. When hb write timeout, hung time will be (jiffies - hr_last_timeout_start). Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27ocfs2: o2hb: don't negotiate if last hb failJunxiao Bi
Sometimes io error is returned when storage is down for a while. Like for iscsi device, stroage is made offline when session timeout, and this will make all io return -EIO. For this case, nodes shouldn't do negotiate timeout but should fence self. So let nodes fence self when o2hb_do_disk_heartbeat return an error, this is the same behavior with o2hb without negotiate timer. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27ocfs2: o2hb: add some user/debug logJunxiao Bi
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27ocfs2: o2hb: add NEGOTIATE_APPROVE messageJunxiao Bi
This message is used to re-queue write timeout timer and negotiate timer when all nodes suffer a write hung to storage, this makes node not fence self if storage down. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27ocfs2: o2hb: add NEGO_TIMEOUT messageJunxiao Bi
This message is sent to master node when non-master nodes's negotiate timer expired. Master node records these nodes in a bitmap which is used to do write timeout timer re-queue decision. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27ocfs2: o2hb: add negotiate timerJunxiao Bi
This series of patches is to fix the issue that when storage down, all nodes will fence self due to write timeout. With this patch set, all nodes will keep going until storage back online, except if the following issue happens, then all nodes will do as before to fence self. 1. io error got 2. network between nodes down 3. nodes panic This patch (of 6): When storage down, all nodes will fence self due to write timeout. The negotiate timer is designed to avoid this, with it node will wait until storage up again. Negotiate timer working in the following way: 1. The timer expires before write timeout timer, its timeout is half of write timeout now. It is re-queued along with write timeout timer. If expires, it will send NEGO_TIMEOUT message to master node(node with lowest node number). This message does nothing but marks a bit in a bitmap recording which nodes are negotiating timeout on master node. 2. If storage down, nodes will send this message to master node, then when master node finds its bitmap including all online nodes, it sends NEGO_APPROVL message to all nodes one by one, this message will re-queue write timeout timer and negotiate timer. For any node doesn't receive this message or meets some issue when handling this message, it will be fenced. If storage up at any time, o2hb_thread will run and re-queue all the timer, nothing will be affected by these two steps. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27switch xattr_handler->set() to passing dentry and inode separatelyAl Viro
preparation for similar switch in ->setxattr() (see the next commit for rationale). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-27ovl: Do d_type check only if work dir creation was successfulVivek Goyal
d_type check requires successful creation of workdir as iterates through work dir and expects work dir to be present in it. If that's not the case, this check will always return d_type not supported even if underlying filesystem might be supporting it. So don't do this check if work dir creation failed in previous step. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-27ovl: override creds with the ones from the superblock mounterAntonio Murdaca
In user namespace the whiteout creation fails with -EPERM because the current process isn't capable(CAP_SYS_ADMIN) when setting xattr. A simple reproducer: $ mkdir upper lower work merged lower/dir $ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged $ unshare -m -p -f -U -r bash Now as root in the user namespace: \# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir \# rm -fR merged/* This ends up failing with -EPERM after the files in dir has been correctly deleted: unlinkat(4, "2", 0) = 0 unlinkat(4, "1", 0) = 0 unlinkat(4, "3", 0) = 0 close(4) = 0 unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not permitted) Interestingly, if you don't place files in merged/dir you can remove it, meaning if upper/dir does not exist, creating the char device file works properly in that same location. This patch uses ovl_sb_creator_cred() to get the cred struct from the superblock mounter and override the old cred with these new ones so that the whiteout creation is possible because overlay is wrong in assuming that the creds it will get with prepare_creds will be in the initial user namespace. The old cap_raise game is removed in favor of just overriding the old cred struct. This patch also drops from ovl_copy_up_one() the following two lines: override_cred->fsuid = stat->uid; override_cred->fsgid = stat->gid; This is because the correct uid and gid are taken directly with the stat struct and correctly set with ovl_set_attr(). Signed-off-by: Antonio Murdaca <runcom@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4 update "mm/zsmalloc: don't fail if can't create debugfs info" dma-debug: avoid spinlock recursion when disabling dma-debug mm: oom_reaper: remove some bloat memcg: fix mem_cgroup_out_of_memory() return value. ocfs2: fix improper handling of return errno mm: slub: remove unused virt_to_obj() mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly seqlock: fix raw_read_seqcount_latch()
2016-05-26Merge tag 'dax-locking-for-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX locking updates from Ross Zwisler: "Filesystem DAX locking for 4.7 - We use a bit in an exceptional radix tree entry as a lock bit and use it similarly to how page lock is used for normal faults. This fixes races between hole instantiation and read faults of the same index. - Filesystem DAX PMD faults are disabled, and will be re-enabled when PMD locking is implemented" * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Remove i_mmap_lock protection dax: Use radix tree entry lock to protect cow faults dax: New fault locking dax: Allow DAX code to replace exceptional entries dax: Define DAX lock bit for radix tree exceptional entry dax: Make huge page handling depend of CONFIG_BROKEN dax: Fix condition for filling of PMD holes
2016-05-26Merge tag 'dax-misc-for-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26ocfs2: fix improper handling of return errnoEric Ren
Previously, if a bad inode was found in ocfs2_iget(), -ESTALE was returned back to the caller anyway. Since commit d2b9d71a2da7 ("ocfs2: check/fix inode block for online file check") can handle with return value from ocfs2_read_locked_inode() now, we know the exact errno returned for us. Link: http://lkml.kernel.org/r/1463970656-18413-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This changeset has a few main parts: - Ilya has finished a huge refactoring effort to sync up the client-side logic in libceph with the user-space client code, which has evolved significantly over the last couple years, with lots of additional behaviors (e.g., how requests are handled when cluster is full and transitions from full to non-full). This structure of the code is more closely aligned with userspace now such that it will be much easier to maintain going forward when behavior changes take place. There are some locking improvements bundled in as well. - Zheng adds multi-filesystem support (multiple namespaces within the same Ceph cluster) - Zheng has changed the readdir offsets and directory enumeration so that dentry offsets are hash-based and therefore stable across directory fragmentation events on the MDS. - Zheng has a smorgasbord of bug fixes across fs/ceph" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits) ceph: fix wake_up_session_cb() ceph: don't use truncate_pagecache() to invalidate read cache ceph: SetPageError() for writeback pages if writepages fails ceph: handle interrupted ceph_writepage() ceph: make ceph_update_writeable_page() uninterruptible libceph: make ceph_osdc_wait_request() uninterruptible ceph: handle -EAGAIN returned by ceph_update_writeable_page() ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM ceph: block non-fatal signals for fault/page_mkwrite ceph: make logical calculation functions return bool ceph: tolerate bad i_size for symlink inode ceph: improve fragtree change detection ceph: keep leaf frag when updating fragtree ceph: fix dir_auth check in ceph_fill_dirfrag() ceph: don't assume frag tree splits in mds reply are sorted ceph: fix inode reference leak ceph: using hash value to compose dentry offset ceph: don't forbid marking directory complete after forward seek ceph: record 'offset' for each entry of readdir result ceph: define 'end/complete' in readdir reply as bit flags ...
2016-05-26Btrfs: fix handling of faults from btrfs_copy_from_userChris Mason
When btrfs_copy_from_user isn't able to copy all of the pages, we need to adjust our accounting to reflect the work that was actually done. Commit 2e78c927d79 changed around the decisions a little and we ended up skipping the accounting adjustments some of the time. This commit makes sure that when we don't copy anything at all, we still hop into the adjustments, and switches to release_bytes instead of write_bytes, since write_bytes isn't aligned. The accounting errors led to warnings during btrfs_destroy_inode: [ 70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0 [ 70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk [ 70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G W 4.6.0-rc6_00062_g2997da1-dirty #23 [ 70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 70.847542] 0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202 [ 70.847543] 0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08 [ 70.847547] ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28 [ 70.847547] Call Trace: [ 70.847550] [<ffffffff8149d5e9>] dump_stack+0x51/0x78 [ 70.847551] [<ffffffff8107bdfd>] __warn+0xfd/0x120 [ 70.847553] [<ffffffff8107be3d>] warn_slowpath_null+0x1d/0x20 [ 70.847555] [<ffffffff8139c9e3>] btrfs_destroy_inode+0x2b3/0x2c0 [ 70.847556] [<ffffffff812003a1>] ? __destroy_inode+0x71/0x140 [ 70.847558] [<ffffffff812004b3>] destroy_inode+0x43/0x70 [ 70.847559] [<ffffffff810b7b5f>] ? wake_up_bit+0x2f/0x40 [ 70.847560] [<ffffffff81200c68>] evict+0x148/0x1d0 [ 70.847562] [<ffffffff81398ade>] ? start_transaction+0x3de/0x460 [ 70.847564] [<ffffffff81200d49>] dispose_list+0x59/0x80 [ 70.847565] [<ffffffff81201ba0>] evict_inodes+0x180/0x190 [ 70.847566] [<ffffffff812191ff>] ? __sync_filesystem+0x3f/0x50 [ 70.847568] [<ffffffff811e95f8>] generic_shutdown_super+0x48/0x100 [ 70.847569] [<ffffffff810b75c0>] ? woken_wake_function+0x20/0x20 [ 70.847571] [<ffffffff811e9796>] kill_anon_super+0x16/0x30 [ 70.847573] [<ffffffff81365cde>] btrfs_kill_super+0x1e/0x130 [ 70.847574] [<ffffffff811e99be>] deactivate_locked_super+0x4e/0x90 [ 70.847576] [<ffffffff811e9e61>] deactivate_super+0x51/0x70 [ 70.847577] [<ffffffff8120536f>] cleanup_mnt+0x3f/0x80 [ 70.847579] [<ffffffff81205402>] __cleanup_mnt+0x12/0x20 [ 70.847581] [<ffffffff81098358>] task_work_run+0x68/0xa0 [ 70.847582] [<ffffffff810022b6>] exit_to_usermode_loop+0xd6/0xe0 [ 70.847583] [<ffffffff81002e1d>] do_syscall_64+0xbd/0x170 [ 70.847586] [<ffffffff817d4dbc>] entry_SYSCALL64_slow_path+0x25/0x25 This is the test program I used to force short returns from btrfs_copy_from_user void *dontneed(void *arg) { char *p = arg; int ret; while(1) { ret = madvise(p, BUFSIZE/4, MADV_DONTNEED); if (ret) { perror("madvise"); exit(1); } } } int main(int ac, char **av) { int ret; int fd; char *filename; unsigned long offset; char *buf; int i; pthread_t tid; if (ac != 2) { fprintf(stderr, "usage: dammitdave filename\n"); exit(1); } buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); if (buf == MAP_FAILED) { perror("mmap"); exit(1); } memset(buf, 'a', BUFSIZE); filename = av[1]; ret = pthread_create(&tid, NULL, dontneed, buf); if (ret) { fprintf(stderr, "error %d from pthread_create\n", ret); exit(1); } ret = pthread_detach(tid); if (ret) { fprintf(stderr, "pthread detach failed %d\n", ret); exit(1); } while (1) { fd = open(filename, O_RDWR | O_CREAT, 0600); if (fd < 0) { perror("open"); exit(1); } for (i = 0; i < ROUNDS; i++) { int this_write = BUFSIZE; offset = rand() % MAXSIZE; ret = pwrite(fd, buf, this_write, offset); if (ret < 0) { perror("pwrite"); exit(1); } else if (ret != this_write) { fprintf(stderr, "short write to %s offset %lu ret %d\n", filename, offset, ret); exit(1); } if (i == ROUNDS - 1) { ret = sync_file_range(fd, offset, 4096, SYNC_FILE_RANGE_WRITE); if (ret < 0) { perror("sync_file_range"); exit(1); } } } ret = ftruncate(fd, 0); if (ret < 0) { perror("ftruncate"); exit(1); } ret = close(fd); if (ret) { perror("close"); exit(1); } ret = unlink(filename); if (ret) { perror("unlink"); exit(1); } } return 0; } Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Dave Jones <dsj@fb.com> Fixes: 2e78c927d79333f299a8ac81c2fd2952caeef335 cc: stable@vger.kernel.org # v4.6 Signed-off-by: Chris Mason <clm@fb.com>
2016-05-26Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "Highlights include: Features: - Add support for the NFS v4.2 COPY operation - Add support for NFS/RDMA over IPv6 Bugfixes and cleanups: - Avoid race that crashes nfs_init_commit() - Fix oops in callback path - Fix LOCK/OPEN race when unlinking an open file - Choose correct stateids when using delegations in setattr, read and write - Don't send empty SETATTR after OPEN_CREATE - xprtrdma: Prevent server from writing a reply into memory client has released - xprtrdma: Support using Read list and Reply chunk in one RPC call" * tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits) pnfs: pnfs_update_layout needs to consider if strict iomode checking is on nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO nfs: avoid race that crashes nfs_init_commit NFS: checking for NULL instead of IS_ERR() in nfs_commit_file() pnfs: make pnfs_layout_process more robust pnfs: rework LAYOUTGET retry handling pnfs: lift retry logic from send_layoutget to pnfs_update_layout pnfs: fix bad error handling in send_layoutget flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args pnfs: keep track of the return sequence number in pnfs_layout_hdr pnfs: record sequence in pnfs_layout_segment when it's created pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io pNFS/flexfile: Fix erroneous fall back to read/write through the MDS NFS: Reclaim writes via writepage are opportunistic NFSv4: Use the right stateid for delegations in setattr, read and write ...
2016-05-26Merge tag 'xfs-for-linus-4.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "A pretty average collection of fixes, cleanups and improvements in this request. Summary: - fixes for mount line parsing, sparse warnings, read-only compat feature remount behaviour - allow fast path symlink lookups for inline symlinks. - attribute listing cleanups - writeback goes direct to bios rather than indirecting through bufferheads - transaction allocation cleanup - optimised kmem_realloc - added configurable error handling for metadata write errors, changed default error handling behaviour from "retry forever" to "retry until unmount then fail" - fixed several inode cluster writeback lookup vs reclaim race conditions - fixed inode cluster writeback checking wrong inode after lookup - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe - cleaned up inode reclaim tagging" * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: fix warning in xfs_finish_page_writeback for non-debug builds xfs: move reclaim tagging functions xfs: simplify inode reclaim tagging interfaces xfs: rename variables in xfs_iflush_cluster for clarity xfs: xfs_iflush_cluster has range issues xfs: mark reclaimed inodes invalid earlier xfs: xfs_inode_free() isn't RCU safe xfs: optimise xfs_iext_destroy xfs: skip stale inodes in xfs_iflush_cluster xfs: fix inode validity check in xfs_iflush_cluster xfs: xfs_iflush_cluster fails to abort on error xfs: remove xfs_fs_evict_inode() xfs: add "fail at unmount" error handling configuration xfs: add configuration handlers for specific errors xfs: add configuration of error failure speed xfs: introduce table-based init for error behaviors xfs: add configurable error support to metadata buffers xfs: introduce metadata IO error class xfs: configurable error behavior via sysfs xfs: buffer ->bi_end_io function requires irq-safe lock ...
2016-05-26pnfs: pnfs_update_layout needs to consider if strict iomode checking is onTom Haynes
As flexfiles has FF_FLAGS_NO_READ_IO, there is a need to generically support enforcing that a IOMODE_RW segment will not allow READ I/O. Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-26nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and ↵Tom Haynes
reading is disabled Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-26restore killability of old mutex_lock_killable(&inode->i_mutex) usersAl Viro
The ones that are taking it exclusive, that is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-26ceph: fix wake_up_session_cb()Yan, Zheng
We should reset i_requested_max_size before waking the waiters. (zero i_requested_max_size make waiter re-request the max size) Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: don't use truncate_pagecache() to invalidate read cacheYan, Zheng
truncate_pagecache() drops dirty pages, it's dangerous to use it to invalidate read cache. Besides, we shouldn't start invalidating read cache while there are buffer writers. Because buffer writers may add dirty pages later. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: SetPageError() for writeback pages if writepages failsYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: handle interrupted ceph_writepage()Yan, Zheng
writepage() can be interrupted when it's called by direct memory reclaimer (the direct memory relaimer is killed). To avoid lossing data, we redirty the page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: make ceph_update_writeable_page() uninterruptibleYan, Zheng
ceph_update_writeable_page() is used by ceph_write_begin(). It beaks atomicity of write operation if it's interruptible. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: handle -EAGAIN returned by ceph_update_writeable_page()Yan, Zheng
when ceph_update_writeable_page() return -EAGAIN, caller should lock the page and call ceph_update_writeable_page() again. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEMYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: block non-fatal signals for fault/page_mkwriteYan, Zheng
Fault and page_mkwrite are supposed to be uninterruptable. But they call ceph functions that are interruptible. So they should block signals before calling functions that are interruptible Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: make logical calculation functions return boolZhang Zhuoyu
This patch makes serverl logical caculation functions return bool to improve readability due to these particular functions only using 0/1 as their return value. No functional change. Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
2016-05-26ceph: tolerate bad i_size for symlink inodeYan, Zheng
A mds bug can cause symlink's size to be truncated to zero. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: improve fragtree change detectionYan, Zheng
check if number of splits in i_fragtree is equal to number of splits in mds reply Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: keep leaf frag when updating fragtreeYan, Zheng
Nodes in i_fragtree are sorted according to ceph_compare_frag(). It means frag node in i_fragtree always follow its direct parent node. To check if a leaf node is valid, we just need to check if it's child of previous split node. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: fix dir_auth check in ceph_fill_dirfrag()Yan, Zheng
-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as inode's auth mds Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: don't assume frag tree splits in mds reply are sortedYan, Zheng
The algorithm that updates i_fragtree relies on that the frag tree splits in mds reply are of the same order of i_fragtree. This is not true because current MDS encodes frag tree splits in ascending order of (unsigned)frag_t. But nodes in i_fragtree are sorted according to ceph_frag_compare(). The fix is sort the frag tree splits first, then updates i_fragtree. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: fix inode reference leakYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: using hash value to compose dentry offsetYan, Zheng
If MDS sorts dentries in dirfrag in hash order, we use hash value to compose dentry offset. dentry offset is: (0xff << 52) | ((24 bits hash) << 28) | (the nth entry hash hash collision) This offset is stable across directory fragmentation. This alos means there is no need to reset readdir offset if directory get fragmented in the middle of readdir. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: don't forbid marking directory complete after forward seekYan, Zheng
Forward seek within same frag does not update fi->last_name, it will not affect contents of later readdir reply. So there is no need to forbid marking directory complete Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: record 'offset' for each entry of readdir resultYan, Zheng
This is preparation for using hash value as dentry 'offset' Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: define 'end/complete' in readdir reply as bit flagsYan, Zheng
Set a flag in readdir request, which indicates that client interprets 'end/complete' as bit flags. So that mds can reply additional flags in readdir reply. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: define struct for dir entry in readdir replyYan, Zheng
This avoids defining multiple arrays for entries in readdir reply Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: simplify 'offset in frag'Yan, Zheng
don't distinguish leftmost frag from other frags. always use 2 as first entry's offset. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26ceph: remove unnecessary checks in __dcache_readdirYan, Zheng
we never add snapdir and the hidden .ceph dir into readdir cache Signed-off-by: Yan, Zheng <zyan@redhat.com>