summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-07-31Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe fixed up a hard to trigger ENOSPC regression from our merge window pull, and we have a few other smaller fixes" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix quick exhaustion of the system array in the superblock btrfs: its btrfs_err() instead of btrfs_error() btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
2015-07-30Merge tag 'xfs-for-linus-4.2-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "There are a couple of recently found, long standing remote attribute corruption fixes caused by log recovery getting confused after a crash, and the new DAX code in XFS (merged in 4.2-rc1) needs to actually use the DAX fault path on read faults. Summary: - remote attribute log recovery corruption fixes - DAX page faults need to use direct mappings, not a page cache mapping" * tag 'xfs-for-linus-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: remote attributes need to be considered data xfs: remote attribute headers contain an invalid LSN xfs: call dax_fault on read page faults for DAX
2015-07-29xfs: remote attributes need to be considered dataDave Chinner
We don't log remote attribute contents, and instead write them synchronously before we commit the block allocation and attribute tree update transaction. As a result we are writing to the allocated space before the allcoation has been made permanent. As a result, we cannot consider this allocation to be a metadata allocation. Metadata allocation can take blocks from the free list and so reuse them before the transaction that freed the block is committed to disk. This behaviour is perfectly fine for journalled metadata changes as log recovery will ensure the free operation is replayed before the overwrite, but for remote attribute writes this is not the case. Hence we have to consider the remote attribute blocks to contain data and allocate accordingly. We do this by dropping the XFS_BMAPI_METADATA flag from the block allocation. This means the allocation will not use blocks that are on the busy list without first ensuring that the freeing transaction has been committed to disk and the blocks removed from the busy list. This ensures we will never overwrite a freed block without first ensuring that it is really free. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29xfs: remote attribute headers contain an invalid LSNDave Chinner
In recent testing, a system that crashed failed log recovery on restart with a bad symlink buffer magic number: XFS (vda): Starting recovery (logdev: internal) XFS (vda): Bad symlink block magic! XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060 On examination of the log via xfs_logprint, none of the symlink buffers in the log had a bad magic number, nor were any other types of buffer log format headers mis-identified as symlink buffers. Tracing was used to find the buffer the kernel was tripping over, and xfs_db identified it's contents as: 000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e 020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d 040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d ..... This is a remote attribute buffer, which are notable in that they are not logged but are instead written synchronously by the remote attribute code so that they exist on disk before the attribute transactions are committed to the journal. The above remote attribute block has an invalid LSN in it - cycle 0xd002000, block 0 - which means when log recovery comes along to determine if the transaction that writes to the underlying block should be replayed, it sees a block that has a future LSN and so does not replay the buffer data in the transaction. Instead, it validates the buffer magic number and attaches the buffer verifier to it. It is this buffer magic number check that is failing in the above assert, indicating that we skipped replay due to the LSN of the underlying buffer. The problem here is that the remote attribute buffers cannot have a valid LSN placed into them, because the transaction that contains the attribute tree pointer changes and the block allocation that the attribute data is being written to hasn't yet been committed. Hence the LSN field in the attribute block is completely unwritten, thereby leaving the underlying contents of the block in the LSN field. It could have any value, and hence a future overwrite of the block by log recovery may or may not work correctly. Fix this by always writing an invalid LSN to the remote attribute block, as any buffer in log recovery that needs to write over the remote attribute should occur. We are protected from having old data written over the attribute by the fact that freeing the block before the remote attribute is written will result in the buffer being marked stale in the log and so all changes prior to the buffer stale transaction will be cancelled by log recovery. Hence it is safe to ignore the LSN in the case or synchronously written, unlogged metadata such as remote attribute blocks, and to ensure we do that correctly, we need to write an invalid LSN to all remote attribute blocks to trigger immediate recovery of metadata that is written over the top. As a further protection for filesystems that may already have remote attribute blocks with bad LSNs on disk, change the log recovery code to always trigger immediate recovery of metadata over remote attribute blocks. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29xfs: call dax_fault on read page faults for DAXDave Chinner
When modifying the patch series to handle the XFS MMAP_LOCK nesting of page faults, I botched the conversion of the read page fault path, and so it is only every calling through the page cache. Re-add the necessary __dax_fault() call for such files. Because the get_blocks callback on read faults may not set up the mapping buffer correctly to allow unwritten extent completion to be run, we need to allow callers of __dax_fault() to pass a null complete_unwritten() callback. The DAX code always zeros the unwritten page when it is read faulted so there are no stale data exposure issues with not doing the conversion. The only downside will be the potential for increased CPU overhead on repeated read faults of the same page. If this proves to be a problem, then the filesystem needs to fix it's get_block callback and provide a convert_unwritten() callback to the read fault path. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-28Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - Fix a situation where the client uses the wrong (zero) stateid. - Fix a memory leak in nfs_do_recoalesce Bugfixes: - Plug a memory leak when ->prepare_layoutcommit fails - Fix an Oops in the NFSv4 open code - Fix a backchannel deadlock - Fix a livelock in sunrpc when sendmsg fails due to low memory availability - Don't revalidate the mapping if both size and change attr are up to date - Ensure we don't miss a file extension when doing pNFS - Several fixes to handle NFSv4.1 sequence operation status bits correctly - Several pNFS layout return bugfixes" * tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits) nfs: Fix an oops caused by using other thread's stack space in ASYNC mode nfs: plug memory leak when ->prepare_layoutcommit fails SUNRPC: Report TCP errors to the caller sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable. NFSv4.2: handle NFS-specific llseek errors NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce() NFS: Fix a memory leak in nfs_do_recoalesce NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised NFS: Don't revalidate the mapping if both size and change attr are up to date NFSv4/pnfs: Ensure we don't miss a file extension NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked SUNRPC: xprt_complete_bc_request must also decrement the free slot count SUNRPC: Fix a backchannel deadlock pNFS: Don't throw out valid layout segments pNFS: pnfs_roc_drain() fix a race with open pNFS: Fix races between return-on-close and layoutreturn. pNFS: pnfs_roc_drain should return 'true' when sleeping pNFS: Layoutreturn must invalidate all existing layout segments. ...
2015-07-28nfs: Fix an oops caused by using other thread's stack space in ASYNC modeKinglong Mee
An oops caused by using other thread's stack space in sunrpc ASYNC sending thread. [ 9839.007187] ------------[ cut here ]------------ [ 9839.007923] kernel BUG at fs/nfs/nfs4xdr.c:910! [ 9839.008069] invalid opcode: 0000 [#1] SMP [ 9839.008069] Modules linked in: blocklayoutdriver rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm joydev iosf_mbi crct10dif_pclmul snd_timer crc32_pclmul crc32c_intel ghash_clmulni_intel snd soundcore ppdev pvpanic parport_pc i2c_piix4 serio_raw virtio_balloon parport acpi_cpufreq nfsd nfs_acl lockd grace auth_rpcgss sunrpc qxl drm_kms_helper virtio_net virtio_console virtio_blk ttm drm virtio_pci virtio_ring virtio ata_generic pata_acpi [ 9839.008069] CPU: 0 PID: 308 Comm: kworker/0:1H Not tainted 4.0.0-0.rc4.git1.3.fc23.x86_64 #1 [ 9839.008069] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 9839.008069] Workqueue: rpciod rpc_async_schedule [sunrpc] [ 9839.008069] task: ffff8800d8b4d8e0 ti: ffff880036678000 task.ti: ffff880036678000 [ 9839.008069] RIP: 0010:[<ffffffffa0339cc9>] [<ffffffffa0339cc9>] reserve_space.part.73+0x9/0x10 [nfsv4] [ 9839.008069] RSP: 0018:ffff88003667ba58 EFLAGS: 00010246 [ 9839.008069] RAX: 0000000000000000 RBX: 000000001fc15e18 RCX: ffff8800c0193800 [ 9839.008069] RDX: ffff8800e4ae3f24 RSI: 000000001fc15e2c RDI: ffff88003667bcd0 [ 9839.008069] RBP: ffff88003667ba58 R08: ffff8800d9173008 R09: 0000000000000003 [ 9839.008069] R10: ffff88003667bcd0 R11: 000000000000000c R12: 0000000000010000 [ 9839.008069] R13: ffff8800d9173350 R14: 0000000000000000 R15: ffff8800c0067b98 [ 9839.008069] FS: 0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000 [ 9839.008069] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9839.008069] CR2: 00007f988c9c8bb0 CR3: 00000000d99b6000 CR4: 00000000000407f0 [ 9839.008069] Stack: [ 9839.008069] ffff88003667bbc8 ffffffffa03412c5 00000000c6c55680 ffff880000000003 [ 9839.008069] 0000000000000088 00000010c6c55680 0001000000000002 ffffffff816e87e9 [ 9839.008069] 0000000000000000 00000000477290e2 ffff88003667bab8 ffffffff81327ba3 [ 9839.008069] Call Trace: [ 9839.008069] [<ffffffffa03412c5>] encode_attrs+0x435/0x530 [nfsv4] [ 9839.008069] [<ffffffff816e87e9>] ? inet_sendmsg+0x69/0xb0 [ 9839.008069] [<ffffffff81327ba3>] ? selinux_socket_sendmsg+0x23/0x30 [ 9839.008069] [<ffffffff8164c1df>] ? do_sock_sendmsg+0x9f/0xc0 [ 9839.008069] [<ffffffff8164c278>] ? kernel_sendmsg+0x58/0x70 [ 9839.008069] [<ffffffffa011acc0>] ? xdr_reserve_space+0x20/0x170 [sunrpc] [ 9839.008069] [<ffffffffa011acc0>] ? xdr_reserve_space+0x20/0x170 [sunrpc] [ 9839.008069] [<ffffffffa0341b40>] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4] [ 9839.008069] [<ffffffffa03419a5>] encode_open+0x2d5/0x340 [nfsv4] [ 9839.008069] [<ffffffffa0341b40>] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4] [ 9839.008069] [<ffffffffa011ab89>] ? xdr_encode_opaque+0x19/0x20 [sunrpc] [ 9839.008069] [<ffffffffa0339cfb>] ? encode_string+0x2b/0x40 [nfsv4] [ 9839.008069] [<ffffffffa0341bf3>] nfs4_xdr_enc_open+0xb3/0x140 [nfsv4] [ 9839.008069] [<ffffffffa0110a4c>] rpcauth_wrap_req+0xac/0xf0 [sunrpc] [ 9839.008069] [<ffffffffa01017db>] call_transmit+0x18b/0x2d0 [sunrpc] [ 9839.008069] [<ffffffffa0101650>] ? call_decode+0x860/0x860 [sunrpc] [ 9839.008069] [<ffffffffa0101650>] ? call_decode+0x860/0x860 [sunrpc] [ 9839.008069] [<ffffffffa010caa0>] __rpc_execute+0x90/0x460 [sunrpc] [ 9839.008069] [<ffffffffa010ce85>] rpc_async_schedule+0x15/0x20 [sunrpc] [ 9839.008069] [<ffffffff810b452b>] process_one_work+0x1bb/0x410 [ 9839.008069] [<ffffffff810b47d3>] worker_thread+0x53/0x470 [ 9839.008069] [<ffffffff810b4780>] ? process_one_work+0x410/0x410 [ 9839.008069] [<ffffffff810b4780>] ? process_one_work+0x410/0x410 [ 9839.008069] [<ffffffff810ba7b8>] kthread+0xd8/0xf0 [ 9839.008069] [<ffffffff810ba6e0>] ? kthread_worker_fn+0x180/0x180 [ 9839.008069] [<ffffffff81786418>] ret_from_fork+0x58/0x90 [ 9839.008069] [<ffffffff810ba6e0>] ? kthread_worker_fn+0x180/0x180 [ 9839.008069] Code: 00 00 48 c7 c7 21 fa 37 a0 e8 94 1c d6 e0 c6 05 d2 17 05 00 01 8b 03 eb d7 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 89 f3 [ 9839.008069] RIP [<ffffffffa0339cc9>] reserve_space.part.73+0x9/0x10 [nfsv4] [ 9839.008069] RSP <ffff88003667ba58> [ 9839.071114] ---[ end trace cc14c03adb522e94 ]--- Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-28nfs: plug memory leak when ->prepare_layoutcommit failsJeff Layton
"data" is currently leaked when the prepare_layoutcommit operation returns an error. Put the cred before taking the spinlock in that case, take the lock and then goto out_unlock which will drop the lock and then free "data". Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27NFSv4.2: handle NFS-specific llseek errorsJ. Bruce Fields
Handle NFS-specific llseek errors instead of letting them leak out to userspace. Reported-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()Trond Myklebust
Recoalescing does not affect whether or not we've already sent off I/O, and doing so means that we end up sending a bunch of synchronous for cases where we actually need to be using unstable writes. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27NFS: Fix a memory leak in nfs_do_recoalesceTrond Myklebust
If the function exits early, then we must put those requests that were not processed back onto the &mirror->pg_list so they can be cleaned up by nfs_pgio_error(). Fixes: a7d42ddb30997 ("nfs: add mirroring support to pgio layer") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-25f2fs: call set_page_dirty to attach i_wb for cgroupJaegeuk Kim
The cgroup attaches inode->i_wb via mark_inode_dirty and when set_page_writeback is called, __inc_wb_stat() updates i_wb's stat. So, we need to explicitly call set_page_dirty->__mark_inode_dirty in prior to any writebacking pages. This patch should resolve the following kernel panic reported by Andreas Reis. https://bugzilla.kernel.org/show_bug.cgi?id=101801 --- Comment #2 from Andreas Reis <andreas.reis@gmail.com> --- BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 IP: [<ffffffff8149deea>] __percpu_counter_add+0x1a/0x90 PGD 2951ff067 PUD 2df43f067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: CPU: 7 PID: 10356 Comm: gcc Tainted: G W 4.2.0-1-cu #1 Hardware name: Gigabyte Technology Co., Ltd. G1.Sniper M5/G1.Sniper M5, BIOS T01 02/03/2015 task: ffff880295044f80 ti: ffff880295140000 task.ti: ffff880295140000 RIP: 0010:[<ffffffff8149deea>] [<ffffffff8149deea>] __percpu_counter_add+0x1a/0x90 RSP: 0018:ffff880295143ac8 EFLAGS: 00010082 RAX: 0000000000000003 RBX: ffffea000a526d40 RCX: 0000000000000001 RDX: 0000000000000020 RSI: 0000000000000001 RDI: 0000000000000088 RBP: ffff880295143ae8 R08: 0000000000000000 R09: ffff88008f69bb30 R10: 00000000fffffffa R11: 0000000000000000 R12: 0000000000000088 R13: 0000000000000001 R14: ffff88041d099000 R15: ffff880084a205d0 FS: 00007f8549374700(0000) GS:ffff88042f3c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a8 CR3: 000000033e1d5000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 0000000000000000 ffffea000a526d40 ffff880084a20738 ffff880084a20750 ffff880295143b48 ffffffff811cc91e ffff880000000000 0000000000000296 0000000000000000 ffff880417090198 0000000000000000 ffffea000a526d40 Call Trace: [<ffffffff811cc91e>] __test_set_page_writeback+0xde/0x1d0 [<ffffffff813fee87>] do_write_data_page+0xe7/0x3a0 [<ffffffff813faeea>] gc_data_segment+0x5aa/0x640 [<ffffffff813fb0b8>] do_garbage_collect+0x138/0x150 [<ffffffff813fb3fe>] f2fs_gc+0x1be/0x3e0 [<ffffffff81405541>] f2fs_balance_fs+0x81/0x90 [<ffffffff813ee357>] f2fs_unlink+0x47/0x1d0 [<ffffffff81239329>] vfs_unlink+0x109/0x1b0 [<ffffffff8123e3d7>] do_unlinkat+0x287/0x2c0 [<ffffffff8123ebc6>] SyS_unlink+0x16/0x20 [<ffffffff81942e2e>] entry_SYSCALL_64_fastpath+0x12/0x71 Code: 41 5e 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 fc 53 48 83 ec 08 65 ff 05 e6 d9 b6 7e <48> 8b 47 20 48 63 ca 65 8b 18 48 63 db 48 01 f3 48 39 cb 7d 0a RIP [<ffffffff8149deea>] __percpu_counter_add+0x1a/0x90 RSP <ffff880295143ac8> CR2: 00000000000000a8 ---[ end trace 5132449a58ed93a3 ]--- note: gcc[10356] exited with preempt_count 2 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-07-25f2fs: handle error cases in move_encrypted_blockJaegeuk Kim
This patch fixes some missing error handlers. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-07-24Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Four smaller fixes for the current series. This contains: - A fix for clones of discard bio's, that can cause data corruption. From Martin. - A fix for null_blk, where in certain queue modes it could access a request after it had been freed. From Mike Krinkin. - An error handling leak fix for blkcg, from Tejun. - Also from Tejun, export of the functions that a file system needs to implement cgroup writeback support" * 'for-linus' of git://git.kernel.dk/linux-block: block: Do a full clone when splitting discard bios block: export bio_associate_*() and wbc_account_io() blkcg: fix gendisk reference leak in blkg_conf_prep() null_blk: fix use-after-free problem
2015-07-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fixes from Eric Biederman: "While reading through the code of detach_mounts I realized the code was slightly off. Testing it revealed two buggy corner cases that can send the code of detach_mounts into an infinite loop. Fixing the code to do the right thing removes the possibility of these user triggered infinite loops in the code" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: mnt: In detach_mounts detach the appropriate unmounted mount mnt: Clarify and correct the disconnect logic in umount_tree
2015-07-23block: export bio_associate_*() and wbc_account_io()Tejun Heo
bio_associate_blkcg(), bio_associate_current() and wbc_account_io() are used to implement cgroup writeback support for filesystems and thus need to be exported. Export them. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-23mnt: In detach_mounts detach the appropriate unmounted mountEric W. Biederman
The handling of in detach_mounts of unmounted but connected mounts is buggy and can lead to an infinite loop. Correct the handling of unmounted mounts in detach_mount. When the mountpoint of an unmounted but connected mount is connected to a dentry, and that dentry is deleted we need to disconnect that mount from the parent mount and the deleted dentry. Nothing changes for the unmounted and connected children. They can be safely ignored. Cc: stable@vger.kernel.org Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-22mnt: Clarify and correct the disconnect logic in umount_treeEric W. Biederman
rmdir mntpoint will result in an infinite loop when there is a mount locked on the mountpoint in another mount namespace. This is because the logic to test to see if a mount should be disconnected in umount_tree is buggy. Move the logic to decide if a mount should remain connected to it's mountpoint into it's own function disconnect_mount so that clarity of expression instead of terseness of expression becomes a virtue. When the conditions where it is invalid to leave a mount connected are first ruled out, the logic for deciding if a mount should be disconnected becomes much clearer and simpler. Fixes: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-22Btrfs: fix quick exhaustion of the system array in the superblockFilipe Manana
Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when finishing block group creation"), introduced in 4.2-rc1, the following test was failing due to exhaustion of the system array in the superblock: #!/bin/bash truncate -s 100T big.img mkfs.btrfs big.img mount -o loop big.img /mnt/loop num=5 sz=10T for ((i = 0; i < $num; i++)); do echo fallocate $i $sz fallocate -l $sz /mnt/loop/testfile$i done btrfs filesystem sync /mnt/loop for ((i = 0; i < $num; i++)); do echo rm $i rm /mnt/loop/testfile$i btrfs filesystem sync /mnt/loop done umount /mnt/loop This made btrfs_add_system_chunk() fail with -EFBIG due to excessive allocation of system block groups. This happened because the test creates a large number of data block groups per transaction and when committing the transaction we start the writeout of the block group caches for all the new new (dirty) block groups, which results in pre-allocating space for each block group's free space cache using the same transaction handle. That in turn often leads to creation of more block groups, and all get attached to the new_bgs list of the same transaction handle to the point of getting a list with over 1500 elements, and creation of new block groups leads to the need of reserving space in the chunk block reserve and often creating a new system block group too. So that made us quickly exhaust the chunk block reserve/system space info, because as of the commit mentioned before, we do reserve space for each new block group in the chunk block reserve, unlike before where we would not and would at most allocate one new system block group and therefore would only ensure that there was enough space in the system space info to allocate 1 new block group even if we ended up allocating thousands of new block groups using the same transaction handle. That worked most of the time because the computed required space at check_system_chunk() is very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and that all nodes/leafs in a path will be COWed and split) and since the updates to the chunk tree all happen at btrfs_create_pending_block_groups it is unlikely that a path needs to be COWed more than once (unless writepages() for the btree inode is called by mm in between) and that compensated for the need of creating any new nodes/leads in the chunk tree. So fix this by ensuring we don't accumulate a too large list of new block groups in a transaction's handles new_bgs list, inserting/updating the chunk tree for all accumulated new block groups and releasing the unused space from the chunk block reserve whenever the list becomes sufficiently large. This is a generic solution even though the problem currently can only happen when starting the writeout of the free space caches for all dirty block groups (btrfs_start_dirty_block_groups()). Reported-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22btrfs: its btrfs_err() instead of btrfs_error()Anand Jain
sorry I indented to use btrfs_err() and I have no idea how btrfs_error() got there. infact I was thinking about these kind of oversights since these two func are too closely named. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22btrfs: Avoid NULL pointer dereference of free_extent_buffer when ↵Zhao Lei
read_tree_block() fail When read_tree_block() failed, we can see following dmesg: [ 134.371389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000063 [ 134.372236] IP: [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90 [ 134.372236] PGD 0 [ 134.372236] Oops: 0000 [#1] SMP [ 134.372236] Modules linked in: [ 134.372236] CPU: 0 PID: 2289 Comm: mount Not tainted 4.2.0-rc1_HEAD_c65b99f046843d2455aa231747b5a07a999a9f3d_+ #115 [ 134.372236] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 134.372236] task: ffff88003b6e1a00 ti: ffff880011e60000 task.ti: ffff880011e60000 [ 134.372236] RIP: 0010:[<ffffffff813a4a51>] [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90 ... [ 134.372236] Call Trace: [ 134.372236] [<ffffffff81379aa1>] free_root_extent_buffers+0x91/0xb0 [ 134.372236] [<ffffffff81379c3d>] free_root_pointers+0x17d/0x190 [ 134.372236] [<ffffffff813801b0>] open_ctree+0x1ca0/0x25b0 [ 134.372236] [<ffffffff8144d017>] ? disk_name+0x97/0xb0 [ 134.372236] [<ffffffff813558aa>] btrfs_mount+0x8fa/0xab0 ... Reason: read_tree_block() changed to return error number on fail, and this value(not NULL) is set to tree_root->node, then subsequent code will run to: free_root_pointers() ->free_root_extent_buffers() ->free_extent_buffer() ->atomic_read((extent_buffer *)(-E_XXX)->refs); and trigger above error. Fix: Set tree_root->node to NULL on fail to make error_handle code happy. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()Zhao Lei
Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of delayed_iput_sem in xfstests generic/241: [ 2061.345955] ============================================= [ 2061.346027] [ INFO: possible recursive locking detected ] [ 2061.346027] 4.1.0+ #268 Tainted: G W [ 2061.346027] --------------------------------------------- [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock: [ 2061.346027] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100 [ 2061.346027] but task is already holding lock: [ 2061.346027] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100 [ 2061.346027] other info that might help us debug this: [ 2061.346027] Possible unsafe locking scenario: [ 2061.346027] CPU0 [ 2061.346027] ---- [ 2061.346027] lock(&fs_info->delayed_iput_sem); [ 2061.346027] lock(&fs_info->delayed_iput_sem); [ 2061.346027] *** DEADLOCK *** It is rarely happened, about 1/400 in my test env. The reason is recursion of btrfs_run_delayed_iputs(): cleaner_kthread -> btrfs_run_delayed_iputs() *1 -> get delayed_iput_sem lock *2 -> iput() -> ... -> btrfs_commit_transaction() -> btrfs_run_delayed_iputs() *1 -> get delayed_iput_sem lock (dead lock) *2 *1: recursion of btrfs_run_delayed_iputs() *2: warning of lockdep about delayed_iput_sem When fs is in high stress, new iputs may added into fs_info->delayed_iputs list when btrfs_run_delayed_iputs() is running, which cause second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem) again, and cause above lockdep warning. Actually, it will not cause real problem because both locks are read lock, but to avoid lockdep warning, we can do a fix. Fix: Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for cleaner_kthread thread to break above recursion path. cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code, and don't need to call btrfs_run_delayed_iputs() again in btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow. Test: No above lockdep warning after patch in 1200 generic/241 tests. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22NFS: Remove the "NFS_CAP_CHANGE_ATTR" capabilityTrond Myklebust
Setting the change attribute has been mandatory for all NFS versions, since commit 3a1556e8662c ("NFSv2/v3: Simulate the change attribute"). We should therefore not have anything be conditional on it being set/unset. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialisedTrond Myklebust
We can't allow caching of data until the change attribute has been initialised correctly. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22NFS: Don't revalidate the mapping if both size and change attr are up to dateTrond Myklebust
If we've ensured that the size and the change attribute are both correct, then there is no point in marking those attributes as needing revalidation again. Only do so if we know the size is incorrect and was not updated. Fixes: f2467b6f64da ("NFS: Clear NFS_INO_REVAL_PAGECACHE when...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22NFSv4/pnfs: Ensure we don't miss a file extensionTrond Myklebust
pNFS writes don't return attributes, however that doesn't mean that we should ignore the fact that they may be extending the file. This patch ensures that if a write is seen to extend the file, then we always set an attribute barrier, and update the cached file size. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_lockedTrond Myklebust
Otherwise, nfs4_select_rw_stateid() will always return the zero stateid instead of the correct open stateid. Fixes: f95549cf24660 ("NFSv4: More CLOSE/OPEN races") Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-21Revert "fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()"Linus Torvalds
This reverts commit a2673b6e040663bf16a552f8619e6bde9f4b9acf. Kinglong Mee reports a memory leak with that patch, and Jan Kara confirms: "Thanks for report! You are right that my patch introduces a race between fsnotify kthread and fsnotify_destroy_group() which can result in leaking inotify event on group destruction. I haven't yet decided whether the right fix is not to queue events for dying notification group (as that is pointless anyway) or whether we should just fix the original problem differently... Whenever I look at fsnotify code mark handling I get lost in the maze of locks, lists, and subtle differences between how different notification systems handle notification marks :( I'll think about it over night" and after thinking about it, Jan says: "OK, I have looked into the code some more and I found another relatively simple way of fixing the original oops. It will be IMHO better than trying to fixup this issue which has more potential for breakage. I'll ask Linus to revert the fsnotify fix he already merged and send a new fix" Reported-by: Kinglong Mee <kinglongmee@gmail.com> Requested-by: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-21Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fix from Jan Kara: "A fix for UDF corruption when certain disk-format feature is enabled" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Don't corrupt unalloc spacetable when writing it
2015-07-18Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Two families of fixes: - Fix an FPU context related boot crash on newer x86 hardware with larger context sizes than what most people test. To fix this without ugly kludges or extensive reverts we had to touch core task allocator, to allow x86 to determine the task size dynamically, at boot time. I've tested it on a number of x86 platforms, and I cross-built it to a handful of architectures: (warns) (warns) testing x86-64: -git: pass ( 0), -tip: pass ( 0) testing x86-32: -git: pass ( 0), -tip: pass ( 0) testing arm: -git: pass ( 1359), -tip: pass ( 1359) testing cris: -git: pass ( 1031), -tip: pass ( 1031) testing m32r: -git: pass ( 1135), -tip: pass ( 1135) testing m68k: -git: pass ( 1471), -tip: pass ( 1471) testing mips: -git: pass ( 1162), -tip: pass ( 1162) testing mn10300: -git: pass ( 1058), -tip: pass ( 1058) testing parisc: -git: pass ( 1846), -tip: pass ( 1846) testing sparc: -git: pass ( 1185), -tip: pass ( 1185) ... so I hope the cross-arch impact 'none', as intended. (by Dave Hansen) - Fix various NMI handling related bugs unearthed by the big asm code rewrite and generally make the NMI code more robust and more maintainable while at it. These changes are a bit late in the cycle, I hope they are still acceptable. (by Andy Lutomirski)" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86 x86/fpu, sched: Dynamically allocate 'struct fpu' x86/entry/64, x86/nmi/64: Add CONFIG_DEBUG_ENTRY NMI testing code x86/nmi/64: Make the "NMI executing" variable more consistent x86/nmi/64: Minor asm simplification x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection x86/nmi/64: Reorder nested NMI checks x86/nmi/64: Improve nested NMI comments x86/nmi/64: Switch stacks on userspace NMI entry x86/nmi/64: Remove asm code that saves CR2 x86/nmi: Enable nested do_nmi() handling for 64-bit kernels
2015-07-18Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "25 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits) lib/decompress: set the compressor name to NULL on error mm/cma_debug: correct size input to bitmap function mm/cma_debug: fix debugging alloc/free interface mm/page_owner: set correct gfp_mask on page_owner mm/page_owner: fix possible access violation fsnotify: fix oops in fsnotify_clear_marks_by_group_flags() /proc/$PID/cmdline: fixup empty ARGV case dma-debug: skip debug_dma_assert_idle() when disabled hexdump: fix for non-aligned buffers checkpatch: fix long line messages about patch context mm: clean up per architecture MM hook header files MAINTAINERS: uclinux-h8-devel is moderated for non-subscribers mailmap: update Sudeep Holla's email id Update Viresh Kumar's email address mm, meminit: suppress unused memory variable warning configfs: fix kernel infoleak through user-controlled format string include, lib: add __printf attributes to several function prototypes s390/hugetlb: add hugepages_supported define mm: hugetlb: allow hugepages_supported to be architecture specific revert "s390/mm: make hugepages_supported a boot time decision" ...
2015-07-17Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are all from Filipe, and cover a few problems we've had reported on the list recently (along with ones he found on his own)" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix file corruption after cloning inline extents Btrfs: fix order by which delayed references are run Btrfs: fix list transaction->pending_ordered corruption Btrfs: fix memory leak in the extent_same ioctl Btrfs: fix shrinking truncate when the no_holes feature is enabled
2015-07-18x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it ↵Ingo Molnar
on x86 Don't burden architectures without dynamic task_struct sizing with the overhead of dynamic sizing. Also optimize the x86 code a bit by caching task_struct_size. Acked-and-Tested-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437128892-9831-3-git-send-email-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-18x86/fpu, sched: Dynamically allocate 'struct fpu'Dave Hansen
The FPU rewrite removed the dynamic allocations of 'struct fpu'. But, this potentially wastes massive amounts of memory (2k per task on systems that do not have AVX-512 for instance). Instead of having a separate slab, this patch just appends the space that we need to the 'task_struct' which we dynamically allocate already. This saves from doing an extra slab allocation at fork(). The only real downside here is that we have to stick everything and the end of the task_struct. But, I think the BUILD_BUG_ON()s I stuck in there should keep that from being too fragile. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-17fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()Jan Kara
fsnotify_clear_marks_by_group_flags() can race with fsnotify_destroy_marks() so when fsnotify_destroy_mark_locked() drops mark_mutex, a mark from the list iterated by fsnotify_clear_marks_by_group_flags() can be freed and we dereference free memory in the loop there. Fix the problem by keeping mark_mutex held in fsnotify_destroy_mark_locked(). The reason why we drop that mutex is that we need to call a ->freeing_mark() callback which may acquire mark_mutex again. To avoid this and similar lock inversion issues, we move the call to ->freeing_mark() callback to the kthread destroying the mark. Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Ashish Sangwan <a.sangwan@samsung.com> Suggested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17/proc/$PID/cmdline: fixup empty ARGV caseAlexey Dobriyan
/proc/*/cmdline code checks if it should look at ENVP area by checking last byte of ARGV area: rv = access_remote_vm(mm, arg_end - 1, &c, 1, 0); if (rv <= 0) goto out_free_page; If ARGV is somehow made empty (by doing execve(..., NULL, ...) or manually setting ->arg_start and ->arg_end to equal values), the decision will be based on byte which doesn't even belong to ARGV/ENVP. So, quickly check if ARGV area is empty and report 0 to match previous behaviour. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17configfs: fix kernel infoleak through user-controlled format stringNicolas Iooss
Some modules call config_item_init_type_name() and config_group_init_type_name() with parameter "name" directly controlled by userspace. These two functions call config_item_set_name() with this name used as a format string, which can be used to leak information such as content of the stack to userspace. For example, make_netconsole_target() in netconsole module calls config_item_init_type_name() with the name of a newly-created directory. This means that the following commands give some unexpected output, with configfs mounted in /sys/kernel/config/ and on a system with a configured eth0 ethernet interface: # modprobe netconsole # mkdir /sys/kernel/config/netconsole/target_%lx # echo eth0 > /sys/kernel/config/netconsole/target_%lx/dev_name # echo 1 > /sys/kernel/config/netconsole/target_%lx/enabled # echo eth0 > /sys/kernel/config/netconsole/target_%lx/dev_name # dmesg |tail -n1 [ 142.697668] netconsole: target (target_ffffffffc0ae8080) is enabled, disable to update parameters The directory name is correct but %lx has been interpreted in the internal item name, displayed here in the error message used by store_dev_name() in drivers/net/netconsole.c. To fix this, update every caller of config_item_set_name to use "%s" when operating on untrusted input. This issue was found using -Wformat-security gcc flag, once a __printf attribute has been added to config_item_set_name(). Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Felipe Balbi <balbi@ti.com> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17fs, proc: add help for CONFIG_PROC_CHILDRENIago López Galeiras
The purpose of the option was documented in Documentation/filesystems/proc.txt but the help text was missing. Add small help text that also points to the documentation. Signed-off-by: Iago López Galeiras <iago@endocode.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-16Merge tag 'jfs-4.2' of git://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs fixes from David Kleikamp: "A couple trivial fixes and an error path fix" * tag 'jfs-4.2' of git://github.com/kleikamp/linux-shaggy: jfs: clean up jfs_rename and fix out of order unlock jfs: fix indentation on if statement jfs: removed a prohibited space after opening parenthesis
2015-07-15Merge tag 'locks-v4.2-1' of git://git.samba.org/jlayton/linuxLinus Torvalds
Pull file locking updates from Jeff Layton: "I had thought that I was going to get away without a pull request this cycle. There was a NFSv4 file locking problem that cropped up that I tried to fix in the NFSv4 code alone, but that fix has turned out to be problematic. These patches fix this in the correct way. Note that this touches some NFSv4 code as well. Ordinarily I'd wait for Trond to ACK this, but he's on holiday right now and the bug is rather nasty. So I suggest we merge this and if he raises issues with it we can sort it out when he gets back" Acked-by: Bruce Fields <bfields@fieldses.org> Acked-by: Dan Williams <dan.j.williams@intel.com> [ +1 to this series fixing a 100% reproducible slab corruption + general protection fault in my nfs-root test environment. - Dan ] Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'locks-v4.2-1' of git://git.samba.org/jlayton/linux: locks: inline posix_lock_file_wait and flock_lock_file_wait nfs4: have do_vfs_lock take an inode pointer locks: new helpers - flock_lock_inode_wait and posix_lock_inode_wait locks: have flock_lock_file take an inode pointer instead of a filp Revert "nfs: take extra reference to fl->fl_file when running a LOCKU operation"
2015-07-15jfs: clean up jfs_rename and fix out of order unlockDave Kleikamp
The end of jfs_rename(), which is also used by the error paths, included a call to IWRITE_UNLOCK(new_ip) after labels out1, out2 and out3. If we come in through these labels, IWRITE_LOCK() has not been called yet. In moving that call to the correct spot, I also moved some exceptional truncate code earlier as well, since the early error paths don't need to deal with it, and I renamed out4: to out_tx: so a future patch by Jan Kara doesn't need to deal with renumbering or confusing out-of-order labels. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2015-07-14Btrfs: fix file corruption after cloning inline extentsFilipe Manana
Using the clone ioctl (or extent_same ioctl, which calls the same extent cloning function as well) we end up allowing copy an inline extent from the source file into a non-zero offset of the destination file. This is something not expected and that the btrfs code is not prepared to deal with - all inline extents must be at a file offset equals to 0. For example, the following excerpt of a test case for fstests triggers a crash/BUG_ON() on a write operation after an inline extent is cloned into a non-zero offset: _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test files. File foo has the same 2K of data at offset 4K # as file bar has at its offset 0. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \ -c "pwrite -S 0xbb 4k 2K" \ -c "pwrite -S 0xcc 8K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # File bar consists of a single inline extent (2K size). $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \ $SCRATCH_MNT/bar | _filter_xfs_io # Now call the clone ioctl to clone the extent of file bar into file # foo at its offset 4K. This made file foo have an inline extent at # offset 4K, something which the btrfs code can not deal with in future # IO operations because all inline extents are supposed to start at an # offset of 0, resulting in all sorts of chaos. # So here we validate that clone ioctl returns an EOPNOTSUPP, which is # what it returns for other cases dealing with inlined extents. $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Because of the inline extent at offset 4K, the following write made # the kernel crash with a BUG_ON(). $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io status=0 exit The stack trace of the BUG_ON() triggered by the last write is: [152154.035903] ------------[ cut here ]------------ [152154.036424] kernel BUG at mm/page-writeback.c:2286! [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$ [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G W 4.1.0-rc6-btrfs-next-11+ #2 [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000 [152154.036424] RIP: 0010:[<ffffffff8111a9d5>] [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP: 0018:ffff880429effc68 EFLAGS: 00010246 [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001 [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0 [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001 [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68 [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80 [152154.036424] FS: 00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000 [152154.036424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0 [152154.036424] Stack: [152154.036424] ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1 [152154.036424] ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8 [152154.036424] 0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000 [152154.036424] Call Trace: [152154.036424] [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs] [152154.036424] [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs] [152154.036424] [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs] [152154.036424] [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5 [152154.036424] [<ffffffff81165f89>] vfs_write+0xa0/0xe4 [152154.036424] [<ffffffff81166855>] SyS_pwrite64+0x64/0x82 [152154.036424] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$ [152154.036424] RIP [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP <ffff880429effc68> [152154.242621] ---[ end trace e3d3376b23a57041 ]--- Fix this by returning the error EOPNOTSUPP if an attempt to copy an inline extent into a non-zero offset happens, just like what is done for other scenarios that would require copying/splitting inline extents, which were introduced by the following commits: 00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents") Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-13locks: inline posix_lock_file_wait and flock_lock_file_waitJeff Layton
They just call file_inode and then the corresponding *_inode_file_wait function. Just make them static inlines instead. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-07-13nfs4: have do_vfs_lock take an inode pointerJeff Layton
Now that we have file locking helpers that can deal with an inode instead of a filp, we can change the NFSv4 locking code to use that instead. This should fix the case where we have a filp that is closed while flock or OFD locks are set on it, and the task is signaled so that it doesn't wait for the LOCKU reply to come in before the filp is freed. At that point we can end up with a use-after-free with the current code, which relies on dereferencing the fl_file in the lock request. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
2015-07-13locks: new helpers - flock_lock_inode_wait and posix_lock_inode_waitJeff Layton
Allow callers to pass in an inode instead of a filp. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
2015-07-13locks: have flock_lock_file take an inode pointer instead of a filpJeff Layton
...and rename it to better describe how it works. In order to fix a use-after-free in NFS, we need to be able to remove locks from an inode after the filp associated with them may have already been freed. flock_lock_file already only dereferences the filp to get to the inode, so just change it so the callers do that. All of the callers already pass in a lock request that has the fl_file set properly, so we don't need to pass it in individually. With that change it now only dereferences the filp to get to the inode, so just push that out to the callers. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
2015-07-13Revert "nfs: take extra reference to fl->fl_file when running a LOCKU operation"Jeff Layton
This reverts commit db2efec0caba4f81a22d95a34da640b86c313c8e. William reported that he was seeing instability with this patch, which is likely due to the fact that it can cause the kernel to take a new reference to a filp after the last reference has already been put. Revert this patch for now, as we'll need to fix this in another way. Cc: stable@vger.kernel.org Reported-by: William Dauchy <william@gandi.net> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
2015-07-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS fixes from Al Viro: "Fixes for this cycle regression in overlayfs and a couple of long-standing (== all the way back to 2.6.12, at least) bugs" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: freeing unlinked file indefinitely delayed fix a braino in ovl_d_select_inode() 9p: don't leave a half-initialized inode sitting around
2015-07-12freeing unlinked file indefinitely delayedAl Viro
Normally opening a file, unlinking it and then closing will have the inode freed upon close() (provided that it's not otherwise busy and has no remaining links, of course). However, there's one case where that does *not* happen. Namely, if you open it by fhandle with cold dcache, then unlink() and close(). In normal case you get d_delete() in unlink(2) notice that dentry is busy and unhash it; on the final dput() it will be forcibly evicted from dcache, triggering iput() and inode removal. In this case, though, we end up with *two* dentries - disconnected (created by open-by-fhandle) and regular one (used by unlink()). The latter will have its reference to inode dropped just fine, but the former will not - it's considered hashed (it is on the ->s_anon list), so it will stay around until the memory pressure will finally do it in. As the result, we have the final iput() delayed indefinitely. It's trivial to reproduce - void flush_dcache(void) { system("mount -o remount,rw /"); } static char buf[20 * 1024 * 1024]; main() { int fd; union { struct file_handle f; char buf[MAX_HANDLE_SZ]; } x; int m; x.f.handle_bytes = sizeof(x); chdir("/root"); mkdir("foo", 0700); fd = open("foo/bar", O_CREAT | O_RDWR, 0600); close(fd); name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); flush_dcache(); fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); unlink("foo/bar"); write(fd, buf, sizeof(buf)); system("df ."); /* 20Mb eaten */ close(fd); system("df ."); /* should've freed those 20Mb */ flush_dcache(); system("df ."); /* should be the same as #2 */ } will spit out something like Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 283282 21692 93% / - inode gets freed only when dentry is finally evicted (here we trigger than by remount; normally it would've happened in response to memory pressure hell knows when). Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/ Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-12fix a braino in ovl_d_select_inode()Al Viro
when opening a directory we want the overlayfs inode, not one from the topmost layer. Reported-By: Andrey Jr. Melnikov <temnota.am@gmail.com> Tested-By: Andrey Jr. Melnikov <temnota.am@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>