summaryrefslogtreecommitdiff
path: root/fs/ext4/mballoc.c
AgeCommit message (Collapse)Author
2017-01-06ext4: fix stack memory corruption with 64k block sizeChandan Rajendra
commit 30a9d7afe70ed6bd9191d3000e2ef1a34fb58493 upstream. The number of 'counters' elements needed in 'struct sg' is super_block->s_blocksize_bits + 2. Presently we have 16 'counters' elements in the array. This is insufficient for block sizes >= 32k. In such cases the memcpy operation performed in ext4_mb_seq_groups_show() would cause stack memory corruption. Fixes: c9de560ded61f Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: fix mballoc breakage with 64k block sizeChandan Rajendra
commit 69e43e8cc971a79dd1ee5d4343d8e63f82725123 upstream. 'border' variable is set to a value of 2 times the block size of the underlying filesystem. With 64k block size, the resulting value won't fit into a 16-bit variable. Hence this commit changes the data type of 'border' to 'unsigned int'. Fixes: c9de560ded61f Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24ext4: use __GFP_NOFAIL in ext4_free_blocks()Konstantin Khlebnikov
commit adb7ef600cc9d9d15ecc934cc26af5c1379777df upstream. This might be unexpected but pages allocated for sbi->s_buddy_cache are charged to current memory cgroup. So, GFP_NOFS allocation could fail if current task has been killed by OOM or if current memory cgroup has no free memory left. Block allocator cannot handle such failures here yet. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16ext4: fix reference counting bug on block allocation errorVegard Nossum
commit 554a5ccc4e4a20c5f3ec859de0842db4b4b9c77e upstream. If we hit this error when mounted with errors=continue or errors=remount-ro: EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to continue. However, ext4_mb_release_context() is the wrong thing to call here since we are still actually using the allocation context. Instead, just error out. We could retry the allocation, but there is a possibility of getting stuck in an infinite loop instead, so this seems safer. [ Fixed up so we don't return EAGAIN to userspace. --tytso ] Fixes: 8556e8f3b6 ("ext4: Don't allow new groups to be added during block allocation") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07ext4: silence UBSAN in ext4_mb_init()Nicolai Stange
commit 935244cd54b86ca46e69bc6604d2adfb1aec2d42 upstream. Currently, in ext4_mb_init(), there's a loop like the following: do { ... offset += 1 << (sb->s_blocksize_bits - i); i++; } while (i <= sb->s_blocksize_bits + 1); Note that the updated offset is used in the loop's next iteration only. However, at the last iteration, that is at i == sb->s_blocksize_bits + 1, the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3)) and UBSAN reports UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15 shift exponent 4294967295 is too large for 32-bit type 'int' [...] Call Trace: [<ffffffff818c4d25>] dump_stack+0xbc/0x117 [<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169 [<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e [<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254 [<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158 [<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390 [<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0 [<ffffffff814293c7>] ? create_cache+0x57/0x1f0 [<ffffffff8142948a>] ? create_cache+0x11a/0x1f0 [<ffffffff821c2168>] ? mutex_lock+0x38/0x60 [<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50 [<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0 [<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0 [<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0 [...] Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1. Unless compilers start to do some fancy transformations (which at least GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the such calculated value of offset is never used again. Silence UBSAN by introducing another variable, offset_incr, holding the next increment to apply to offset and adjust that one by right shifting it by one position per loop iteration. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161 Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07ext4: address UBSAN warning in mb_find_order_for_block()Nicolai Stange
commit b5cb316cdf3a3f5f6125412b0f6065185240cfdc upstream. Currently, in mb_find_order_for_block(), there's a loop like the following: while (order <= e4b->bd_blkbits + 1) { ... bb += 1 << (e4b->bd_blkbits - order); } Note that the updated bb is used in the loop's next iteration only. However, at the last iteration, that is at order == e4b->bd_blkbits + 1, the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11 shift exponent -1 is negative [...] Call Trace: [<ffffffff818c4d35>] dump_stack+0xbc/0x117 [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169 [<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e [<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254 [<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158 [<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590 [<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80 [<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240 [...] Unless compilers start to do some fancy transformations (which at least GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the such calculated value of bb is never used again. Silence UBSAN by introducing another variable, bb_incr, holding the next increment to apply to bb and adjust that one by right shifting it by one position per loop iteration. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161 Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-09remove abs64()Andrew Morton
Switch everything to the new and more capable implementation of abs(). Mainly to give the new abs() a bit of a workout. Cc: Michal Nazarewicz <mina86@mina86.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-19ext4: fix abs() usage in ext4_mb_check_group_paJohn Stultz
The ext4_fsblk_t type is a long long, which should not be used with abs(), as is done in ext4_mb_check_group_pa(). This patch modifies ext4_mb_check_group_pa() to use abs64() instead. Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17ext4: fix xfstest generic/269 double revoked buffer bug with bigallocDaeho Jeong
When you repeatly execute xfstest generic/269 with bigalloc_1k option enabled using the below command: "./kvm-xfstests -c bigalloc_1k -m nodelalloc -C 1000 generic/269" you can easily see the below bug message. "JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);" This means that an already revoked buffer is erroneously revoked again and it is caused by doing revoke for the buffer at the wrong position in ext4_free_blocks(). We need to re-position the buffer revoke procedure for an unspecified buffer after checking the cluster boundary for bigalloc option. If not, some part of the cluster can be doubly revoked. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
2015-10-17ext4: make the bitmap read routines return real error codesDarrick J. Wong
Make the bitmap reaading routines return real error codes (EIO, EFSCORRUPTED, EFSBADCRC) which can then be reflected back to userspace for more precise diagnosis work. In particular, this means that mballoc no longer claims that we're out of memory if the block bitmaps become corrupt. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23ext4: move procfs registration code to fs/ext4/sysfs.cTheodore Ts'o
This allows us to refactor the procfs code, which saves a bit of compiled space. More importantly it isolates most of the procfs support code into a single file, so it's easier to #ifdef it out if the proc file system has been disabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-05Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Bug fixes (all for stable kernels) for ext4: - address corner cases for indirect blocks->extent migration - fix reserved block accounting invalidate_page when page_size != block_size (i.e., ppc or 1k block size file systems) - fix deadlocks when a memcg is under heavy memory pressure - fix fencepost error in lazytime optimization" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: replace open coded nofail allocation in ext4_free_blocks() ext4: correctly migrate a file with a hole at the beginning ext4: be more strict when migrating to non-extent based file ext4: fix reservation release on invalidatepage for delalloc fs ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp bufferhead: Add _gfp version for sb_getblk() ext4: fix fencepost error in lazytime optimization
2015-07-05ext4: replace open coded nofail allocation in ext4_free_blocks()Michal Hocko
ext4_free_blocks is looping around the allocation request and mimics __GFP_NOFAIL behavior without any allocation fallback strategy. Let's remove the open coded loop and replace it with __GFP_NOFAIL. Without the flag the allocator has no way to find out never-fail requirement and cannot help in any way. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-06-25Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
2015-06-15ext4: mballoc: avoid 20-argument function callRasmus Villemoes
Making a function call with 20 arguments is rather expensive in both stack and .text. In this case, doing the formatting manually doesn't make it any less readable, so we might as well save 155 bytes of .text and 112 bytes of stack. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
2015-06-08ext4: return error code from ext4_mb_good_group()Lukas Czerner
Currently ext4_mb_good_group() only returns 0 or 1 depending on whether the allocation group is suitable for use or not. However we might get various errors and fail while initializing new group including -EIO which would never get propagated up the call chain. This might lead to an endless loop at writeback when we're trying to find a good group to allocate from and we fail to initialize new group (read error for example). Fix this by returning proper error code from ext4_mb_good_group() and using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator() we will always return only the first occurred error from ext4_mb_good_group() and we only propagate it back to the caller if we do not get any other errors and we fail to allocate any blocks. Note that with other modes than errors=continue, we will fail immediately in ext4_mb_good_group() in case of error, however with errors=continue we should try to continue using the file system, that's why we're not going to fail immediately when we see an error from ext4_mb_good_group(), but rather when we fail to find a suitable block group to allocate from due to an problem in group initialization. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-06-08ext4: try to initialize all groups we can in case of failure on ppc64Lukas Czerner
Currently on the machines with page size > block size when initializing block group buddy cache we initialize it for all the block group bitmaps in the page. However in the case of read error, checksum error, or if a single bitmap is in any way corrupted we would fail to initialize all of the bitmaps. This is problematic because we will not have access to the other allocation groups even though those might be perfectly fine and usable. Fix this by reading all the bitmaps instead of error out on the first problem and simply skip the bitmaps which were either not read properly, or are not valid. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-02writeback: separate out include/linux/backing-dev-defs.hTejun Heo
With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-25ext4: Remove an unnecessary check for NULL before iput()Markus Elfring
The iput() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: cleanup GFP flags inside resize pathDmitry Monakhov
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo and ext4_calculate_overhead() because they are called from inside a journal transaction. Call trace: ioctl ->ext4_group_add ->journal_start ->ext4_setup_new_descs ->ext4_mb_add_groupinfo -> GFP_KERNEL ->ext4_flex_group_add ->ext4_update_super ->ext4_calculate_overhead -> GFP_KERNEL ->journal_stop Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-20ext4: kill ext4_kvfree()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-20Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of cleanups and bug fixes, with some (minor) journal optimizations" [ This got sent to me before -rc1, but was stuck in my spam folder. - Linus ] * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits) ext4: check s_chksum_driver when looking for bg csum presence ext4: move error report out of atomic context in ext4_init_block_bitmap() ext4: Replace open coded mdata csum feature to helper function ext4: delete useless comments about ext4_move_extents ext4: fix reservation overflow in ext4_da_write_begin ext4: add ext4_iget_normal() which is to be used for dir tree lookups ext4: don't orphan or truncate the boot loader inode ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT ext4: optimize block allocation on grow indepth ext4: get rid of code duplication ext4: fix over-defensive complaint after journal abort ext4: fix return value of ext4_do_update_inode ext4: fix mmap data corruption when blocksize < pagesize vfs: fix data corruption when blocksize < pagesize for mmaped data ext4: fold ext4_nojournal_sops into ext4_sops ext4: support freezing ext2 (nojournal) file systems ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs() ext4: don't check quota format when there are no quota files jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list jbd2: avoid pointless scanning of checkpoint lists ...
2014-10-15Merge branch 'for-3.18-consistent-ops' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_*() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...
2014-10-01ext4: get rid of code duplicationDmitry Monakhov
Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-04ext4: drop the EXT4_STATE_DELALLOC_RESERVED flagTheodore Ts'o
Having done a full regression test, we can now drop the DELALLOC_RESERVED state flag. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04ext4: prepare to drop EXT4_STATE_DELALLOC_RESERVEDTheodore Ts'o
The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented because it was too hard to make sure the mballoc and get_block flags could be reliably passed down through all of the codepaths that end up calling ext4_mb_new_blocks(). Since then, we have mb_flags passed down through most of the code paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky as it used to. This commit plumbs in the last of what is required, and then adds a WARN_ON check to make sure we haven't missed anything. If this passes a full regression test run, we can then drop EXT4_STATE_DELALLOC_RESERVED. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-08-26block: Replace __this_cpu_ptr with raw_cpu_ptrChristoph Lameter
__this_cpu_ptr is being phased out use raw_cpu_ptr instead which was introduced in 3.15-rc1. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-23ext4: fix BUG_ON in mb_free_blocks()Theodore Ts'o
If we suffer a block allocation failure (for example due to a memory allocation failure), it's possible that we will call ext4_discard_allocated_blocks() before we've actually allocated any blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0) triggering the BUG_ON on mb_free_blocks(): BUG_ON(last >= (sb->s_blocksize << 3)); Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len is zero. Also fix a missing ext4_mb_unload_buddy() call in ext4_discard_allocated_blocks(). Google-Bug-Id: 16844242 Fixes: 86f0afd463215fc3e58020493482faa4ac3a4d69 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-07-30ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa structTheodore Ts'o
If there is a failure while allocating the preallocation structure, a number of blocks can end up getting marked in the in-memory buddy bitmap, and then not getting released. This can result in the following corruption getting reported by the kernel: EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126, 12793 clusters in bitmap, 12729 in gd In that case, we need to release the blocks using mb_free_blocks(). Tested: fs smoke test; also demonstrated that with injected errors, the file system is no longer getting corrupted Google-Bug-Id: 16657874 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-07-27ext4: fix wrong size computation in ext4_mb_normalize_request()Xiaoguang Wang
As the member fe_len defined in struct ext4_free_extent is expressed as number of clusters, the variable "size" computation is wrong, we need to first translate fe_len to block number, then to bytes. Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-15ext4: remove metadata reservation checksTheodore Ts'o
Commit 27dd43854227b ("ext4: introduce reserved space") reserves 2% of the file system space to make sure metadata allocations will always succeed. Given that, tracking the reservation of metadata blocks is no longer necessary. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-05ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()Theodore Ts'o
We are spending a lot of time explaining to users what this error means. Let's try to improve the message to avoid this problem. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-06-26ext4: decrement free clusters/inodes counters when block group declared badNamjae Jeon
We should decrement free clusters counter when block bitmap is marked as corrupt and free inodes counter when the allocation bitmap is marked as corrupt to avoid misunderstanding due to incorrect available size in statfs result. User can get immediately ENOSPC error from write begin without reaching for the writepages. Cc: Darrick J. Wong<darrick.wong@oracle.com> Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
2014-06-08Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Clean ups and miscellaneous bug fixes, in particular for the new collapse_range and zero_range fallocate functions. In addition, improve the scalability of adding and remove inodes from the orphan list" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: handle symlink properly with inline_data ext4: fix wrong assert in ext4_mb_normalize_request() ext4: fix zeroing of page during writeback ext4: remove unused local variable "stored" from ext4_readdir(...) ext4: fix ZERO_RANGE test failure in data journalling ext4: reduce contention on s_orphan_lock ext4: use sbi in ext4_orphan_{add|del}() ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged() ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access ext4: remove unnecessary double parentheses ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails ext4: make local functions static ext4: fix block bitmap validation when bigalloc, ^flex_bg ext4: fix block bitmap initialization under sparse_super2 ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem ext4: avoid unneeded lookup when xattr name is invalid ext4: fix data integrity sync in ordered mode ext4: remove obsoleted check ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode ext4: fix locking for O_APPEND writes ...
2014-06-04mm: non-atomically mark page accessed during page cache allocation where ↵Mel Gorman
possible aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-27ext4: fix wrong assert in ext4_mb_normalize_request()Maurizio Lombardi
The variable "size" is expressed as number of blocks and not as number of clusters, this could trigger a kernel panic when using ext4 with the size of a cluster different from the size of a block. Cc: stable@vger.kernel.org Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-12ext4: add missing BUFFER_TRACE before ext4_journal_get_write_accessliang xie
Make them more consistently Signed-off-by: xieliang <xieliang@xiaomi.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() failsAndrey Tsyvarev
Caches from 'ext4_groupinfo_caches' may be in use by other mounts, which have already existed. So, it is incorrect to destroy them when newly requested mount fails. Found by Linux File System Verification project (linuxtesting.org). Signed-off-by: Andrey Tsyvarev <tsyvarev@ispras.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-04-12ext4: silence sparse check warning for function ext4_trim_extentjon ernst
This fixes the following sparse warning: CHECK fs/ext4/mballoc.c fs/ext4/mballoc.c:5019:9: warning: context imbalance in 'ext4_trim_extent' - unexpected unlock Signed-off-by: "Jon Ernst" <jonernst07@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-10ext4: return ENOMEM rather than EIO when find_###_page() failsYounger Liu
Return ENOMEM rather than EIO when find_get_page() fails in ext4_mb_get_buddy_page_lock() and find_or_create_page() fails in ext4_mb_load_buddy(). Signed-off-by: Younger Liu <younger.liucn@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20ext4: remove unused ac_ex_scannedEric Sandeen
When looking at a bug report with: > kernel: EXT4-fs: 0 scanned, 0 found I thought wow, 0 scanned, that's odd? But it's not odd; it's printing a variable that is initialized to 0 and never touched again. It's never been used since the original merge, so I don't really even know what the original intent was, either. If anyone knows how to hook it up, speak now via patch, otherwise just yank it so it's not making a confusing situation more confusing in kernel logs. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20ext4: make sure ex.fe_logical is initializedTheodore Ts'o
The lowest levels of mballoc set all of the fields of struct ext4_free_extent except for fe_logical, since they are just trying to find the requested free set of blocks, and the logical block hasn't been set yet. This makes some static code checkers sad. Set it to various different debug values, which would be useful when debugging mballoc if these values were to ever show up due to the parts of mballoc triyng to use ac->ac_b_ex.fe_logical before it is properly upper layers of mballoc failing to properly set, usually by ext4_mb_use_best_found(). Addresses-Coverity-Id: #139697 Addresses-Coverity-Id: #139698 Addresses-Coverity-Id: #139699 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-12-20ext4: add explicit casts when masking cluster sizesTheodore Ts'o
The missing casts can cause the high 64-bits of the physical blocks to be lost. Set up new macros which allows us to make sure the right thing happen, even if at some point we end up supporting larger logical block numbers. Thanks to the Emese Revfy and the PaX security team for reporting this issue. Reported-by: PaX Team <pageexec@freemail.hu> Reported-by: Emese Revfy <re.emese@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-12-03ext4: fix use-after-free in ext4_mb_new_blocksJunho Ryu
ext4_mb_put_pa should hold pa->pa_lock before accessing pa->pa_count. While ext4_mb_use_preallocated checks pa->pa_deleted first and then increments pa->count later, ext4_mb_put_pa decrements pa->pa_count before holding pa->pa_lock and then sets pa->pa_deleted. * Free sequence ext4_mb_put_pa (1): atomic_dec_and_test pa->pa_count ext4_mb_put_pa (2): lock pa->pa_lock ext4_mb_put_pa (3): check pa->pa_deleted ext4_mb_put_pa (4): set pa->pa_deleted=1 ext4_mb_put_pa (5): unlock pa->pa_lock ext4_mb_put_pa (6): remove pa from a list ext4_mb_pa_callback: free pa * Use sequence ext4_mb_use_preallocated (1): iterate over preallocation ext4_mb_use_preallocated (2): lock pa->pa_lock ext4_mb_use_preallocated (3): check pa->pa_deleted ext4_mb_use_preallocated (4): increase pa->pa_count ext4_mb_use_preallocated (5): unlock pa->pa_lock ext4_mb_release_context: access pa * Use-after-free sequence [initial status] <pa->pa_deleted = 0, pa_count = 1> ext4_mb_use_preallocated (1): iterate over preallocation ext4_mb_use_preallocated (2): lock pa->pa_lock ext4_mb_use_preallocated (3): check pa->pa_deleted ext4_mb_put_pa (1): atomic_dec_and_test pa->pa_count [pa_count decremented] <pa->pa_deleted = 0, pa_count = 0> ext4_mb_use_preallocated (4): increase pa->pa_count [pa_count incremented] <pa->pa_deleted = 0, pa_count = 1> ext4_mb_use_preallocated (5): unlock pa->pa_lock ext4_mb_put_pa (2): lock pa->pa_lock ext4_mb_put_pa (3): check pa->pa_deleted ext4_mb_put_pa (4): set pa->pa_deleted=1 [race condition!] <pa->pa_deleted = 1, pa_count = 1> ext4_mb_put_pa (5): unlock pa->pa_lock ext4_mb_put_pa (6): remove pa from a list ext4_mb_pa_callback: free pa ext4_mb_release_context: access pa AddressSanitizer has detected use-after-free in ext4_mb_new_blocks Bug report: http://goo.gl/rG1On3 Signed-off-by: Junho Ryu <jayr@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-10-30ext4: fix FITRIM in no journal modeLukas Czerner
When using FITRIM ioctl on a file system without journal it will only trim the block group once, no matter how many times you invoke FITRIM ioctl and how many block you release from the block group. It is because we only clear EXT4_GROUP_INFO_WAS_TRIMMED_BIT in journal callback. Fix this by clearing the bit in no journal mode as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Jorge Fábregas <jorge.fabregas@gmail.com>
2013-08-28ext4: mark block group as corrupt on block bitmap errorDarrick J. Wong
When we notice a block-bitmap corruption (because of device failure or something else), we should mark this group as corrupt and prevent further block allocations/deallocations from it. Currently, we end up generating one error message for every block in the bitmap. This potentially could make the system unstable as noticed in some bugs. With this patch, the error will be printed only the first time and mark the entire block group as corrupted. This prevents future access allocations/deallocations from it. Also tested by corrupting the block bitmap and forcefully introducing the mb_free_blocks error: (1) create a largefile (2Gb) $ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200 (2) umount filesystem. use dumpe2fs to see which block-bitmaps are in use by largefile and note their block numbers (3) use dd to zero-out the used block bitmaps $ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct (4) mount the FS and delete the largefile. (5) recreate the largefile. verify that the new largefile does not get any blocks from the groups marked as bad. Without the patch, we will see mb_free_blocks error for each bit in each zero'ed out bitmap at (4). With the patch, we only see the error once per blockgroup: [ 309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure [ 309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure [ 309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure [ 309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure [ 309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. Google-Bug-Id: 7258357 [darrick.wong@oracle.com] Further modifications (by Darrick) to make more obvious that this corruption bit applies to blocks only. Set the corruption flag if the block group bitmap verification fails. Original-author: Aditya Kali <adityakali@google.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-17ext4: fix warning in ext4_da_update_reserve_space()Jan Kara
reaim workfile.dbase test easily triggers warning in ext4_da_update_reserve_space(): EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365: ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks with reserved 9 data blocks) The problem is that (one of) tests creates file and then randomly writes to it with O_SYNC. That results in writing back pages of the file in random order so we create extents for written blocks say 0, 2, 4, 6, 8 - this last allocation also allocates new block for extents. Then we writeout block 1 so we have extents 0-2, 4, 6, 8 and we release indirect extent block because extents fit in the inode again. Then we writeout block 10 and we need to allocate indirect extent block again which triggers the warning because we don't have the reservation anymore. Fix the problem by giving back freed metadata blocks resulting from extent merging into inode's reservation pool. Signed-off-by: Jan Kara <jack@suse.cz>
2013-07-13ext4: don't allow ext4_free_blocks() to fail due to ENOMEMTheodore Ts'o
The filesystem should not be marked inconsistent if ext4_free_blocks() is not able to allocate memory. Unfortunately some callers (most notably ext4_truncate) don't have a way to reflect an error back up to the VFS. And even if we did, most userspace applications won't deal with most system calls returning ENOMEM anyway. Reported-by: Nagachandra P <nagachandra@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-07-01ext4: implement error handling of ext4_mb_new_preallocation()Alexey Khoroshilov
If memory allocation in ext4_mb_new_group_pa() is failed, it returns error code, ext4_mb_new_preallocation() propages it, but ext4_mb_new_blocks() ignores it. An observed result was: - allocation fail means ext4_mb_new_group_pa() does not update ext4_allocation_context; - ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len = ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead of number of blocks requested (1); - that activates update cycle in ext4_splice_branch(): for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here *(where->p + i) = cpu_to_le32(current_block++); - it iterates 511 times and corrupts a chunk of memory including inode structure; - page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty(); - system hangs with 'scheduling while atomic' BUG. The patch implements a check for ext4_mb_new_preallocation() error code and handles its failure as if ext4_mb_regular_allocator() fails. Found by Linux File System Verification project (linuxtesting.org). [ Patch restructed by tytso to make the flow of control easier to follow. ] Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator()Theodore Ts'o
For a file systems with a very large number of block groups, if all of the block group bitmaps are in memory and the file system is relatively badly fragmented, it's possible ext4_mb_regular_allocator() to take a long time trying to find a good match. This is especially true if the tuning parameter mb_max_to_scan has been sent to a very large number. So add a cond_resched() to avoid soft lockup warnings and to provide better system responsiveness. For ext4_free_blocks(), if we are deleting a large range of blocks, and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed, the loop to call sb_find_get_block() and to call ext4_forget() can take over 10-15 milliseocnds or more. So it's better to add a cond_resched() here a well. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>