summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-06-07Linux 4.9.31v4.9.31Greg Kroah-Hartman
2017-06-07xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()Jan Kara
commit d7fd24257aa60316bf81093f7f909dc9475ae974 upstream. There is an off-by-one error in loop termination conditions in xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of desired range if 'endoff' is page aligned. It doesn't have any visible effects but still it is good to fix it. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: fix unaligned access in xfs_btree_visit_blocksEric Sandeen
commit a4d768e702de224cc85e0c8eac9311763403b368 upstream. This structure copy was throwing unaligned access warnings on sparc64: Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs] xfs_btree_copy_ptrs does a memcpy, which avoids it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: avoid mount-time deadlock in CoW extent recoveryDarrick J. Wong
commit 3ecb3ac7b950ff8f6c6a61e8b7b0d6e3546429a0 upstream. If a malicious user corrupts the refcount btree to cause a cycle between different levels of the tree, the next mount attempt will deadlock in the CoW recovery routine while grabbing buffer locks. We can use the ability to re-grab a buffer that was previous locked to a transaction to avoid deadlocks, so do that here. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: xfs_trans_alloc_emptyChristoph Hellwig
This is a partial cherry-pick of commit e89c041338 ("xfs: implement the GETFSMAP ioctl"), which also adds this helper, and a great example of why feature patches should be properly split into their parts. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: split from the larger patch for -stable] Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-07xfs: bad assertion for delalloc an extent that start at i_sizeZorro Lang
commit 892d2a5f705723b2cb488bfb38bcbdcf83273184 upstream. By run fsstress long enough time enough in RHEL-7, I find an assertion failure (harder to reproduce on linux-4.11, but problem is still there): XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c The assertion is in xfs_getbmap() funciton: if (map[i].br_startblock == DELAYSTARTBLOCK && --> map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip))) ASSERT((iflags & BMV_IF_DELALLOC) != 0); When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the startoff is just at EOF. But we only need to make sure delalloc extents that are within EOF, not include EOF. Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: BMAPX shouldn't barf on inline-format directoriesDarrick J. Wong
commit 6eadbf4c8ba816c10d1c97bed9aa861d9fd17809 upstream. When we're fulfilling a BMAPX request, jump out early if the data fork is in local format. This prevents us from hitting a debugging check in bmapi_read and barfing errors back to userspace. The on-disk extent count check later isn't sufficient for IF_DELALLOC mode because da extents are in memory and not on disk. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: fix indlen accounting error on partial delalloc conversionBrian Foster
commit 0daaecacb83bc6b656a56393ab77a31c28139bc7 upstream. The delalloc -> real block conversion path uses an incorrect calculation in the case where the middle part of a delalloc extent is being converted. This is documented as a rare situation because XFS generally attempts to maximize contiguity by converting as much of a delalloc extent as possible. If this situation does occur, the indlen reservation for the two new delalloc extents left behind by the conversion of the middle range is calculated and compared with the original reservation. If more blocks are required, the delta is allocated from the global block pool. This delta value can be characterized as the difference between the new total requirement (temp + temp2) and the currently available reservation minus those blocks that have already been allocated (startblockval(PREV.br_startblock) - allocated). The problem is that the current code does not account for previously allocated blocks correctly. It subtracts the current allocation count from the (new - old) delta rather than the old indlen reservation. This means that more indlen blocks than have been allocated end up stashed in the remaining extents and free space accounting is broken as a result. Fix up the calculation to subtract the allocated block count from the original extent indlen and thus correctly allocate the reservation delta based on the difference between the new total requirement and the unused blocks from the original reservation. Also remove a bogus assert that contradicts the fact that the new indlen reservation can be larger than the original indlen reservation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: fix use-after-free in xfs_finish_page_writebackEryu Guan
commit 161f55efba5ddccc690139fae9373cafc3447a97 upstream. Commit 28b783e47ad7 ("xfs: bufferhead chains are invalid after end_page_writeback") fixed one use-after-free issue by pre-calculating the loop conditionals before calling bh->b_end_io() in the end_io processing loop, but it assigned 'next' pointer before checking end offset boundary & breaking the loop, at which point the bh might be freed already, and caused use-after-free. This is caught by KASAN when running fstests generic/127 on sub-page block size XFS. [ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50 [ 2747.868840] ================================================================== [ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698 ... [ 2747.918245] Call Trace: [ 2747.920975] dump_stack+0x63/0x84 [ 2747.924673] kasan_object_err+0x21/0x70 [ 2747.928950] kasan_report+0x271/0x530 [ 2747.933064] ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs] [ 2747.938409] ? end_page_writeback+0xce/0x110 [ 2747.943171] __asan_report_load8_noabort+0x19/0x20 [ 2747.948545] xfs_destroy_ioend+0x3d3/0x4e0 [xfs] [ 2747.953724] xfs_end_io+0x1af/0x2b0 [xfs] [ 2747.958197] process_one_work+0x5ff/0x1000 [ 2747.962766] worker_thread+0xe4/0x10e0 [ 2747.966946] kthread+0x2d3/0x3d0 [ 2747.970546] ? process_one_work+0x1000/0x1000 [ 2747.975405] ? kthread_create_on_node+0xc0/0xc0 [ 2747.980457] ? syscall_return_slowpath+0xe6/0x140 [ 2747.985706] ? do_page_fault+0x30/0x80 [ 2747.989887] ret_from_fork+0x2c/0x40 [ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104 [ 2748.001155] Allocated: [ 2748.003782] PID = 8327 [ 2748.006411] save_stack_trace+0x1b/0x20 [ 2748.010688] save_stack+0x46/0xd0 [ 2748.014383] kasan_kmalloc+0xad/0xe0 [ 2748.018370] kasan_slab_alloc+0x12/0x20 [ 2748.022648] kmem_cache_alloc+0xb8/0x1b0 [ 2748.027024] alloc_buffer_head+0x22/0xc0 [ 2748.031399] alloc_page_buffers+0xd1/0x250 [ 2748.035968] create_empty_buffers+0x30/0x410 [ 2748.040730] create_page_buffers+0x120/0x1b0 [ 2748.045493] __block_write_begin_int+0x17a/0x1800 [ 2748.050740] iomap_write_begin+0x100/0x2f0 [ 2748.055308] iomap_zero_range_actor+0x253/0x5c0 [ 2748.060362] iomap_apply+0x157/0x270 [ 2748.064347] iomap_zero_range+0x5a/0x80 [ 2748.068624] iomap_truncate_page+0x6b/0xa0 [ 2748.073227] xfs_setattr_size+0x1f7/0xa10 [xfs] [ 2748.078312] xfs_vn_setattr_size+0x68/0x140 [xfs] [ 2748.083589] xfs_file_fallocate+0x4ac/0x820 [xfs] [ 2748.088838] vfs_fallocate+0x2cf/0x780 [ 2748.093021] SyS_fallocate+0x48/0x80 [ 2748.097006] do_syscall_64+0x18a/0x430 [ 2748.101186] return_from_SYSCALL_64+0x0/0x6a [ 2748.105948] Freed: [ 2748.108189] PID = 8327 [ 2748.110816] save_stack_trace+0x1b/0x20 [ 2748.115093] save_stack+0x46/0xd0 [ 2748.118788] kasan_slab_free+0x73/0xc0 [ 2748.122969] kmem_cache_free+0x7a/0x200 [ 2748.127247] free_buffer_head+0x41/0x80 [ 2748.131524] try_to_free_buffers+0x178/0x250 [ 2748.136316] xfs_vm_releasepage+0x2e9/0x3d0 [xfs] [ 2748.141563] try_to_release_page+0x100/0x180 [ 2748.146325] invalidate_inode_pages2_range+0x7da/0xcf0 [ 2748.152087] xfs_shift_file_space+0x37d/0x6e0 [xfs] [ 2748.157557] xfs_collapse_file_space+0x49/0x120 [xfs] [ 2748.163223] xfs_file_fallocate+0x2a7/0x820 [xfs] [ 2748.168462] vfs_fallocate+0x2cf/0x780 [ 2748.172642] SyS_fallocate+0x48/0x80 [ 2748.176629] do_syscall_64+0x18a/0x430 [ 2748.180810] return_from_SYSCALL_64+0x0/0x6a Fixed it by checking on offset against end & breaking out first, dereference bh only if there're still bufferheads to process. Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: reserve enough blocks to handle btree splits when remappingDarrick J. Wong
commit fe0be23e68200573de027de9b8cc2b27e7fce35e upstream. In xfs_reflink_end_cow, we erroneously reserve only enough blocks to handle adding 1 extent. This is problematic if we fragment free space, have to do CoW, and then have to perform multiple bmap btree expansions. Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to handle btree splits, so log recovery fails after our first error causes the filesystem to go down. Therefore, refactor the transaction block reservation macros until we have a macro that works for our deferred (re)mapping activities, and fix both problems by using that macro. With 1k blocks we can hit this fairly often in g/187 if the scratch fs is big enough. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: wait on new inodes during quotaoff dquot releaseBrian Foster
commit e20c8a517f259cb4d258e10b0cd5d4b30d4167a0 upstream. The quotaoff operation has a race with inode allocation that results in a livelock. An inode allocation that occurs before the quota status flags are updated acquires the appropriate dquots for the inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode into the perag radix tree, sometime later attaches the dquots to the inode and finally clears the XFS_INEW flag. Quotaoff expects to release the dquots from all inodes in the filesystem via xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator, which skips inodes in the XFS_INEW state because they are not fully constructed. If the scan occurs after dquots have been attached to an inode, but before XFS_INEW is cleared, the newly allocated inode will continue to hold a reference to the applicable dquots. When quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those dquot(s) remain elevated and the dqpurge scan spins indefinitely. To address this problem, update the xfs_qm_dqrele_all_inodes() scan to wait on inodes marked on the XFS_INEW state. We wait on the inodes explicitly rather than skip and retry to avoid continuous retry loops due to a parallel inode allocation workload. Since quotaoff updates the quota state flags and uses a synchronous transaction before the dqrele scan, and dquots are attached to inodes after radix tree insertion iff quota is enabled, one INEW waiting pass through the AG guarantees that the scan has processed all inodes that could possibly hold dquot references. Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: update ag iterator to support wait on new inodesBrian Foster
commit ae2c4ac2dd39b23a87ddb14ceddc3f2872c6aef5 upstream. The AG inode iterator currently skips new inodes as such inodes are inserted into the inode radix tree before they are fully constructed. Certain contexts require the ability to wait on the construction of new inodes, however. The fs-wide dquot release from the quotaoff sequence is an example of this. Update the AG inode iterator to support the ability to wait on inodes flagged with XFS_INEW upon request. Create a new xfs_inode_ag_iterator_flags() interface and support a set of iteration flags to modify the iteration behavior. When the XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the radix tree inode lookup and wait on them before the callback is executed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: support ability to wait on new inodesBrian Foster
commit 756baca27fff3ecaeab9dbc7a5ee35a1d7bc0c7f upstream. Inodes that are inserted into the perag tree but still under construction are flagged with the XFS_INEW bit. Most contexts either skip such inodes when they are encountered or have the ability to handle them. The runtime quotaoff sequence introduces a context that must wait for construction of such inodes to correctly ensure that all dquots in the fs are released. In anticipation of this, support the ability to wait on new inodes. Wake the appropriate bit when XFS_INEW is cleared. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: fix up quotacheck buffer list error handlingBrian Foster
commit 20e8a063786050083fe05b4f45be338c60b49126 upstream. The quotacheck error handling of the delwri buffer list assumes the resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag on the buffers that are dequeued. This can lead to assert failures on buffer release and possibly other locking problems. Move this code to a delwri queue cancel helper function to encapsulate the logic required to properly release buffers from a delwri queue. Update the helper to clear the delwri queue flag and call it from quotacheck. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: prevent multi-fsb dir readahead from reading random blocksBrian Foster
commit cb52ee334a45ae6c78a3999e4b473c43ddc528f4 upstream. Directory block readahead uses a complex iteration mechanism to map between high-level directory blocks and underlying physical extents. This mechanism attempts to traverse the higher-level dir blocks in a manner that handles multi-fsb directory blocks and simultaneously maintains a reference to the corresponding physical blocks. This logic doesn't handle certain (discontiguous) physical extent layouts correctly with multi-fsb directory blocks. For example, consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory block size and a directory with the following extent layout: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..7]: 88..95 0 (88..95) 8 1: [8..15]: 80..87 0 (80..87) 8 2: [16..39]: 168..191 0 (168..191) 24 3: [40..63]: 5242952..5242975 1 (72..95) 24 Directory block 0 spans physical extents 0 and 1, dirblk 1 lies entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because extent 2 is larger than the directory block size, the readahead code erroneously assumes the block is contiguous and issues a readahead based on the physical mapping of the first fsb of the dirblk. This results in read verifier failure and a spurious corruption or crc failure, depending on the filesystem format. Further, the subsequent readahead code responsible for walking through the physical table doesn't correctly advance the physical block reference for dirblk 2. Instead of advancing two physical filesystem blocks, the first iteration of the loop advances 1 block (correctly), but the subsequent iteration advances 2 more physical blocks because the next physical extent (extent 3, above) happens to cover more than dirblk 2. At this point, the higher-level directory block walking is completely off the rails of the actual physical layout of the directory for the respective mapping table. Update the contiguous dirblock logic to consider the current offset in the physical extent to avoid issuing directory readahead to unrelated blocks. Also, update the mapping table advancing code to consider the current offset within the current dirblock to avoid advancing the mapping reference too far beyond the dirblock. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: handle array index overrun in xfs_dir2_leaf_readbuf()Eric Sandeen
commit 023cc840b40fad95c6fe26fff1d380a8c9d45939 upstream. Carlos had a case where "find" seemed to start spinning forever and never return. This was on a filesystem with non-default multi-fsb (8k) directory blocks, and a fragmented directory with extents like this: 0:[0,133646,2,0] 1:[2,195888,1,0] 2:[3,195890,1,0] 3:[4,195892,1,0] 4:[5,195894,1,0] 5:[6,195896,1,0] 6:[7,195898,1,0] 7:[8,195900,1,0] 8:[9,195902,1,0] 9:[10,195908,1,0] 10:[11,195910,1,0] 11:[12,195912,1,0] 12:[13,195914,1,0] ... i.e. the first extent is a contiguous 2-fsb dir block, but after that it is fragmented into 1 block extents. At the top of the readdir path, we allocate a mapping array which (for this filesystem geometry) can hold 10 extents; see the assignment to map_info->map_size. During readdir, we are therefore able to map extents 0 through 9 above into the array for readahead purposes. If we count by 2, we see that the last mapped index (9) is the first block of a 2-fsb directory block. At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill more readahead; the outer loop assumes one full dir block is processed each loop iteration, and an inner loop that ensures that this is so by advancing to the next extent until a full directory block is mapped. The problem is that this inner loop may step past the last extent in the mapping array as it tries to reach the end of the directory block. This will read garbage for the extent length, and as a result the loop control variable 'j' may become corrupted and never fail the loop conditional. The number of valid mappings we have in our array is stored in map->map_valid, so stop this inner loop based on that limit. There is an ASSERT at the top of the outer loop for this same condition, but we never made it out of the inner loop, so the ASSERT never fired. Huge appreciation for Carlos for debugging and isolating the problem. Debugged-and-analyzed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: fix integer truncation in xfs_bmap_remap_allocChristoph Hellwig
commit 52813fb13ff90bd9c39a93446cbf1103c290b6e9 upstream. bno should be a xfs_fsblock_t, which is 64-bit wides instead of a xfs_aglock_t, which truncates the value to 32 bits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: drop iolock from reclaim context to appease lockdepBrian Foster
commit 3b4683c294095b5f777c03307ef8c60f47320e12 upstream. Lockdep complains about use of the iolock in inode reclaim context because it doesn't understand that reclaim has the last reference to the inode, and thus an iolock->reclaim->iolock deadlock is not possible. The iolock is technically not necessary in xfs_inactive() and was only added to appease an assert in xfs_free_eofblocks(), which can be called from other non-reclaim contexts. Therefore, just kill the assert and drop the use of the iolock from reclaim context to quiet lockdep. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: actually report xattr extents via iomapDarrick J. Wong
commit 84358536dc355a9c8978ee425f87e116186bed16 upstream. Apparently FIEMAP for xattrs has been broken since we switched to the iomap backend because of an incorrect check for xattr presence. Also fix the broken locking. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: fix over-copying of getbmap parameters from userspaceDarrick J. Wong
commit be6324c00c4d1e0e665f03ed1fc18863a88da119 upstream. In xfs_ioc_getbmap, we should only copy the fields of struct getbmap from userspace, or else we end up copying random stack contents into the kernel. struct getbmap is a strict subset of getbmapx, so a partial structure copy should work fine. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: use dedicated log worker wq to avoid deadlock with cil wqBrian Foster
commit 696a562072e3c14bcd13ae5acc19cdf27679e865 upstream. The log covering background task used to be part of the xfssyncd workqueue. That workqueue was removed as of commit 5889608df ("xfs: syncd workqueue is no more") and the associated work item scheduled to the xfs-log wq. The latter is used for log buffer I/O completion. Since xfs_log_worker() can invoke a log flush, a deadlock is possible between the xfs-log and xfs-cil workqueues. Consider the following codepath from xfs_log_worker(): xfs_log_worker() xfs_log_force() _xfs_log_force() xlog_cil_force() xlog_cil_force_lsn() xlog_cil_push_now() flush_work() The above is in xfs-log wq context and blocked waiting on the completion of an xfs-cil work item. Concurrently, the cil push in progress can end up blocked here: xlog_cil_push_work() xlog_cil_push() xlog_write() xlog_state_get_iclog_space() xlog_wait(&log->l_flush_wait, ...) The above is in xfs-cil context waiting on log buffer I/O completion, which executes in xfs-log wq context. In this scenario both workqueues are deadlocked waiting on eachother. Add a new workqueue specifically for the high level log covering and ail pushing worker, as was the case prior to commit 5889608df. Diagnosed-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: fix kernel memory exposure problemsDarrick J. Wong
commit bf9216f922612d2db7666aae01e65064da2ffb3a upstream. Fix a memory exposure problems in inumbers where we allocate an array of structures with holes, fail to zero the holes, then blindly copy the kernel memory contents (junk and all) into userspace. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: rework the inline directory verifiersDarrick J. Wong
commit 78420281a9d74014af7616958806c3aba056319e upstream. The inline directory verifiers should be called on the inode fork data, which means after iformat_local on the read side, and prior to ifork_flush on the write side. This makes the fork verifier more consistent with the way buffer verifiers work -- i.e. they will operate on the memory buffer that the code will be reading and writing directly. Furthermore, revise the verifier function to return -EFSCORRUPTED so that we don't flood the logs with corruption messages and assert notices. This has been a particular problem with xfs/348, which triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the kernel when CONFIG_XFS_DEBUG=y. Disk corruption isn't supposed to do that, at least not in a verifier. Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: verify inline directory data forksDarrick J. Wong
commit 630a04e79dd41ff746b545d4fc052e0abb836120 upstream. When we're reading or writing the data fork of an inline directory, check the contents to make sure we're not overflowing buffers or eating garbage data. xfs/348 corrupts an inline symlink into an inline directory, triggering a buffer overflow bug. v2: add more checks consistent with _dir2_sf_check and make the verifier usable from anywhere. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()Eryu Guan
commit 8affebe16d79ebefb1d9d6d56a46dc89716f9453 upstream. xfs_find_get_desired_pgoff() is used to search for offset of hole or data in page range [index, end] (both inclusive), and the max number of pages to search should be at least one, if end == index. Otherwise the only page is missed and no hole or data is found, which is not correct. When block size is smaller than page size, this can be demonstrated by preallocating a file with size smaller than page size and writing data to the last block. E.g. run this xfs_io command on a 1k block size XFS on x86_64 host. # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \ -c "seek -d 0" /mnt/xfs/testfile wrote 1024/1024 bytes at offset 2048 1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec) Whence Result DATA EOF Data at offset 2k was missed, and lseek(2) returned ENXIO. This is uncovered by generic/285 subtest 07 and 08 on ppc64 host, where pagesize is 64k. Because a recent change to generic/285 reduced the preallocated file size to smaller than 64k. Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: use ->b_state to fix buffer I/O accounting release raceBrian Foster
commit 63db7c815bc0997c29e484d2409684fdd9fcd93b upstream. We've had user reports of unmount hangs in xfs_wait_buftarg() that analysis shows is due to btp->bt_io_count == -1. bt_io_count represents the count of in-flight asynchronous buffers and thus should always be >= 0. xfs_wait_buftarg() waits for this value to stabilize to zero in order to ensure that all untracked (with respect to the lru) buffers have completed I/O processing before unmount proceeds to tear down in-core data structures. The value of -1 implies an I/O accounting decrement race. Indeed, the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele() (where the buffer lock is no longer held) means that bp->b_flags can be updated from an unsafe context. While a user-level reproducer is currently not available, some intrusive hacks to run racing buffer lookups/ioacct/releases from multiple threads was used to successfully manufacture this problem. Existing callers do not expect to acquire the buffer lock from xfs_buf_rele(). Therefore, we can not safely update ->b_flags from this context. It turns out that we already have separate buffer state bits and associated serialization for dealing with buffer LRU state in the form of ->b_state and ->b_lock. Therefore, replace the _XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O accounting wrappers appropriately and make sure they are used with the correct locking. This ensures that buffer in-flight state can be modified at buffer release time without racing with modifications from a buffer lock holder. Fixes: 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against unmount") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Libor Pechacek <lpechacek@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07xfs: Fix missed holes in SEEK_HOLE implementationJan Kara
commit 5375023ae1266553a7baa0845e82917d8803f48c upstream. XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as can be seen by the following command: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec) wrote 8192/8192 bytes at offset 131072 8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec) Whence Result HOLE 139264 Where we can see that hole at offset 56k was just ignored by SEEK_HOLE implementation. The bug is in xfs_find_get_desired_pgoff() which does not properly detect the case when pages are not contiguous. Fix the problem by properly detecting when found page has larger offset than expected. Fixes: d126d43f631f996daeee5006714fed914be32368 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07drm/gma500/psb: Actually use VBT mode when it is foundPatrik Jakobsson
commit 82bc9a42cf854fdf63155759c0aa790bd1f361b0 upstream. With LVDS we were incorrectly picking the pre-programmed mode instead of the prefered mode provided by VBT. Make sure we pick the VBT mode if one is provided. It is likely that the mode read-out code is still wrong but this patch fixes the immediate problem on most machines. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=78562 Signed-off-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170418114332.12183-1-patrik.r.jakobsson@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07mm/slub.c: trace free objects at KERN_INFODaniel Thompson
commit aa2efd5ea4041754da4046c3d2e7edaac9526258 upstream. Currently when trace is enabled (e.g. slub_debug=T,kmalloc-128 ) the trace messages are mostly output at KERN_INFO. However the trace code also calls print_section() to hexdump the head of a free object. This is hard coded to use KERN_ERR, meaning the console is deluged with trace messages even if we've asked for quiet. Fix this the obvious way but adding a level parameter to print_section(), allowing calls from the trace code to use the same trace level as other trace messages. Link: http://lkml.kernel.org/r/20170113154850.518-1-daniel.thompson@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07slub/memcg: cure the brainless abuse of sysfs attributesThomas Gleixner
commit 478fe3037b2278d276d4cd9cd0ab06c4cb2e9b32 upstream. memcg_propagate_slab_attrs() abuses the sysfs attribute file functions to propagate settings from the root kmem_cache to a newly created kmem_cache. It does that with: attr->show(root, buf); attr->store(new, buf, strlen(bug); Aside of being a lazy and absurd hackery this is broken because it does not check the return value of the show() function. Some of the show() functions return 0 w/o touching the buffer. That means in such a case the store function is called with the stale content of the previous show(). That causes nonsense like invoking kmem_cache_shrink() on a newly created kmem_cache. In the worst case it would cause handing in an uninitialized buffer. This should be rewritten proper by adding a propagate() callback to those slub_attributes which must be propagated and avoid that insane conversion to and from ASCII, but that's too large for a hot fix. Check at least the return value of the show() function, so calling store() with stale content is prevented. Steven said: "It can cause a deadlock with get_online_cpus() that has been uncovered by recent cpu hotplug and lockdep changes that Thomas and Peter have been doing. Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug.lock); lock(slab_mutex); lock(cpu_hotplug.lock); lock(slab_mutex); *** DEADLOCK ***" Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07ksm: prevent crash after write_protect_page failsAndrea Arcangeli
commit a7306c3436e9c8e584a4b9fad5f3dc91be2a6076 upstream. "err" needs to be left set to -EFAULT if split_huge_page succeeds. Otherwise if "err" gets clobbered with zero and write_protect_page fails, try_to_merge_one_page() will succeed instead of returning -EFAULT and then try_to_merge_with_ksm_page() will continue thinking kpage is a PageKsm when in fact it's still an anonymous page. Eventually it'll crash in page_add_anon_rmap. This has been reproduced on Fedora25 kernel but I can reproduce with upstream too. The bug was introduced in commit f765f540598a ("ksm: prepare to new THP semantics") introduced in v4.5. page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673 flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked) page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffffa09674bf0000 ------------[ cut here ]------------ kernel BUG at mm/rmap.c:1222! CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1 RIP: do_page_add_anon_rmap+0x1c4/0x240 Call Trace: page_add_anon_rmap+0x18/0x20 try_to_merge_with_ksm_page+0x50b/0x780 ksm_scan_thread+0x1211/0x1410 ? prepare_to_wait_event+0x100/0x100 ? try_to_merge_with_ksm_page+0x780/0x780 kthread+0xd9/0xf0 ? kthread_park+0x60/0x60 ret_from_fork+0x25/0x30 Fixes: f765f54059 ("ksm: prepare to new THP semantics") Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Federico Simoncelli <fsimonce@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07x86/boot: Use CROSS_COMPILE prefix for readelfRob Landley
commit 3780578761921f094179c6289072a74b2228c602 upstream. The boot code Makefile contains a straight 'readelf' invocation. This causes build warnings in cross compile environments, when there is no unprefixed readelf accessible via $PATH. Add the missing $(CROSS_COMPILE) prefix. [ tglx: Rewrote changelog ] Fixes: 98f78525371b ("x86/boot: Refuse to build with data relocations") Signed-off-by: Rob Landley <rob@landley.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: "H.J. Lu" <hjl.tools@gmail.com> Link: http://lkml.kernel.org/r/ced18878-693a-9576-a024-113ef39a22c0@landley.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07RDMA/qib,hfi1: Fix MR reference count leak on write with immediateMike Marciniszyn
commit 1feb40067cf04ae48d65f728d62ca255c9449178 upstream. The handling of IB_RDMA_WRITE_ONLY_WITH_IMMEDIATE will leak a memory reference when a buffer cannot be allocated for returning the immediate data. The issue is that the rkey validation has already occurred and the RNR nak fails to release the reference that was fruitlessly gotten. The the peer will send the identical single packet request when its RNR timer pops. The fix is to release the held reference prior to the rnr nak exit. This is the only sequence the requires both rkey validation and the buffer allocation on the same packet. Tested-by: Tadeusz Struk <tadeusz.struk@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07mm: consider memblock reservations for deferred memory initialization sizingMichal Hocko
commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 upstream. We have seen an early OOM killer invocation on ppc64 systems with crashkernel=4096M: kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0 kthreadd cpuset=/ mems_allowed=7 CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1 Call Trace: dump_stack+0xb0/0xf0 (unreliable) dump_header+0xb0/0x258 out_of_memory+0x5f0/0x640 __alloc_pages_nodemask+0xa8c/0xc80 kmem_getpages+0x84/0x1a0 fallback_alloc+0x2a4/0x320 kmem_cache_alloc_node+0xc0/0x2e0 copy_process.isra.25+0x260/0x1b30 _do_fork+0x94/0x470 kernel_thread+0x48/0x60 kthreadd+0x264/0x330 ret_from_kernel_thread+0x5c/0xa4 Mem-Info: active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:5 slab_unreclaimable:73 mapped:0 shmem:0 pagetables:0 bounce:0 free:0 free_pcp:0 free_cma:0 Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB 0 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 819200 pages RAM 0 pages HighMem/MovableOnly 817481 pages reserved 0 pages cma reserved 0 pages hwpoisoned the reason is that the managed memory is too low (only 110MB) while the rest of the the 50GB is still waiting for the deferred intialization to be done. update_defer_init estimates the initial memoty to initialize to 2GB at least but it doesn't consider any memory allocated in that range. In this particular case we've had Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB) so the low 2GB is mostly depleted. Fix this by considering memblock allocations in the initial static initialization estimation. Move the max_initialise to reset_deferred_meminit and implement a simple memblock_reserved_memory helper which iterates all reserved blocks and sums the size of all that start below the given address. The cumulative size is than added on top of the initial estimation. This is still not ideal because reset_deferred_meminit doesn't consider holes and so reservation might be above the initial estimation whihch we ignore but let's make the logic simpler until we really need to handle more complicated cases. Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07mlock: fix mlock count can not decrease in race conditionYisheng Xie
commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc upstream. Kefeng reported that when running the follow test, the mlock count in meminfo will increase permanently: [1] testcase linux:~ # cat test_mlockal grep Mlocked /proc/meminfo for j in `seq 0 10` do for i in `seq 4 15` do ./p_mlockall >> log & done sleep 0.2 done # wait some time to let mlock counter decrease and 5s may not enough sleep 5 grep Mlocked /proc/meminfo linux:~ # cat p_mlockall.c #include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #define SPACE_LEN 4096 int main(int argc, char ** argv) { int ret; void *adr = malloc(SPACE_LEN); if (!adr) return -1; ret = mlockall(MCL_CURRENT | MCL_FUTURE); printf("mlcokall ret = %d\n", ret); ret = munlockall(); printf("munlcokall ret = %d\n", ret); free(adr); return 0; } In __munlock_pagevec() we should decrement NR_MLOCK for each page where we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch NR_MLOCK zone state updates") has introduced a bug where we don't decrement NR_MLOCK for pages where we clear the flag, but fail to isolate them from the lru list (e.g. when the pages are on some other cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK accounting gets permanently disrupted by this. Fix it by counting the number of page whose PageMlock flag is cleared. Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates") Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joern Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07mm/migrate: fix refcount handling when !hugepage_migration_supported()Punit Agrawal
commit 30809f559a0d348c2dfd7ab05e9a451e2384962e upstream. On failing to migrate a page, soft_offline_huge_page() performs the necessary update to the hugepage ref-count. But when !hugepage_migration_supported() , unmap_and_move_hugepage() also decrements the page ref-count for the hugepage. The combined behaviour leaves the ref-count in an inconsistent state. This leads to soft lockups when running the overcommitted hugepage test from mce-tests suite. Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000 soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head) INFO: rcu_preempt detected stalls on CPUs/tasks: Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715 (detected by 7, t=5254 jiffies, g=963, c=962, q=321) thugetlb_overco R running task 0 2715 2685 0x00000008 Call trace: dump_backtrace+0x0/0x268 show_stack+0x24/0x30 sched_show_task+0x134/0x180 rcu_print_detail_task_stall_rnp+0x54/0x7c rcu_check_callbacks+0xa74/0xb08 update_process_times+0x34/0x60 tick_sched_handle.isra.7+0x38/0x70 tick_sched_timer+0x4c/0x98 __hrtimer_run_queues+0xc0/0x300 hrtimer_interrupt+0xac/0x228 arch_timer_handler_phys+0x3c/0x50 handle_percpu_devid_irq+0x8c/0x290 generic_handle_irq+0x34/0x50 __handle_domain_irq+0x68/0xc0 gic_handle_irq+0x5c/0xb0 Address this by changing the putback_active_hugepage() in soft_offline_huge_page() to putback_movable_pages(). This only triggers on systems that enable memory failure handling (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration (!ARCH_ENABLE_HUGEPAGE_MIGRATION). I imagine this wasn't triggered as there aren't many systems running this configuration. [akpm@linux-foundation.org: remove dead comment, per Naoya] Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com Reported-by: Manoj Iyer <manoj.iyer@canonical.com> Tested-by: Manoj Iyer <manoj.iyer@canonical.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430Alexander Tsoy
commit 1fc2e41f7af4572b07190f9dec28396b418e9a36 upstream. This model is actually called 92XXM2-8 in Windows driver. But since pin configs for M22 and M28 are identical, just reuse M22 quirk. Fixes external microphone (tested) and probably docking station ports (not tested). Signed-off-by: Alexander Tsoy <alexander@tsoy.me> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07pcmcia: remove left-over %Z formatNicolas Iooss
commit ff5a20169b98d84ad8d7f99f27c5ebbb008204d6 upstream. Commit 5b5e0928f742 ("lib/vsprintf.c: remove %Z support") removed some usages of format %Z but forgot "%.2Zx". This makes clang 4.0 reports a -Wformat-extra-args warning because it does not know about %Z. Replace %Z with %z. Link: http://lkml.kernel.org/r/20170520090946.22562-1-nicolas.iooss_linux@m4x.org Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Harald Welte <laforge@gnumonks.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07drm/radeon: Fix vram_size/visible values in DRM_RADEON_GEM_INFO ioctlMichel Dänzer
commit 51964e9e12d0a054002a1a0d1dec4f661c7aaf28 upstream. vram_size is supposed to be the total amount of VRAM that can be used by userspace, which corresponds to the TTM VRAM manager size (which is normally the full amount of VRAM, but can be just the visible VRAM when DMA can't be used for BO migration for some reason). The above was incorrectly used for vram_visible before, resulting in generally too large values being reported. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07drm/radeon: Unbreak HPD handling for r600+Lyude
commit 3d18e33735a02b1a90aecf14410bf3edbfd4d3dc upstream. We end up reading the interrupt register for HPD5, and then writing it to HPD6 which on systems without anything using HPD5 results in permanently disabling hotplug on one of the display outputs after the first time we acknowledge a hotplug interrupt from the GPU. This code is really bad. But for now, let's just fix this. I will hopefully have a large patch series to refactor all of this soon. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Lyude <lyude@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07drm/radeon/ci: disable mclk switching for high refresh rates (v2)Alex Deucher
commit 58d7e3e427db1bd68f33025519a9468140280a75 upstream. Even if the vblank period would allow it, it still seems to be problematic on some cards. v2: fix logic inversion (Nils) bug: https://bugs.freedesktop.org/show_bug.cgi?id=96868 Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07scsi: mpt3sas: Force request partial completion alignmentRam Pai
commit f2e767bb5d6ee0d988cb7d4e54b0b21175802b6b upstream. The firmware or device, possibly under a heavy I/O load, can return on a partial unaligned boundary. Scsi-ml expects these requests to be completed on an alignment boundary. Scsi-ml blindly requeues the I/O without checking the alignment boundary of the I/O request for the remaining bytes. This leads to errors, since devices cannot perform non-aligned read/write operations. This patch fixes the issue in the driver. It aligns unaligned completions of FS requests, by truncating them to the nearest alignment boundary. [mkp: simplified if statement] Reported-by: Mauricio Faria De Oliveira <mauricfo@linux.vnet.ibm.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Acked-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07nvme: avoid to use blk_mq_abort_requeue_list()Ming Lei
commit 986f75c876dbafed98eba7cb516c5118f155db23 upstream. NVMe may add request into requeue list simply and not kick off the requeue if hw queues are stopped. Then blk_mq_abort_requeue_list() is called in both nvme_kill_queues() and nvme_ns_remove() for dealing with this issue. Unfortunately blk_mq_abort_requeue_list() is absolutely a race maker, for example, one request may be requeued during the aborting. So this patch just calls blk_mq_kick_requeue_list() in nvme_kill_queues() to handle this issue like what nvme_start_queues() does. Now all requests in requeue list when queues are stopped will be handled by blk_mq_kick_requeue_list() when queues are restarted, either in nvme_start_queues() or in nvme_kill_queues(). Reported-by: Zhang Yi <yizhan@redhat.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07nvme: use blk_mq_start_hw_queues() in nvme_kill_queues()Ming Lei
commit 806f026f9b901eaf1a6baeb48b5da18d6a4f818e upstream. Inside nvme_kill_queues(), we have to start hw queues for draining requests in sw queues, .dispatch list and requeue list, so use blk_mq_start_hw_queues() instead of blk_mq_start_stopped_hw_queues() which only run queues if queues are stopped, but the queues may have been started already, for example nvme_start_queues() is called in reset work function. blk_mq_start_hw_queues() run hw queues in current context, instead of running asynchronously like before. Given nvme_kill_queues() is run from either remove context or reset worker context, both are fine to run hw queue directly. And the mutex of namespaces_mutex isn't a problem too becasue nvme_start_freeze() runs hw queue in this way already. Reported-by: Zhang Yi <yizhan@redhat.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07nvme-rdma: support devices with queue size < 32Marta Rybczynska
commit 0544f5494a03b8846db74e02be5685d1f32b06c9 upstream. In the case of small NVMe-oF queue size (<32) we may enter a deadlock caused by the fact that the IB completions aren't sent waiting for 32 and the send queue will fill up. The error is seen as (using mlx5): [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273): [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12 This patch changes the way the signaling is done so that it depends on the queue depth now. The magic define has been removed completely. Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Signed-off-by: Samuel Jones <sjones@kalray.eu> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07HID: wacom: Have wacom_tpc_irq guard against possible NULL dereferenceJason Gerecke
commit 2ac97f0f6654da14312d125005c77a6010e0ea38 upstream. The following Smatch complaint was generated in response to commit 2a6cdbd ("HID: wacom: Introduce new 'touch_input' device"): drivers/hid/wacom_wac.c:1586 wacom_tpc_irq() error: we previously assumed 'wacom->touch_input' could be null (see line 1577) The 'touch_input' and 'pen_input' variables point to the 'struct input_dev' used for relaying touch and pen events to userspace, respectively. If a device does not have a touch interface or pen interface, the associated input variable is NULL. The 'wacom_tpc_irq()' function is responsible for forwarding input reports to a more-specific IRQ handler function. An unknown report could theoretically be mistaken as e.g. a touch report on a device which does not have a touch interface. This can be prevented by only calling the pen/touch functions are called when the pen/touch pointers are valid. Fixes: 2a6cdbd ("HID: wacom: Introduce new 'touch_input' device") Signed-off-by: Jason Gerecke <jason.gerecke@wacom.com> Reviewed-by: Ping Cheng <ping.cheng@wacom.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07ibmvscsis: Fix the incorrect req_lim_deltaBryant G. Ly
commit 75dbf2d36f6b122ad3c1070fe4bf95f71bbff321 upstream. The current code is not correctly calculating the req_lim_delta. We want to make sure vscsi->credit is always incremented when we do not send a response for the scsi op. Thus for the case where there is a successfully aborted task we need to make sure the vscsi->credit is incremented. v2 - Moves the original location of the vscsi->credit increment to a better spot. Since if we increment credit, the next command we send back will have increased req_lim_delta. But we probably shouldn't be doing that until the aborted cmd is actually released. Otherwise the client will think that it can send a new command, and we could find ourselves short of command elements. Not likely, but could happen. This patch depends on both: commit 25e78531268e ("ibmvscsis: Do not send aborted task response") commit 98883f1b5415 ("ibmvscsis: Clear left-over abort_cmd pointers") Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com> Reviewed-by: Michael Cyr <mikecyr@linux.vnet.ibm.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07ibmvscsis: Clear left-over abort_cmd pointersBryant G. Ly
commit 98883f1b5415ea9dce60d5178877d15f4faa10b8 upstream. With the addition of ibmvscsis->abort_cmd pointer within commit 25e78531268e ("ibmvscsis: Do not send aborted task response"), make sure to explicitly NULL these pointers when clearing DELAY_SEND flag. Do this for two cases, when getting the new new ibmvscsis descriptor in ibmvscsis_get_free_cmd() and before posting the response completion in ibmvscsis_send_messages(). Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com> Reviewed-by: Michael Cyr <mikecyr@linux.vnet.ibm.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07iscsi-target: Always wait for kthread_should_stop() before kthread exitJiang Yi
commit 5e0cf5e6c43b9e19fc0284f69e5cd2b4a47523b0 upstream. There are three timing problems in the kthread usages of iscsi_target_mod: - np_thread of struct iscsi_np - rx_thread and tx_thread of struct iscsi_conn In iscsit_close_connection(), it calls send_sig(SIGINT, conn->tx_thread, 1); kthread_stop(conn->tx_thread); In conn->tx_thread, which is iscsi_target_tx_thread(), when it receive SIGINT the kthread will exit without checking the return value of kthread_should_stop(). So if iscsi_target_tx_thread() exit right between send_sig(SIGINT...) and kthread_stop(...), the kthread_stop() will try to stop an already stopped kthread. This is invalid according to the documentation of kthread_stop(). (Fix -ECONNRESET logout handling in iscsi_target_tx_thread and early iscsi_target_rx_thread failure case - nab) Signed-off-by: Jiang Yi <jiangyilism@gmail.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07mmc: sdhci-iproc: suppress spurious interrupt with Multiblock readSrinath Mannam
commit f5f968f2371ccdebb8a365487649673c9af68d09 upstream. The stingray SDHCI hardware supports ACMD12 and automatically issues after multi block transfer completed. If ACMD12 in SDHCI is disabled, spurious tx done interrupts are seen on multi block read command with below error message: Got data interrupt 0x00000002 even though no data operation was in progress. This patch uses SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 to enable ACM12 support in SDHCI hardware and suppress spurious interrupt. Signed-off-by: Srinath Mannam <srinath.mannam@broadcom.com> Reviewed-by: Ray Jui <ray.jui@broadcom.com> Reviewed-by: Scott Branden <scott.branden@broadcom.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: b580c52d58d9 ("mmc: sdhci-iproc: add IPROC SDHCI driver") Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>