linux-toradex.git/fs/btrfs, branch v4.9.45

Btrfs: fix early ENOSPC due to delalloc

2017-08-11T15:49:31+00:00

commit 17024ad0a0fdfcfe53043afb969b813d3e020c21 upstream.

If a lot of metadata is reserved for outstanding delayed allocations, we
rely on shrink_delalloc() to reclaim metadata space in order to fulfill
reservation tickets. However, shrink_delalloc() has a shortcut where if
it determines that space can be overcommitted, it will stop early. This
made sense before the ticketed enospc system, but now it means that
shrink_delalloc() will often not reclaim enough space to fulfill any
tickets, leading to an early ENOSPC. (Reservation tickets don't care
about being able to overcommit, they need every byte accounted for.)

Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
all of the metadata it is supposed to. This fixes early ENOSPCs we were
seeing when doing a btrfs receive to populate a new filesystem, as well
as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.

Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure")
Tested-by: Christoph Anton Mitterer 
Reviewed-by: Josef Bacik 
Signed-off-by: Omar Sandoval 
Signed-off-by: David Sterba 
Signed-off-by: Nikolay Borisov 
Signed-off-by: Greg Kroah-Hartman

Btrfs: adjust outstanding_extents counter properly when dio write is split

2017-08-07T01:59:47+00:00

[ Upstream commit c2931667c83ded6504b3857e99cc45b21fa496fb ]

Currently how btrfs dio deals with split dio write is not good
enough if dio write is split into several segments due to the
lack of contiguous space, a large dio write like 'dd bs=1G count=1'
can end up with incorrect outstanding_extents counter and endio
would complain loudly with an assertion.

This fixes the problem by compensating the outstanding_extents
counter in inode if a large dio write gets split.

Reported-by: Anand Jain 
Tested-by: Anand Jain 
Signed-off-by: Liu Bo 
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix lockdep warning about log_mutex

2017-08-07T01:59:47+00:00

[ Upstream commit 781feef7e6befafd4d9787d1f7ada1f9ccd504e4 ]

While checking INODE_REF/INODE_EXTREF for a corner case, we may acquire a
different inode's log_mutex with holding the current inode's log_mutex, and
lockdep has complained this with a possilble deadlock warning.

Fix this by using mutex_lock_nested() when processing the other inode's
log_mutex.

Reviewed-by: Filipe Manana 
Signed-off-by: Liu Bo 
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

Btrfs: use down_read_nested to make lockdep silent

2017-08-07T01:59:47+00:00

[ Upstream commit e321f8a801d7b4c40da8005257b05b9c2b51b072 ]

If @block_group is not @used_bg, it'll try to get @used_bg's lock without
droping @block_group 's lock and lockdep has throwed a scary deadlock warning
about it.
Fix it by using down_read_nested.

Signed-off-by: Liu Bo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

btrfs: Don't clear SGID when inheriting ACLs

2017-07-27T22:07:58+00:00

commit b7f8a09f8097db776b8d160862540e4fc1f51296 upstream.

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: linux-btrfs@vger.kernel.org
CC: David Sterba 
Signed-off-by: Jan Kara 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix truncate down when no_holes feature is enabled

2017-07-05T12:40:22+00:00

[ Upstream commit 91298eec05cd8d4e828cf7ee5d4a6334f70cf69a ]

For such a file mapping,

[0-4k][hole][8k-12k]

In NO_HOLES mode, we don't have the [hole] extent any more.
Commit c1aa45759e90 ("Btrfs: fix shrinking truncate when the no_holes feature is enabled")
 fixed disk isize not being updated in NO_HOLES mode when data is not flushed.

However, even if data has been flushed, we can still have trouble
in updating disk isize since we updated disk isize to 'start' of
the last evicted extent.

Reviewed-by: Chris Mason 
Signed-off-by: Liu Bo 
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

Btrfs: Fix deadlock between direct IO and fast fsync

2017-07-05T12:40:22+00:00

[ Upstream commit 97dcdea076ecef41ea4aaa23d4397c2f622e4265 ]

The following deadlock is seen when executing generic/113 test,

 ---------------------------------------------------------+----------------------------------------------------
  Direct I/O task                                           Fast fsync task
 ---------------------------------------------------------+----------------------------------------------------
  btrfs_direct_IO
    __blockdev_direct_IO
     do_blockdev_direct_IO
      do_direct_IO
       btrfs_get_blocks_direct
        while (blocks needs to written)
         get_more_blocks (first iteration)
          btrfs_get_blocks_direct
           btrfs_create_dio_extent
             down_read(&BTRFS_I(inode) >dio_sem)
             Create and add extent map and ordered extent
             up_read(&BTRFS_I(inode) >dio_sem)
                                                            btrfs_sync_file
                                                              btrfs_log_dentry_safe
                                                               btrfs_log_inode_parent
                                                                btrfs_log_inode
                                                                 btrfs_log_changed_extents
                                                                  down_write(&BTRFS_I(inode) >dio_sem)
                                                                   Collect new extent maps and ordered extents
                                                                    wait for ordered extent completion
         get_more_blocks (second iteration)
          btrfs_get_blocks_direct
           btrfs_create_dio_extent
             down_read(&BTRFS_I(inode) >dio_sem)
 --------------------------------------------------------------------------------------------------------------

In the above description, Btrfs direct I/O code path has not yet started
submitting bios for file range covered by the initial ordered
extent. Meanwhile, The fast fsync task obtains the write semaphore and
waits for I/O on the ordered extent to get completed. However, the
Direct I/O task is now blocked on obtaining the read semaphore.

To resolve the deadlock, this commit modifies the Direct I/O code path
to obtain the read semaphore before invoking
__blockdev_direct_IO(). The semaphore is then given up after
__blockdev_direct_IO() returns. This allows the Direct I/O code to
complete I/O on all the ordered extents it creates.

Signed-off-by: Chandan Rajendra 
Reviewed-by: Filipe Manana 
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

crypto: Work around deallocated stack frame reference gcc bug on sparc.

2017-06-24T05:11:17+00:00

commit d41519a69b35b10af7fda867fb9100df24fdf403 upstream.

On sparc, if we have an alloca() like situation, as is the case with
SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
memory.  The result can be that the value is clobbered if a trap
or interrupt arrives at just the right instruction.

It only occurs if the function ends returning a value from that
alloca() area and that value can be placed into the return value
register using a single instruction.

For example, in lib/libcrc32c.c:crc32c() we end up with a return
sequence like:

        return  %i7+8
         lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],

%o5 holds the base of the on-stack area allocated for the shash
descriptor.  But the return released the stack frame and the
register window.

So if an intererupt arrives between 'return' and 'lduw', then
the value read at %o5+16 can be corrupted.

Add a data compiler barrier to work around this problem.  This is
exactly what the gcc fix will end up doing as well, and it absolutely
should not change the code generated for other cpus (unless gcc
on them has the same bug :-)

With crucial insight from Eric Sandeen.

Reported-by: Anatoly Pugachev 
Signed-off-by: David S. Miller 
Signed-off-by: Herbert Xu 
Signed-off-by: Greg Kroah-Hartman

btrfs: fix memory leak in update_space_info failure path

2017-06-14T13:06:02+00:00

commit 896533a7da929136d0432713f02a3edffece2826 upstream.

If we fail to add the space_info kobject, we'll leak the memory
for the percpu counter.

Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs)
Signed-off-by: Jeff Mahoney 
Reviewed-by: Liu Bo 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: use correct types for page indices in btrfs_page_exists_in_range

2017-06-14T13:06:02+00:00

commit cc2b702c52094b637a351d7491ac5200331d0445 upstream.

Variables start_idx and end_idx are supposed to hold a page index
derived from the file offsets. The int type is not the right one though,
offsets larger than 1 << 44 will get silently trimmed off the high bits.
(1 << 44 is 16TiB)

What can go wrong, if start is below the boundary and end gets trimmed:
- if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
- the final check "if (page->index <= end_idx)" will unexpectedly fail

The function will return false, ie. "there's no page in the range",
although there is at least one.

btrfs_page_exists_in_range is used to prevent races in:

* in hole punching, where we make sure there are not pages in the
  truncated range, otherwise we'll wait for them to finish and redo
  truncation, but we're going to replace the pages with holes anyway so
  the only problem is the intermediate state

* lock_extent_direct: we want to make sure there are no pages before we
  lock and start DIO, to prevent stale data reads

For practical occurence of the bug, there are several constaints.  The
file must be quite large, the affected range must cross the 16TiB
boundary and the internal state of the file pages and pending operations
must match.  Also, we must not have started any ordered data in the
range, otherwise we don't even reach the buggy function check.

DIO locking tries hard in several places to avoid deadlocks with
buffered IO and avoids waiting for ranges. The worst consequence seems
to be stale data read.

CC: Liu Bo 
Fixes: fc4adbff823f7 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
Reviewed-by: Liu Bo 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman