<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/fs/btrfs, branch v4.1.10</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>Btrfs: check if previous transaction aborted to avoid fs corruption</title>
<updated>2015-09-29T17:26:07+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-08-12T10:54:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fe995635458e8967dedc75ed8d47fe208e77a9be'/>
<id>fe995635458e8967dedc75ed8d47fe208e77a9be</id>
<content type='text'>
commit 1f9b8c8fbc9a4d029760b16f477b9d15500e3a34 upstream.

While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:

          CPU 0                                                        CPU 1

  transaction N starts

  (...)

  btrfs_commit_transaction(N)

    cur_trans-&gt;state = TRANS_STATE_COMMIT_START;
    (...)
    cur_trans-&gt;state = TRANS_STATE_COMMIT_DOING;
    (...)

    cur_trans-&gt;state = TRANS_STATE_UNBLOCKED;
    root-&gt;fs_info-&gt;running_transaction = NULL;

                                                              btrfs_start_transaction()
                                                                 --&gt; starts transaction N + 1

    btrfs_write_and_wait_transaction(trans, root);
      --&gt; starts writing all new or COWed ebs created
          at transaction N

                                                              creates some new ebs, COWs some
                                                              existing ebs but doesn't COW or
                                                              deletes eb X

                                                              btrfs_commit_transaction(N + 1)
                                                                (...)
                                                                cur_trans-&gt;state = TRANS_STATE_COMMIT_START;
                                                                (...)
                                                                wait_for_commit(root, prev_trans);
                                                                  --&gt; prev_trans == transaction N

    btrfs_write_and_wait_transaction() continues
    writing ebs
       --&gt; fails writing eb X, we abort transaction N
           and set bit BTRFS_FS_STATE_ERROR on
           fs_info-&gt;fs_state, so no new transactions
           can start after setting that bit

       cleanup_transaction()
         btrfs_cleanup_one_transaction()
           wakes up task at CPU 1

                                                                continues, doesn't abort because
                                                                cur_trans-&gt;aborted (transaction N + 1)
                                                                is zero, and no checks for bit
                                                                BTRFS_FS_STATE_ERROR in fs_info-&gt;fs_state
                                                                are made

                                                                btrfs_write_and_wait_transaction(trans, root);
                                                                  --&gt; succeeds, no errors during writeback

                                                                write_ctree_super(trans, root, 0);
                                                                  --&gt; succeeds
                                                                  --&gt; we have now a superblock that points us
                                                                      to some root that uses eb X, which was
                                                                      never written to disk

In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".

So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 1f9b8c8fbc9a4d029760b16f477b9d15500e3a34 upstream.

While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:

          CPU 0                                                        CPU 1

  transaction N starts

  (...)

  btrfs_commit_transaction(N)

    cur_trans-&gt;state = TRANS_STATE_COMMIT_START;
    (...)
    cur_trans-&gt;state = TRANS_STATE_COMMIT_DOING;
    (...)

    cur_trans-&gt;state = TRANS_STATE_UNBLOCKED;
    root-&gt;fs_info-&gt;running_transaction = NULL;

                                                              btrfs_start_transaction()
                                                                 --&gt; starts transaction N + 1

    btrfs_write_and_wait_transaction(trans, root);
      --&gt; starts writing all new or COWed ebs created
          at transaction N

                                                              creates some new ebs, COWs some
                                                              existing ebs but doesn't COW or
                                                              deletes eb X

                                                              btrfs_commit_transaction(N + 1)
                                                                (...)
                                                                cur_trans-&gt;state = TRANS_STATE_COMMIT_START;
                                                                (...)
                                                                wait_for_commit(root, prev_trans);
                                                                  --&gt; prev_trans == transaction N

    btrfs_write_and_wait_transaction() continues
    writing ebs
       --&gt; fails writing eb X, we abort transaction N
           and set bit BTRFS_FS_STATE_ERROR on
           fs_info-&gt;fs_state, so no new transactions
           can start after setting that bit

       cleanup_transaction()
         btrfs_cleanup_one_transaction()
           wakes up task at CPU 1

                                                                continues, doesn't abort because
                                                                cur_trans-&gt;aborted (transaction N + 1)
                                                                is zero, and no checks for bit
                                                                BTRFS_FS_STATE_ERROR in fs_info-&gt;fs_state
                                                                are made

                                                                btrfs_write_and_wait_transaction(trans, root);
                                                                  --&gt; succeeds, no errors during writeback

                                                                write_ctree_super(trans, root, 0);
                                                                  --&gt; succeeds
                                                                  --&gt; we have now a superblock that points us
                                                                      to some root that uses eb X, which was
                                                                      never written to disk

In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".

So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix file corruption after cloning inline extents</title>
<updated>2015-08-03T16:29:14+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-07-14T15:09:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=df7c9ca8f5925d9b839864437b7090a865d560e4'/>
<id>df7c9ca8f5925d9b839864437b7090a865d560e4</id>
<content type='text'>
commit ed958762644b404654a6f5d23e869f496fe127c6 upstream.

Using the clone ioctl (or extent_same ioctl, which calls the same extent
cloning function as well) we end up allowing copy an inline extent from
the source file into a non-zero offset of the destination file. This is
something not expected and that the btrfs code is not prepared to deal
with - all inline extents must be at a file offset equals to 0.

For example, the following excerpt of a test case for fstests triggers
a crash/BUG_ON() on a write operation after an inline extent is cloned
into a non-zero offset:

  _scratch_mkfs &gt;&gt;$seqres.full 2&gt;&amp;1
  _scratch_mount

  # Create our test files. File foo has the same 2K of data at offset 4K
  # as file bar has at its offset 0.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
      -c "pwrite -S 0xbb 4k 2K" \
      -c "pwrite -S 0xcc 8K 4K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # File bar consists of a single inline extent (2K size).
  $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
     $SCRATCH_MNT/bar | _filter_xfs_io

  # Now call the clone ioctl to clone the extent of file bar into file
  # foo at its offset 4K. This made file foo have an inline extent at
  # offset 4K, something which the btrfs code can not deal with in future
  # IO operations because all inline extents are supposed to start at an
  # offset of 0, resulting in all sorts of chaos.
  # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
  # what it returns for other cases dealing with inlined extents.
  $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
      $SCRATCH_MNT/bar $SCRATCH_MNT/foo

  # Because of the inline extent at offset 4K, the following write made
  # the kernel crash with a BUG_ON().
  $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io

  status=0
  exit

The stack trace of the BUG_ON() triggered by the last write is:

  [152154.035903] ------------[ cut here ]------------
  [152154.036424] kernel BUG at mm/page-writeback.c:2286!
  [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$
  [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G        W       4.1.0-rc6-btrfs-next-11+ #2
  [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
  [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000
  [152154.036424] RIP: 0010:[&lt;ffffffff8111a9d5&gt;]  [&lt;ffffffff8111a9d5&gt;] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424] RSP: 0018:ffff880429effc68  EFLAGS: 00010246
  [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001
  [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0
  [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001
  [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68
  [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80
  [152154.036424] FS:  00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000
  [152154.036424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0
  [152154.036424] Stack:
  [152154.036424]  ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1
  [152154.036424]  ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8
  [152154.036424]  0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000
  [152154.036424] Call Trace:
  [152154.036424]  [&lt;ffffffffa04e97c1&gt;] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs]
  [152154.036424]  [&lt;ffffffffa04ea82c&gt;] __btrfs_buffered_write+0x245/0x4c8 [btrfs]
  [152154.036424]  [&lt;ffffffffa04ed14b&gt;] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs]
  [152154.036424]  [&lt;ffffffffa04ed15a&gt;] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
  [152154.036424]  [&lt;ffffffffa04ed2c7&gt;] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs]
  [152154.036424]  [&lt;ffffffff81165a4a&gt;] __vfs_write+0x7c/0xa5
  [152154.036424]  [&lt;ffffffff81165f89&gt;] vfs_write+0xa0/0xe4
  [152154.036424]  [&lt;ffffffff81166855&gt;] SyS_pwrite64+0x64/0x82
  [152154.036424]  [&lt;ffffffff81465197&gt;] system_call_fastpath+0x12/0x6f
  [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 &lt;0f&gt; 0b 4d 85 e4 74 59 49 8b 3c 2$
  [152154.036424] RIP  [&lt;ffffffff8111a9d5&gt;] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424]  RSP &lt;ffff880429effc68&gt;
  [152154.242621] ---[ end trace e3d3376b23a57041 ]---

Fix this by returning the error EOPNOTSUPP if an attempt to copy an
inline extent into a non-zero offset happens, just like what is done for
other scenarios that would require copying/splitting inline extents,
which were introduced by the following commits:

   00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split")
   3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents")

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit ed958762644b404654a6f5d23e869f496fe127c6 upstream.

Using the clone ioctl (or extent_same ioctl, which calls the same extent
cloning function as well) we end up allowing copy an inline extent from
the source file into a non-zero offset of the destination file. This is
something not expected and that the btrfs code is not prepared to deal
with - all inline extents must be at a file offset equals to 0.

For example, the following excerpt of a test case for fstests triggers
a crash/BUG_ON() on a write operation after an inline extent is cloned
into a non-zero offset:

  _scratch_mkfs &gt;&gt;$seqres.full 2&gt;&amp;1
  _scratch_mount

  # Create our test files. File foo has the same 2K of data at offset 4K
  # as file bar has at its offset 0.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
      -c "pwrite -S 0xbb 4k 2K" \
      -c "pwrite -S 0xcc 8K 4K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # File bar consists of a single inline extent (2K size).
  $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
     $SCRATCH_MNT/bar | _filter_xfs_io

  # Now call the clone ioctl to clone the extent of file bar into file
  # foo at its offset 4K. This made file foo have an inline extent at
  # offset 4K, something which the btrfs code can not deal with in future
  # IO operations because all inline extents are supposed to start at an
  # offset of 0, resulting in all sorts of chaos.
  # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
  # what it returns for other cases dealing with inlined extents.
  $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
      $SCRATCH_MNT/bar $SCRATCH_MNT/foo

  # Because of the inline extent at offset 4K, the following write made
  # the kernel crash with a BUG_ON().
  $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io

  status=0
  exit

The stack trace of the BUG_ON() triggered by the last write is:

  [152154.035903] ------------[ cut here ]------------
  [152154.036424] kernel BUG at mm/page-writeback.c:2286!
  [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$
  [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G        W       4.1.0-rc6-btrfs-next-11+ #2
  [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
  [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000
  [152154.036424] RIP: 0010:[&lt;ffffffff8111a9d5&gt;]  [&lt;ffffffff8111a9d5&gt;] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424] RSP: 0018:ffff880429effc68  EFLAGS: 00010246
  [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001
  [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0
  [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001
  [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68
  [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80
  [152154.036424] FS:  00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000
  [152154.036424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0
  [152154.036424] Stack:
  [152154.036424]  ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1
  [152154.036424]  ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8
  [152154.036424]  0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000
  [152154.036424] Call Trace:
  [152154.036424]  [&lt;ffffffffa04e97c1&gt;] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs]
  [152154.036424]  [&lt;ffffffffa04ea82c&gt;] __btrfs_buffered_write+0x245/0x4c8 [btrfs]
  [152154.036424]  [&lt;ffffffffa04ed14b&gt;] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs]
  [152154.036424]  [&lt;ffffffffa04ed15a&gt;] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
  [152154.036424]  [&lt;ffffffffa04ed2c7&gt;] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs]
  [152154.036424]  [&lt;ffffffff81165a4a&gt;] __vfs_write+0x7c/0xa5
  [152154.036424]  [&lt;ffffffff81165f89&gt;] vfs_write+0xa0/0xe4
  [152154.036424]  [&lt;ffffffff81166855&gt;] SyS_pwrite64+0x64/0x82
  [152154.036424]  [&lt;ffffffff81465197&gt;] system_call_fastpath+0x12/0x6f
  [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 &lt;0f&gt; 0b 4d 85 e4 74 59 49 8b 3c 2$
  [152154.036424] RIP  [&lt;ffffffff8111a9d5&gt;] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424]  RSP &lt;ffff880429effc68&gt;
  [152154.242621] ---[ end trace e3d3376b23a57041 ]---

Fix this by returning the error EOPNOTSUPP if an attempt to copy an
inline extent into a non-zero offset happens, just like what is done for
other scenarios that would require copying/splitting inline extents,
which were introduced by the following commits:

   00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split")
   3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents")

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix list transaction-&gt;pending_ordered corruption</title>
<updated>2015-08-03T16:29:13+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-07-03T19:30:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=98f7bfe6a25a7b822efc7fcc72b7b494fefd438f'/>
<id>98f7bfe6a25a7b822efc7fcc72b7b494fefd438f</id>
<content type='text'>
commit d3efe08400317888f559bbedf0e42cd31575d0ef upstream.

When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
&gt;= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.

This produces the following warning on a kernel with linked list
debugging enabled:

[109749.265416] ------------[ cut here ]------------
[109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
[109749.267969] list_del corruption. prev-&gt;next should be ffff8800ba087e20, but was fffffff8c1f7c35d
(...)
[109749.287505] Call Trace:
[109749.288135]  [&lt;ffffffff8145f077&gt;] dump_stack+0x4f/0x7b
[109749.298080]  [&lt;ffffffff81095de5&gt;] ? console_unlock+0x356/0x3a2
[109749.331605]  [&lt;ffffffff8104b3b0&gt;] warn_slowpath_common+0xa1/0xbb
[109749.334849]  [&lt;ffffffff81260642&gt;] ? __list_del_entry+0x5a/0x98
[109749.337093]  [&lt;ffffffff8104b410&gt;] warn_slowpath_fmt+0x46/0x48
[109749.337847]  [&lt;ffffffff81260642&gt;] __list_del_entry+0x5a/0x98
[109749.338678]  [&lt;ffffffffa053e8bf&gt;] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
[109749.340145]  [&lt;ffffffffa058a65f&gt;] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
[109749.348313]  [&lt;ffffffffa054077d&gt;] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
[109749.349745]  [&lt;ffffffff81087310&gt;] ? trace_hardirqs_on+0xd/0xf
[109749.350819]  [&lt;ffffffffa055370d&gt;] btrfs_sync_file+0x36f/0x3fc [btrfs]
[109749.351976]  [&lt;ffffffff8118ec98&gt;] vfs_fsync_range+0x8f/0x9e
[109749.360341]  [&lt;ffffffff8118ecc3&gt;] vfs_fsync+0x1c/0x1e
[109749.368828]  [&lt;ffffffff8118ee1d&gt;] do_fsync+0x34/0x4e
[109749.369790]  [&lt;ffffffff8118f045&gt;] SyS_fsync+0x10/0x14
[109749.370925]  [&lt;ffffffff81465197&gt;] system_call_fastpath+0x12/0x6f
[109749.382274] ---[ end trace 48e0d07f7c03d95a ]---

On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().

Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3"
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit d3efe08400317888f559bbedf0e42cd31575d0ef upstream.

When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
&gt;= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.

This produces the following warning on a kernel with linked list
debugging enabled:

[109749.265416] ------------[ cut here ]------------
[109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
[109749.267969] list_del corruption. prev-&gt;next should be ffff8800ba087e20, but was fffffff8c1f7c35d
(...)
[109749.287505] Call Trace:
[109749.288135]  [&lt;ffffffff8145f077&gt;] dump_stack+0x4f/0x7b
[109749.298080]  [&lt;ffffffff81095de5&gt;] ? console_unlock+0x356/0x3a2
[109749.331605]  [&lt;ffffffff8104b3b0&gt;] warn_slowpath_common+0xa1/0xbb
[109749.334849]  [&lt;ffffffff81260642&gt;] ? __list_del_entry+0x5a/0x98
[109749.337093]  [&lt;ffffffff8104b410&gt;] warn_slowpath_fmt+0x46/0x48
[109749.337847]  [&lt;ffffffff81260642&gt;] __list_del_entry+0x5a/0x98
[109749.338678]  [&lt;ffffffffa053e8bf&gt;] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
[109749.340145]  [&lt;ffffffffa058a65f&gt;] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
[109749.348313]  [&lt;ffffffffa054077d&gt;] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
[109749.349745]  [&lt;ffffffff81087310&gt;] ? trace_hardirqs_on+0xd/0xf
[109749.350819]  [&lt;ffffffffa055370d&gt;] btrfs_sync_file+0x36f/0x3fc [btrfs]
[109749.351976]  [&lt;ffffffff8118ec98&gt;] vfs_fsync_range+0x8f/0x9e
[109749.360341]  [&lt;ffffffff8118ecc3&gt;] vfs_fsync+0x1c/0x1e
[109749.368828]  [&lt;ffffffff8118ee1d&gt;] do_fsync+0x34/0x4e
[109749.369790]  [&lt;ffffffff8118f045&gt;] SyS_fsync+0x10/0x14
[109749.370925]  [&lt;ffffffff81465197&gt;] system_call_fastpath+0x12/0x6f
[109749.382274] ---[ end trace 48e0d07f7c03d95a ]---

On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().

Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3"
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix memory leak in the extent_same ioctl</title>
<updated>2015-08-03T16:29:13+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-07-03T07:36:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=992a3fbb5f5d77b23e4d7a785e4ad22b4acf82cd'/>
<id>992a3fbb5f5d77b23e4d7a785e4ad22b4acf82cd</id>
<content type='text'>
commit 497b4050e0eacd4c746dd396d14916b1e669849d upstream.

We were allocating memory with memdup_user() but we were never releasing
that memory. This affected pretty much every call to the ioctl, whether
it deduplicated extents or not.

This issue was reported on IRC by Julian Taylor and on the mailing list
by Marcel Ritter, credit goes to them for finding the issue.

Reported-by: Julian Taylor &lt;jtaylor.debian@googlemail.com&gt;
Reported-by: Marcel Ritter &lt;ritter.marcel@gmail.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Mark Fasheh &lt;mfasheh@suse.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 497b4050e0eacd4c746dd396d14916b1e669849d upstream.

We were allocating memory with memdup_user() but we were never releasing
that memory. This affected pretty much every call to the ioctl, whether
it deduplicated extents or not.

This issue was reported on IRC by Julian Taylor and on the mailing list
by Marcel Ritter, credit goes to them for finding the issue.

Reported-by: Julian Taylor &lt;jtaylor.debian@googlemail.com&gt;
Reported-by: Marcel Ritter &lt;ritter.marcel@gmail.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Mark Fasheh &lt;mfasheh@suse.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix fsync data loss after append write</title>
<updated>2015-08-03T16:29:13+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-06-17T11:49:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=544f8fbe0af1ab753531cfc4efedf39d64a2a656'/>
<id>544f8fbe0af1ab753531cfc4efedf39d64a2a656</id>
<content type='text'>
commit e4545de5b035c7debb73d260c78377dbb69cbfb5 upstream.

If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.

This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.

Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).

This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!

  _cleanup()
  {
          _cleanup_flakey
          rm -f $tmp.*
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  _crash_and_mount()
  {
          # Simulate a crash/power loss.
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
          # Allow writes again and mount. This makes the fs replay its fsync log.
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs &gt;&gt; $seqres.full 2&gt;&amp;1
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and then fsync it.
  # The fsync here is only needed to trigger the issue in btrfs, as it causes the
  # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                  -c "fsync" \
                  $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add a hard link to our file.
  # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
  # which is a necessary condition to trigger the issue.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now append more data to our file, increasing its size, and fsync the file.
  # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
  # write path did not update the inode item in the btree nor the delayed inode
  # item (in memory struture) in the current transaction (created by the fsync
  # handler), the fsync did not record the inode's new i_size in the fsync
  # log/journal. This made the data unavailable after the fsync log/journal is
  # replayed.
  $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
               -c "fsync" \
               $SCRATCH_MNT/foo | _filter_xfs_io

  echo "File content after fsync and before crash:"
  od -t x1 $SCRATCH_MNT/foo

  _crash_and_mount

  echo "File content after crash and log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected file output before and after the crash/power failure expects the
appended data to be available, which is:

  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0200000

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit e4545de5b035c7debb73d260c78377dbb69cbfb5 upstream.

If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.

This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.

Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).

This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!

  _cleanup()
  {
          _cleanup_flakey
          rm -f $tmp.*
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  _crash_and_mount()
  {
          # Simulate a crash/power loss.
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
          # Allow writes again and mount. This makes the fs replay its fsync log.
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs &gt;&gt; $seqres.full 2&gt;&amp;1
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and then fsync it.
  # The fsync here is only needed to trigger the issue in btrfs, as it causes the
  # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                  -c "fsync" \
                  $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add a hard link to our file.
  # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
  # which is a necessary condition to trigger the issue.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now append more data to our file, increasing its size, and fsync the file.
  # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
  # write path did not update the inode item in the btree nor the delayed inode
  # item (in memory struture) in the current transaction (created by the fsync
  # handler), the fsync did not record the inode's new i_size in the fsync
  # log/journal. This made the data unavailable after the fsync log/journal is
  # replayed.
  $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
               -c "fsync" \
               $SCRATCH_MNT/foo | _filter_xfs_io

  echo "File content after fsync and before crash:"
  od -t x1 $SCRATCH_MNT/foo

  _crash_and_mount

  echo "File content after crash and log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected file output before and after the crash/power failure expects the
appended data to be available, which is:

  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0200000

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix race between caching kthread and returning inode to inode cache</title>
<updated>2015-08-03T16:29:13+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-06-13T05:52:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=9547e86be4af61fb535d1487d6a0a607bb67f392'/>
<id>9547e86be4af61fb535d1487d6a0a607bb67f392</id>
<content type='text'>
commit ae9d8f17118551bedd797406a6768b87c2146234 upstream.

While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info-&gt;commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.

This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.

This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:

[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075]  [&lt;ffffffff8145eec7&gt;] dump_stack+0x4f/0x7b
[132408.505075]  [&lt;ffffffff81095dce&gt;] ? console_unlock+0x356/0x3a2
[132408.505075]  [&lt;ffffffff81154e1e&gt;] __slab_error.isra.28+0x25/0x36
[132408.505075]  [&lt;ffffffff81155733&gt;] __cache_free+0xe2/0x4b6
[132408.505075]  [&lt;ffffffffa054a95a&gt;] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075]  [&lt;ffffffffa0505b5f&gt;] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [&lt;ffffffff810f3b30&gt;] ? time_hardirqs_off+0x15/0x28
[132408.505075]  [&lt;ffffffff81084d42&gt;] ? trace_hardirqs_off+0xd/0xf
[132408.505075]  [&lt;ffffffff811563a1&gt;] ? kfree+0xb6/0x14e
[132408.505075]  [&lt;ffffffff811563d0&gt;] kfree+0xe5/0x14e
[132408.505075]  [&lt;ffffffffa0505b5f&gt;] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [&lt;ffffffffa0505e08&gt;] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075]  [&lt;ffffffffa0505b6a&gt;] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075]  [&lt;ffffffff8106698f&gt;] kthread+0xef/0xf7
[132408.505075]  [&lt;ffffffff810f3b08&gt;] ? time_hardirqs_on+0x15/0x28
[132408.505075]  [&lt;ffffffff810668a0&gt;] ? __kthread_parkme+0xad/0xad
[132408.505075]  [&lt;ffffffff814653d2&gt;] ret_from_fork+0x42/0x70
[132408.505075]  [&lt;ffffffff810668a0&gt;] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!

Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.

Fixes: 1c70d8fb4dfa ("Btrfs: fix inode caching vs tree log")
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit ae9d8f17118551bedd797406a6768b87c2146234 upstream.

While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info-&gt;commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.

This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.

This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:

[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075]  [&lt;ffffffff8145eec7&gt;] dump_stack+0x4f/0x7b
[132408.505075]  [&lt;ffffffff81095dce&gt;] ? console_unlock+0x356/0x3a2
[132408.505075]  [&lt;ffffffff81154e1e&gt;] __slab_error.isra.28+0x25/0x36
[132408.505075]  [&lt;ffffffff81155733&gt;] __cache_free+0xe2/0x4b6
[132408.505075]  [&lt;ffffffffa054a95a&gt;] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075]  [&lt;ffffffffa0505b5f&gt;] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [&lt;ffffffff810f3b30&gt;] ? time_hardirqs_off+0x15/0x28
[132408.505075]  [&lt;ffffffff81084d42&gt;] ? trace_hardirqs_off+0xd/0xf
[132408.505075]  [&lt;ffffffff811563a1&gt;] ? kfree+0xb6/0x14e
[132408.505075]  [&lt;ffffffff811563d0&gt;] kfree+0xe5/0x14e
[132408.505075]  [&lt;ffffffffa0505b5f&gt;] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [&lt;ffffffffa0505e08&gt;] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075]  [&lt;ffffffffa0505b6a&gt;] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075]  [&lt;ffffffff8106698f&gt;] kthread+0xef/0xf7
[132408.505075]  [&lt;ffffffff810f3b08&gt;] ? time_hardirqs_on+0x15/0x28
[132408.505075]  [&lt;ffffffff810668a0&gt;] ? __kthread_parkme+0xad/0xad
[132408.505075]  [&lt;ffffffff814653d2&gt;] ret_from_fork+0x42/0x70
[132408.505075]  [&lt;ffffffff810668a0&gt;] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!

Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.

Fixes: 1c70d8fb4dfa ("Btrfs: fix inode caching vs tree log")
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: use kmem_cache_free when freeing entry in inode cache</title>
<updated>2015-08-03T16:29:13+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-06-13T05:52:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6f953ad80fba78eed807fe61d96c4dde50708698'/>
<id>6f953ad80fba78eed807fe61d96c4dde50708698</id>
<content type='text'>
commit c3f4a1685bb87e59c886ee68f7967eae07d4dffa upstream.

The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:

  /*
   * (...)
   *
   * Don't free memory not originally allocated by kmalloc()
   * or you will run into trouble.
   */

So better be safe and use kmem_cache_free().

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.cz&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit c3f4a1685bb87e59c886ee68f7967eae07d4dffa upstream.

The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:

  /*
   * (...)
   *
   * Don't free memory not originally allocated by kmalloc()
   * or you will run into trouble.
   */

So better be safe and use kmem_cache_free().

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.cz&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: don't invalidate root dentry when subvolume deletion fails</title>
<updated>2015-08-03T16:29:13+00:00</updated>
<author>
<name>Omar Sandoval</name>
<email>osandov@osandov.com</email>
</author>
<published>2015-06-03T00:31:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=54b1fb57886f3b2a7be247937efa26d6e305aaa4'/>
<id>54b1fb57886f3b2a7be247937efa26d6e305aaa4</id>
<content type='text'>
commit 64ad6c488975d7516230cf7849190a991fd615ae upstream.

Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"),
mounted subvolumes can be deleted because d_invalidate() won't fail.
However, we run into problems when we attempt to delete the default
subvolume while it is mounted as the root filesystem:

	# btrfs subvol list /
	ID 257 gen 306 top level 5 path rootvol
	ID 267 gen 334 top level 5 path snap1
	# btrfs subvol get-default /
	ID 267 gen 334 top level 5 path snap1
	# btrfs inspect-internal rootid /
	267
	# mount -o subvol=/ /dev/vda1 /mnt
	# btrfs subvol del /mnt/snap1
	Delete subvolume (no-commit): '/mnt/snap1'
	ERROR: cannot delete '/mnt/snap1' - Operation not permitted
	# findmnt /
	findmnt: can't read /proc/mounts: No such file or directory
	# ls /proc
	#

Markus reported that this same scenario simply led to a kernel oops.

This happens because in btrfs_ioctl_snap_destroy(), we call
d_invalidate() before we check may_destroy_subvol(), which means that we
detach the submounts and drop the dentry before erroring out. Instead,
we should only invalidate the dentry once the deletion has succeeded.
Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
will prune the dcache for the deleted subvolume.

Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
Reported-by: Markus Schauler &lt;mschauler@gmail.com&gt;
Signed-off-by: Omar Sandoval &lt;osandov@osandov.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 64ad6c488975d7516230cf7849190a991fd615ae upstream.

Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"),
mounted subvolumes can be deleted because d_invalidate() won't fail.
However, we run into problems when we attempt to delete the default
subvolume while it is mounted as the root filesystem:

	# btrfs subvol list /
	ID 257 gen 306 top level 5 path rootvol
	ID 267 gen 334 top level 5 path snap1
	# btrfs subvol get-default /
	ID 267 gen 334 top level 5 path snap1
	# btrfs inspect-internal rootid /
	267
	# mount -o subvol=/ /dev/vda1 /mnt
	# btrfs subvol del /mnt/snap1
	Delete subvolume (no-commit): '/mnt/snap1'
	ERROR: cannot delete '/mnt/snap1' - Operation not permitted
	# findmnt /
	findmnt: can't read /proc/mounts: No such file or directory
	# ls /proc
	#

Markus reported that this same scenario simply led to a kernel oops.

This happens because in btrfs_ioctl_snap_destroy(), we call
d_invalidate() before we check may_destroy_subvol(), which means that we
detach the submounts and drop the dentry before erroring out. Instead,
we should only invalidate the dentry once the deletion has succeeded.
Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
will prune the dcache for the deleted subvolume.

Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
Reported-by: Markus Schauler &lt;mschauler@gmail.com&gt;
Signed-off-by: Omar Sandoval &lt;osandov@osandov.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs</title>
<updated>2015-05-23T18:14:10+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-05-23T18:14:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7ce14f6ff26460819345fe8495cf2dd6538b7cdc'/>
<id>7ce14f6ff26460819345fe8495cf2dd6538b7cdc</id>
<content type='text'>
Pull btrfs fixes from Chris Mason:
 "I fixed up a regression from 4.0 where conversion between different
  raid levels would sometimes bail out without converting.

  Filipe tracked down a race where it was possible to double allocate
  chunks on the drive.

  Mark has a fix for fiemap.  All three will get bundled off for stable
  as well"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix regression in raid level conversion
  Btrfs: fix racy system chunk allocation when setting block group ro
  btrfs: clear 'ret' in btrfs_check_shared() loop
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull btrfs fixes from Chris Mason:
 "I fixed up a regression from 4.0 where conversion between different
  raid levels would sometimes bail out without converting.

  Filipe tracked down a race where it was possible to double allocate
  chunks on the drive.

  Mark has a fix for fiemap.  All three will get bundled off for stable
  as well"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix regression in raid level conversion
  Btrfs: fix racy system chunk allocation when setting block group ro
  btrfs: clear 'ret' in btrfs_check_shared() loop
</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix regression in raid level conversion</title>
<updated>2015-05-20T18:03:38+00:00</updated>
<author>
<name>Chris Mason</name>
<email>clm@fb.com</email>
</author>
<published>2015-05-20T01:54:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=153c35b6cccc0c72de9fae06c8e2c8b2c47d79d4'/>
<id>153c35b6cccc0c72de9fae06c8e2c8b2c47d79d4</id>
<content type='text'>
Commit 2f0810880f082fa8ba66ab2c33b02e4ff9770a5e changed
btrfs_set_block_group_ro to avoid trying to allocate new chunks with the
new raid profile during conversion.  This fixed failures when there was
no space on the drive to allocate a new chunk, but the metadata
reserves were sufficient to continue the conversion.

But this ended up causing a regression when the drive had plenty of
space to allocate new chunks, mostly because reduce_alloc_profile isn't
using the new raid profile.

Fixing btrfs_reduce_alloc_profile is a bigger patch.  For now, do a
partial revert of 2f0810880, and don't error out if we hit ENOSPC.

Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Tested-by: Dave Sterba &lt;dsterba@suse.cz&gt;
Reported-by: Holger Hoffstaette &lt;holger.hoffstaette@googlemail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 2f0810880f082fa8ba66ab2c33b02e4ff9770a5e changed
btrfs_set_block_group_ro to avoid trying to allocate new chunks with the
new raid profile during conversion.  This fixed failures when there was
no space on the drive to allocate a new chunk, but the metadata
reserves were sufficient to continue the conversion.

But this ended up causing a regression when the drive had plenty of
space to allocate new chunks, mostly because reduce_alloc_profile isn't
using the new raid profile.

Fixing btrfs_reduce_alloc_profile is a bigger patch.  For now, do a
partial revert of 2f0810880, and don't error out if we hit ENOSPC.

Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Tested-by: Dave Sterba &lt;dsterba@suse.cz&gt;
Reported-by: Holger Hoffstaette &lt;holger.hoffstaette@googlemail.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
