<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/drivers/md, branch v4.4.136</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>bcache: quit dc-&gt;writeback_thread when BCACHE_DEV_DETACHING is set</title>
<updated>2018-05-30T05:49:11+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2018-03-19T00:36:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7fb3ba30296b5d7b3758fe223614ea559bdc77e8'/>
<id>7fb3ba30296b5d7b3758fe223614ea559bdc77e8</id>
<content type='text'>
[ Upstream commit fadd94e05c02afec7b70b0b14915624f1782f578 ]

In patch "bcache: fix cached_dev-&gt;count usage for bch_cache_set_error()",
cached_dev_get() is called when creating dc-&gt;writeback_thread, and
cached_dev_put() is called when exiting dc-&gt;writeback_thread. This
modification works well unless people detach the bcache device manually by
    'echo 1 &gt; /sys/block/bcache&lt;N&gt;/bcache/detach'
Because this sysfs interface only calls bch_cached_dev_detach() which wakes
up dc-&gt;writeback_thread but does not stop it. The reason is, before patch
"bcache: fix cached_dev-&gt;count usage for bch_cache_set_error()", inside
bch_writeback_thread(), if cache is not dirty after writeback,
cached_dev_put() will be called here. And in cached_dev_make_request() when
a new write request makes cache from clean to dirty, cached_dev_get() will
be called there. Since we don't operate dc-&gt;count in these locations,
refcount d-&gt;count cannot be dropped after cache becomes clean, and
cached_dev_detach_finish() won't be called to detach bcache device.

This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
set inside bch_writeback_thread(). If this bit is set and cache is clean
(no existing writeback_keys), break the while-loop, call cached_dev_put()
and quit the writeback thread.

Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
writeback thread should continue to perform writeback, this is the original
design of manually detach.

It is safe to do the following check without locking, let me explain why,
+	if (!test_bit(BCACHE_DEV_DETACHING, &amp;dc-&gt;disk.flags) &amp;&amp;
+	    (!atomic_read(&amp;dc-&gt;has_dirty) || !dc-&gt;writeback_running)) {

If the kenrel thread does not sleep and continue to run due to conditions
are not updated in time on the running CPU core, it just consumes more CPU
cycles and has no hurt. This should-sleep-but-run is safe here. We just
focus on the should-run-but-sleep condition, which means the writeback
thread goes to sleep in mistake while it should continue to run.
1, First of all, no matter the writeback thread is hung or not,
   kthread_stop() from cached_dev_detach_finish() will wake up it and
   terminate by making kthread_should_stop() return true. And in normal
   run time, bit on index BCACHE_DEV_DETACHING is always cleared, the
   condition
	!test_bit(BCACHE_DEV_DETACHING, &amp;dc-&gt;disk.flags)
   is always true and can be ignored as constant value.
2, If one of the following conditions is true, the writeback thread should
   go to sleep,
   "!atomic_read(&amp;dc-&gt;has_dirty)" or "!dc-&gt;writeback_running)"
   each of them independently controls the writeback thread should sleep or
   not, let's analyse them one by one.
2.1 condition "!atomic_read(&amp;dc-&gt;has_dirty)"
   If dc-&gt;has_dirty is set from 0 to 1 on another CPU core, bcache will
   call bch_writeback_queue() immediately or call bch_writeback_add() which
   indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
   wake_up_process(dc-&gt;writeback_thread) is called. It sets writeback
   thread's task state to TASK_RUNNING and following an implicit memory
   barrier, then tries to wake up the writeback thread.
   In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
   doing the condition check. If other CPU core sets the TASK_RUNNING state
   after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
   will be scheduled to run very soon because its state is not
   TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
   writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
   of wake_up_process() will make sure modification of dc-&gt;has_dirty on
   other CPU core is updated and observed on the CPU core of writeback
   thread. Therefore the condition check will correctly be false, and
   continue writeback code without sleeping.
2.2 condition "!dc-&gt;writeback_running)"
   dc-&gt;writeback_running can be changed via sysfs file, every time it is
   modified, a following bch_writeback_queue() is alwasy called. So the
   change is always observed on the CPU core of writeback thread. If
   dc-&gt;writeback_running is changed from 0 to 1 on other CPU core, this
   condition check will observe the modification and allow writeback
   thread to continue to run without sleeping.
Now we can see, even without a locking protection, multiple conditions
check is safe here, no deadlock or process hang up will happen.

I compose a separte patch because that patch "bcache: fix cached_dev-&gt;count
usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
Reinecke. Also this fix is not trivial and good for a separate patch.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Hannes Reinecke &lt;hare@suse.com&gt;
Cc: Huijun Tang &lt;tang.junhui@zte.com.cn&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit fadd94e05c02afec7b70b0b14915624f1782f578 ]

In patch "bcache: fix cached_dev-&gt;count usage for bch_cache_set_error()",
cached_dev_get() is called when creating dc-&gt;writeback_thread, and
cached_dev_put() is called when exiting dc-&gt;writeback_thread. This
modification works well unless people detach the bcache device manually by
    'echo 1 &gt; /sys/block/bcache&lt;N&gt;/bcache/detach'
Because this sysfs interface only calls bch_cached_dev_detach() which wakes
up dc-&gt;writeback_thread but does not stop it. The reason is, before patch
"bcache: fix cached_dev-&gt;count usage for bch_cache_set_error()", inside
bch_writeback_thread(), if cache is not dirty after writeback,
cached_dev_put() will be called here. And in cached_dev_make_request() when
a new write request makes cache from clean to dirty, cached_dev_get() will
be called there. Since we don't operate dc-&gt;count in these locations,
refcount d-&gt;count cannot be dropped after cache becomes clean, and
cached_dev_detach_finish() won't be called to detach bcache device.

This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
set inside bch_writeback_thread(). If this bit is set and cache is clean
(no existing writeback_keys), break the while-loop, call cached_dev_put()
and quit the writeback thread.

Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
writeback thread should continue to perform writeback, this is the original
design of manually detach.

It is safe to do the following check without locking, let me explain why,
+	if (!test_bit(BCACHE_DEV_DETACHING, &amp;dc-&gt;disk.flags) &amp;&amp;
+	    (!atomic_read(&amp;dc-&gt;has_dirty) || !dc-&gt;writeback_running)) {

If the kenrel thread does not sleep and continue to run due to conditions
are not updated in time on the running CPU core, it just consumes more CPU
cycles and has no hurt. This should-sleep-but-run is safe here. We just
focus on the should-run-but-sleep condition, which means the writeback
thread goes to sleep in mistake while it should continue to run.
1, First of all, no matter the writeback thread is hung or not,
   kthread_stop() from cached_dev_detach_finish() will wake up it and
   terminate by making kthread_should_stop() return true. And in normal
   run time, bit on index BCACHE_DEV_DETACHING is always cleared, the
   condition
	!test_bit(BCACHE_DEV_DETACHING, &amp;dc-&gt;disk.flags)
   is always true and can be ignored as constant value.
2, If one of the following conditions is true, the writeback thread should
   go to sleep,
   "!atomic_read(&amp;dc-&gt;has_dirty)" or "!dc-&gt;writeback_running)"
   each of them independently controls the writeback thread should sleep or
   not, let's analyse them one by one.
2.1 condition "!atomic_read(&amp;dc-&gt;has_dirty)"
   If dc-&gt;has_dirty is set from 0 to 1 on another CPU core, bcache will
   call bch_writeback_queue() immediately or call bch_writeback_add() which
   indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
   wake_up_process(dc-&gt;writeback_thread) is called. It sets writeback
   thread's task state to TASK_RUNNING and following an implicit memory
   barrier, then tries to wake up the writeback thread.
   In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
   doing the condition check. If other CPU core sets the TASK_RUNNING state
   after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
   will be scheduled to run very soon because its state is not
   TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
   writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
   of wake_up_process() will make sure modification of dc-&gt;has_dirty on
   other CPU core is updated and observed on the CPU core of writeback
   thread. Therefore the condition check will correctly be false, and
   continue writeback code without sleeping.
2.2 condition "!dc-&gt;writeback_running)"
   dc-&gt;writeback_running can be changed via sysfs file, every time it is
   modified, a following bch_writeback_queue() is alwasy called. So the
   change is always observed on the CPU core of writeback thread. If
   dc-&gt;writeback_running is changed from 0 to 1 on other CPU core, this
   condition check will observe the modification and allow writeback
   thread to continue to run without sleeping.
Now we can see, even without a locking protection, multiple conditions
check is safe here, no deadlock or process hang up will happen.

I compose a separte patch because that patch "bcache: fix cached_dev-&gt;count
usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
Reinecke. Also this fix is not trivial and good for a separate patch.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Hannes Reinecke &lt;hare@suse.com&gt;
Cc: Huijun Tang &lt;tang.junhui@zte.com.cn&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcache: fix kcrashes with fio in RAID5 backend dev</title>
<updated>2018-05-30T05:49:02+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-02-27T17:49:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=3d0dca33a2346608db5c3466f2f654e1e2e9e76e'/>
<id>3d0dca33a2346608db5c3466f2f654e1e2e9e76e</id>
<content type='text'>
[ Upstream commit 60eb34ec5526e264c2bbaea4f7512d714d791caf ]

Kernel crashed when run fio in a RAID5 backend bcache device, the call
trace is bellow:
[  440.012034] kernel BUG at block/blk-ioc.c:146!
[  440.012696] invalid opcode: 0000 [#1] SMP NOPTI
[  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
[  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
/2015
[  440.028615] RIP: 0010:put_io_context+0x8b/0x90
[  440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246
[  440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000
0f4240
[  440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882
94fca0
[  440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8
0c1700
[  440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000
01000
[  440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11
bd1e0
[  440.035239] FS:  0000000000000000(0000) GS:ffff949cba280000(0000) knlGS:
0000000000000000
[  440.060190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001
606e0
[  440.110498] Call Trace:
[  440.135443]  bio_disassociate_task+0x1b/0x60
[  440.160355]  bio_free+0x1b/0x60
[  440.184666]  bio_put+0x23/0x30
[  440.208272]  search_free+0x23/0x40 [bcache]
[  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
[  440.254468]  closure_put+0xb6/0xd0 [bcache]
[  440.277087]  request_endio+0x30/0x40 [bcache]
[  440.298703]  bio_endio+0xa1/0x120
[  440.319644]  handle_stripe+0x418/0x2270 [raid456]
[  440.340614]  ? load_balance+0x17b/0x9c0
[  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
[  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
[  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
[  440.419193]  ? schedule+0x36/0x80
[  440.437932]  ? schedule_timeout+0x1d2/0x2f0
[  440.456136]  md_thread+0x122/0x150
[  440.473687]  ? wait_woken+0x80/0x80
[  440.491411]  kthread+0x102/0x140
[  440.508636]  ? find_pers+0x70/0x70
[  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
[  440.541791]  ret_from_fork+0x35/0x40
[  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2
48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 &lt;0f&gt; 0b
0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41
[  440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8
[  440.628575] ---[ end trace a1fd79d85643a73e ]--

All the crash issue happened when a bypass IO coming, in such scenario
s-&gt;iop.bio is pointed to the s-&gt;orig_bio. In search_free(), it finishes the
s-&gt;orig_bio by calling bio_complete(), and after that, s-&gt;iop.bio became
invalid, then kernel would crash when calling bio_put(). Maybe its upper
layer's faulty, since bio should not be freed before we calling bio_put(),
but we'd better calling bio_put() first before calling bio_complete() to
notify upper layer ending this bio.

This patch moves bio_complete() under bio_put() to avoid kernel crash.

[mlyle: fixed commit subject for character limits]

Reported-by: Matthias Ferdinand &lt;bcache@mfedv.net&gt;
Tested-by: Matthias Ferdinand &lt;bcache@mfedv.net&gt;
Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 60eb34ec5526e264c2bbaea4f7512d714d791caf ]

Kernel crashed when run fio in a RAID5 backend bcache device, the call
trace is bellow:
[  440.012034] kernel BUG at block/blk-ioc.c:146!
[  440.012696] invalid opcode: 0000 [#1] SMP NOPTI
[  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
[  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
/2015
[  440.028615] RIP: 0010:put_io_context+0x8b/0x90
[  440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246
[  440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000
0f4240
[  440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882
94fca0
[  440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8
0c1700
[  440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000
01000
[  440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11
bd1e0
[  440.035239] FS:  0000000000000000(0000) GS:ffff949cba280000(0000) knlGS:
0000000000000000
[  440.060190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001
606e0
[  440.110498] Call Trace:
[  440.135443]  bio_disassociate_task+0x1b/0x60
[  440.160355]  bio_free+0x1b/0x60
[  440.184666]  bio_put+0x23/0x30
[  440.208272]  search_free+0x23/0x40 [bcache]
[  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
[  440.254468]  closure_put+0xb6/0xd0 [bcache]
[  440.277087]  request_endio+0x30/0x40 [bcache]
[  440.298703]  bio_endio+0xa1/0x120
[  440.319644]  handle_stripe+0x418/0x2270 [raid456]
[  440.340614]  ? load_balance+0x17b/0x9c0
[  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
[  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
[  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
[  440.419193]  ? schedule+0x36/0x80
[  440.437932]  ? schedule_timeout+0x1d2/0x2f0
[  440.456136]  md_thread+0x122/0x150
[  440.473687]  ? wait_woken+0x80/0x80
[  440.491411]  kthread+0x102/0x140
[  440.508636]  ? find_pers+0x70/0x70
[  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
[  440.541791]  ret_from_fork+0x35/0x40
[  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2
48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 &lt;0f&gt; 0b
0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41
[  440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8
[  440.628575] ---[ end trace a1fd79d85643a73e ]--

All the crash issue happened when a bypass IO coming, in such scenario
s-&gt;iop.bio is pointed to the s-&gt;orig_bio. In search_free(), it finishes the
s-&gt;orig_bio by calling bio_complete(), and after that, s-&gt;iop.bio became
invalid, then kernel would crash when calling bio_put(). Maybe its upper
layer's faulty, since bio should not be freed before we calling bio_put(),
but we'd better calling bio_put() first before calling bio_complete() to
notify upper layer ending this bio.

This patch moves bio_complete() under bio_put() to avoid kernel crash.

[mlyle: fixed commit subject for character limits]

Reported-by: Matthias Ferdinand &lt;bcache@mfedv.net&gt;
Tested-by: Matthias Ferdinand &lt;bcache@mfedv.net&gt;
Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid1: fix NULL pointer dereference</title>
<updated>2018-05-30T05:49:01+00:00</updated>
<author>
<name>Yufen Yu</name>
<email>yuyufen@huawei.com</email>
</author>
<published>2018-02-24T04:05:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8d16ec1c91ab008f0d7113cfd25467e677aeda4a'/>
<id>8d16ec1c91ab008f0d7113cfd25467e677aeda4a</id>
<content type='text'>
[ Upstream commit 3de59bb9d551428cbdc76a9ea57883f82e350b4d ]

In handle_write_finished(), if r1_bio-&gt;bios[m] != NULL, it thinks
the corresponding conf-&gt;mirrors[m].rdev is also not NULL. But, it
is not always true.

Even if some io hold replacement rdev(i.e. rdev-&gt;nr_pending.count &gt; 0),
raid1_remove_disk() can also set the rdev as NULL. That means,
bios[m] != NULL, but mirrors[m].rdev is NULL, resulting in NULL
pointer dereference in handle_write_finished and sync_request_write.

This patch can fix BUGs as follows:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000140
 IP: [&lt;ffffffff815bbbbd&gt;] raid1d+0x2bd/0xfc0
 PGD 12ab52067 PUD 12f587067 PMD 0
 Oops: 0000 [#1] SMP
 CPU: 1 PID: 2008 Comm: md3_raid1 Not tainted 4.1.44+ #130
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
 Call Trace:
  ? schedule+0x37/0x90
  ? prepare_to_wait_event+0x83/0xf0
  md_thread+0x144/0x150
  ? wake_atomic_t_function+0x70/0x70
  ? md_start_sync+0xf0/0xf0
  kthread+0xd8/0xf0
  ? kthread_worker_fn+0x160/0x160
  ret_from_fork+0x42/0x70
  ? kthread_worker_fn+0x160/0x160

 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
 IP: sync_request_write+0x9e/0x980
 PGD 800000007c518067 P4D 800000007c518067 PUD 8002b067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 24 PID: 2549 Comm: md3_raid1 Not tainted 4.15.0+ #118
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
 Call Trace:
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0xc/0xb0
  ? flush_pending_writes+0x3a/0xd0
  ? pick_next_task_fair+0x4d5/0x5f0
  ? __switch_to+0xa2/0x430
  raid1d+0x65a/0x870
  ? find_pers+0x70/0x70
  ? find_pers+0x70/0x70
  ? md_thread+0x11c/0x160
  md_thread+0x11c/0x160
  ? finish_wait+0x80/0x80
  kthread+0x111/0x130
  ? kthread_create_worker_on_cpu+0x70/0x70
  ? do_syscall_64+0x6f/0x190
  ? SyS_exit_group+0x10/0x10
  ret_from_fork+0x35/0x40

Reviewed-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Yufen Yu &lt;yuyufen@huawei.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 3de59bb9d551428cbdc76a9ea57883f82e350b4d ]

In handle_write_finished(), if r1_bio-&gt;bios[m] != NULL, it thinks
the corresponding conf-&gt;mirrors[m].rdev is also not NULL. But, it
is not always true.

Even if some io hold replacement rdev(i.e. rdev-&gt;nr_pending.count &gt; 0),
raid1_remove_disk() can also set the rdev as NULL. That means,
bios[m] != NULL, but mirrors[m].rdev is NULL, resulting in NULL
pointer dereference in handle_write_finished and sync_request_write.

This patch can fix BUGs as follows:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000140
 IP: [&lt;ffffffff815bbbbd&gt;] raid1d+0x2bd/0xfc0
 PGD 12ab52067 PUD 12f587067 PMD 0
 Oops: 0000 [#1] SMP
 CPU: 1 PID: 2008 Comm: md3_raid1 Not tainted 4.1.44+ #130
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
 Call Trace:
  ? schedule+0x37/0x90
  ? prepare_to_wait_event+0x83/0xf0
  md_thread+0x144/0x150
  ? wake_atomic_t_function+0x70/0x70
  ? md_start_sync+0xf0/0xf0
  kthread+0xd8/0xf0
  ? kthread_worker_fn+0x160/0x160
  ret_from_fork+0x42/0x70
  ? kthread_worker_fn+0x160/0x160

 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
 IP: sync_request_write+0x9e/0x980
 PGD 800000007c518067 P4D 800000007c518067 PUD 8002b067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 24 PID: 2549 Comm: md3_raid1 Not tainted 4.15.0+ #118
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
 Call Trace:
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0xc/0xb0
  ? flush_pending_writes+0x3a/0xd0
  ? pick_next_task_fair+0x4d5/0x5f0
  ? __switch_to+0xa2/0x430
  raid1d+0x65a/0x870
  ? find_pers+0x70/0x70
  ? find_pers+0x70/0x70
  ? md_thread+0x11c/0x160
  md_thread+0x11c/0x160
  ? finish_wait+0x80/0x80
  kthread+0x111/0x130
  ? kthread_create_worker_on_cpu+0x70/0x70
  ? do_syscall_64+0x6f/0x190
  ? SyS_exit_group+0x10/0x10
  ret_from_fork+0x35/0x40

Reviewed-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Yufen Yu &lt;yuyufen@huawei.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: raid5: avoid string overflow warning</title>
<updated>2018-05-30T05:49:00+00:00</updated>
<author>
<name>Arnd Bergmann</name>
<email>arnd@arndb.de</email>
</author>
<published>2018-02-20T13:09:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e163691fa8e8403d1637690c7ef43c0a423deb25'/>
<id>e163691fa8e8403d1637690c7ef43c0a423deb25</id>
<content type='text'>
[ Upstream commit 53b8d89ddbdbb0e4625a46d2cdbb6f106c52f801 ]

gcc warns about a possible overflow of the kmem_cache string, when adding
four characters to a string of the same length:

drivers/md/raid5.c: In function 'setup_conf':
drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
  sprintf(conf-&gt;cache_name[1], "%s-alt", conf-&gt;cache_name[0]);
                                  ^~~~
drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
  sprintf(conf-&gt;cache_name[1], "%s-alt", conf-&gt;cache_name[0]);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If I'm counting correctly, we need 11 characters for the fixed part
of the string and 18 characters for a 64-bit pointer (when no gendisk
is used), so that leaves three characters for conf-&gt;level, which should
always be sufficient.

This makes the code use snprintf() with the correct length, to
make the code more robust against changes, and to get the compiler
to shut up.

In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
kmem_cache when mddev has no gendisk") from 2010, Neil said that
the pointer could be removed "shortly" once devices without gendisk
are disallowed. I have no idea if that happened, but if it did, that
should probably be changed as well.

Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 53b8d89ddbdbb0e4625a46d2cdbb6f106c52f801 ]

gcc warns about a possible overflow of the kmem_cache string, when adding
four characters to a string of the same length:

drivers/md/raid5.c: In function 'setup_conf':
drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
  sprintf(conf-&gt;cache_name[1], "%s-alt", conf-&gt;cache_name[0]);
                                  ^~~~
drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
  sprintf(conf-&gt;cache_name[1], "%s-alt", conf-&gt;cache_name[0]);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If I'm counting correctly, we need 11 characters for the fixed part
of the string and 18 characters for a 64-bit pointer (when no gendisk
is used), so that leaves three characters for conf-&gt;level, which should
always be sufficient.

This makes the code use snprintf() with the correct length, to
make the code more robust against changes, and to get the compiler
to shut up.

In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
kmem_cache when mddev has no gendisk") from 2010, Neil said that
the pointer could be removed "shortly" once devices without gendisk
are disallowed. I have no idea if that happened, but if it did, that
should probably be changed as well.

Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md raid10: fix NULL deference in handle_write_completed()</title>
<updated>2018-05-30T05:48:59+00:00</updated>
<author>
<name>Yufen Yu</name>
<email>yuyufen@huawei.com</email>
</author>
<published>2018-02-06T09:39:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=247e3b4dd0ccf6e703dfd9eb86af953849ed2999'/>
<id>247e3b4dd0ccf6e703dfd9eb86af953849ed2999</id>
<content type='text'>
[ Upstream commit 01a69cab01c184d3786af09e9339311123d63d22 ]

In the case of 'recover', an r10bio with R10BIO_WriteError &amp;
R10BIO_IsRecover will be progressed by handle_write_completed().
This function traverses all r10bio-&gt;devs[copies].
If devs[m].repl_bio != NULL, it thinks conf-&gt;mirrors[dev].replacement
is also not NULL. However, this is not always true.

When there is an rdev of raid10 has replacement, then each r10bio
-&gt;devs[m].repl_bio != NULL in conf-&gt;r10buf_pool. However, in 'recover',
even if corresponded replacement is NULL, it doesn't clear r10bio
-&gt;devs[m].repl_bio, resulting in replacement NULL deference.

This bug was introduced when replacement support for raid10 was
added in Linux 3.3.

As NeilBrown suggested:
	Elsewhere the determination of "is this device part of the
	resync/recovery" is made by resting bio-&gt;bi_end_io.
	If this is end_sync_write, then we tried to write here.
	If it is NULL, then we didn't try to write.

Fixes: 9ad1aefc8ae8 ("md/raid10:  Handle replacement devices during resync.")
Cc: stable (V3.3+)
Suggested-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Yufen Yu &lt;yuyufen@huawei.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 01a69cab01c184d3786af09e9339311123d63d22 ]

In the case of 'recover', an r10bio with R10BIO_WriteError &amp;
R10BIO_IsRecover will be progressed by handle_write_completed().
This function traverses all r10bio-&gt;devs[copies].
If devs[m].repl_bio != NULL, it thinks conf-&gt;mirrors[dev].replacement
is also not NULL. However, this is not always true.

When there is an rdev of raid10 has replacement, then each r10bio
-&gt;devs[m].repl_bio != NULL in conf-&gt;r10buf_pool. However, in 'recover',
even if corresponded replacement is NULL, it doesn't clear r10bio
-&gt;devs[m].repl_bio, resulting in replacement NULL deference.

This bug was introduced when replacement support for raid10 was
added in Linux 3.3.

As NeilBrown suggested:
	Elsewhere the determination of "is this device part of the
	resync/recovery" is made by resting bio-&gt;bi_end_io.
	If this is end_sync_write, then we tried to write here.
	If it is NULL, then we didn't try to write.

Fixes: 9ad1aefc8ae8 ("md/raid10:  Handle replacement devices during resync.")
Cc: stable (V3.3+)
Suggested-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Yufen Yu &lt;yuyufen@huawei.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcache: return attach error when no cache set exist</title>
<updated>2018-05-30T05:48:57+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-02-07T19:41:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b2fce717a374c3801c5560dfb4df60ac3fb4e34b'/>
<id>b2fce717a374c3801c5560dfb4df60ac3fb4e34b</id>
<content type='text'>
[ Upstream commit 7f4fc93d4713394ee8f1cd44c238e046e11b4f15 ]

I attach a back-end device to a cache set, and the cache set is not
registered yet, this back-end device did not attach successfully, and no
error returned:
[root]# echo 87859280-fec6-4bcc-20df7ca8f86b &gt; /sys/block/sde/bcache/attach
[root]#

In sysfs_attach(), the return value "v" is initialized to "size" in
the beginning, and if no cache set exist in bch_cache_sets, the "v" value
would not change any more, and return to sysfs, sysfs regard it as success
since the "size" is a positive number.

This patch fixes this issue by assigning "v" with "-ENOENT" in the
initialization.

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 7f4fc93d4713394ee8f1cd44c238e046e11b4f15 ]

I attach a back-end device to a cache set, and the cache set is not
registered yet, this back-end device did not attach successfully, and no
error returned:
[root]# echo 87859280-fec6-4bcc-20df7ca8f86b &gt; /sys/block/sde/bcache/attach
[root]#

In sysfs_attach(), the return value "v" is initialized to "size" in
the beginning, and if no cache set exist in bch_cache_sets, the "v" value
would not change any more, and return to sysfs, sysfs regard it as success
since the "size" is a positive number.

This patch fixes this issue by assigning "v" with "-ENOENT" in the
initialization.

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcache: fix for data collapse after re-attaching an attached device</title>
<updated>2018-05-30T05:48:57+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-02-07T19:41:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0b803fa8db94a81e95e2dc6b5ddc25af6e162b9e'/>
<id>0b803fa8db94a81e95e2dc6b5ddc25af6e162b9e</id>
<content type='text'>
[ Upstream commit 73ac105be390c1de42a2f21643c9778a5e002930 ]

back-end device sdm has already attached a cache_set with ID
f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
another cache set, and it returns with an error:
[root]# cd /sys/block/sdm/bcache
[root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f &gt; attach
-bash: echo: write error: Invalid argument

After that, execute a command to modify the label of bcache
device:
[root]# echo data_disk1 &gt; label

Then we reboot the system, when the system power on, the back-end
device can not attach to cache_set, a messages show in the log:
Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
bch_cached_dev_attach() couldn't find uuid for sdm in set

In sysfs_attach(), dc-&gt;sb.set_uuid was assigned to the value
which input through sysfs, no matter whether it is success
or not in bch_cached_dev_attach(). For example, If the back-end
device has already attached to an cache set, bch_cached_dev_attach()
would fail, but dc-&gt;sb.set_uuid was changed. Then modify the
label of bcache device, it will call bch_write_bdev_super(),
which would write the dc-&gt;sb.set_uuid to the super block, so we
record a wrong cache set ID in the super block, after the system
reboot, the cache set couldn't find the uuid of the back-end
device, so the bcache device couldn't exist and use any more.

In this patch, we don't assigned cache set ID to dc-&gt;sb.set_uuid
in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
and assigned dc-&gt;sb.set_uuid to the cache set ID after the back-end
device attached to the cache set successful.

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 73ac105be390c1de42a2f21643c9778a5e002930 ]

back-end device sdm has already attached a cache_set with ID
f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
another cache set, and it returns with an error:
[root]# cd /sys/block/sdm/bcache
[root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f &gt; attach
-bash: echo: write error: Invalid argument

After that, execute a command to modify the label of bcache
device:
[root]# echo data_disk1 &gt; label

Then we reboot the system, when the system power on, the back-end
device can not attach to cache_set, a messages show in the log:
Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
bch_cached_dev_attach() couldn't find uuid for sdm in set

In sysfs_attach(), dc-&gt;sb.set_uuid was assigned to the value
which input through sysfs, no matter whether it is success
or not in bch_cached_dev_attach(). For example, If the back-end
device has already attached to an cache set, bch_cached_dev_attach()
would fail, but dc-&gt;sb.set_uuid was changed. Then modify the
label of bcache device, it will call bch_write_bdev_super(),
which would write the dc-&gt;sb.set_uuid to the super block, so we
record a wrong cache set ID in the super block, after the system
reboot, the cache set couldn't find the uuid of the back-end
device, so the bcache device couldn't exist and use any more.

In this patch, we don't assigned cache set ID to dc-&gt;sb.set_uuid
in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
and assigned dc-&gt;sb.set_uuid to the cache set ID after the back-end
device attached to the cache set successful.

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcache: fix for allocator and register thread race</title>
<updated>2018-05-30T05:48:57+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-02-07T19:41:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cab95b251b76047e2422f16589413d706a2d74ec'/>
<id>cab95b251b76047e2422f16589413d706a2d74ec</id>
<content type='text'>
[ Upstream commit 682811b3ce1a5a4e20d700939a9042f01dbc66c4 ]

After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[&lt;ffffffffa06b2455&gt;] closure_sync+0x25/0x90 [bcache]
[&lt;ffffffffa06b6be8&gt;] bch_journal+0x118/0x2b0 [bcache]
[&lt;ffffffffa06b6dc7&gt;] bch_journal_meta+0x47/0x70 [bcache]
[&lt;ffffffffa06be8f7&gt;] bch_prio_write+0x237/0x340 [bcache]
[&lt;ffffffffa06a8018&gt;] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[&lt;ffffffff810a631f&gt;] kthread+0xcf/0xe0
[&lt;ffffffff8164c318&gt;] ret_from_fork+0x58/0x90
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[&lt;ffffffffa06b1abd&gt;] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[&lt;ffffffffa06b1bd1&gt;] bch_btree_insert+0xf1/0x170 [bcache]
[&lt;ffffffffa06b637f&gt;] bch_journal_replay+0x13f/0x230 [bcache]
[&lt;ffffffffa06c75fe&gt;] run_cache_set+0x79a/0x7c2 [bcache]
[&lt;ffffffffa06c0cf8&gt;] register_bcache+0xd48/0x1310 [bcache]
[&lt;ffffffff812f702f&gt;] kobj_attr_store+0xf/0x20
[&lt;ffffffff8125b216&gt;] sysfs_write_file+0xc6/0x140
[&lt;ffffffff811dfbfd&gt;] vfs_write+0xbd/0x1e0
[&lt;ffffffff811e069f&gt;] SyS_write+0x7f/0xe0
[&lt;ffffffff8164c3c9&gt;] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.

I reboot the machine several times, the issue always
exsit in this machine.

I debug the code, and found the call trace as bellow:
register_bcache()
   ==&gt;run_cache_set()
      ==&gt;bch_journal_replay()
         ==&gt;bch_btree_insert()
            ==&gt;__bch_btree_map_nodes()
               ==&gt;btree_insert_fn()
                  ==&gt;btree_split() //node need split
                     ==&gt;btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c-&gt;btree_cache_wait, and goes to sleep.

Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==&gt;bch_prio_write()
   ==&gt;bch_journal_meta()
      ==&gt;bch_journal()
         ==&gt;journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c-&gt;journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)-&gt;journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.

Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
   RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
   can flush btree nodes and get journal bucket.
   then they are all got stuck by waiting for each other.

Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.

So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough?  By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.

This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.

[patch v2] ca-&gt;sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.

Signed-off-by: Hua Rui &lt;huarui.dev@gmail.com&gt;
Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 682811b3ce1a5a4e20d700939a9042f01dbc66c4 ]

After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[&lt;ffffffffa06b2455&gt;] closure_sync+0x25/0x90 [bcache]
[&lt;ffffffffa06b6be8&gt;] bch_journal+0x118/0x2b0 [bcache]
[&lt;ffffffffa06b6dc7&gt;] bch_journal_meta+0x47/0x70 [bcache]
[&lt;ffffffffa06be8f7&gt;] bch_prio_write+0x237/0x340 [bcache]
[&lt;ffffffffa06a8018&gt;] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[&lt;ffffffff810a631f&gt;] kthread+0xcf/0xe0
[&lt;ffffffff8164c318&gt;] ret_from_fork+0x58/0x90
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[&lt;ffffffffa06b1abd&gt;] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[&lt;ffffffffa06b1bd1&gt;] bch_btree_insert+0xf1/0x170 [bcache]
[&lt;ffffffffa06b637f&gt;] bch_journal_replay+0x13f/0x230 [bcache]
[&lt;ffffffffa06c75fe&gt;] run_cache_set+0x79a/0x7c2 [bcache]
[&lt;ffffffffa06c0cf8&gt;] register_bcache+0xd48/0x1310 [bcache]
[&lt;ffffffff812f702f&gt;] kobj_attr_store+0xf/0x20
[&lt;ffffffff8125b216&gt;] sysfs_write_file+0xc6/0x140
[&lt;ffffffff811dfbfd&gt;] vfs_write+0xbd/0x1e0
[&lt;ffffffff811e069f&gt;] SyS_write+0x7f/0xe0
[&lt;ffffffff8164c3c9&gt;] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.

I reboot the machine several times, the issue always
exsit in this machine.

I debug the code, and found the call trace as bellow:
register_bcache()
   ==&gt;run_cache_set()
      ==&gt;bch_journal_replay()
         ==&gt;bch_btree_insert()
            ==&gt;__bch_btree_map_nodes()
               ==&gt;btree_insert_fn()
                  ==&gt;btree_split() //node need split
                     ==&gt;btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c-&gt;btree_cache_wait, and goes to sleep.

Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==&gt;bch_prio_write()
   ==&gt;bch_journal_meta()
      ==&gt;bch_journal()
         ==&gt;journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c-&gt;journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)-&gt;journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.

Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
   RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
   can flush btree nodes and get journal bucket.
   then they are all got stuck by waiting for each other.

Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.

So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough?  By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.

This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.

[patch v2] ca-&gt;sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.

Signed-off-by: Hua Rui &lt;huarui.dev@gmail.com&gt;
Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcache: properly set task state in bch_writeback_thread()</title>
<updated>2018-05-30T05:48:57+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2018-02-07T19:41:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=acd8bb42c820b482a57c30b29d0f24cc5e1eab48'/>
<id>acd8bb42c820b482a57c30b29d0f24cc5e1eab48</id>
<content type='text'>
[ Upstream commit 99361bbf26337186f02561109c17a4c4b1a7536a ]

Kernel thread routine bch_writeback_thread() has the following code block,

447         down_write(&amp;dc-&gt;writeback_lock);
448~450     if (check conditions) {
451                 up_write(&amp;dc-&gt;writeback_lock);
452                 set_current_state(TASK_INTERRUPTIBLE);
453
454                 if (kthread_should_stop())
455                         return 0;
456
457                 schedule();
458                 continue;
459         }

If condition check is true, its task state is set to TASK_INTERRUPTIBLE
and call schedule() to wait for others to wake up it.

There are 2 issues in current code,
1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
   another process changes the condition and call wake_up_process(dc-&gt;
   writeback_thread), then at line 452 task state is set back to
   TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
   waken up.
2, At line 454 if kthread_should_stop() is true, writeback kernel thread
   will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
   call do_exit(). It is not good to enter do_exit() with task state
   TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
   warning message is reported by __might_sleep(): "WARNING: do not call
   blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".

For the first issue, task state should be set before condition checks.
Ineed because dc-&gt;writeback_lock is required when modifying all the
conditions, calling set_current_state() inside code block where dc-&gt;
writeback_lock is hold is safe. But this is quite implicit, so I still move
set_current_state() before all the condition checks.

For the second issue, frankley speaking it does not hurt when kernel thread
exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
makes them feel there might be something risky with bcache and hurt their
data.  Setting task state to TASK_RUNNING before returning fixes this
problem.

In alloc.c:allocator_wait(), there is also a similar issue, and is also
fixed in this patch.

Changelog:
v3: merge two similar fixes into one patch
v2: fix the race issue in v1 patch.
v1: initial buggy fix.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Reviewed-by: Hannes Reinecke &lt;hare@suse.de&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Junhui Tang &lt;tang.junhui@zte.com.cn&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 99361bbf26337186f02561109c17a4c4b1a7536a ]

Kernel thread routine bch_writeback_thread() has the following code block,

447         down_write(&amp;dc-&gt;writeback_lock);
448~450     if (check conditions) {
451                 up_write(&amp;dc-&gt;writeback_lock);
452                 set_current_state(TASK_INTERRUPTIBLE);
453
454                 if (kthread_should_stop())
455                         return 0;
456
457                 schedule();
458                 continue;
459         }

If condition check is true, its task state is set to TASK_INTERRUPTIBLE
and call schedule() to wait for others to wake up it.

There are 2 issues in current code,
1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
   another process changes the condition and call wake_up_process(dc-&gt;
   writeback_thread), then at line 452 task state is set back to
   TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
   waken up.
2, At line 454 if kthread_should_stop() is true, writeback kernel thread
   will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
   call do_exit(). It is not good to enter do_exit() with task state
   TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
   warning message is reported by __might_sleep(): "WARNING: do not call
   blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".

For the first issue, task state should be set before condition checks.
Ineed because dc-&gt;writeback_lock is required when modifying all the
conditions, calling set_current_state() inside code block where dc-&gt;
writeback_lock is hold is safe. But this is quite implicit, so I still move
set_current_state() before all the condition checks.

For the second issue, frankley speaking it does not hurt when kernel thread
exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
makes them feel there might be something risky with bcache and hurt their
data.  Setting task state to TASK_RUNNING before returning fixes this
problem.

In alloc.c:allocator_wait(), there is also a similar issue, and is also
fixed in this patch.

Changelog:
v3: merge two similar fixes into one patch
v2: fix the race issue in v1 patch.
v1: initial buggy fix.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Reviewed-by: Hannes Reinecke &lt;hare@suse.de&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Junhui Tang &lt;tang.junhui@zte.com.cn&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcache: segregate flash only volume write streams</title>
<updated>2018-04-13T17:50:22+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-01-08T20:21:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=ebd67e20fba8f6318d06d69cf043a6caacf3dbbb'/>
<id>ebd67e20fba8f6318d06d69cf043a6caacf3dbbb</id>
<content type='text'>
[ Upstream commit 4eca1cb28d8b0574ca4f1f48e9331c5f852d43b9 ]

In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 4eca1cb28d8b0574ca4f1f48e9331c5f852d43b9 ]

In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
