<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/net/ceph, branch v5.5-rc7</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>libceph, rbd, ceph: convert to use the new mount API</title>
<updated>2019-11-27T21:28:37+00:00</updated>
<author>
<name>David Howells</name>
<email>dhowells@redhat.com</email>
</author>
<published>2019-03-25T16:38:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=82995cc6c5ae4bf4d72edef381a085e52d5b5905'/>
<id>82995cc6c5ae4bf4d72edef381a085e52d5b5905</id>
<content type='text'>
Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

[ Numerous string handling, leak and regression fixes; rbd conversion
  was particularly broken and had to be redone almost from scratch. ]

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

[ Numerous string handling, leak and regression fixes; rbd conversion
  was particularly broken and had to be redone almost from scratch. ]

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: drop unnecessary check from dispatch() in mon_client.c</title>
<updated>2019-11-25T10:44:02+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2019-10-14T14:25:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d8f544c30ba0ec7b3e5d0dd2c4258181b2adb8a0'/>
<id>d8f544c30ba0ec7b3e5d0dd2c4258181b2adb8a0</id>
<content type='text'>
con-&gt;private is set in ceph_con_init() and is never cleared.

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
con-&gt;private is set in ceph_con_init() and is never cleared.

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: use ceph_kvmalloc() for osdmap arrays</title>
<updated>2019-09-16T10:06:25+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2019-09-04T13:40:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cf73d882cc51c1f245a890cccb79952a260302d3'/>
<id>cf73d882cc51c1f245a890cccb79952a260302d3</id>
<content type='text'>
osdmap has a bunch of arrays that grow linearly with the number of
OSDs.  osd_state, osd_weight and osd_primary_affinity take 4 bytes per
OSD.  osd_addr takes 136 bytes per OSD because of sockaddr_storage.
The CRUSH workspace area also grows linearly with the number of OSDs.

Normally these arrays are allocated at client startup.  The osdmap is
usually updated in small incrementals, but once in a while a full map
may need to be processed.  For a cluster with 10000 OSDs, this means
a bunch of 40K allocations followed by a 1.3M allocation, all of which
are currently required to be physically contiguous.  This results in
sporadic ENOMEM errors, hanging the client.

Go back to manually (re)allocating arrays and use ceph_kvmalloc() to
fall back to non-contiguous allocation when necessary.

Link: https://tracker.ceph.com/issues/40481
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
osdmap has a bunch of arrays that grow linearly with the number of
OSDs.  osd_state, osd_weight and osd_primary_affinity take 4 bytes per
OSD.  osd_addr takes 136 bytes per OSD because of sockaddr_storage.
The CRUSH workspace area also grows linearly with the number of OSDs.

Normally these arrays are allocated at client startup.  The osdmap is
usually updated in small incrementals, but once in a while a full map
may need to be processed.  For a cluster with 10000 OSDs, this means
a bunch of 40K allocations followed by a 1.3M allocation, all of which
are currently required to be physically contiguous.  This results in
sporadic ENOMEM errors, hanging the client.

Go back to manually (re)allocating arrays and use ceph_kvmalloc() to
fall back to non-contiguous allocation when necessary.

Link: https://tracker.ceph.com/issues/40481
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc()</title>
<updated>2019-09-16T10:06:25+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2019-08-30T18:44:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=10c12851a022662bf6085bd4384b4ebed4c447ce'/>
<id>10c12851a022662bf6085bd4384b4ebed4c447ce</id>
<content type='text'>
The vmalloc allocator doesn't fully respect the specified gfp mask:
while the actual pages are allocated as requested, the page table pages
are always allocated with GFP_KERNEL.  ceph_kvmalloc() may be called
with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may
result in a deadlock.

There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic,
it's just something that seemed sensible at the time (ceph_kvmalloc()
predates kvmalloc()).  kvmalloc() is smarter: in an attempt to reduce
long term fragmentation, it first tries to kmalloc non-disruptively.

Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using
the scope API to avoid the deadlock.  Note that kvmalloc() needs to be
passed GFP_KERNEL to enable the fallback.

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The vmalloc allocator doesn't fully respect the specified gfp mask:
while the actual pages are allocated as requested, the page table pages
are always allocated with GFP_KERNEL.  ceph_kvmalloc() may be called
with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may
result in a deadlock.

There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic,
it's just something that seemed sensible at the time (ceph_kvmalloc()
predates kvmalloc()).  kvmalloc() is smarter: in an attempt to reduce
long term fragmentation, it first tries to kmalloc non-disruptively.

Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using
the scope API to avoid the deadlock.  Note that kvmalloc() needs to be
passed GFP_KERNEL to enable the fallback.

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: drop unused con parameter of calc_target()</title>
<updated>2019-09-16T10:06:25+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2019-08-21T11:57:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8edf84ba4d0ecad7d1c5b393ba004b959dae76ea'/>
<id>8edf84ba4d0ecad7d1c5b393ba004b959dae76ea</id>
<content type='text'>
This bit was omitted from a561372405cf ("libceph: fix PG split vs OSD
(re)connect race") to avoid backport conflicts.

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This bit was omitted from a561372405cf ("libceph: fix PG split vs OSD
(re)connect race") to avoid backport conflicts.

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: handle OSD op ceph_pagelist_append() errors</title>
<updated>2019-09-16T10:06:25+00:00</updated>
<author>
<name>David Disseldorp</name>
<email>ddiss@suse.de</email>
</author>
<published>2019-07-03T15:59:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4766815b1179dbe4fe263a5f95795710e29276e2'/>
<id>4766815b1179dbe4fe263a5f95795710e29276e2</id>
<content type='text'>
osd_req_op_cls_init() and osd_req_op_xattr_init() currently propagate
ceph_pagelist_alloc() ENOMEM errors but ignore ceph_pagelist_append()
memory allocation failures. Add these checks and cleanup on error.

Signed-off-by: David Disseldorp &lt;ddiss@suse.de&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
osd_req_op_cls_init() and osd_req_op_xattr_init() currently propagate
ceph_pagelist_alloc() ENOMEM errors but ignore ceph_pagelist_append()
memory allocation failures. Add these checks and cleanup on error.

Signed-off-by: David Disseldorp &lt;ddiss@suse.de&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: add function that clears osd client's abort_err</title>
<updated>2019-09-16T10:06:23+00:00</updated>
<author>
<name>Yan, Zheng</name>
<email>zyan@redhat.com</email>
</author>
<published>2019-07-25T12:16:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2cef0ba8032ce0df3f29621921fe7d6372f68d3e'/>
<id>2cef0ba8032ce0df3f29621921fe7d6372f68d3e</id>
<content type='text'>
Signed-off-by: "Yan, Zheng" &lt;zyan@redhat.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: "Yan, Zheng" &lt;zyan@redhat.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: add function that reset client's entity addr</title>
<updated>2019-09-16T10:06:23+00:00</updated>
<author>
<name>Yan, Zheng</name>
<email>zyan@redhat.com</email>
</author>
<published>2019-07-25T12:16:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=120a75ea9f4ba02f852171e75d44f29139b9c83e'/>
<id>120a75ea9f4ba02f852171e75d44f29139b9c83e</id>
<content type='text'>
This function also re-open connections to OSD/MON, and re-send in-flight
OSD requests after re-opening connections to OSD.

Signed-off-by: "Yan, Zheng" &lt;zyan@redhat.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This function also re-open connections to OSD/MON, and re-send in-flight
OSD requests after re-opening connections to OSD.

Signed-off-by: "Yan, Zheng" &lt;zyan@redhat.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: don't call crypto_free_sync_skcipher() on a NULL tfm</title>
<updated>2019-08-28T10:33:46+00:00</updated>
<author>
<name>Jia-Ju Bai</name>
<email>baijiaju1990@gmail.com</email>
</author>
<published>2019-07-24T09:43:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e8c99200b4d117c340c392ebd5e62d85dfeed027'/>
<id>e8c99200b4d117c340c392ebd5e62d85dfeed027</id>
<content type='text'>
In set_secret(), key-&gt;tfm is assigned to NULL on line 55, and then
ceph_crypto_key_destroy(key) is executed.

ceph_crypto_key_destroy(key)
  crypto_free_sync_skcipher(key-&gt;tfm)
    crypto_free_skcipher(&amp;tfm-&gt;base);

This happens to work because crypto_sync_skcipher is a trivial wrapper
around crypto_skcipher: &amp;tfm-&gt;base is still 0 and crypto_free_skcipher()
handles that.  Let's not rely on the layout of crypto_sync_skcipher.

This bug is found by a static analysis tool STCheck written by us.

Fixes: 69d6302b65a8 ("libceph: Remove VLA usage of skcipher").
Signed-off-by: Jia-Ju Bai &lt;baijiaju1990@gmail.com&gt;
Reviewed-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In set_secret(), key-&gt;tfm is assigned to NULL on line 55, and then
ceph_crypto_key_destroy(key) is executed.

ceph_crypto_key_destroy(key)
  crypto_free_sync_skcipher(key-&gt;tfm)
    crypto_free_skcipher(&amp;tfm-&gt;base);

This happens to work because crypto_sync_skcipher is a trivial wrapper
around crypto_skcipher: &amp;tfm-&gt;base is still 0 and crypto_free_skcipher()
handles that.  Let's not rely on the layout of crypto_sync_skcipher.

This bug is found by a static analysis tool STCheck written by us.

Fixes: 69d6302b65a8 ("libceph: Remove VLA usage of skcipher").
Signed-off-by: Jia-Ju Bai &lt;baijiaju1990@gmail.com&gt;
Reviewed-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: fix PG split vs OSD (re)connect race</title>
<updated>2019-08-22T08:47:41+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2019-08-20T14:40:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a561372405cf6bc6f14239b3a9e57bb39f2788b0'/>
<id>a561372405cf6bc6f14239b3a9e57bb39f2788b0</id>
<content type='text'>
We can't rely on -&gt;peer_features in calc_target() because it may be
called both when the OSD session is established and open and when it's
not.  -&gt;peer_features is not valid unless the OSD session is open.  If
this happens on a PG split (pg_num increase), that could mean we don't
resend a request that should have been resent, hanging the client
indefinitely.

In userspace this was fixed by looking at require_osd_release and
get_xinfo[osd].features fields of the osdmap.  However these fields
belong to the OSD section of the osdmap, which the kernel doesn't
decode (only the client section is decoded).

Instead, let's drop this feature check.  It effectively checks for
luminous, so only pre-luminous OSDs would be affected in that on a PG
split the kernel might resend a request that should not have been
resent.  Duplicates can occur in other scenarios, so both sides should
already be prepared for them: see dup/replay logic on the OSD side and
retry_attempt check on the client side.

Cc: stable@vger.kernel.org
Fixes: 7de030d6b10a ("libceph: resend on PG splits if OSD has RESEND_ON_SPLIT")
Link: https://tracker.ceph.com/issues/41162
Reported-by: Jerry Lee &lt;leisurelysw24@gmail.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Tested-by: Jerry Lee &lt;leisurelysw24@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We can't rely on -&gt;peer_features in calc_target() because it may be
called both when the OSD session is established and open and when it's
not.  -&gt;peer_features is not valid unless the OSD session is open.  If
this happens on a PG split (pg_num increase), that could mean we don't
resend a request that should have been resent, hanging the client
indefinitely.

In userspace this was fixed by looking at require_osd_release and
get_xinfo[osd].features fields of the osdmap.  However these fields
belong to the OSD section of the osdmap, which the kernel doesn't
decode (only the client section is decoded).

Instead, let's drop this feature check.  It effectively checks for
luminous, so only pre-luminous OSDs would be affected in that on a PG
split the kernel might resend a request that should not have been
resent.  Duplicates can occur in other scenarios, so both sides should
already be prepared for them: see dup/replay logic on the OSD side and
retry_attempt check on the client side.

Cc: stable@vger.kernel.org
Fixes: 7de030d6b10a ("libceph: resend on PG splits if OSD has RESEND_ON_SPLIT")
Link: https://tracker.ceph.com/issues/41162
Reported-by: Jerry Lee &lt;leisurelysw24@gmail.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Tested-by: Jerry Lee &lt;leisurelysw24@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
