linux-toradex.git/net/ceph/messenger.c, branch v4.4.78

libceph: force GFP_NOIO for socket allocations

2017-04-08T07:53:30+00:00

commit 633ee407b9d15a75ac9740ba9d3338815e1fcb95 upstream.

sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:

    Workqueue: ceph-msgr con_work [libceph]
    ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
    0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
    ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
    Call Trace:
    [] schedule+0x29/0x70
    [] schedule_timeout+0x1bd/0x200
    [] ? ttwu_do_wakeup+0x2c/0x120
    [] ? ttwu_do_activate.constprop.135+0x66/0x70
    [] wait_for_completion+0xbf/0x180
    [] ? try_to_wake_up+0x390/0x390
    [] flush_work+0x165/0x250
    [] ? worker_detach_from_pool+0xd0/0xd0
    [] xlog_cil_force_lsn+0x81/0x200 [xfs]
    [] ? __slab_free+0xee/0x234
    [] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
    [] ? lookup_page_cgroup_used+0xe/0x30
    [] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_log_force_lsn+0x3f/0xf0 [xfs]
    [] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
    [] ? wake_atomic_t_function+0x40/0x40
    [] xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
    [] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
    [] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
    [] super_cache_scan+0x178/0x180
    [] shrink_slab_node+0x14e/0x340
    [] ? mem_cgroup_iter+0x16b/0x450
    [] shrink_slab+0x100/0x140
    [] do_try_to_free_pages+0x335/0x490
    [] try_to_free_pages+0xb9/0x1f0
    [] ? __alloc_pages_direct_compact+0x69/0x1be
    [] __alloc_pages_nodemask+0x69a/0xb40
    [] alloc_pages_current+0x9e/0x110
    [] new_slab+0x2c5/0x390
    [] __slab_alloc+0x33b/0x459
    [] ? sock_alloc_inode+0x2d/0xd0
    [] ? inet_sendmsg+0x71/0xc0
    [] ? sock_alloc_inode+0x2d/0xd0
    [] kmem_cache_alloc+0x1a2/0x1b0
    [] sock_alloc_inode+0x2d/0xd0
    [] alloc_inode+0x26/0xa0
    [] new_inode_pseudo+0x1a/0x70
    [] sock_alloc+0x1e/0x80
    [] __sock_create+0x95/0x220
    [] sock_create_kern+0x24/0x30
    [] con_work+0xef9/0x2050 [libceph]
    [] ? rbd_img_request_submit+0x4c/0x60 [rbd]
    [] process_one_work+0x159/0x4f0
    [] worker_thread+0x11b/0x530
    [] ? create_worker+0x1d0/0x1d0
    [] kthread+0xc9/0xe0
    [] ? flush_kthread_worker+0x90/0x90
    [] ret_from_fork+0x58/0x90
    [] ? flush_kthread_worker+0x90/0x90

Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Jeff Layton 
Signed-off-by: Greg Kroah-Hartman

libceph: verify authorize reply on connect

2017-01-09T07:07:52+00:00

commit 5c056fdc5b474329037f2aa18401bd73033e0ce0 upstream.

After sending an authorizer (ceph_x_authorize_a + ceph_x_authorize_b),
the client gets back a ceph_x_authorize_reply, which it is supposed to
verify to ensure the authenticity and protect against replay attacks.
The code for doing this is there (ceph_x_verify_authorizer_reply(),
ceph_auth_verify_authorizer_reply() + plumbing), but it is never
invoked by the the messenger.

AFAICT this goes back to 2009, when ceph authentication protocols
support was added to the kernel client in 4e7a5dcd1bba ("ceph:
negotiate authentication protocol; implement AUTH_NONE protocol").

The second param of ceph_connection_operations::verify_authorizer_reply
is unused all the way down.  Pass 0 to facilitate backporting, and kill
it in the next commit.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Greg Kroah-Hartman

libceph: use the right footer size when skipping a message

2016-03-03T23:07:26+00:00

commit dbc0d3caff5b7591e0cf8e34ca686ca6f4479ee1 upstream.

ceph_msg_footer is 21 bytes long, while ceph_msg_footer_old is only 13.
Don't skip too much when CEPH_FEATURE_MSG_AUTH isn't negotiated.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder 
Signed-off-by: Greg Kroah-Hartman

libceph: don't bail early from try_read() when skipping a message

2016-03-03T23:07:26+00:00

commit e7a88e82fe380459b864e05b372638aeacb0f52d upstream.

The contract between try_read() and try_write() is that when called
each processes as much data as possible.  When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more.  try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.

Reported-by: Varada Kari 
Signed-off-by: Ilya Dryomov 
Tested-by: Varada Kari 
Reviewed-by: Alex Elder 
Signed-off-by: Greg Kroah-Hartman

libceph: fix ceph_msg_revoke()

2016-03-03T23:07:26+00:00

commit 67645d7619738e51c668ca69f097cb90b5470422 upstream.

There are a number of problems with revoking a "was sending" message:

(1) We never make any attempt to revoke data - only kvecs contibute to
con->out_skip.  However, once the header (envelope) is written to the
socket, our peer learns data_len and sets itself to expect at least
data_len bytes to follow front or front+middle.  If ceph_msg_revoke()
is called while the messenger is sending message's data portion,
anything we send after that call is counted by the OSD towards the now
revoked message's data portion.  The effects vary, the most common one
is the eventual hang - higher layers get stuck waiting for the reply to
the message that was sent out after ceph_msg_revoke() returned and
treated by the OSD as a bunch of data bytes.  This is what Matt ran
into.

(2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs
is wrong.  If ceph_msg_revoke() is called before the tag is sent out or
while the messenger is sending the header, we will get a connection
reset, either due to a bad tag (0 is not a valid tag) or a bad header
CRC, which kind of defeats the purpose of revoke.  Currently the kernel
client refuses to work with header CRCs disabled, but that will likely
change in the future, making this even worse.

(3) con->out_skip is not reset on connection reset, leading to one or
more spurious connection resets if we happen to get a real one between
con->out_skip is set in ceph_msg_revoke() and before it's cleared in
write_partial_skip().

Fixing (1) and (3) is trivial.  The idea behind fixing (2) is to never
zero the tag or the header, i.e. send out tag+header regardless of when
ceph_msg_revoke() is called.  That way the header is always correct, no
unnecessary resets are induced and revoke stands ready for disabled
CRCs.  Since ceph_msg_revoke() rips out con->out_msg, introduce a new
"message out temp" and copy the header into it before sending.

Reported-by: Matt Conner 
Signed-off-by: Ilya Dryomov 
Tested-by: Matt Conner 
Reviewed-by: Sage Weil 
Signed-off-by: Greg Kroah-Hartman

libceph: clear msg->con in ceph_msg_release() only

2015-11-02T22:37:46+00:00

The following bit in ceph_msg_revoke_incoming() is unsafe:

    struct ceph_connection *con = msg->con;
    if (!con)
            return;
    mutex_lock(&con->mutex);
    con use>

There is nothing preventing con from getting destroyed right after
msg->con test.  One easy way to reproduce this is to disable message
signing only on the server side and try to map an image.  The system
will go into a

    libceph: read_partial_message ffff880073f0ab68 signature check failed
    libceph: osd0 192.168.255.155:6801 bad crc/signature
    libceph: read_partial_message ffff880073f0ab68 signature check failed
    libceph: osd0 192.168.255.155:6801 bad crc/signature

loop which has to be interrupted with Ctrl-C.  Hit Ctrl-C and you are
likely to end up with a random GP fault if the reset handler executes
"within" ceph_msg_revoke_incoming():

                     
                                   ...
          
    rbd_obj_request_end
      ceph_osdc_cancel_request
        __unregister_request
          ceph_osdc_put_request
            ceph_msg_revoke_incoming
                                   ...
                                osd_reset
                                  __kick_osd_requests
                                    __reset_osd
                                      remove_osd
                                        ceph_con_close
                                          reset_connection
                                            in_msg->con>
                                            
                                              put_osd
                                                
              con use> <-- !!!

If ceph_msg_revoke_incoming() executes "before" the reset handler,
osd/con will be leaked because ceph_msg_revoke_incoming() clears
con->in_msg but doesn't put con ref, while reset_connection() only puts
con ref if con->in_msg != NULL.

The current msg->con scheme was introduced by commits 38941f8031bf
("libceph: have messages point to their connection") and 92ce034b5a74
("libceph: have messages take a connection reference"), which defined
when messages get associated with a connection and when that
association goes away.  Part of the problem is that this association is
supposed to go away in much too many places; closing this race entirely
requires either a rework of the existing or an addition of a new layer
of synchronization.

In lieu of that, we can make it *much* less likely to hit by
disassociating messages only on their destruction and resend through
a different connection.  This makes the code simpler and is probably
a good thing to do regardless - this patch adds a msg_con_set() helper
which is is called from only three places: ceph_con_send() and
ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear
it.

Signed-off-by: Ilya Dryomov

libceph: add nocephx_sign_messages option

2015-11-02T22:37:46+00:00

Support for message signing was merged into 3.19, along with
nocephx_require_signatures option.  But, all that option does is allow
the kernel client to talk to clusters that don't support MSG_AUTH
feature bit.  That's pretty useless, given that it's been supported
since bobtail.

Meanwhile, if one disables message signing on the server side with
"cephx sign messages = false", it becomes impossible to use the kernel
client since it expects messages to be signed if MSG_AUTH was
negotiated.  Add nocephx_sign_messages option to support this use case.

Signed-off-by: Ilya Dryomov

libceph: stop duplicating client fields in messenger

2015-11-02T22:37:46+00:00

supported_features and required_features serve no purpose at all, while
nocrc and tcp_nodelay belong to ceph_options::flags.

Signed-off-by: Ilya Dryomov

libceph: msg signing callouts don't need con argument

2015-11-02T22:37:45+00:00

We can use msg->con instead - at the point we sign an outgoing message
or check the signature on the incoming one, msg->con is always set.  We
wouldn't know how to sign a message without an associated session (i.e.
msg->con == NULL) and being able to sign a message using an explicitly
provided authorizer is of no use.

Signed-off-by: Ilya Dryomov

libceph: use local variable cursor instead of &msg->cursor

2015-11-02T22:36:47+00:00

Use local variable cursor in place of &msg->cursor in
read_partial_msg_data() and write_partial_msg_data().

Signed-off-by: Shraddha Barke 
Signed-off-by: Ilya Dryomov