linux-toradex.git/net/ceph/messenger.c, branch v4.9.52

libceph: force GFP_NOIO for socket allocations

2017-04-08T07:30:30+00:00

commit 633ee407b9d15a75ac9740ba9d3338815e1fcb95 upstream.

sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:

    Workqueue: ceph-msgr con_work [libceph]
    ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
    0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
    ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
    Call Trace:
    [] schedule+0x29/0x70
    [] schedule_timeout+0x1bd/0x200
    [] ? ttwu_do_wakeup+0x2c/0x120
    [] ? ttwu_do_activate.constprop.135+0x66/0x70
    [] wait_for_completion+0xbf/0x180
    [] ? try_to_wake_up+0x390/0x390
    [] flush_work+0x165/0x250
    [] ? worker_detach_from_pool+0xd0/0xd0
    [] xlog_cil_force_lsn+0x81/0x200 [xfs]
    [] ? __slab_free+0xee/0x234
    [] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
    [] ? lookup_page_cgroup_used+0xe/0x30
    [] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_log_force_lsn+0x3f/0xf0 [xfs]
    [] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
    [] ? wake_atomic_t_function+0x40/0x40
    [] xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
    [] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
    [] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
    [] super_cache_scan+0x178/0x180
    [] shrink_slab_node+0x14e/0x340
    [] ? mem_cgroup_iter+0x16b/0x450
    [] shrink_slab+0x100/0x140
    [] do_try_to_free_pages+0x335/0x490
    [] try_to_free_pages+0xb9/0x1f0
    [] ? __alloc_pages_direct_compact+0x69/0x1be
    [] __alloc_pages_nodemask+0x69a/0xb40
    [] alloc_pages_current+0x9e/0x110
    [] new_slab+0x2c5/0x390
    [] __slab_alloc+0x33b/0x459
    [] ? sock_alloc_inode+0x2d/0xd0
    [] ? inet_sendmsg+0x71/0xc0
    [] ? sock_alloc_inode+0x2d/0xd0
    [] kmem_cache_alloc+0x1a2/0x1b0
    [] sock_alloc_inode+0x2d/0xd0
    [] alloc_inode+0x26/0xa0
    [] new_inode_pseudo+0x1a/0x70
    [] sock_alloc+0x1e/0x80
    [] __sock_create+0x95/0x220
    [] sock_create_kern+0x24/0x30
    [] con_work+0xef9/0x2050 [libceph]
    [] ? rbd_img_request_submit+0x4c/0x60 [rbd]
    [] process_one_work+0x159/0x4f0
    [] worker_thread+0x11b/0x530
    [] ? create_worker+0x1d0/0x1d0
    [] kthread+0xc9/0xe0
    [] ? flush_kthread_worker+0x90/0x90
    [] ret_from_fork+0x58/0x90
    [] ? flush_kthread_worker+0x90/0x90

Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Jeff Layton 
Signed-off-by: Greg Kroah-Hartman

libceph: verify authorize reply on connect

2017-01-09T07:32:24+00:00

commit 5c056fdc5b474329037f2aa18401bd73033e0ce0 upstream.

After sending an authorizer (ceph_x_authorize_a + ceph_x_authorize_b),
the client gets back a ceph_x_authorize_reply, which it is supposed to
verify to ensure the authenticity and protect against replay attacks.
The code for doing this is there (ceph_x_verify_authorizer_reply(),
ceph_auth_verify_authorizer_reply() + plumbing), but it is never
invoked by the the messenger.

AFAICT this goes back to 2009, when ceph authentication protocols
support was added to the kernel client in 4e7a5dcd1bba ("ceph:
negotiate authentication protocol; implement AUTH_NONE protocol").

The second param of ceph_connection_operations::verify_authorizer_reply
is unused all the way down.  Pass 0 to facilitate backporting, and kill
it in the next commit.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Greg Kroah-Hartman

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

2016-04-04T17:41:08+00:00

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 -  << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

 -  >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov 
Acked-by: Michal Hocko 
Signed-off-by: Linus Torvalds

libceph: use KMEM_CACHE macro

2016-03-25T17:51:57+00:00

Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.

Signed-off-by: Geliang Tang 
Signed-off-by: Ilya Dryomov

libceph: use sizeof_footer() more

2016-03-25T17:51:53+00:00

Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke().  Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder

libceph: use the right footer size when skipping a message

2016-02-24T19:28:40+00:00

ceph_msg_footer is 21 bytes long, while ceph_msg_footer_old is only 13.
Don't skip too much when CEPH_FEATURE_MSG_AUTH isn't negotiated.

Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder

libceph: don't bail early from try_read() when skipping a message

2016-02-24T19:28:31+00:00

The contract between try_read() and try_write() is that when called
each processes as much data as possible.  When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more.  try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.

Cc: stable@vger.kernel.org # 3.10+
Reported-by: Varada Kari 
Signed-off-by: Ilya Dryomov 
Tested-by: Varada Kari 
Reviewed-by: Alex Elder

libceph: clear messenger auth_retry flag if we fault

2016-01-21T18:36:08+00:00

Commit 20e55c4cc758 ("libceph: clear messenger auth_retry flag when we
authenticate") got us only half way there.  We clear the flag if the
second attempt succeeds, but it also needs to be cleared if that
attempt fails, to allow for the exponential backoff to kick in.
Otherwise, if ->should_authenticate() thinks our keys are valid, we
will busy loop, incrementing auth_retry to no avail:

    process_connect ffff880079a63830 got BADAUTHORIZER attempt 1
    process_connect ffff880079a63830 got BADAUTHORIZER attempt 2
    process_connect ffff880079a63830 got BADAUTHORIZER attempt 3
    process_connect ffff880079a63830 got BADAUTHORIZER attempt 4
    process_connect ffff880079a63830 got BADAUTHORIZER attempt 5
    ...

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil

libceph: fix ceph_msg_revoke()

2016-01-21T18:36:08+00:00

There are a number of problems with revoking a "was sending" message:

(1) We never make any attempt to revoke data - only kvecs contibute to
con->out_skip.  However, once the header (envelope) is written to the
socket, our peer learns data_len and sets itself to expect at least
data_len bytes to follow front or front+middle.  If ceph_msg_revoke()
is called while the messenger is sending message's data portion,
anything we send after that call is counted by the OSD towards the now
revoked message's data portion.  The effects vary, the most common one
is the eventual hang - higher layers get stuck waiting for the reply to
the message that was sent out after ceph_msg_revoke() returned and
treated by the OSD as a bunch of data bytes.  This is what Matt ran
into.

(2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs
is wrong.  If ceph_msg_revoke() is called before the tag is sent out or
while the messenger is sending the header, we will get a connection
reset, either due to a bad tag (0 is not a valid tag) or a bad header
CRC, which kind of defeats the purpose of revoke.  Currently the kernel
client refuses to work with header CRCs disabled, but that will likely
change in the future, making this even worse.

(3) con->out_skip is not reset on connection reset, leading to one or
more spurious connection resets if we happen to get a real one between
con->out_skip is set in ceph_msg_revoke() and before it's cleared in
write_partial_skip().

Fixing (1) and (3) is trivial.  The idea behind fixing (2) is to never
zero the tag or the header, i.e. send out tag+header regardless of when
ceph_msg_revoke() is called.  That way the header is always correct, no
unnecessary resets are induced and revoke stands ready for disabled
CRCs.  Since ceph_msg_revoke() rips out con->out_msg, introduce a new
"message out temp" and copy the header into it before sending.

Cc: stable@vger.kernel.org # 4.0+
Reported-by: Matt Conner 
Signed-off-by: Ilya Dryomov 
Tested-by: Matt Conner 
Reviewed-by: Sage Weil

libceph: use list_for_each_entry_safe

2016-01-21T18:36:08+00:00

Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang 
[idryomov@gmail.com: nuke call to list_splice_init() as well]
Signed-off-by: Ilya Dryomov