linux-toradex.git/fs/dlm, branch v6.2-rc3

Treewide: Stop corrupting socket's task_frag

2022-12-20T01:28:49+00:00

Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.  The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code.  I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false.  Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner 
CC: Lars Ellenberg 
CC: "Christoph Böhmwalder" 
CC: Jens Axboe 
CC: Josef Bacik 
CC: Keith Busch 
CC: Christoph Hellwig 
CC: Sagi Grimberg 
CC: Lee Duncan 
CC: Chris Leech 
CC: Mike Christie 
CC: "James E.J. Bottomley" 
CC: "Martin K. Petersen" 
CC: Valentina Manea 
CC: Shuah Khan 
CC: Greg Kroah-Hartman 
CC: David Howells 
CC: Marc Dionne 
CC: Steve French 
CC: Christine Caulfield 
CC: David Teigland 
CC: Mark Fasheh 
CC: Joel Becker 
CC: Joseph Qi 
CC: Eric Van Hensbergen 
CC: Latchesar Ionkov 
CC: Dominique Martinet 
CC: Ilya Dryomov 
CC: Xiubo Li 
CC: Chuck Lever 
CC: Jeff Layton 
CC: Trond Myklebust 
CC: Anna Schumaker 
CC: Steffen Klassert 
CC: Herbert Xu 

Suggested-by: Guillaume Nault 
Signed-off-by: Benjamin Coddington 
Reviewed-by: Guillaume Nault 
Signed-off-by: Jakub Kicinski

fs: dlm: fix building without lockdep

2022-11-22T16:14:26+00:00

This patch uses assert_spin_locked() instead of lockdep_is_held()
where it's available to use because lockdep_is_held() is only available
if CONFIG_LOCKDEP is set.

In other cases like lockdep_sock_is_held() we surround it by a
CONFIG_LOCKDEP idef.

Fixes: dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling")
Reported-by: kernel test robot 
Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland

fs: dlm: parallelize lowcomms socket handling

2022-11-21T15:45:49+00:00

This patch is rework of lowcomms handling, the main goal was here to
handle recvmsg() and sendpage() to run parallel. Parallel in two senses:
1. per connection and 2. that recvmsg()/sendpage() doesn't block each
other.

Currently recvmsg()/sendpage() cannot run parallel because two
workqueues "dlm_recv" and "dlm_send" are ordered workqueues. That means
only one work item can be executed. The amount of queue items will be
increased about the amount of nodes being inside the cluster. The current
two workqueues for sending and receiving can also block each other if the
same connection is executed at the same time in dlm_recv and dlm_send
workqueue because a per connection mutex for the socket handling.

To make it more parallel we introduce one "dlm_io" workqueue which is
not an ordered workqueue, the amount of workers are not limited. Due
per connection flags SEND/RECV pending we schedule workers ordered per
connection and per send and receive task. To get rid of the mutex
blocking same workers to do socket handling we switched to a semaphore
which handles socket operations as read lock and sock releases as write
operations, to prevent sock_release() being called while the socket is
being used.

There might be more optimization removing the semaphore and replacing it
with other synchronization mechanism, however due other circumstances
e.g. othercon behaviour it seems complicated to doing this change. I
added comments to remove the othercon handling and moving to a different
synchronization mechanism as this is done. We need to do that to the next
dlm major version upgrade because it is not backwards compatible with the
current connect mechanism.

The processing of dlm messages need to be still handled by a ordered
workqueue. An dlm_process ordered workqueue was introduced which gets
filled by the receive worker. This is probably the next bottleneck of
DLM but the application can't currently parse dlm messages parallel. A
comment was introduced to lift the workqueue context of dlm processing
in a non-sleepable softirq to get messages processing done fast.

Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland

fs: dlm: don't init error value

2022-11-21T15:45:49+00:00

This patch removes a init of an error value to -EINVAL which is not
necessary.

Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland

fs: dlm: use saved sk_error_report()

2022-11-21T15:45:49+00:00

This patch changes the handling of calling the original
sk_error_report() by not putting it on the stack and calling it later.
If the listen_sock.sk_error_report() is NULL in this moment it indicates
a bug in our implementation.

Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland

fs: dlm: use sock2con without checking null

2022-11-21T15:45:49+00:00

This patch removes null checks on private data for sockets. If we have a
null dereference there we having a bug in our implementation that such
callback occurs in this state.

Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland

fs: dlm: remove dlm_node_addrs lookup list

2022-11-21T15:45:49+00:00

This patch merges the dlm_node_addrs lookup list to the connection
structure. It is a per node mapping to some configuration setup by
configfs. We don't need two lookup structures. The connection hash has
now a lifetime like the dlm_node_addrs entries. Means we add only new
entries when configure cluster and not while new connections are coming
in, remove connection when a node got fenced and cleanup all connection
when the dlm exits. It should work the same and even will show more
issues because we don't try to somehow keep those two data structures in
sync with the current cluster configuration.

Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland

fs: dlm: don't put dlm_local_addrs on heap

2022-11-21T15:45:49+00:00

This patch removes to allocate the dlm_local_addr[] pointers on the
heap. Instead we directly store the type of "struct sockaddr_storage".
This removes function deinit_local() because it was freeing memory only.

Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland

fs: dlm: cleanup listen sock handling

2022-11-21T15:45:49+00:00

This patch removes save_listen_callbacks() and add_listen_sock() as they
are only used once in lowcomms functionality. For shutdown lowcomms it's
not necessary to whole flush the workqueues to synchronize with
restoring the old sk_data_ready() callback. Only the listen con receive
work need to be cancelled. For each individual node shutdown we should be
sure that last ack was been transmitted which is done by flushing per
connection swork worker.

Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland

fs: dlm: remove socket shutdown handling

2022-11-21T15:45:49+00:00

Since commit 489d8e559c65 ("fs: dlm: add reliable connection if
reconnect") we have functionality like TCP offers for half-closed
sockets on dlm application protocol layer. This feature is required
because the cluster manager events about leaving resource memberships
can be locally already occurred but other cluster nodes having a pending
leaving membership over the cluster manager protocol happening. In this
time the local dlm node already shutdown it's connection and don't
transmit anymore any new dlm messages, but however it still needs to be
able to accept dlm messages because the pending leave membership request
of the cluster manager protocol which the dlm kernel implementation has
no control about it.

We have this functionality on the application for two reasons, the main
reason is that SCTP does not support such functionality on socket
layer. But we can do it inside application layer.

Another small issue is that this feature is broken in the TCP world
because some NAT devices does not implement such functionality
correctly. This is the same reason why the reliable connection session
layer in DLM exists. We give up on middle devices in the networking
which sends e.g. TCP resets out. In DLM we cannot have any message
dropping and we ensure it over a session layer that it can't happen.

Back to the half-closed grace shutdown handling. It's not necessary
anymore to do it on socket layer (which is only support for TCP sockets)
because we do it on application layer. This patch removes this handling,
if there are still issues then we have a problem on the application
layer for such handling.

Signed-off-by: Alexander Aring 
Signed-off-by: David Teigland