From 003f82b58c99146dfb0c9ce1ee7ed59bc572959b Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 16 Oct 2024 15:49:49 -0600 Subject: io_uring/rw: get rid of using req->imu It's assigned in the same function that it's being used, get rid of it. A local variable will do just fine. Signed-off-by: Jens Axboe --- io_uring/rw.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index 354c4e175654..d8b9e7a712f6 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -330,6 +330,7 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); struct io_ring_ctx *ctx = req->ctx; + struct io_mapped_ubuf *imu; struct io_async_rw *io; u16 index; int ret; @@ -341,11 +342,11 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe if (unlikely(req->buf_index >= ctx->nr_user_bufs)) return -EFAULT; index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); - req->imu = ctx->user_bufs[index]; + imu = ctx->user_bufs[index]; io_req_set_rsrc_node(req, ctx, 0); io = req->async_data; - ret = io_import_fixed(ddir, &io->iter, req->imu, rw->addr, rw->len); + ret = io_import_fixed(ddir, &io->iter, imu, rw->addr, rw->len); iov_iter_save_state(&io->iter, &io->iter_state); return ret; } -- cgit v1.2.3 From 1caa00d6b61651e04c04c2b50b3e149f24c6764d Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 23 Oct 2024 07:14:22 -0600 Subject: io_uring: remove 'issue_flags' argument for io_req_set_rsrc_node() All callers already hold the ring lock and hence are passing '0', remove the argument and the conditional locking that it controlled. Suggested-by: Pavel Begunkov Signed-off-by: Jens Axboe --- io_uring/rw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index d8b9e7a712f6..8080ffd6d571 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -343,7 +343,7 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe return -EFAULT; index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); imu = ctx->user_bufs[index]; - io_req_set_rsrc_node(req, ctx, 0); + io_req_set_rsrc_node(req, ctx); io = req->async_data; ret = io_import_fixed(ddir, &io->iter, imu, rw->addr, rw->len); -- cgit v1.2.3 From 7029acd8a950393ee3a3d8e1a7ee1a9b77808a3b Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 25 Oct 2024 19:27:39 -0600 Subject: io_uring/rsrc: get rid of per-ring io_rsrc_node list Work in progress, but get rid of the per-ring serialization of resource nodes, like registered buffers and files. Main issue here is that one node can otherwise hold up a bunch of other nodes from getting freed, which is especially a problem for file resource nodes and networked workloads where some descriptors may not see activity in a long time. As an example, instantiate an io_uring ring fd and create a sparse registered file table. Even 2 will do. Then create a socket and register it as fixed file 0, F0. The number of open files in the app is now 5, with 0/1/2 being the usual stdin/out/err, 3 being the ring fd, and 4 being the socket. Register this socket (eg "the listener") in slot 0 of the registered file table. Now add an operation on the socket that uses slot 0. Finally, loop N times, where each loop creates a new socket, registers said socket as a file, then unregisters the socket, and finally closes the socket. This is roughly similar to what a basic accept loop would look like. At the end of this loop, it's not unreasonable to expect that there would still be 5 open files. Each socket created and registered in the loop is also unregistered and closed. But since the listener socket registered first still has references to its resource node due to still being active, each subsequent socket unregistration is stuck behind it for reclaim. Hence 5 + N files are still open at that point, where N is awaiting the final put held up by the listener socket. Rewrite the io_rsrc_node handling to NOT rely on serialization. Struct io_kiocb now gets explicit resource nodes assigned, with each holding a reference to the parent node. A parent node is either of type FILE or BUFFER, which are the two types of nodes that exist. A request can have two nodes assigned, if it's using both registered files and buffers. Since request issue and task_work completion is both under the ring private lock, no atomics are needed to handle these references. It's a simple unlocked inc/dec. As before, the registered buffer or file table each hold a reference as well to the registered nodes. Final put of the node will remove the node and free the underlying resource, eg unmap the buffer or put the file. Outside of removing the stall in resource reclaim described above, it has the following advantages: 1) It's a lot simpler than the previous scheme, and easier to follow. No need to specific quiesce handling anymore. 2) There are no resource node allocations in the fast path, all of that happens at resource registration time. 3) The structs related to resource handling can all get simplified quite a bit, like io_rsrc_node and io_rsrc_data. io_rsrc_put can go away completely. 4) Handling of resource tags is much simpler, and doesn't require persistent storage as it can simply get assigned up front at registration time. Just copy them in one-by-one at registration time and assign to the resource node. The only real downside is that a request is now explicitly limited to pinning 2 resources, one file and one buffer, where before just assigning a resource node to a request would pin all of them. The upside is that it's easier to follow now, as an individual resource is explicitly referenced and assigned to the request. With this in place, the above mentioned example will be using exactly 5 files at the end of the loop, not N. Signed-off-by: Jens Axboe --- io_uring/rw.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index 8080ffd6d571..65491f4f2c7e 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -330,7 +330,7 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); struct io_ring_ctx *ctx = req->ctx; - struct io_mapped_ubuf *imu; + struct io_rsrc_node *node; struct io_async_rw *io; u16 index; int ret; @@ -342,11 +342,11 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe if (unlikely(req->buf_index >= ctx->nr_user_bufs)) return -EFAULT; index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); - imu = ctx->user_bufs[index]; - io_req_set_rsrc_node(req, ctx); + node = ctx->user_bufs[index]; + io_req_assign_rsrc_node(req, node); io = req->async_data; - ret = io_import_fixed(ddir, &io->iter, imu, rw->addr, rw->len); + ret = io_import_fixed(ddir, &io->iter, node->buf, rw->addr, rw->len); iov_iter_save_state(&io->iter, &io->iter_state); return ret; } -- cgit v1.2.3 From 3597f2786b687a7f26361ce00a805ea0af41b65f Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Sat, 26 Oct 2024 14:50:13 -0600 Subject: io_uring/rsrc: unify file and buffer resource tables For files, there's nr_user_files/file_table/file_data, and buffers have nr_user_bufs/user_bufs/buf_data. There's no reason why file_table and file_data can't be the same thing, and ditto for the buffer side. That gets rid of more io_ring_ctx state that's in two spots rather than just being in one spot, as it should be. Put all the registered file data in one locations, and ditto on the buffer front. This also avoids having both io_rsrc_data->nodes being an allocated array, and ->user_bufs[] or ->file_table.nodes. There's no reason to have this information duplicated. Keep it in one spot, io_rsrc_data, along with how many resources are available. Signed-off-by: Jens Axboe --- io_uring/rw.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index 65491f4f2c7e..28fff18ebb19 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -339,10 +339,10 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe if (unlikely(ret)) return ret; - if (unlikely(req->buf_index >= ctx->nr_user_bufs)) + if (unlikely(req->buf_index >= ctx->buf_table.nr)) return -EFAULT; - index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); - node = ctx->user_bufs[index]; + index = array_index_nospec(req->buf_index, ctx->buf_table.nr); + node = ctx->buf_table.nodes[index]; io_req_assign_rsrc_node(req, node); io = req->async_data; -- cgit v1.2.3 From b54a14041ee6444692d95ff38c8b3d1af682aa17 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Sun, 27 Oct 2024 09:08:31 -0600 Subject: io_uring/rsrc: add io_rsrc_node_lookup() helper There are lots of spots open-coding this functionality, add a generic helper that does the node lookup in a speculation safe way. Signed-off-by: Jens Axboe --- io_uring/rw.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index 28fff18ebb19..30448f343c7f 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -332,17 +332,15 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe struct io_ring_ctx *ctx = req->ctx; struct io_rsrc_node *node; struct io_async_rw *io; - u16 index; int ret; ret = io_prep_rw(req, sqe, ddir, false); if (unlikely(ret)) return ret; - if (unlikely(req->buf_index >= ctx->buf_table.nr)) + node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index); + if (!node) return -EFAULT; - index = array_index_nospec(req->buf_index, ctx->buf_table.nr); - node = ctx->buf_table.nodes[index]; io_req_assign_rsrc_node(req, node); io = req->async_data; -- cgit v1.2.3 From 01ee194d1aba1202f0926d5047a2a4cf84d0e45d Mon Sep 17 00:00:00 2001 From: hexue Date: Fri, 1 Nov 2024 17:19:57 +0800 Subject: io_uring: add support for hybrid IOPOLL A new hybrid poll is implemented on the io_uring layer. Once an IO is issued, it will not poll immediately, but rather block first and re-run before IO complete, then poll to reap IO. While this poll method could be a suboptimal solution when running on a single thread, it offers performance lower than regular polling but higher than IRQ, and CPU utilization is also lower than polling. To use hybrid polling, the ring must be setup with both the IORING_SETUP_IOPOLL and IORING_SETUP_HYBRID)IOPOLL flags set. Hybrid polling has the same restrictions as IOPOLL, in that commands must explicitly support it. Signed-off-by: hexue Link: https://lore.kernel.org/r/20241101091957.564220-2-xue01.he@samsung.com Signed-off-by: Jens Axboe --- io_uring/rw.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 81 insertions(+), 11 deletions(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index 30448f343c7f..1ea6be2ccc90 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -817,6 +817,11 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) kiocb->ki_flags |= IOCB_HIPRI; kiocb->ki_complete = io_complete_rw_iopoll; req->iopoll_completed = 0; + if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) { + /* make sure every req only blocks once*/ + req->flags &= ~REQ_F_IOPOLL_STATE; + req->iopoll_start = ktime_get_ns(); + } } else { if (kiocb->ki_flags & IOCB_HIPRI) return -EINVAL; @@ -1115,6 +1120,78 @@ void io_rw_fail(struct io_kiocb *req) io_req_set_res(req, res, req->cqe.flags); } +static int io_uring_classic_poll(struct io_kiocb *req, struct io_comp_batch *iob, + unsigned int poll_flags) +{ + struct file *file = req->file; + + if (req->opcode == IORING_OP_URING_CMD) { + struct io_uring_cmd *ioucmd; + + ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); + return file->f_op->uring_cmd_iopoll(ioucmd, iob, poll_flags); + } else { + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + + return file->f_op->iopoll(&rw->kiocb, iob, poll_flags); + } +} + +static u64 io_hybrid_iopoll_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + struct hrtimer_sleeper timer; + enum hrtimer_mode mode; + ktime_t kt; + u64 sleep_time; + + if (req->flags & REQ_F_IOPOLL_STATE) + return 0; + + if (ctx->hybrid_poll_time == LLONG_MAX) + return 0; + + /* Using half the running time to do schedule */ + sleep_time = ctx->hybrid_poll_time / 2; + + kt = ktime_set(0, sleep_time); + req->flags |= REQ_F_IOPOLL_STATE; + + mode = HRTIMER_MODE_REL; + hrtimer_init_sleeper_on_stack(&timer, CLOCK_MONOTONIC, mode); + hrtimer_set_expires(&timer.timer, kt); + set_current_state(TASK_INTERRUPTIBLE); + hrtimer_sleeper_start_expires(&timer, mode); + + if (timer.task) + io_schedule(); + + hrtimer_cancel(&timer.timer); + __set_current_state(TASK_RUNNING); + destroy_hrtimer_on_stack(&timer.timer); + return sleep_time; +} + +static int io_uring_hybrid_poll(struct io_kiocb *req, + struct io_comp_batch *iob, unsigned int poll_flags) +{ + struct io_ring_ctx *ctx = req->ctx; + u64 runtime, sleep_time; + int ret; + + sleep_time = io_hybrid_iopoll_delay(ctx, req); + ret = io_uring_classic_poll(req, iob, poll_flags); + runtime = ktime_get_ns() - req->iopoll_start - sleep_time; + + /* + * Use minimum sleep time if we're polling devices with different + * latencies. We could get more completions from the faster ones. + */ + if (ctx->hybrid_poll_time > runtime) + ctx->hybrid_poll_time = runtime; + + return ret; +} + int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) { struct io_wq_work_node *pos, *start, *prev; @@ -1131,7 +1208,6 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) wq_list_for_each(pos, start, &ctx->iopoll_list) { struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); - struct file *file = req->file; int ret; /* @@ -1142,17 +1218,11 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (READ_ONCE(req->iopoll_completed)) break; - if (req->opcode == IORING_OP_URING_CMD) { - struct io_uring_cmd *ioucmd; - - ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); - ret = file->f_op->uring_cmd_iopoll(ioucmd, &iob, - poll_flags); - } else { - struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) + ret = io_uring_hybrid_poll(req, &iob, poll_flags); + else + ret = io_uring_classic_poll(req, &iob, poll_flags); - ret = file->f_op->iopoll(&rw->kiocb, &iob, poll_flags); - } if (unlikely(ret < 0)) return ret; else if (ret) -- cgit v1.2.3 From 6f94cbc29adacc15007c5a16295052e674099282 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Sun, 3 Nov 2024 08:46:07 -0700 Subject: io_uring/rsrc: split io_kiocb node type assignments Currently the io_rsrc_node assignment in io_kiocb is an array of two pointers, as two nodes may be assigned to a request - one file node, and one buffer node. However, the buffer node can co-exist with the provided buffers, as currently it's not supported to use both provided and registered buffers at the same time. This crucially brings struct io_kiocb down to 4 cache lines again, as before it spilled into the 5th cacheline. Signed-off-by: Jens Axboe --- io_uring/rw.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index 1ea6be2ccc90..144730344c0f 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -341,7 +341,8 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index); if (!node) return -EFAULT; - io_req_assign_rsrc_node(req, node); + io_req_assign_rsrc_node(&req->buf_node, node); + req->flags |= REQ_F_BUF_NODE; io = req->async_data; ret = io_import_fixed(ddir, &io->iter, node->buf, rw->addr, rw->len); -- cgit v1.2.3 From b6f58a3f4aa8dba424356c7a69388a81f4459300 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Sun, 3 Nov 2024 10:23:38 -0700 Subject: io_uring: move struct io_kiocb from task_struct to io_uring_task Rather than store the task_struct itself in struct io_kiocb, store the io_uring specific task_struct. The life times are the same in terms of io_uring, and this avoids doing some dereferences through the task_struct. For the hot path of putting local task references, we can deref req->tctx instead, which we'll need anyway in that function regardless of whether it's local or remote references. This is mostly straight forward, except the original task PF_EXITING check needs a bit of tweaking. task_work is _always_ run from the originating task, except in the fallback case, where it's run from a kernel thread. Replace the potentially racy (in case of fallback work) checks for req->task->flags with current->flags. It's either the still the original task, in which case PF_EXITING will be sane, or it has PF_KTHREAD set, in which case it's fallback work. Both cases should prevent moving forward with the given request. Signed-off-by: Jens Axboe --- io_uring/rw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index 144730344c0f..e368b9afde03 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -435,7 +435,7 @@ static bool io_rw_should_reissue(struct io_kiocb *req) * Play it safe and assume not safe to re-import and reissue if we're * not in the original thread group (or in task context). */ - if (!same_thread_group(req->task, current) || !in_task()) + if (!same_thread_group(req->tctx->task, current) || !in_task()) return false; return true; } -- cgit v1.2.3 From 039c878db7add23c1c9ea18424c442cce76670f9 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Thu, 7 Nov 2024 19:01:36 +0800 Subject: io_uring/rsrc: add & apply io_req_assign_buf_node() The following pattern becomes more and more: + io_req_assign_rsrc_node(&req->buf_node, node); + req->flags |= REQ_F_BUF_NODE; so make it a helper, which is less fragile to use than above code, for example, the BUF_NODE flag is even missed in current io_uring_cmd_prep(). Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20241107110149.890530-4-ming.lei@redhat.com Signed-off-by: Jens Axboe --- io_uring/rw.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'io_uring/rw.c') diff --git a/io_uring/rw.c b/io_uring/rw.c index e368b9afde03..b62cdb5fc936 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -341,8 +341,7 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index); if (!node) return -EFAULT; - io_req_assign_rsrc_node(&req->buf_node, node); - req->flags |= REQ_F_BUF_NODE; + io_req_assign_buf_node(req, node); io = req->async_data; ret = io_import_fixed(ddir, &io->iter, node->buf, rw->addr, rw->len); -- cgit v1.2.3