From f0e1a0643a59bf1f922fa209cec86a170b784f3f Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 18 Jun 2024 10:09:17 -1000 Subject: sched_ext: Implement BPF extensible scheduler class MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following: 1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. 2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments. sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks. For more detailed discussion on the motivations and overview, please refer to the cover letter. Later patches will also add several example schedulers and documentation. This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support. include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting: - Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx". - In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided. - A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext. - To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq(). - sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks. - A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq(). - One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq. - Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn(). - When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths. v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed. - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free. - Consolidated per-cpu variable usages and other cleanups. v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group. - SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch. - ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext. - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change. - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param. - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed. - The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs. - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future. - scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility. - exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs. - The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE. - Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files. - Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs. v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64. - Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields. - Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops. - Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword. - Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE. - Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt(). v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly. - task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX. - scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask. - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores. - ops.enable() now sees up-to-date p->scx.weight value. - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call. - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler. v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight. - move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags. - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them. - The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations. v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details. - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK(). - Unused all_dsqs list removed. This was a left-over from previous iterations. - p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe. - BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately. - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1. - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support. - Add MAINTAINERS entries and other misc changes. Signed-off-by: Tejun Heo Co-authored-by: David Vernet Acked-by: Josh Don Acked-by: Hao Luo Acked-by: Barret Rhoden Cc: Andrea Righi --- include/uapi/linux/sched.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 3bac0a8ceab2..359a14cc76a4 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -118,6 +118,7 @@ struct clone_args { /* SCHED_ISO: reserved but not implemented yet */ #define SCHED_IDLE 5 #define SCHED_DEADLINE 6 +#define SCHED_EXT 7 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */ #define SCHED_RESET_ON_FORK 0x40000000 -- cgit v1.2.3 From 7543ae2269a855683a39af57048035f44bc8ef9c Mon Sep 17 00:00:00 2001 From: Wouter Verhelst Date: Thu, 25 Jul 2024 18:45:36 +0200 Subject: nbd: add support for rotational devices The NBD protocol defines the flag NBD_FLAG_ROTATIONAL to flag that the export in use should be treated as a rotational device. Add support for that flag to the kernel driver. Signed-off-by: Wouter Verhelst Reviewed-by: Eric Blake Link: https://lore.kernel.org/r/20240725164536.1275851-1-w@uter.be Signed-off-by: Jens Axboe --- include/uapi/linux/nbd.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/nbd.h b/include/uapi/linux/nbd.h index 80ce0ef43afd..d75215f2c675 100644 --- a/include/uapi/linux/nbd.h +++ b/include/uapi/linux/nbd.h @@ -51,8 +51,9 @@ enum { #define NBD_FLAG_READ_ONLY (1 << 1) /* device is read-only */ #define NBD_FLAG_SEND_FLUSH (1 << 2) /* can flush writeback cache */ #define NBD_FLAG_SEND_FUA (1 << 3) /* send FUA (forced unit access) */ -/* there is a gap here to match userspace */ +#define NBD_FLAG_ROTATIONAL (1 << 4) /* device is rotational */ #define NBD_FLAG_SEND_TRIM (1 << 5) /* send trim/discard */ +/* there is a gap here to match userspace */ #define NBD_FLAG_CAN_MULTI_CONN (1 << 8) /* Server supports multiple connections per export. */ /* values for cmd flags in the upper 16 bits of request type */ -- cgit v1.2.3 From f58872f45c36ded048bccc22701b0986019c24d8 Mon Sep 17 00:00:00 2001 From: Marcelo Schmitt Date: Fri, 12 Jul 2024 16:20:36 -0300 Subject: spi: Enable controllers to extend the SPI protocol with MOSI idle configuration The behavior of an SPI controller data output line (SDO or MOSI or COPI (Controller Output Peripheral Input) for disambiguation) is usually not specified when the controller is not clocking out data on SCLK edges. However, there do exist SPI peripherals that require specific MOSI line state when data is not being clocked out of the controller. Conventional SPI controllers may set the MOSI line on SCLK edges then bring it low when no data is going out or leave the line the state of the last transfer bit. More elaborated controllers are capable to set the MOSI idle state according to different configurable levels and thus are more suitable for interfacing with demanding peripherals. Add SPI mode bits to allow peripherals to request explicit MOSI idle state when needed. When supporting a particular MOSI idle configuration, the data output line state is expected to remain at the configured level when the controller is not clocking out data. When a device that needs a specific MOSI idle state is identified, its driver should request the MOSI idle configuration by setting the proper SPI mode bit. Acked-by: Nuno Sa Reviewed-by: Jonathan Cameron Reviewed-by: David Lechner Tested-by: David Lechner Signed-off-by: Marcelo Schmitt Link: https://patch.msgid.link/9802160b5e5baed7f83ee43ac819cb757a19be55.1720810545.git.marcelo.schmitt@analog.com Signed-off-by: Mark Brown --- include/uapi/linux/spi/spi.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/spi/spi.h b/include/uapi/linux/spi/spi.h index ca56e477d161..ee4ac812b8f8 100644 --- a/include/uapi/linux/spi/spi.h +++ b/include/uapi/linux/spi/spi.h @@ -28,7 +28,8 @@ #define SPI_RX_OCTAL _BITUL(14) /* receive with 8 wires */ #define SPI_3WIRE_HIZ _BITUL(15) /* high impedance turnaround */ #define SPI_RX_CPHA_FLIP _BITUL(16) /* flip CPHA on Rx only xfer */ -#define SPI_MOSI_IDLE_LOW _BITUL(17) /* leave mosi line low when idle */ +#define SPI_MOSI_IDLE_LOW _BITUL(17) /* leave MOSI line low when idle */ +#define SPI_MOSI_IDLE_HIGH _BITUL(18) /* leave MOSI line high when idle */ /* * All the bits defined above should be covered by SPI_MODE_USER_MASK. @@ -38,6 +39,6 @@ * These bits must not overlap. A static assert check should make sure of that. * If adding extra bits, make sure to increase the bit index below as well. */ -#define SPI_MODE_USER_MASK (_BITUL(18) - 1) +#define SPI_MODE_USER_MASK (_BITUL(19) - 1) #endif /* _UAPI_SPI_H */ -- cgit v1.2.3 From ba386777a30b38dabcc7fb8a89ec2869a09915f7 Mon Sep 17 00:00:00 2001 From: Vignesh Balasubramanian Date: Thu, 25 Jul 2024 21:40:18 +0530 Subject: x86/elf: Add a new FPU buffer layout info to x86 core files Add a new .note section containing type, size, offset and flags of every xfeature that is present. This information will be used by debuggers to understand the XSAVE layout of the machine where the core file has been dumped, and to read XSAVE registers, especially during cross-platform debugging. The XSAVE layouts of modern AMD and Intel CPUs differ, especially since Memory Protection Keys and the AVX-512 features have been inculcated into the AMD CPUs. Since AMD never adopted (and hence never left room in the XSAVE layout for) the Intel MPX feature, tools like GDB had assumed a fixed XSAVE layout matching that of Intel (based on the XCR0 mask). Hence, core dumps from AMD CPUs didn't match the known size for the XCR0 mask. This resulted in GDB and other tools not being able to access the values of the AVX-512 and PKRU registers on AMD CPUs. To solve this, an interim solution has been accepted into GDB, and is already a part of GDB 14, see https://sourceware.org/pipermail/gdb-patches/2023-March/198081.html. But it depends on heuristics based on the total XSAVE register set size and the XCR0 mask to infer the layouts of the various register blocks for core dumps, and hence, is not a foolproof mechanism to determine the layout of the XSAVE area. Therefore, add a new core dump note in order to allow GDB/LLDB and other relevant tools to determine the layout of the XSAVE area of the machine where the corefile was dumped. The new core dump note (which is being proposed as a per-process .note section), NT_X86_XSAVE_LAYOUT (0x205) contains an array of structures. Each structure describes an individual extended feature containing offset, size and flags in this format: struct x86_xfeat_component { u32 type; u32 size; u32 offset; u32 flags; }; and in an independent manner, allowing for future extensions without depending on hw arch specifics like CPUID etc. [ bp: Massage commit message, zap trailing whitespace. ] Co-developed-by: Jini Susan George Signed-off-by: Jini Susan George Co-developed-by: Borislav Petkov (AMD) Signed-off-by: Borislav Petkov (AMD) Signed-off-by: Vignesh Balasubramanian Link: https://lore.kernel.org/r/20240725161017.112111-2-vigbalas@amd.com --- include/uapi/linux/elf.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index b54b313bcf07..e30a9b47dc87 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -411,6 +411,7 @@ typedef struct elf64_shdr { #define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */ /* Old binutils treats 0x203 as a CET state */ #define NT_X86_SHSTK 0x204 /* x86 SHSTK state */ +#define NT_X86_XSAVE_LAYOUT 0x205 /* XSAVE layout description */ #define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */ #define NT_S390_TIMER 0x301 /* s390 timer register */ #define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */ -- cgit v1.2.3 From d579b04a52a183db47dfcb7a44304d7747d551e1 Mon Sep 17 00:00:00 2001 From: Yu-Ting Tseng Date: Tue, 9 Jul 2024 00:00:47 -0700 Subject: binder: frozen notification Frozen processes present a significant challenge in binder transactions. When a process is frozen, it cannot, by design, accept and/or respond to binder transactions. As a result, the sender needs to adjust its behavior, such as postponing transactions until the peer process unfreezes. However, there is currently no way to subscribe to these state change events, making it impossible to implement frozen-aware behaviors efficiently. Introduce a binder API for subscribing to frozen state change events. This allows programs to react to changes in peer process state, mitigating issues related to binder transactions sent to frozen processes. Implementation details: For a given binder_ref, the state of frozen notification can be one of the followings: 1. Userspace doesn't want a notification. binder_ref->freeze is null. 2. Userspace wants a notification but none is in flight. list_empty(&binder_ref->freeze->work.entry) = true 3. A notification is in flight and waiting to be read by userspace. binder_ref_freeze.sent is false. 4. A notification was read by userspace and kernel is waiting for an ack. binder_ref_freeze.sent is true. When a notification is in flight, new state change events are coalesced into the existing binder_ref_freeze struct. If userspace hasn't picked up the notification yet, the driver simply rewrites the state. Otherwise, the notification is flagged as requiring a resend, which will be performed once userspace acks the original notification that's inflight. See https://r.android.com/3070045 for how userspace is going to use this feature. Signed-off-by: Yu-Ting Tseng Acked-by: Carlos Llamas Link: https://lore.kernel.org/r/20240709070047.4055369-4-yutingtseng@google.com Signed-off-by: Greg Kroah-Hartman --- include/uapi/linux/android/binder.h | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h index d44a8118b2ed..1fd92021a573 100644 --- a/include/uapi/linux/android/binder.h +++ b/include/uapi/linux/android/binder.h @@ -236,6 +236,12 @@ struct binder_frozen_status_info { __u32 async_recv; }; +struct binder_frozen_state_info { + binder_uintptr_t cookie; + __u32 is_frozen; + __u32 reserved; +}; + /* struct binder_extened_error - extended error information * @id: identifier for the failed operation * @command: command as defined by binder_driver_return_protocol @@ -467,6 +473,17 @@ enum binder_driver_return_protocol { /* * The target of the last async transaction is frozen. No parameters. */ + + BR_FROZEN_BINDER = _IOR('r', 21, struct binder_frozen_state_info), + /* + * The cookie and a boolean (is_frozen) that indicates whether the process + * transitioned into a frozen or an unfrozen state. + */ + + BR_CLEAR_FREEZE_NOTIFICATION_DONE = _IOR('r', 22, binder_uintptr_t), + /* + * void *: cookie + */ }; enum binder_driver_command_protocol { @@ -550,6 +567,25 @@ enum binder_driver_command_protocol { /* * binder_transaction_data_sg: the sent command. */ + + BC_REQUEST_FREEZE_NOTIFICATION = + _IOW('c', 19, struct binder_handle_cookie), + /* + * int: handle + * void *: cookie + */ + + BC_CLEAR_FREEZE_NOTIFICATION = _IOW('c', 20, + struct binder_handle_cookie), + /* + * int: handle + * void *: cookie + */ + + BC_FREEZE_NOTIFICATION_DONE = _IOW('c', 21, binder_uintptr_t), + /* + * void *: cookie + */ }; #endif /* _UAPI_LINUX_BINDER_H */ -- cgit v1.2.3 From 613f21505b25a4f43f33de00f11afc059bedde2b Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Thu, 4 Jul 2024 11:01:51 +0200 Subject: media: cec: core: add new CEC_MSG_FL_REPLY_VENDOR_ID flag If this flag is set, then the reply is expected to consist of the CEC_MSG_VENDOR_COMMAND_WITH_ID opcode followed by the Vendor ID (as used in bytes 1-4 of the message), followed by the struct cec_msg reply field. Note that this assumes that the byte after the Vendor ID is a vendor-specific opcode. This flag makes it easier to wait for replies to vendor commands, using the same CEC framework support for waiting for regular replies. Support for this flag is indicated by setting the new CEC_CAP_REPLY_VENDOR_ID capability. Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab --- include/uapi/linux/cec.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/cec.h b/include/uapi/linux/cec.h index b8e071abaea5..894fffc66f2c 100644 --- a/include/uapi/linux/cec.h +++ b/include/uapi/linux/cec.h @@ -165,6 +165,7 @@ static inline int cec_msg_recv_is_rx_result(const struct cec_msg *msg) /* cec_msg flags field */ #define CEC_MSG_FL_REPLY_TO_FOLLOWERS (1 << 0) #define CEC_MSG_FL_RAW (1 << 1) +#define CEC_MSG_FL_REPLY_VENDOR_ID (1 << 2) /* cec_msg tx/rx_status field */ #define CEC_TX_STATUS_OK (1 << 0) @@ -339,6 +340,8 @@ static inline int cec_is_unconfigured(__u16 log_addr_mask) #define CEC_CAP_MONITOR_PIN (1 << 7) /* CEC_ADAP_G_CONNECTOR_INFO is available */ #define CEC_CAP_CONNECTOR_INFO (1 << 8) +/* CEC_MSG_FL_REPLY_VENDOR_ID is available */ +#define CEC_CAP_REPLY_VENDOR_ID (1 << 9) /** * struct cec_caps - CEC capabilities structure. -- cgit v1.2.3 From 0079c9d1e58a39148e6ce13bda55307ea6aa3a9e Mon Sep 17 00:00:00 2001 From: Takashi Iwai Date: Tue, 6 Aug 2024 09:00:23 +0200 Subject: ALSA: ump: Handle MIDI 1.0 Function Block in MIDI 2.0 protocol The UMP v1.1 spec says in the section 6.2.1: "If a UMP Endpoint declares MIDI 2.0 Protocol but a Function Block represents a MIDI 1.0 connection, then may optionally be used for messages to/from that Function Block." It implies that the driver can (and should) keep MIDI 1.0 CVM exceptionally for those FBs even if UMP Endpoint is running in MIDI 2.0 protocol, and the current driver lacks of it. This patch extends the sequencer port info to indicate a MIDI 1.0 port, and tries to send/receive MIDI 1.0 CVM as is when this port is the source or sink. The sequencer port flag is set by the driver at parsing FBs and GTBs although application can set it to its own user-space clients, too. Link: https://patch.msgid.link/20240806070024.14301-1-tiwai@suse.de Signed-off-by: Takashi Iwai --- include/uapi/sound/asequencer.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/sound/asequencer.h b/include/uapi/sound/asequencer.h index 39b37edcf813..bc30c1f2a109 100644 --- a/include/uapi/sound/asequencer.h +++ b/include/uapi/sound/asequencer.h @@ -461,6 +461,8 @@ struct snd_seq_remove_events { #define SNDRV_SEQ_PORT_FLG_TIMESTAMP (1<<1) #define SNDRV_SEQ_PORT_FLG_TIME_REAL (1<<2) +#define SNDRV_SEQ_PORT_FLG_IS_MIDI1 (1<<3) /* Keep MIDI 1.0 protocol */ + /* port direction */ #define SNDRV_SEQ_PORT_DIR_UNKNOWN 0 #define SNDRV_SEQ_PORT_DIR_INPUT 1 -- cgit v1.2.3 From 599f6899051cb70c4e0aa9fd591b9ee220cb6f14 Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Wed, 7 Aug 2024 09:22:10 +0200 Subject: media: uapi/linux/cec.h: cec_msg_set_reply_to: zero flags The cec_msg_set_reply_to() helper function never zeroed the struct cec_msg flags field, this can cause unexpected behavior if flags was uninitialized to begin with. Signed-off-by: Hans Verkuil Fixes: 0dbacebede1e ("[media] cec: move the CEC framework out of staging and to media") Cc: Signed-off-by: Mauro Carvalho Chehab --- include/uapi/linux/cec.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/cec.h b/include/uapi/linux/cec.h index 894fffc66f2c..b2af1dddd4d7 100644 --- a/include/uapi/linux/cec.h +++ b/include/uapi/linux/cec.h @@ -132,6 +132,8 @@ static inline void cec_msg_init(struct cec_msg *msg, * Set the msg destination to the orig initiator and the msg initiator to the * orig destination. Note that msg and orig may be the same pointer, in which * case the change is done in place. + * + * It also zeroes the reply, timeout and flags fields. */ static inline void cec_msg_set_reply_to(struct cec_msg *msg, struct cec_msg *orig) @@ -139,7 +141,9 @@ static inline void cec_msg_set_reply_to(struct cec_msg *msg, /* The destination becomes the initiator and vice versa */ msg->msg[0] = (cec_msg_destination(orig) << 4) | cec_msg_initiator(orig); - msg->reply = msg->timeout = 0; + msg->reply = 0; + msg->timeout = 0; + msg->flags = 0; } /** -- cgit v1.2.3 From 3882dccf48f9fbe787b5df3187d708ef348ac860 Mon Sep 17 00:00:00 2001 From: Alan Maguire Date: Thu, 8 Aug 2024 16:05:57 +0100 Subject: bpf/bpf_get,set_sockopt: add option to set TCP-BPF sock ops flags Currently the only opportunity to set sock ops flags dictating which callbacks fire for a socket is from within a TCP-BPF sockops program. This is problematic if the connection is already set up as there is no further chance to specify callbacks for that socket. Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_setsockopt() and bpf_getsockopt() to allow users to specify callbacks later, either via an iterator over sockets or via a socket-specific program triggered by a setsockopt() on the socket. Previous discussion on this here [1]. [1] https://lore.kernel.org/bpf/f42f157b-6e52-dd4d-3d97-9b86c84c0b00@oracle.com/ Signed-off-by: Alan Maguire Link: https://lore.kernel.org/r/20240808150558.1035626-2-alan.maguire@oracle.com Signed-off-by: Martin KaFai Lau --- include/uapi/linux/bpf.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 35bcf52dbc65..e05b39e39c3f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2851,7 +2851,7 @@ union bpf_attr { * **TCP_SYNCNT**, **TCP_USER_TIMEOUT**, **TCP_NOTSENT_LOWAT**, * **TCP_NODELAY**, **TCP_MAXSEG**, **TCP_WINDOW_CLAMP**, * **TCP_THIN_LINEAR_TIMEOUTS**, **TCP_BPF_DELACK_MAX**, - * **TCP_BPF_RTO_MIN**. + * **TCP_BPF_RTO_MIN**, **TCP_BPF_SOCK_OPS_CB_FLAGS**. * * **IPPROTO_IP**, which supports *optname* **IP_TOS**. * * **IPPROTO_IPV6**, which supports the following *optname*\ s: * **IPV6_TCLASS**, **IPV6_AUTOFLOWLABEL**. @@ -7080,6 +7080,7 @@ enum { TCP_BPF_SYN = 1005, /* Copy the TCP header */ TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ + TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */ }; enum { -- cgit v1.2.3 From a1d220d9dafa8d76ba60a784a1016c3134e6a1e8 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 19 Jul 2024 13:41:52 +0200 Subject: nsfs: iterate through mount namespaces It is already possible to list mounts in other mount namespaces and to retrieve namespace file descriptors without having to go through procfs by deriving them from pidfds. Augment these abilities by adding the ability to retrieve information about a mount namespace via NS_MNT_GET_INFO. This will return the mount namespace id and the number of mounts currently in the mount namespace. The number of mounts can be used to size the buffer that needs to be used for listmount() and is in general useful without having to actually iterate through all the mounts. The structure is extensible. And add the ability to iterate through all mount namespaces over which the caller holds privilege returning the file descriptor for the next or previous mount namespace. To retrieve a mount namespace the caller must be privileged wrt to it's owning user namespace. This means that PID 1 on the host can list all mounts in all mount namespaces or that a container can list all mounts of its nested containers. Optionally pass a structure for NS_MNT_GET_INFO with NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount namespace in one go. Both ioctls can be implemented for other namespace types easily. Together with recent api additions this means one can iterate through all mounts in all mount namespaces without ever touching procfs. Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-5-834113cab0d2@kernel.org Reviewed-by: Josef Bacik Reviewed-by: Jeff Layton Signed-off-by: Christian Brauner --- include/uapi/linux/nsfs.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h index b133211331f6..cb31af3c1840 100644 --- a/include/uapi/linux/nsfs.h +++ b/include/uapi/linux/nsfs.h @@ -3,6 +3,7 @@ #define __LINUX_NSFS_H #include +#include #define NSIO 0xb7 @@ -26,4 +27,19 @@ /* Return thread-group leader id of pid in the target pid namespace. */ #define NS_GET_TGID_IN_PIDNS _IOR(NSIO, 0x9, int) +struct mnt_ns_info { + __u32 size; + __u32 nr_mounts; + __u64 mnt_ns_id; +}; + +#define MNT_NS_INFO_SIZE_VER0 16 /* size of first published struct */ + +/* Get information about namespace. */ +#define NS_MNT_GET_INFO _IOR(NSIO, 10, struct mnt_ns_info) +/* Get next namespace. */ +#define NS_MNT_GET_NEXT _IOR(NSIO, 11, struct mnt_ns_info) +/* Get previous namespace. */ +#define NS_MNT_GET_PREV _IOR(NSIO, 12, struct mnt_ns_info) + #endif /* __LINUX_NSFS_H */ -- cgit v1.2.3 From de8f847a5114ff7cfcdfc114af8485c431dec703 Mon Sep 17 00:00:00 2001 From: Yishai Hadas Date: Thu, 1 Aug 2024 15:05:16 +0300 Subject: RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky --- include/uapi/rdma/mlx5_user_ioctl_cmds.h | 4 ++++ include/uapi/rdma/mlx5_user_ioctl_verbs.h | 4 ++++ 2 files changed, 8 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/rdma/mlx5_user_ioctl_cmds.h b/include/uapi/rdma/mlx5_user_ioctl_cmds.h index 5b74d6534899..106276a4cce7 100644 --- a/include/uapi/rdma/mlx5_user_ioctl_cmds.h +++ b/include/uapi/rdma/mlx5_user_ioctl_cmds.h @@ -274,6 +274,10 @@ enum mlx5_ib_create_cq_attrs { MLX5_IB_ATTR_CREATE_CQ_UAR_INDEX = UVERBS_ID_DRIVER_NS_WITH_UHW, }; +enum mlx5_ib_reg_dmabuf_mr_attrs { + MLX5_IB_ATTR_REG_DMABUF_MR_ACCESS_FLAGS = (1U << UVERBS_ID_NS_SHIFT), +}; + #define MLX5_IB_DW_MATCH_PARAM 0xA0 struct mlx5_ib_match_params { diff --git a/include/uapi/rdma/mlx5_user_ioctl_verbs.h b/include/uapi/rdma/mlx5_user_ioctl_verbs.h index 3189c7f08d17..7c233df475e7 100644 --- a/include/uapi/rdma/mlx5_user_ioctl_verbs.h +++ b/include/uapi/rdma/mlx5_user_ioctl_verbs.h @@ -54,6 +54,10 @@ enum mlx5_ib_uapi_flow_action_packet_reformat_type { MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TO_L3_TUNNEL = 0x3, }; +enum mlx5_ib_uapi_reg_dmabuf_flags { + MLX5_IB_UAPI_REG_DMABUF_ACCESS_DATA_DIRECT = 1 << 0, +}; + struct mlx5_ib_uapi_devx_async_cmd_hdr { __aligned_u64 wr_id; __u8 out_data[]; -- cgit v1.2.3 From ec7ad6530909983c8736c80af46e3529ce7bab55 Mon Sep 17 00:00:00 2001 From: Yishai Hadas Date: Thu, 1 Aug 2024 15:05:17 +0300 Subject: RDMA/mlx5: Introduce GET_DATA_DIRECT_SYSFS_PATH ioctl Introduce the 'GET_DATA_DIRECT_SYSFS_PATH' ioctl to return the sysfs path of the affiliated 'data direct' device for a given device. Signed-off-by: Yishai Hadas Link: https://patch.msgid.link/403745463e0ef52adbef681ff09aa6a29a756352.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky --- include/uapi/rdma/mlx5_user_ioctl_cmds.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/rdma/mlx5_user_ioctl_cmds.h b/include/uapi/rdma/mlx5_user_ioctl_cmds.h index 106276a4cce7..fd2e4a3a56b3 100644 --- a/include/uapi/rdma/mlx5_user_ioctl_cmds.h +++ b/include/uapi/rdma/mlx5_user_ioctl_cmds.h @@ -348,6 +348,7 @@ enum mlx5_ib_pd_methods { enum mlx5_ib_device_methods { MLX5_IB_METHOD_QUERY_PORT = (1U << UVERBS_ID_NS_SHIFT), + MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH, }; enum mlx5_ib_query_port_attrs { @@ -355,4 +356,8 @@ enum mlx5_ib_query_port_attrs { MLX5_IB_ATTR_QUERY_PORT, }; +enum mlx5_ib_get_data_direct_sysfs_path_attrs { + MLX5_IB_ATTR_GET_DATA_DIRECT_SYSFS_PATH = (1U << UVERBS_ID_NS_SHIFT), +}; + #endif -- cgit v1.2.3 From e9d05e9d5db155dcc708891611f1b6f977c3fa11 Mon Sep 17 00:00:00 2001 From: Jacopo Mondi Date: Thu, 8 Aug 2024 22:40:54 +0200 Subject: media: uapi: rkisp1-config: Add extensible params format Add to the rkisp1-config.h header data types and documentation of the extensible parameters format. Signed-off-by: Jacopo Mondi Reviewed-by: Laurent Pinchart Reviewed-by: Paul Elder Tested-by: Kieran Bingham Acked-by: Sakari Ailus Signed-off-by: Laurent Pinchart --- include/uapi/linux/rkisp1-config.h | 491 +++++++++++++++++++++++++++++++++++++ 1 file changed, 491 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/rkisp1-config.h b/include/uapi/linux/rkisp1-config.h index 6eeaf8bf2362..9767bb19c72d 100644 --- a/include/uapi/linux/rkisp1-config.h +++ b/include/uapi/linux/rkisp1-config.h @@ -996,4 +996,495 @@ struct rkisp1_stat_buffer { struct rkisp1_cif_isp_stat params; }; +/*---------- PART3: Extensible Configuration Parameters ------------*/ + +/** + * enum rkisp1_ext_params_block_type - RkISP1 extensible params block type + * + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS: Black level subtraction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_DPCC: Defect pixel cluster correction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_SDG: Sensor de-gamma + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_GAIN: Auto white balance gains + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_FLT: ISP filtering + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_BDM: Bayer de-mosaic + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_CTK: Cross-talk correction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_GOC: Gamma out correction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF: De-noise pre-filter + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF_STRENGTH: De-noise pre-filter strength + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_CPROC: Color processing + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_IE: Image effects + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_LSC: Lens shading correction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_MEAS: Auto white balance statistics + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS: Histogram statistics + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS: Auto exposure statistics + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS: Auto-focus statistics + */ +enum rkisp1_ext_params_block_type { + RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_DPCC, + RKISP1_EXT_PARAMS_BLOCK_TYPE_SDG, + RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_GAIN, + RKISP1_EXT_PARAMS_BLOCK_TYPE_FLT, + RKISP1_EXT_PARAMS_BLOCK_TYPE_BDM, + RKISP1_EXT_PARAMS_BLOCK_TYPE_CTK, + RKISP1_EXT_PARAMS_BLOCK_TYPE_GOC, + RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF, + RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF_STRENGTH, + RKISP1_EXT_PARAMS_BLOCK_TYPE_CPROC, + RKISP1_EXT_PARAMS_BLOCK_TYPE_IE, + RKISP1_EXT_PARAMS_BLOCK_TYPE_LSC, + RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_MEAS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS, +}; + +#define RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE (1U << 0) +#define RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE (1U << 1) + +/** + * struct rkisp1_ext_params_block_header - RkISP1 extensible parameters block + * header + * + * This structure represents the common part of all the ISP configuration + * blocks. Each parameters block shall embed an instance of this structure type + * as its first member, followed by the block-specific configuration data. The + * driver inspects this common header to discern the block type and its size and + * properly handle the block content by casting it to the correct block-specific + * type. + * + * The @type field is one of the values enumerated by + * :c:type:`rkisp1_ext_params_block_type` and specifies how the data should be + * interpreted by the driver. The @size field specifies the size of the + * parameters block and is used by the driver for validation purposes. + * + * The @flags field is a bitmask of per-block flags RKISP1_EXT_PARAMS_FL_*. + * + * When userspace wants to configure and enable an ISP block it shall fully + * populate the block configuration and set the + * RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE bit in the @flags field. + * + * When userspace simply wants to disable an ISP block the + * RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE bit should be set in @flags field. The + * driver ignores the rest of the block configuration structure in this case. + * + * If a new configuration of an ISP block has to be applied userspace shall + * fully populate the ISP block configuration and omit setting the + * RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE and RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE bits + * in the @flags field. + * + * Setting both the RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE and + * RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE bits in the @flags field is not allowed + * and not accepted by the driver. + * + * Userspace is responsible for correctly populating the parameters block header + * fields (@type, @flags and @size) and the block-specific parameters. + * + * For example: + * + * .. code-block:: c + * + * void populate_bls(struct rkisp1_ext_params_block_header *block) { + * struct rkisp1_ext_params_bls_config *bls = + * (struct rkisp1_ext_params_bls_config *)block; + * + * bls->header.type = RKISP1_EXT_PARAMS_BLOCK_ID_BLS; + * bls->header.flags = RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE; + * bls->header.size = sizeof(*bls); + * + * bls->config.enable_auto = 0; + * bls->config.fixed_val.r = blackLevelRed_; + * bls->config.fixed_val.gr = blackLevelGreenR_; + * bls->config.fixed_val.gb = blackLevelGreenB_; + * bls->config.fixed_val.b = blackLevelBlue_; + * } + * + * @type: The parameters block type, see + * :c:type:`rkisp1_ext_params_block_type` + * @flags: A bitmask of block flags + * @size: Size (in bytes) of the parameters block, including this header + */ +struct rkisp1_ext_params_block_header { + __u16 type; + __u16 flags; + __u32 size; +}; + +/** + * struct rkisp1_ext_params_bls_config - RkISP1 extensible params BLS config + * + * RkISP1 extensible parameters Black Level Subtraction configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Black Level Subtraction configuration, see + * :c:type:`rkisp1_cif_isp_bls_config` + */ +struct rkisp1_ext_params_bls_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_bls_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_dpcc_config - RkISP1 extensible params DPCC config + * + * RkISP1 extensible parameters Defective Pixel Cluster Correction configuration + * block. Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_DPCC`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Defective Pixel Cluster Correction configuration, see + * :c:type:`rkisp1_cif_isp_dpcc_config` + */ +struct rkisp1_ext_params_dpcc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_dpcc_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_sdg_config - RkISP1 extensible params SDG config + * + * RkISP1 extensible parameters Sensor Degamma configuration block. Identified + * by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_SDG`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Sensor Degamma configuration, see + * :c:type:`rkisp1_cif_isp_sdg_config` + */ +struct rkisp1_ext_params_sdg_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_sdg_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_lsc_config - RkISP1 extensible params LSC config + * + * RkISP1 extensible parameters Lens Shading Correction configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_LSC`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Lens Shading Correction configuration, see + * :c:type:`rkisp1_cif_isp_lsc_config` + */ +struct rkisp1_ext_params_lsc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_lsc_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_awb_gain_config - RkISP1 extensible params AWB + * gain config + * + * RkISP1 extensible parameters Auto-White Balance Gains configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_GAIN`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Auto-White Balance Gains configuration, see + * :c:type:`rkisp1_cif_isp_awb_gain_config` + */ +struct rkisp1_ext_params_awb_gain_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_awb_gain_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_flt_config - RkISP1 extensible params FLT config + * + * RkISP1 extensible parameters Filter configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_FLT`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Filter configuration, see :c:type:`rkisp1_cif_isp_flt_config` + */ +struct rkisp1_ext_params_flt_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_flt_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_bdm_config - RkISP1 extensible params BDM config + * + * RkISP1 extensible parameters Demosaicing configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_BDM`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Demosaicing configuration, see :c:type:`rkisp1_cif_isp_bdm_config` + */ +struct rkisp1_ext_params_bdm_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_bdm_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_ctk_config - RkISP1 extensible params CTK config + * + * RkISP1 extensible parameters Cross-Talk configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_CTK`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Cross-Talk configuration, see :c:type:`rkisp1_cif_isp_ctk_config` + */ +struct rkisp1_ext_params_ctk_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_ctk_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_goc_config - RkISP1 extensible params GOC config + * + * RkISP1 extensible parameters Gamma-Out configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_GOC`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Gamma-Out configuration, see :c:type:`rkisp1_cif_isp_goc_config` + */ +struct rkisp1_ext_params_goc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_goc_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_dpf_config - RkISP1 extensible params DPF config + * + * RkISP1 extensible parameters De-noise Pre-Filter configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: De-noise Pre-Filter configuration, see + * :c:type:`rkisp1_cif_isp_dpf_config` + */ +struct rkisp1_ext_params_dpf_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_dpf_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_dpf_strength_config - RkISP1 extensible params DPF + * strength config + * + * RkISP1 extensible parameters De-noise Pre-Filter strength configuration + * block. Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF_STRENGTH`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: De-noise Pre-Filter strength configuration, see + * :c:type:`rkisp1_cif_isp_dpf_strength_config` + */ +struct rkisp1_ext_params_dpf_strength_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_dpf_strength_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_cproc_config - RkISP1 extensible params CPROC config + * + * RkISP1 extensible parameters Color Processing configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_CPROC`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Color processing configuration, see + * :c:type:`rkisp1_cif_isp_cproc_config` + */ +struct rkisp1_ext_params_cproc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_cproc_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_ie_config - RkISP1 extensible params IE config + * + * RkISP1 extensible parameters Image Effect configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_IE`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Image Effect configuration, see :c:type:`rkisp1_cif_isp_ie_config` + */ +struct rkisp1_ext_params_ie_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_ie_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_awb_meas_config - RkISP1 extensible params AWB + * Meas config + * + * RkISP1 extensible parameters Auto-White Balance Measurement configuration + * block. Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_MEAS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Auto-White Balance measure configuration, see + * :c:type:`rkisp1_cif_isp_awb_meas_config` + */ +struct rkisp1_ext_params_awb_meas_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_awb_meas_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_hst_config - RkISP1 extensible params Histogram config + * + * RkISP1 extensible parameters Histogram statistics configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Histogram statistics configuration, see + * :c:type:`rkisp1_cif_isp_hst_config` + */ +struct rkisp1_ext_params_hst_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_hst_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_aec_config - RkISP1 extensible params AEC config + * + * RkISP1 extensible parameters Auto-Exposure statistics configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Auto-Exposure statistics configuration, see + * :c:type:`rkisp1_cif_isp_aec_config` + */ +struct rkisp1_ext_params_aec_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_aec_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_afc_config - RkISP1 extensible params AFC config + * + * RkISP1 extensible parameters Auto-Focus statistics configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Auto-Focus statistics configuration, see + * :c:type:`rkisp1_cif_isp_afc_config` + */ +struct rkisp1_ext_params_afc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_afc_config config; +} __attribute__((aligned(8))); + +#define RKISP1_EXT_PARAMS_MAX_SIZE \ + (sizeof(struct rkisp1_ext_params_bls_config) +\ + sizeof(struct rkisp1_ext_params_dpcc_config) +\ + sizeof(struct rkisp1_ext_params_sdg_config) +\ + sizeof(struct rkisp1_ext_params_lsc_config) +\ + sizeof(struct rkisp1_ext_params_awb_gain_config) +\ + sizeof(struct rkisp1_ext_params_flt_config) +\ + sizeof(struct rkisp1_ext_params_bdm_config) +\ + sizeof(struct rkisp1_ext_params_ctk_config) +\ + sizeof(struct rkisp1_ext_params_goc_config) +\ + sizeof(struct rkisp1_ext_params_dpf_config) +\ + sizeof(struct rkisp1_ext_params_dpf_strength_config) +\ + sizeof(struct rkisp1_ext_params_cproc_config) +\ + sizeof(struct rkisp1_ext_params_ie_config) +\ + sizeof(struct rkisp1_ext_params_awb_meas_config) +\ + sizeof(struct rkisp1_ext_params_hst_config) +\ + sizeof(struct rkisp1_ext_params_aec_config) +\ + sizeof(struct rkisp1_ext_params_afc_config)) + +/** + * enum rksip1_ext_param_buffer_version - RkISP1 extensible parameters version + * + * @RKISP1_EXT_PARAM_BUFFER_V1: First version of RkISP1 extensible parameters + */ +enum rksip1_ext_param_buffer_version { + RKISP1_EXT_PARAM_BUFFER_V1 = 1, +}; + +/** + * struct rkisp1_ext_params_cfg - RkISP1 extensible parameters configuration + * + * This struct contains the configuration parameters of the RkISP1 ISP + * algorithms, serialized by userspace into a data buffer. Each configuration + * parameter block is represented by a block-specific structure which contains a + * :c:type:`rkisp1_ext_params_block_header` entry as first member. Userspace + * populates the @data buffer with configuration parameters for the blocks that + * it intends to configure. As a consequence, the data buffer effective size + * changes according to the number of ISP blocks that userspace intends to + * configure and is set by userspace in the @data_size field. + * + * The parameters buffer is versioned by the @version field to allow modifying + * and extending its definition. Userspace shall populate the @version field to + * inform the driver about the version it intends to use. The driver will parse + * and handle the @data buffer according to the data layout specific to the + * indicated version and return an error if the desired version is not + * supported. + * + * Currently the single RKISP1_EXT_PARAM_BUFFER_V1 version is supported. + * When a new format version will be added, a mechanism for userspace to query + * the supported format versions will be implemented in the form of a read-only + * V4L2 control. If such control is not available, userspace should assume only + * RKISP1_EXT_PARAM_BUFFER_V1 is supported by the driver. + * + * For each ISP block that userspace wants to configure, a block-specific + * structure is appended to the @data buffer, one after the other without gaps + * in between nor overlaps. Userspace shall populate the @data_size field with + * the effective size, in bytes, of the @data buffer. + * + * The expected memory layout of the parameters buffer is:: + * + * +-------------------- struct rkisp1_ext_params_cfg -------------------+ + * | version = RKISP_EXT_PARAMS_BUFFER_V1; | + * | data_size = sizeof(struct rkisp1_ext_params_bls_config) | + * | + sizeof(struct rkisp1_ext_params_dpcc_config); | + * | +------------------------- data ---------------------------------+ | + * | | +------------- struct rkisp1_ext_params_bls_config -----------+ | | + * | | | +-------- struct rkisp1_ext_params_block_header ---------+ | | | + * | | | | type = RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS; | | | | + * | | | | flags = RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE; | | | | + * | | | | size = sizeof(struct rkisp1_ext_params_bls_config); | | | | + * | | | +---------------------------------------------------------+ | | | + * | | | +---------- struct rkisp1_cif_isp_bls_config -------------+ | | | + * | | | | enable_auto = 0; | | | | + * | | | | fixed_val.r = 256; | | | | + * | | | | fixed_val.gr = 256; | | | | + * | | | | fixed_val.gb = 256; | | | | + * | | | | fixed_val.b = 256; | | | | + * | | | +---------------------------------------------------------+ | | | + * | | +------------ struct rkisp1_ext_params_dpcc_config -----------+ | | + * | | | +-------- struct rkisp1_ext_params_block_header ---------+ | | | + * | | | | type = RKISP1_EXT_PARAMS_BLOCK_TYPE_DPCC; | | | | + * | | | | flags = RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE; | | | | + * | | | | size = sizeof(struct rkisp1_ext_params_dpcc_config); | | | | + * | | | +---------------------------------------------------------+ | | | + * | | | +---------- struct rkisp1_cif_isp_dpcc_config ------------+ | | | + * | | | | mode = RKISP1_CIF_ISP_DPCC_MODE_STAGE1_ENABLE; | | | | + * | | | | output_mode = | | | | + * | | | | RKISP1_CIF_ISP_DPCC_OUTPUT_MODE_STAGE1_INCL_G_CENTER; | | | | + * | | | | set_use = ... ; | | | | + * | | | | ... = ... ; | | | | + * | | | +---------------------------------------------------------+ | | | + * | | +-------------------------------------------------------------+ | | + * | +-----------------------------------------------------------------+ | + * +---------------------------------------------------------------------+ + * + * @version: The RkISP1 extensible parameters buffer version, see + * :c:type:`rksip1_ext_param_buffer_version` + * @data_size: The RkISP1 configuration data effective size, excluding this + * header + * @data: The RkISP1 extensible configuration data blocks + */ +struct rkisp1_ext_params_cfg { + __u32 version; + __u32 data_size; + __u8 data[RKISP1_EXT_PARAMS_MAX_SIZE]; +}; + #endif /* _UAPI_RKISP1_CONFIG_H */ -- cgit v1.2.3 From 1fc379f6241b331207c4573c4ff43526fe404301 Mon Sep 17 00:00:00 2001 From: Jacopo Mondi Date: Thu, 8 Aug 2024 22:40:55 +0200 Subject: media: uapi: videodev2: Add V4L2_META_FMT_RK_ISP1_EXT_PARAMS The rkisp1 driver stores ISP configuration parameters in the fixed rkisp1_params_cfg structure. As the members of the structure are part of the userspace API, the structure layout is immutable and cannot be extended further. Introducing new parameters or modifying the existing ones would change the buffer layout and cause breakages in existing applications. The allow for future extensions to the ISP parameters, introduce a new extensible parameters format, with a new format 4CC. Document usage of the new format in the rkisp1 admin guide. Signed-off-by: Jacopo Mondi Reviewed-by: Daniel Scally Reviewed-by: Paul Elder Reviewed-by: Laurent Pinchart Tested-by: Kieran Bingham Acked-by: Sakari Ailus Signed-off-by: Laurent Pinchart --- include/uapi/linux/videodev2.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/videodev2.h b/include/uapi/linux/videodev2.h index 4e91362da6da..725e86c4bbbd 100644 --- a/include/uapi/linux/videodev2.h +++ b/include/uapi/linux/videodev2.h @@ -854,6 +854,7 @@ struct v4l2_pix_format { /* Vendor specific - used for RK_ISP1 camera sub-system */ #define V4L2_META_FMT_RK_ISP1_PARAMS v4l2_fourcc('R', 'K', '1', 'P') /* Rockchip ISP1 3A Parameters */ #define V4L2_META_FMT_RK_ISP1_STAT_3A v4l2_fourcc('R', 'K', '1', 'S') /* Rockchip ISP1 3A Statistics */ +#define V4L2_META_FMT_RK_ISP1_EXT_PARAMS v4l2_fourcc('R', 'K', '1', 'E') /* Rockchip ISP1 3a Extensible Parameters */ /* Vendor specific - used for RaspberryPi PiSP */ #define V4L2_META_FMT_RPI_BE_CFG v4l2_fourcc('R', 'P', 'B', 'C') /* PiSP BE configuration */ -- cgit v1.2.3 From 3d50c66c0609c8b64fb22e9c188fca88f34e7c98 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 9 Aug 2024 22:37:26 -0700 Subject: ethtool: rss: support skipping contexts during dump Applications may want to deal with dynamic RSS contexts only. So dumping context 0 will be counter-productive for them. Support starting the dump from a given context ID. Alternative would be to implement a dump flag to skip just context 0, not sure which is better... Reviewed-by: Edward Cree Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- include/uapi/linux/ethtool_netlink.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 6d5bdcc67631..93c57525a975 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -965,6 +965,7 @@ enum { ETHTOOL_A_RSS_INDIR, /* binary */ ETHTOOL_A_RSS_HKEY, /* binary */ ETHTOOL_A_RSS_INPUT_XFRM, /* u32 */ + ETHTOOL_A_RSS_START_CONTEXT, /* u32 */ __ETHTOOL_A_RSS_CNT, ETHTOOL_A_RSS_MAX = (__ETHTOOL_A_RSS_CNT - 1), -- cgit v1.2.3 From 75bab45e6b2da379fe2ebda48ed35f8ce371a2ef Mon Sep 17 00:00:00 2001 From: Petr Machata Date: Wed, 7 Aug 2024 16:13:46 +0200 Subject: net: nexthop: Add flag to assert that NHGRP reserved fields are zero There are many unpatched kernel versions out there that do not initialize the reserved fields of struct nexthop_grp. The issue with that is that if those fields were to be used for some end (i.e. stop being reserved), old kernels would still keep sending random data through the field, and a new userspace could not rely on the value. In this patch, use the existing NHA_OP_FLAGS, which is currently inbound only, to carry flags back to the userspace. Add a flag to indicate that the reserved fields in struct nexthop_grp are zeroed before dumping. This is reliant on the actual fix from commit 6d745cd0e972 ("net: nexthop: Initialize all fields in dumped nexthops"). Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/21037748d4f9d8ff486151f4c09083bcf12d5df8.1723036486.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/nexthop.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index dd8787f9cf39..f4f060a87cc2 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -33,6 +33,9 @@ enum { #define NHA_OP_FLAG_DUMP_STATS BIT(0) #define NHA_OP_FLAG_DUMP_HW_STATS BIT(1) +/* Response OP_FLAGS. */ +#define NHA_OP_FLAG_RESP_GRP_RESVD_0 BIT(31) /* Dump clears resvd fields. */ + enum { NHA_UNSPEC, NHA_ID, /* u32; id for nexthop. id == 0 means auto-assign */ -- cgit v1.2.3 From b72a6a7ab9573e06d5c2fcb92eaa28614a735bfd Mon Sep 17 00:00:00 2001 From: Petr Machata Date: Wed, 7 Aug 2024 16:13:47 +0200 Subject: net: nexthop: Increase weight to u16 In CLOS networks, as link failures occur at various points in the network, ECMP weights of the involved nodes are adjusted to compensate. With high fan-out of the involved nodes, and overall high number of nodes, a (non-)ECMP weight ratio that we would like to configure does not fit into 8 bits. Instead of, say, 255:254, we might like to configure something like 1000:999. For these deployments, the 8-bit weight may not be enough. To that end, in this patch increase the next hop weight from u8 to u16. Increasing the width of an integral type can be tricky, because while the code still compiles, the types may not check out anymore, and numerical errors come up. To prevent this, the conversion was done in two steps. First the type was changed from u8 to a single-member structure, which invalidated all uses of the field. This allowed going through them one by one and audit for type correctness. Then the structure was replaced with a vanilla u16 again. This should ensure that no place was missed. The UAPI for configuring nexthop group members is that an attribute NHA_GROUP carries an array of struct nexthop_grp entries: struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ __u8 resvd1; __u16 resvd2; }; The field resvd1 is currently validated and required to be zero. We can lift this requirement and carry high-order bits of the weight in the reserved field: struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ __u8 weight_high; __u16 resvd2; }; Keeping the fields split this way was chosen in case an existing userspace makes assumptions about the width of the weight field, and to sidestep any endianness issues. The weight field is currently encoded as the weight value minus one, because weight of 0 is invalid. This same trick is impossible for the new weight_high field, because zero must mean actual zero. With this in place: - Old userspace is guaranteed to carry weight_high of 0, therefore configuring 8-bit weights as appropriate. When dumping nexthops with 16-bit weight, it would only show the lower 8 bits. But configuring such nexthops implies existence of userspace aware of the extension in the first place. - New userspace talking to an old kernel will work as long as it only attempts to configure 8-bit weights, where the high-order bits are zero. Old kernel will bounce attempts at configuring >8-bit weights. Renaming reserved fields as they are allocated for some purpose is commonly done in Linux. Whoever touches a reserved field is doing so at their own risk. nexthop_grp::resvd1 in particular is currently used by at least strace, however they carry an own copy of UAPI headers, and the conversion should be trivial. A helper is provided for decoding the weight out of the two fields. Forcing a conversion seems preferable to bending backwards and introducing anonymous unions or whatever. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern Reviewed-by: Przemek Kitszel Link: https://patch.msgid.link/483e2fcf4beb0d9135d62e7d27b46fa2685479d4.1723036486.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/nexthop.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index f4f060a87cc2..bc49baf4a267 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -16,10 +16,15 @@ struct nhmsg { struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ - __u8 resvd1; + __u8 weight_high; /* high order bits of weight */ __u16 resvd2; }; +static inline __u16 nexthop_grp_weight(const struct nexthop_grp *entry) +{ + return ((entry->weight_high << 8) | entry->weight) + 1; +} + enum { NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group * default type if not specified -- cgit v1.2.3 From c26cee817f8bd9a22bfade20f739ec2fc6f20221 Mon Sep 17 00:00:00 2001 From: David Sands Date: Sat, 10 Aug 2024 20:00:05 -0400 Subject: usb: gadget: f_fs: add capability for dfu functional descriptor Add the ability for the USB FunctionFS (FFS) gadget driver to be able to create Device Firmware Upgrade (DFU) functional descriptors. [1] This patch allows implementation of DFU in userspace using the FFS gadget. The DFU protocol uses the control pipe (ep0) for all messaging so only the addition of the DFU functional descriptor is needed in the kernel driver. The DFU functional descriptor is written to the ep0 file along with any other descriptors during FFS setup. DFU requires an interface descriptor followed by the DFU functional descriptor. This patch includes documentation of the added descriptor for DFU and conversion of some existing documentation to kernel-doc format so that it can be included in the generated docs. An implementation of DFU 1.1 that implements just the runtime descriptor using the FunctionFS gadget (with rebooting into u-boot for DFU mode) has been tested on an i.MX8 Nano. An implementation of DFU 1.1 that implements both runtime and DFU mode using the FunctionFS gadget has been tested on Xilinx Zynq UltraScale+. Note that for the best performance of firmware update file transfers, the userspace program should respond as quick as possible to the setup packets. [1] https://www.usb.org/sites/default/files/DFU_1.1.pdf Signed-off-by: David Sands Co-developed-by: Chris Wulff Signed-off-by: Chris Wulff Link: https://lore.kernel.org/r/20240811000004.1395888-2-crwulff@gmail.com Signed-off-by: Greg Kroah-Hartman --- include/uapi/linux/usb/ch9.h | 8 ++- include/uapi/linux/usb/functionfs.h | 97 ++++++++++++++++++++++++++++++++++--- 2 files changed, 95 insertions(+), 10 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/usb/ch9.h b/include/uapi/linux/usb/ch9.h index 44d73ba8788d..91f0f7e214a5 100644 --- a/include/uapi/linux/usb/ch9.h +++ b/include/uapi/linux/usb/ch9.h @@ -254,6 +254,9 @@ struct usb_ctrlrequest { #define USB_DT_DEVICE_CAPABILITY 0x10 #define USB_DT_WIRELESS_ENDPOINT_COMP 0x11 #define USB_DT_WIRE_ADAPTER 0x21 +/* From USB Device Firmware Upgrade Specification, Revision 1.1 */ +#define USB_DT_DFU_FUNCTIONAL 0x21 +/* these are from the Wireless USB spec */ #define USB_DT_RPIPE 0x22 #define USB_DT_CS_RADIO_CONTROL 0x23 /* From the T10 UAS specification */ @@ -329,9 +332,10 @@ struct usb_device_descriptor { #define USB_CLASS_USB_TYPE_C_BRIDGE 0x12 #define USB_CLASS_MISC 0xef #define USB_CLASS_APP_SPEC 0xfe -#define USB_CLASS_VENDOR_SPEC 0xff +#define USB_SUBCLASS_DFU 0x01 -#define USB_SUBCLASS_VENDOR_SPEC 0xff +#define USB_CLASS_VENDOR_SPEC 0xff +#define USB_SUBCLASS_VENDOR_SPEC 0xff /*-------------------------------------------------------------------------*/ diff --git a/include/uapi/linux/usb/functionfs.h b/include/uapi/linux/usb/functionfs.h index 9f88de9c3d66..2ebdba111a8f 100644 --- a/include/uapi/linux/usb/functionfs.h +++ b/include/uapi/linux/usb/functionfs.h @@ -3,6 +3,7 @@ #define _UAPI__LINUX_FUNCTIONFS_H__ +#include #include #include @@ -37,6 +38,31 @@ struct usb_endpoint_descriptor_no_audio { __u8 bInterval; } __attribute__((packed)); +/** + * struct usb_dfu_functional_descriptor - DFU Functional descriptor + * @bLength: Size of the descriptor (bytes) + * @bDescriptorType: USB_DT_DFU_FUNCTIONAL + * @bmAttributes: DFU attributes + * @wDetachTimeOut: Maximum time to wait after DFU_DETACH (ms, le16) + * @wTransferSize: Maximum number of bytes per control-write (le16) + * @bcdDFUVersion: DFU Spec version (BCD, le16) + */ +struct usb_dfu_functional_descriptor { + __u8 bLength; + __u8 bDescriptorType; + __u8 bmAttributes; + __le16 wDetachTimeOut; + __le16 wTransferSize; + __le16 bcdDFUVersion; +} __attribute__ ((packed)); + +/* from DFU functional descriptor bmAttributes */ +#define DFU_FUNC_ATT_CAN_DOWNLOAD _BITUL(0) +#define DFU_FUNC_ATT_CAN_UPLOAD _BITUL(1) +#define DFU_FUNC_ATT_MANIFEST_TOLERANT _BITUL(2) +#define DFU_FUNC_ATT_WILL_DETACH _BITUL(3) + + struct usb_functionfs_descs_head_v2 { __le32 magic; __le32 length; @@ -104,23 +130,38 @@ struct usb_ffs_dmabuf_transfer_req { #ifndef __KERNEL__ -/* +/** + * DOC: descriptors + * * Descriptors format: * + * +-----+-----------+--------------+--------------------------------------+ * | off | name | type | description | - * |-----+-----------+--------------+--------------------------------------| + * +-----+-----------+--------------+--------------------------------------+ * | 0 | magic | LE32 | FUNCTIONFS_DESCRIPTORS_MAGIC_V2 | + * +-----+-----------+--------------+--------------------------------------+ * | 4 | length | LE32 | length of the whole data chunk | + * +-----+-----------+--------------+--------------------------------------+ * | 8 | flags | LE32 | combination of functionfs_flags | + * +-----+-----------+--------------+--------------------------------------+ * | | eventfd | LE32 | eventfd file descriptor | + * +-----+-----------+--------------+--------------------------------------+ * | | fs_count | LE32 | number of full-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | hs_count | LE32 | number of high-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | ss_count | LE32 | number of super-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | os_count | LE32 | number of MS OS descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | fs_descrs | Descriptor[] | list of full-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | hs_descrs | Descriptor[] | list of high-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | ss_descrs | Descriptor[] | list of super-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | os_descrs | OSDesc[] | list of MS OS descriptors | + * +-----+-----------+--------------+--------------------------------------+ * * Depending on which flags are set, various fields may be missing in the * structure. Any flags that are not recognised cause the whole block to be @@ -128,71 +169,111 @@ struct usb_ffs_dmabuf_transfer_req { * * Legacy descriptors format (deprecated as of 3.14): * + * +-----+-----------+--------------+--------------------------------------+ * | off | name | type | description | - * |-----+-----------+--------------+--------------------------------------| + * +-----+-----------+--------------+--------------------------------------+ * | 0 | magic | LE32 | FUNCTIONFS_DESCRIPTORS_MAGIC | + * +-----+-----------+--------------+--------------------------------------+ * | 4 | length | LE32 | length of the whole data chunk | + * +-----+-----------+--------------+--------------------------------------+ * | 8 | fs_count | LE32 | number of full-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | 12 | hs_count | LE32 | number of high-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | 16 | fs_descrs | Descriptor[] | list of full-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | hs_descrs | Descriptor[] | list of high-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * * All numbers must be in little endian order. * * Descriptor[] is an array of valid USB descriptors which have the following * format: * + * +-----+-----------------+------+--------------------------+ * | off | name | type | description | - * |-----+-----------------+------+--------------------------| + * +-----+-----------------+------+--------------------------+ * | 0 | bLength | U8 | length of the descriptor | + * +-----+-----------------+------+--------------------------+ * | 1 | bDescriptorType | U8 | descriptor type | + * +-----+-----------------+------+--------------------------+ * | 2 | payload | | descriptor's payload | + * +-----+-----------------+------+--------------------------+ * * OSDesc[] is an array of valid MS OS Feature Descriptors which have one of * the following formats: * + * +-----+-----------------+------+--------------------------+ * | off | name | type | description | - * |-----+-----------------+------+--------------------------| + * +-----+-----------------+------+--------------------------+ * | 0 | inteface | U8 | related interface number | + * +-----+-----------------+------+--------------------------+ * | 1 | dwLength | U32 | length of the descriptor | + * +-----+-----------------+------+--------------------------+ * | 5 | bcdVersion | U16 | currently supported: 1 | + * +-----+-----------------+------+--------------------------+ * | 7 | wIndex | U16 | currently supported: 4 | + * +-----+-----------------+------+--------------------------+ * | 9 | bCount | U8 | number of ext. compat. | + * +-----+-----------------+------+--------------------------+ * | 10 | Reserved | U8 | 0 | + * +-----+-----------------+------+--------------------------+ * | 11 | ExtCompat[] | | list of ext. compat. d. | + * +-----+-----------------+------+--------------------------+ * + * +-----+-----------------+------+--------------------------+ * | off | name | type | description | - * |-----+-----------------+------+--------------------------| + * +-----+-----------------+------+--------------------------+ * | 0 | inteface | U8 | related interface number | + * +-----+-----------------+------+--------------------------+ * | 1 | dwLength | U32 | length of the descriptor | + * +-----+-----------------+------+--------------------------+ * | 5 | bcdVersion | U16 | currently supported: 1 | + * +-----+-----------------+------+--------------------------+ * | 7 | wIndex | U16 | currently supported: 5 | + * +-----+-----------------+------+--------------------------+ * | 9 | wCount | U16 | number of ext. compat. | + * +-----+-----------------+------+--------------------------+ * | 11 | ExtProp[] | | list of ext. prop. d. | + * +-----+-----------------+------+--------------------------+ * * ExtCompat[] is an array of valid Extended Compatiblity descriptors * which have the following format: * + * +-----+-----------------------+------+-------------------------------------+ * | off | name | type | description | - * |-----+-----------------------+------+-------------------------------------| + * +-----+-----------------------+------+-------------------------------------+ * | 0 | bFirstInterfaceNumber | U8 | index of the interface or of the 1st| + * +-----+-----------------------+------+-------------------------------------+ * | | | | interface in an IAD group | + * +-----+-----------------------+------+-------------------------------------+ * | 1 | Reserved | U8 | 1 | + * +-----+-----------------------+------+-------------------------------------+ * | 2 | CompatibleID | U8[8]| compatible ID string | + * +-----+-----------------------+------+-------------------------------------+ * | 10 | SubCompatibleID | U8[8]| subcompatible ID string | + * +-----+-----------------------+------+-------------------------------------+ * | 18 | Reserved | U8[6]| 0 | + * +-----+-----------------------+------+-------------------------------------+ * * ExtProp[] is an array of valid Extended Properties descriptors * which have the following format: * + * +-----+-----------------------+------+-------------------------------------+ * | off | name | type | description | - * |-----+-----------------------+------+-------------------------------------| + * +-----+-----------------------+------+-------------------------------------+ * | 0 | dwSize | U32 | length of the descriptor | + * +-----+-----------------------+------+-------------------------------------+ * | 4 | dwPropertyDataType | U32 | 1..7 | + * +-----+-----------------------+------+-------------------------------------+ * | 8 | wPropertyNameLength | U16 | bPropertyName length (NL) | + * +-----+-----------------------+------+-------------------------------------+ * | 10 | bPropertyName |U8[NL]| name of this property | + * +-----+-----------------------+------+-------------------------------------+ * |10+NL| dwPropertyDataLength | U32 | bPropertyData length (DL) | + * +-----+-----------------------+------+-------------------------------------+ * |14+NL| bProperty |U8[DL]| payload of this property | + * +-----+-----------------------+------+-------------------------------------+ */ struct usb_functionfs_strings_head { -- cgit v1.2.3 From ac79beb913dc40b1a49b628e31afdfeb20194ab6 Mon Sep 17 00:00:00 2001 From: Paul Elder Date: Thu, 8 Aug 2024 22:41:05 +0200 Subject: media: rkisp1: Add support for the companding block Add support to the rkisp1 driver for the companding block that exists on the i.MX8MP version of the ISP. This requires usage of the new extensible parameters format, and showcases how the format allows for extensions without breaking backward compatibility. Signed-off-by: Paul Elder Reviewed-by: Jacopo Mondi Reviewed-by: Paul Elder Signed-off-by: Jacopo Mondi Tested-by: Kieran Bingham Acked-by: Sakari Ailus Signed-off-by: Laurent Pinchart --- include/uapi/linux/rkisp1-config.h | 89 +++++++++++++++++++++++++++++++++++++- 1 file changed, 88 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/rkisp1-config.h b/include/uapi/linux/rkisp1-config.h index 9767bb19c72d..430daceafac7 100644 --- a/include/uapi/linux/rkisp1-config.h +++ b/include/uapi/linux/rkisp1-config.h @@ -164,6 +164,11 @@ #define RKISP1_CIF_ISP_DPF_MAX_NLF_COEFFS 17 #define RKISP1_CIF_ISP_DPF_MAX_SPATIAL_COEFFS 6 +/* + * Compand + */ +#define RKISP1_CIF_ISP_COMPAND_NUM_POINTS 64 + /* * Measurement types */ @@ -851,6 +856,39 @@ struct rkisp1_params_cfg { struct rkisp1_cif_isp_isp_other_cfg others; }; +/** + * struct rkisp1_cif_isp_compand_bls_config - Rockchip ISP1 Companding parameters (BLS) + * @r: Fixed subtraction value for Bayer pattern R + * @gr: Fixed subtraction value for Bayer pattern Gr + * @gb: Fixed subtraction value for Bayer pattern Gb + * @b: Fixed subtraction value for Bayer pattern B + * + * The values will be subtracted from the sensor values. Note that unlike the + * dedicated BLS block, the BLS values in the compander are 20-bit unsigned. + */ +struct rkisp1_cif_isp_compand_bls_config { + __u32 r; + __u32 gr; + __u32 gb; + __u32 b; +}; + +/** + * struct rkisp1_cif_isp_compand_curve_config - Rockchip ISP1 Companding + * parameters (expand and compression curves) + * @px: Compand curve x-values. Each value stores the distance from the + * previous x-value, expressed as log2 of the distance on 5 bits. + * @x: Compand curve x-values. The functionality of these parameters are + * unknown due to do a lack of hardware documentation, but these are left + * here for future compatibility purposes. + * @y: Compand curve y-values + */ +struct rkisp1_cif_isp_compand_curve_config { + __u8 px[RKISP1_CIF_ISP_COMPAND_NUM_POINTS]; + __u32 x[RKISP1_CIF_ISP_COMPAND_NUM_POINTS]; + __u32 y[RKISP1_CIF_ISP_COMPAND_NUM_POINTS]; +}; + /*---------- PART2: Measurement Statistics ------------*/ /** @@ -1018,6 +1056,9 @@ struct rkisp1_stat_buffer { * @RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS: Histogram statistics * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS: Auto exposure statistics * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS: Auto-focus statistics + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_BLS: BLS in the compand block + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_EXPAND: Companding expand curve + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_COMPRESS: Companding compress curve */ enum rkisp1_ext_params_block_type { RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS, @@ -1037,6 +1078,9 @@ enum rkisp1_ext_params_block_type { RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS, RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS, RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_BLS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_EXPAND, + RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_COMPRESS, }; #define RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE (1U << 0) @@ -1380,6 +1424,46 @@ struct rkisp1_ext_params_afc_config { struct rkisp1_cif_isp_afc_config config; } __attribute__((aligned(8))); +/** + * struct rkisp1_ext_params_compand_bls_config - RkISP1 extensible params + * Compand BLS config + * + * RkISP1 extensible parameters Companding configuration block (black level + * subtraction). Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_BLS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Companding BLS configuration, see + * :c:type:`rkisp1_cif_isp_compand_bls_config` + */ +struct rkisp1_ext_params_compand_bls_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_compand_bls_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_compand_curve_config - RkISP1 extensible params + * Compand curve config + * + * RkISP1 extensible parameters Companding configuration block (expand and + * compression curves). Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_EXPAND` or + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_COMPRESS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Companding curve configuration, see + * :c:type:`rkisp1_cif_isp_compand_curve_config` + */ +struct rkisp1_ext_params_compand_curve_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_compand_curve_config config; +} __attribute__((aligned(8))); + +/* + * The rkisp1_ext_params_compand_curve_config structure is counted twice as it + * is used for both the COMPAND_EXPAND and COMPAND_COMPRESS block types. + */ #define RKISP1_EXT_PARAMS_MAX_SIZE \ (sizeof(struct rkisp1_ext_params_bls_config) +\ sizeof(struct rkisp1_ext_params_dpcc_config) +\ @@ -1397,7 +1481,10 @@ struct rkisp1_ext_params_afc_config { sizeof(struct rkisp1_ext_params_awb_meas_config) +\ sizeof(struct rkisp1_ext_params_hst_config) +\ sizeof(struct rkisp1_ext_params_aec_config) +\ - sizeof(struct rkisp1_ext_params_afc_config)) + sizeof(struct rkisp1_ext_params_afc_config) +\ + sizeof(struct rkisp1_ext_params_compand_bls_config) +\ + sizeof(struct rkisp1_ext_params_compand_curve_config) +\ + sizeof(struct rkisp1_ext_params_compand_curve_config)) /** * enum rksip1_ext_param_buffer_version - RkISP1 extensible parameters version -- cgit v1.2.3 From 216203bdc2280d8fc5baf60707eee2051de1426e Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 13 Aug 2024 16:15:02 -0600 Subject: UAPI: net/sched: Use __struct_group() in flex struct tc_u32_sel Use the `__struct_group()` helper to create a new tagged `struct tc_u32_sel_hdr`. This structure groups together all the members of the flexible `struct tc_u32_sel` except the flexible array. As a result, the array is effectively separated from the rest of the members without modifying the memory layout of the flexible structure. This new tagged struct will be used to fix problematic declarations of middle-flex-arrays in composite structs[1]. [1] https://git.kernel.org/linus/d88cabfd9abc Signed-off-by: Gustavo A. R. Silva Link: https://patch.msgid.link/e59fe833564ddc5b2cc83056a4c504be887d6193.1723586870.git.gustavoars@kernel.org Signed-off-by: Jakub Kicinski --- include/uapi/linux/pkt_cls.h | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index d36d9cdf0c00..2c32080416b5 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -246,16 +246,19 @@ struct tc_u32_key { }; struct tc_u32_sel { - unsigned char flags; - unsigned char offshift; - unsigned char nkeys; - - __be16 offmask; - __u16 off; - short offoff; - - short hoff; - __be32 hmask; + /* New members MUST be added within the __struct_group() macro below. */ + __struct_group(tc_u32_sel_hdr, hdr, /* no attrs */, + unsigned char flags; + unsigned char offshift; + unsigned char nkeys; + + __be16 offmask; + __u16 off; + short offoff; + + short hoff; + __be32 hmask; + ); struct tc_u32_key keys[]; }; -- cgit v1.2.3 From 2140e63cd87fa707acf514d594725097bed018fd Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Mon, 12 Aug 2024 09:30:44 +0200 Subject: ethtool: Add new result codes for TDR diagnostics Add new result codes to support TDR diagnostics in preparation for Open Alliance 1000BaseT1 TDR support: - ETHTOOL_A_CABLE_RESULT_CODE_NOISE: TDR not possible due to high noise level. - ETHTOOL_A_CABLE_RESULT_CODE_RESOLUTION_NOT_POSSIBLE: TDR resolution not possible / out of distance. Reviewed-by: Andrew Lunn Signed-off-by: Oleksij Rempel Link: https://patch.msgid.link/20240812073046.1728288-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski --- include/uapi/linux/ethtool_netlink.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 93c57525a975..9074fa309bd6 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -556,6 +556,10 @@ enum { * a regular 100 Ohm cable and a part with the abnormal impedance value */ ETHTOOL_A_CABLE_RESULT_CODE_IMPEDANCE_MISMATCH, + /* TDR not possible due to high noise level */ + ETHTOOL_A_CABLE_RESULT_CODE_NOISE, + /* TDR resolution not possible / out of distance */ + ETHTOOL_A_CABLE_RESULT_CODE_RESOLUTION_NOT_POSSIBLE, }; enum { -- cgit v1.2.3 From 37745918e0e7575bc40f38da93a99b9fa6406224 Mon Sep 17 00:00:00 2001 From: Ivan Orlov Date: Tue, 13 Aug 2024 13:07:00 +0100 Subject: ALSA: timer: Introduce virtual userspace-driven timers Implement two ioctl calls in order to support virtual userspace-driven ALSA timers. The first ioctl is SNDRV_TIMER_IOCTL_CREATE, which gets the snd_timer_uinfo struct as a parameter and puts a file descriptor of a virtual timer into the `fd` field of the snd_timer_unfo structure. It also updates the `id` field of the snd_timer_uinfo struct, which provides a unique identifier for the timer (basically, the subdevice number which can be used when creating timer instances). This patch also introduces a tiny id allocator for the userspace-driven timers, which guarantees that we don't have more than 128 of them in the system. Another ioctl is SNDRV_TIMER_IOCTL_TRIGGER, which allows us to trigger the virtual timer (and calls snd_timer_interrupt for the timer under the hood), causing all of the timer instances binded to this timer to execute their callbacks. The maximum amount of ticks available for the timer is 1 for the sake of simplicity of the userspace API. 'start', 'stop', 'open' and 'close' callbacks for the userspace-driven timers are empty since we don't really do any hardware initialization here. Suggested-by: Axel Holzinger Signed-off-by: Ivan Orlov Signed-off-by: Takashi Iwai Link: https://patch.msgid.link/20240813120701.171743-4-ivan.orlov0322@gmail.com --- include/uapi/sound/asound.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/sound/asound.h b/include/uapi/sound/asound.h index 8bf7e8a0eb6f..4cd513215bcd 100644 --- a/include/uapi/sound/asound.h +++ b/include/uapi/sound/asound.h @@ -869,7 +869,7 @@ struct snd_ump_block_info { * Timer section - /dev/snd/timer */ -#define SNDRV_TIMER_VERSION SNDRV_PROTOCOL_VERSION(2, 0, 7) +#define SNDRV_TIMER_VERSION SNDRV_PROTOCOL_VERSION(2, 0, 8) enum { SNDRV_TIMER_CLASS_NONE = -1, @@ -894,6 +894,7 @@ enum { #define SNDRV_TIMER_GLOBAL_RTC 1 /* unused */ #define SNDRV_TIMER_GLOBAL_HPET 2 #define SNDRV_TIMER_GLOBAL_HRTIMER 3 +#define SNDRV_TIMER_GLOBAL_UDRIVEN 4 /* info flags */ #define SNDRV_TIMER_FLG_SLAVE (1<<0) /* cannot be controlled */ @@ -974,6 +975,18 @@ struct snd_timer_status { }; #endif +/* + * This structure describes the userspace-driven timer. Such timers are purely virtual, + * and can only be triggered from software (for instance, by userspace application). + */ +struct snd_timer_uinfo { + /* To pretend being a normal timer, we need to know the resolution in ns. */ + __u64 resolution; + int fd; + unsigned int id; + unsigned char reserved[16]; +}; + #define SNDRV_TIMER_IOCTL_PVERSION _IOR('T', 0x00, int) #define SNDRV_TIMER_IOCTL_NEXT_DEVICE _IOWR('T', 0x01, struct snd_timer_id) #define SNDRV_TIMER_IOCTL_TREAD_OLD _IOW('T', 0x02, int) @@ -990,6 +1003,8 @@ struct snd_timer_status { #define SNDRV_TIMER_IOCTL_CONTINUE _IO('T', 0xa2) #define SNDRV_TIMER_IOCTL_PAUSE _IO('T', 0xa3) #define SNDRV_TIMER_IOCTL_TREAD64 _IOW('T', 0xa4, int) +#define SNDRV_TIMER_IOCTL_CREATE _IOWR('T', 0xa5, struct snd_timer_uinfo) +#define SNDRV_TIMER_IOCTL_TRIGGER _IO('T', 0xa6) #if __BITS_PER_LONG == 64 #define SNDRV_TIMER_IOCTL_TREAD SNDRV_TIMER_IOCTL_TREAD_OLD -- cgit v1.2.3 From 820a185896b77814557302b981b092a9e7b36814 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 24 Jul 2024 15:15:35 +0200 Subject: fcntl: add F_CREATED_QUERY Systemd has a helper called openat_report_new() that returns whether a file was created anew or it already existed before for cases where O_CREAT has to be used without O_EXCL (cf. [1]). That apparently isn't something that's specific to systemd but it's where I noticed it. The current logic is that it first attempts to open the file without O_CREAT | O_EXCL and if it gets ENOENT the helper tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT | O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT | O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. The caller could try and use fanotify() to register to listen for creation events in the directory before calling openat(). The caller could then compare the returned tid to its own tid to ensure that even in threaded environments it actually created the file. That might work but is a lot of work for something that should be fairly simple and I'm uncertain about it's reliability. The caller could use a bpf lsm hook to hook into security_file_open() to figure out whether they created the file. That also seems a bit wild. So let's add F_CREATED_QUERY which allows the caller to check whether they actually did create the file. That has caveats of course but I don't think they are problematic: * In multi-threaded environments a thread can only be sure that it did create the file if it calls openat() with O_CREAT. In other words, it's obviously not enough to just go through it's fdtable and check these fds because another thread could've created the file. * If there's any codepaths where an openat() with O_CREAT would yield the same struct file as that of another thread it would obviously cause wrong results. I'm not aware of any such codepaths from openat() itself. Imho, that would be a bug. * Related to the previous point, calling the new fcntl() on files created and opened via special-purpose system calls or ioctl()s would cause wrong results only if the affected subsystem a) raises FMODE_CREATED and b) may return the same struct file for two different calls. I'm not seeing anything outside of regular VFS code that raises FMODE_CREATED. There is code for b) in e.g., the drm layer where the same struct file is resurfaced but again FMODE_CREATED isn't used and it would be very misleading if it did. Link: https://github.com/systemd/systemd/blob/11d5e2b5fbf9f6bfa5763fd45b56829ad4f0777f/src/basic/fs-util.c#L1078 [1] Link: https://lore.kernel.org/r/20240724-work-fcntl-v1-1-e8153a2f1991@kernel.org Reviewed-by: Jeff Layton Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/uapi/linux/fcntl.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index c0bcc185fa48..e55a3314bcb0 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -16,6 +16,9 @@ #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) +/* Was the file just created? */ +#define F_CREATED_QUERY (F_LINUX_SPECIFIC_BASE + 4) + /* * Cancel a blocking posix lock; internal use only until we expose an * asynchronous lock api to userspace: -- cgit v1.2.3 From 0311507792b54069ac72e0a6c6b35c5d40aadad8 Mon Sep 17 00:00:00 2001 From: Deven Bowers Date: Fri, 2 Aug 2024 23:08:15 -0700 Subject: lsm: add IPE lsm Integrity Policy Enforcement (IPE) is an LSM that provides an complimentary approach to Mandatory Access Control than existing LSMs today. Existing LSMs have centered around the concept of access to a resource should be controlled by the current user's credentials. IPE's approach, is that access to a resource should be controlled by the system's trust of a current resource. The basis of this approach is defining a global policy to specify which resource can be trusted. Signed-off-by: Deven Bowers Signed-off-by: Fan Wu [PM: subject line tweak] Signed-off-by: Paul Moore --- include/uapi/linux/lsm.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/lsm.h b/include/uapi/linux/lsm.h index 33d8c9f4aa6b..938593dfd5da 100644 --- a/include/uapi/linux/lsm.h +++ b/include/uapi/linux/lsm.h @@ -64,6 +64,7 @@ struct lsm_ctx { #define LSM_ID_LANDLOCK 110 #define LSM_ID_IMA 111 #define LSM_ID_EVM 112 +#define LSM_ID_IPE 113 /* * LSM_ATTR_XXX definitions identify different LSM attributes -- cgit v1.2.3 From d386d59b7c1a39112ca875327339ed519df2b96c Mon Sep 17 00:00:00 2001 From: Wen Gu Date: Wed, 14 Aug 2024 21:08:26 +0800 Subject: net/smc: introduce statistics for allocated ringbufs of link group Currently we have the statistics on sndbuf/RMB sizes of all connections that have ever been on the link group, namely smc_stats_memsize. However these statistics are incremental and since the ringbufs of link group are allowed to be reused, we cannot know the actual allocated buffers through these. So here introduces the statistic on actual allocated ringbufs of the link group, it will be incremented when a new ringbuf is added into buf_list and decremented when it is deleted from buf_list. Signed-off-by: Wen Gu Signed-off-by: Paolo Abeni --- include/uapi/linux/smc.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h index b531e3ef011a..9b74ef79070a 100644 --- a/include/uapi/linux/smc.h +++ b/include/uapi/linux/smc.h @@ -127,6 +127,8 @@ enum { SMC_NLA_LGR_R_NET_COOKIE, /* u64 */ SMC_NLA_LGR_R_PAD, /* flag */ SMC_NLA_LGR_R_BUF_TYPE, /* u8 */ + SMC_NLA_LGR_R_SNDBUF_ALLOC, /* uint */ + SMC_NLA_LGR_R_RMB_ALLOC, /* uint */ __SMC_NLA_LGR_R_MAX, SMC_NLA_LGR_R_MAX = __SMC_NLA_LGR_R_MAX - 1 }; @@ -162,6 +164,8 @@ enum { SMC_NLA_LGR_D_V2_COMMON, /* nest */ SMC_NLA_LGR_D_EXT_GID, /* u64 */ SMC_NLA_LGR_D_PEER_EXT_GID, /* u64 */ + SMC_NLA_LGR_D_SNDBUF_ALLOC, /* uint */ + SMC_NLA_LGR_D_DMB_ALLOC, /* uint */ __SMC_NLA_LGR_D_MAX, SMC_NLA_LGR_D_MAX = __SMC_NLA_LGR_D_MAX - 1 }; -- cgit v1.2.3 From e0d103542b06c36240e3887edfe49578464866eb Mon Sep 17 00:00:00 2001 From: Wen Gu Date: Wed, 14 Aug 2024 21:08:27 +0800 Subject: net/smc: introduce statistics for ringbufs usage of net namespace The buffer size histograms in smc_stats, namely rx/tx_rmbsize, record the sizes of ringbufs for all connections that have ever appeared in the net namespace. They are incremental and we cannot know the actual ringbufs usage from these. So here introduces statistics for current ringbufs usage of existing smc connections in the net namespace into smc_stats, it will be incremented when new connection uses a ringbuf and decremented when the ringbuf is unused. Signed-off-by: Wen Gu Signed-off-by: Paolo Abeni --- include/uapi/linux/smc.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h index 9b74ef79070a..1f58cb0c266b 100644 --- a/include/uapi/linux/smc.h +++ b/include/uapi/linux/smc.h @@ -253,6 +253,8 @@ enum { SMC_NLA_STATS_T_TX_BYTES, /* u64 */ SMC_NLA_STATS_T_RX_CNT, /* u64 */ SMC_NLA_STATS_T_TX_CNT, /* u64 */ + SMC_NLA_STATS_T_RX_RMB_USAGE, /* uint */ + SMC_NLA_STATS_T_TX_RMB_USAGE, /* uint */ __SMC_NLA_STATS_T_MAX, SMC_NLA_STATS_T_MAX = __SMC_NLA_STATS_T_MAX - 1 }; -- cgit v1.2.3 From 1fa3314c14c6a20d098991a0a6980f9b18b2f930 Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Wed, 14 Aug 2024 15:52:24 +0300 Subject: ipv4: Centralize TOS matching The TOS field in the IPv4 flow information structure ('flowi4_tos') is matched by the kernel against the TOS selector in IPv4 rules and routes. The field is initialized differently by different call sites. Some treat it as DSCP (RFC 2474) and initialize all six DSCP bits, some treat it as RFC 1349 TOS and initialize it using RT_TOS() and some treat it as RFC 791 TOS and initialize it using IPTOS_RT_MASK. What is common to all these call sites is that they all initialize the lower three DSCP bits, which fits the TOS definition in the initial IPv4 specification (RFC 791). Therefore, the kernel only allows configuring IPv4 FIB rules that match on the lower three DSCP bits which are always guaranteed to be initialized by all call sites: # ip -4 rule add tos 0x1c table 100 # ip -4 rule add tos 0x3c table 100 Error: Invalid tos. While this works, it is unlikely to be very useful. RFC 791 that initially defined the TOS and IP precedence fields was updated by RFC 2474 over twenty five years ago where these fields were replaced by a single six bits DSCP field. Extending FIB rules to match on DSCP can be done by adding a new DSCP selector while maintaining the existing semantics of the TOS selector for applications that rely on that. A prerequisite for allowing FIB rules to match on DSCP is to adjust all the call sites to initialize the high order DSCP bits and remove their masking along the path to the core where the field is matched on. However, making this change alone will result in a behavior change. For example, a forwarded IPv4 packet with a DS field of 0xfc will no longer match a FIB rule that was configured with 'tos 0x1c'. This behavior change can be avoided by masking the upper three DSCP bits in 'flowi4_tos' before comparing it against the TOS selectors in FIB rules and routes. Implement the above by adding a new function that checks whether a given DSCP value matches the one specified in the IPv4 flow information structure and invoke it from the three places that currently match on 'flowi4_tos'. Use RT_TOS() for the masking of 'flowi4_tos' instead of IPTOS_RT_MASK since the latter is not uAPI and we should be able to remove it at some point. Include in since the former defines IPTOS_TOS_MASK which is used in the definition of RT_TOS() in . No regressions in FIB tests: # ./fib_tests.sh [...] Tests passed: 218 Tests failed: 0 And FIB rule tests: # ./fib_rule_tests.sh [...] Tests passed: 116 Tests failed: 0 Signed-off-by: Ido Schimmel Signed-off-by: Paolo Abeni --- include/uapi/linux/in_route.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/in_route.h b/include/uapi/linux/in_route.h index 0cc2c23b47f8..10bdd7e7107f 100644 --- a/include/uapi/linux/in_route.h +++ b/include/uapi/linux/in_route.h @@ -2,6 +2,8 @@ #ifndef _LINUX_IN_ROUTE_H #define _LINUX_IN_ROUTE_H +#include + /* IPv4 routing cache flags */ #define RTCF_DEAD RTNH_F_DEAD -- cgit v1.2.3 From f44554b5067b36c14cc91ed811fa1bd58baed34a Mon Sep 17 00:00:00 2001 From: Deven Bowers Date: Fri, 2 Aug 2024 23:08:23 -0700 Subject: audit,ipe: add IPE auditing support Users of IPE require a way to identify when and why an operation fails, allowing them to both respond to violations of policy and be notified of potentially malicious actions on their systems with respect to IPE itself. This patch introduces 3 new audit events. AUDIT_IPE_ACCESS(1420) indicates the result of an IPE policy evaluation of a resource. AUDIT_IPE_CONFIG_CHANGE(1421) indicates the current active IPE policy has been changed to another loaded policy. AUDIT_IPE_POLICY_LOAD(1422) indicates a new IPE policy has been loaded into the kernel. This patch also adds support for success auditing, allowing users to identify why an allow decision was made for a resource. However, it is recommended to use this option with caution, as it is quite noisy. Here are some examples of the new audit record types: AUDIT_IPE_ACCESS(1420): audit: AUDIT1420 ipe_op=EXECUTE ipe_hook=BPRM_CHECK enforcing=1 pid=297 comm="sh" path="/root/vol/bin/hello" dev="tmpfs" ino=3897 rule="op=EXECUTE boot_verified=TRUE action=ALLOW" audit: AUDIT1420 ipe_op=EXECUTE ipe_hook=BPRM_CHECK enforcing=1 pid=299 comm="sh" path="/mnt/ipe/bin/hello" dev="dm-0" ino=2 rule="DEFAULT action=DENY" audit: AUDIT1420 ipe_op=EXECUTE ipe_hook=BPRM_CHECK enforcing=1 pid=300 path="/tmp/tmpdp2h1lub/deny/bin/hello" dev="tmpfs" ino=131 rule="DEFAULT action=DENY" The above three records were generated when the active IPE policy only allows binaries from the initramfs to run. The three identical `hello` binary were placed at different locations, only the first hello from the rootfs(initramfs) was allowed. Field ipe_op followed by the IPE operation name associated with the log. Field ipe_hook followed by the name of the LSM hook that triggered the IPE event. Field enforcing followed by the enforcement state of IPE. (it will be introduced in the next commit) Field pid followed by the pid of the process that triggered the IPE event. Field comm followed by the command line program name of the process that triggered the IPE event. Field path followed by the file's path name. Field dev followed by the device name as found in /dev where the file is from. Note that for device mappers it will use the name `dm-X` instead of the name in /dev/mapper. For a file in a temp file system, which is not from a device, it will use `tmpfs` for the field. The implementation of this part is following another existing use case LSM_AUDIT_DATA_INODE in security/lsm_audit.c Field ino followed by the file's inode number. Field rule followed by the IPE rule made the access decision. The whole rule must be audited because the decision is based on the combination of all property conditions in the rule. Along with the syscall audit event, user can know why a blocked happened. For example: audit: AUDIT1420 ipe_op=EXECUTE ipe_hook=BPRM_CHECK enforcing=1 pid=2138 comm="bash" path="/mnt/ipe/bin/hello" dev="dm-0" ino=2 rule="DEFAULT action=DENY" audit[1956]: SYSCALL arch=c000003e syscall=59 success=no exit=-13 a0=556790138df0 a1=556790135390 a2=5567901338b0 a3=ab2a41a67f4f1f4e items=1 ppid=147 pid=1956 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="bash" exe="/usr/bin/bash" key=(null) The above two records showed bash used execve to run "hello" and got blocked by IPE. Note that the IPE records are always prior to a SYSCALL record. AUDIT_IPE_CONFIG_CHANGE(1421): audit: AUDIT1421 old_active_pol_name="Allow_All" old_active_pol_version=0.0.0 old_policy_digest=sha256:E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649 new_active_pol_name="boot_verified" new_active_pol_version=0.0.0 new_policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F auid=4294967295 ses=4294967295 lsm=ipe res=1 The above record showed the current IPE active policy switch from `Allow_All` to `boot_verified` along with the version and the hash digest of the two policies. Note IPE can only have one policy active at a time, all access decision evaluation is based on the current active policy. The normal procedure to deploy a policy is loading the policy to deploy into the kernel first, then switch the active policy to it. AUDIT_IPE_POLICY_LOAD(1422): audit: AUDIT1422 policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F2676 auid=4294967295 ses=4294967295 lsm=ipe res=1 The above record showed a new policy has been loaded into the kernel with the policy name, policy version and policy hash. Signed-off-by: Deven Bowers Signed-off-by: Fan Wu [PM: subject line tweak] Signed-off-by: Paul Moore --- include/uapi/linux/audit.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h index d676ed2b246e..75e21a135483 100644 --- a/include/uapi/linux/audit.h +++ b/include/uapi/linux/audit.h @@ -143,6 +143,9 @@ #define AUDIT_MAC_UNLBL_STCDEL 1417 /* NetLabel: del a static label */ #define AUDIT_MAC_CALIPSO_ADD 1418 /* NetLabel: add CALIPSO DOI entry */ #define AUDIT_MAC_CALIPSO_DEL 1419 /* NetLabel: del CALIPSO DOI entry */ +#define AUDIT_IPE_ACCESS 1420 /* IPE denial or grant */ +#define AUDIT_IPE_CONFIG_CHANGE 1421 /* IPE config change */ +#define AUDIT_IPE_POLICY_LOAD 1422 /* IPE policy load */ #define AUDIT_FIRST_KERN_ANOM_MSG 1700 #define AUDIT_LAST_KERN_ANOM_MSG 1799 -- cgit v1.2.3 From 273f8c142003dfd874d45b1a60965809e95ccd50 Mon Sep 17 00:00:00 2001 From: Justin Iurman Date: Sat, 17 Aug 2024 15:18:18 +0200 Subject: net: ipv6: ioam6: new feature tunsrc This patch provides a new feature (i.e., "tunsrc") for the tunnel (i.e., "encap") mode of ioam6. Just like seg6 already does, except it is attached to a route. The "tunsrc" is optional: when not provided (by default), the automatic resolution is applied. Using "tunsrc" when possible has a benefit: performance. See the comparison: - before (= "encap" mode): https://ibb.co/bNCzvf7 - after (= "encap" mode with "tunsrc"): https://ibb.co/PT8L6yq Signed-off-by: Justin Iurman Signed-off-by: Paolo Abeni --- include/uapi/linux/ioam6_iptunnel.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/ioam6_iptunnel.h b/include/uapi/linux/ioam6_iptunnel.h index 38f6a8fdfd34..8aef21e4a8c1 100644 --- a/include/uapi/linux/ioam6_iptunnel.h +++ b/include/uapi/linux/ioam6_iptunnel.h @@ -50,6 +50,12 @@ enum { IOAM6_IPTUNNEL_FREQ_K, /* u32 */ IOAM6_IPTUNNEL_FREQ_N, /* u32 */ + /* Tunnel src address. + * For encap,auto modes. + * Optional (automatic if not provided). + */ + IOAM6_IPTUNNEL_SRC, /* struct in6_addr */ + __IOAM6_IPTUNNEL_MAX, }; -- cgit v1.2.3 From a139c98f760efa1b6f0ee2f36ea6f62f04c8b20a Mon Sep 17 00:00:00 2001 From: Chris Wulff Date: Sat, 17 Aug 2024 10:28:51 -0400 Subject: USB: gadget: f_hid: Add GET_REPORT via userspace IOCTL While supporting GET_REPORT is a mandatory request per the HID specification the current implementation of the GET_REPORT request responds to the USB Host with an empty reply of the request length. However, some USB Hosts will request the contents of feature reports via the GET_REPORT request. In addition, some proprietary HID 'protocols' will expect different data, for the same report ID, to be to become available in the feature report by sending a preceding SET_REPORT to the USB Device that defines what data is to be presented when that feature report is subsequently retrieved via GET_REPORT (with a very fast < 5ms turn around between the SET_REPORT and the GET_REPORT). There are two other patch sets already submitted for adding GET_REPORT support. The first [1] allows for pre-priming a list of reports via IOCTLs which then allows the USB Host to perform the request, with no further userspace interaction possible during the GET_REPORT request. And another [2] which allows for a single report to be setup by userspace via IOCTL, which will be fetched and returned by the kernel for subsequent GET_REPORT requests by the USB Host, also with no further userspace interaction possible. This patch, while loosely based on both the patch sets, differs by allowing the option for userspace to respond to each GET_REPORT request by setting up a poll to notify userspace that a new GET_REPORT request has arrived. To support this, two extra IOCTLs are supplied. The first of which is used to retrieve the report ID of the GET_REPORT request (in the case of having non-zero report IDs in the HID descriptor). The second IOCTL allows for storing report responses in a list for responding to requests. The report responses are stored in a list (it will be either added if it does not exist or updated if it exists already). A flag (userspace_req) can be set to whether subsequent requests notify userspace or not. Basic operation when a GET_REPORT request arrives from USB Host: - If the report ID exists in the list and it is set for immediate return (i.e. userspace_req == false) then response is sent immediately, userspace is not notified - The report ID does not exist, or exists but is set to notify userspace (i.e. userspace_req == true) then notify userspace via poll: - If userspace responds, and either adds or update the response in the list and respond to the host with the contents - If userspace does not respond within the fixed timeout (2500ms) but the report has been set prevously, then send 'old' report contents - If userspace does not respond within the fixed timeout (2500ms) and the report does not exist in the list then send an empty report Note that userspace could 'prime' the report list at any other time. While this patch allows for flexibility in how the system responds to requests, and therefore the HID 'protocols' that could be supported, a drawback is the time it takes to service the requests and therefore the maximum throughput that would be achievable. The USB HID Specification v1.11 itself states that GET_REPORT is not intended for periodic data polling, so this limitation is not severe. Testing on an iMX8M Nano Ultra Lite with a heavy multi-core CPU loading showed that userspace can typically respond to the GET_REPORT request within 1200ms - which is well within the 5000ms most operating systems seem to allow, and within the 2500ms set by this patch. [1] https://lore.kernel.org/all/20220805070507.123151-2-sunil@amarulasolutions.com/ [2] https://lore.kernel.org/all/20220726005824.2817646-1-vi@endrift.com/ Signed-off-by: David Sands Signed-off-by: Chris Wulff Link: https://lore.kernel.org/r/20240817142850.1311460-2-crwulff@gmail.com Signed-off-by: Greg Kroah-Hartman --- include/uapi/linux/usb/g_hid.h | 40 +++++++++++++++++++++++++++++++++++++++ include/uapi/linux/usb/gadgetfs.h | 2 +- 2 files changed, 41 insertions(+), 1 deletion(-) create mode 100644 include/uapi/linux/usb/g_hid.h (limited to 'include/uapi') diff --git a/include/uapi/linux/usb/g_hid.h b/include/uapi/linux/usb/g_hid.h new file mode 100644 index 000000000000..b965092db476 --- /dev/null +++ b/include/uapi/linux/usb/g_hid.h @@ -0,0 +1,40 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ + +#ifndef __UAPI_LINUX_USB_G_HID_H +#define __UAPI_LINUX_USB_G_HID_H + +#include + +/* Maximum HID report length for High-Speed USB (i.e. USB 2.0) */ +#define MAX_REPORT_LENGTH 64 + +/** + * struct usb_hidg_report - response to GET_REPORT + * @report_id: report ID that this is a response for + * @userspace_req: + * !0 this report is used for any pending GET_REPORT request + * but wait on userspace to issue a new report on future requests + * 0 this report is to be used for any future GET_REPORT requests + * @length: length of the report response + * @data: report response + * @padding: padding for 32/64 bit compatibility + * + * Structure used by GADGET_HID_WRITE_GET_REPORT ioctl on /dev/hidg*. + */ +struct usb_hidg_report { + __u8 report_id; + __u8 userspace_req; + __u16 length; + __u8 data[MAX_REPORT_LENGTH]; + __u8 padding[4]; +}; + +/* The 'g' code is used by gadgetfs and hid gadget ioctl requests. + * Don't add any colliding codes to either driver, and keep + * them in unique ranges. + */ + +#define GADGET_HID_READ_GET_REPORT_ID _IOR('g', 0x41, __u8) +#define GADGET_HID_WRITE_GET_REPORT _IOW('g', 0x42, struct usb_hidg_report) + +#endif /* __UAPI_LINUX_USB_G_HID_H */ diff --git a/include/uapi/linux/usb/gadgetfs.h b/include/uapi/linux/usb/gadgetfs.h index 835473910a49..9754822b2a40 100644 --- a/include/uapi/linux/usb/gadgetfs.h +++ b/include/uapi/linux/usb/gadgetfs.h @@ -62,7 +62,7 @@ struct usb_gadgetfs_event { }; -/* The 'g' code is also used by printer gadget ioctl requests. +/* The 'g' code is also used by printer and hid gadget ioctl requests. * Don't add any colliding codes to either driver, and keep * them in unique ranges (size 0x20 for now). */ -- cgit v1.2.3 From 5ac86f0ed04bce41242167ffa12ad92038788a95 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 10 Jul 2024 16:15:55 -0700 Subject: virt: vbox: struct vmmdev_hgcm_pagelist: Replace 1-element array with flexible array Replace the deprecated[1] use of a 1-element array in struct vmmdev_hgcm_pagelist with a modern flexible array. As this is UAPI, we cannot trivially change the size of the struct, so use a union to retain the old first element's size, but switch "pages" to a flexible array. No binary differences are present after this conversion. Link: https://github.com/KSPP/linux/issues/79 [1] Reviewed-by: Gustavo A. R. Silva Link: https://lore.kernel.org/r/20240710231555.work.406-kees@kernel.org Signed-off-by: Kees Cook --- include/uapi/linux/vbox_vmmdev_types.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/vbox_vmmdev_types.h b/include/uapi/linux/vbox_vmmdev_types.h index f8a8d6b3c521..6073858d52a2 100644 --- a/include/uapi/linux/vbox_vmmdev_types.h +++ b/include/uapi/linux/vbox_vmmdev_types.h @@ -282,7 +282,10 @@ struct vmmdev_hgcm_pagelist { __u32 flags; /** VMMDEV_HGCM_F_PARM_*. */ __u16 offset_first_page; /** Data offset in the first page. */ __u16 page_count; /** Number of pages. */ - __u64 pages[1]; /** Page addresses. */ + union { + __u64 unused; /** Deprecated place-holder for first "pages" entry. */ + __DECLARE_FLEX_ARRAY(__u64, pages); /** Page addresses. */ + }; }; VMMDEV_ASSERT_SIZE(vmmdev_hgcm_pagelist, 4 + 2 + 2 + 8); -- cgit v1.2.3 From 3849687869092094003ba009dc00e2e0237a3b8a Mon Sep 17 00:00:00 2001 From: Maxime Chevallier Date: Wed, 21 Aug 2024 17:09:55 +0200 Subject: net: phy: Introduce ethernet link topology representation Link topologies containing multiple network PHYs attached to the same net_device can be found when using a PHY as a media converter for use with an SFP connector, on which an SFP transceiver containing a PHY can be used. With the current model, the transceiver's PHY can't be used for operations such as cable testing, timestamping, macsec offload, etc. The reason being that most of the logic for these configuration, coming from either ethtool netlink or ioctls tend to use netdev->phydev, which in multi-phy systems will reference the PHY closest to the MAC. Introduce a numbering scheme allowing to enumerate PHY devices that belong to any netdev, which can in turn allow userspace to take more precise decisions with regard to each PHY's configuration. The numbering is maintained per-netdev, in a phy_device_list. The numbering works similarly to a netdevice's ifindex, with identifiers that are only recycled once INT_MAX has been reached. This prevents races that could occur between PHY listing and SFP transceiver removal/insertion. The identifiers are assigned at phy_attach time, as the numbering depends on the netdevice the phy is attached to. The PHY index can be re-used for PHYs that are persistent. Signed-off-by: Maxime Chevallier Reviewed-by: Christophe Leroy Tested-by: Christophe Leroy Signed-off-by: David S. Miller --- include/uapi/linux/ethtool.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index 4a0a6e703483..c405ed63acfa 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -2533,4 +2533,20 @@ struct ethtool_link_settings { * __u32 map_lp_advertising[link_mode_masks_nwords]; */ }; + +/** + * enum phy_upstream - Represents the upstream component a given PHY device + * is connected to, as in what is on the other end of the MII bus. Most PHYs + * will be attached to an Ethernet MAC controller, but in some cases, there's + * an intermediate PHY used as a media-converter, which will driver another + * MII interface as its output. + * @PHY_UPSTREAM_MAC: Upstream component is a MAC (a switch port, + * or ethernet controller) + * @PHY_UPSTREAM_PHY: Upstream component is a PHY (likely a media converter) + */ +enum phy_upstream { + PHY_UPSTREAM_MAC, + PHY_UPSTREAM_PHY, +}; + #endif /* _UAPI_LINUX_ETHTOOL_H */ -- cgit v1.2.3 From c15e065b46dc4e19837275b826c1960d55564abd Mon Sep 17 00:00:00 2001 From: Maxime Chevallier Date: Wed, 21 Aug 2024 17:09:59 +0200 Subject: net: ethtool: Allow passing a phy index for some commands Some netlink commands are target towards ethernet PHYs, to control some of their features. As there's several such commands, add the ability to pass a PHY index in the ethnl request, which will populate the generic ethnl_req_info with the passed phy_index. Add a helper that netlink command handlers need to use to grab the targeted PHY from the req_info. This helper needs to hold rtnl_lock() while interacting with the PHY, as it may be removed at any point. Signed-off-by: Maxime Chevallier Reviewed-by: Christophe Leroy Tested-by: Christophe Leroy Signed-off-by: David S. Miller --- include/uapi/linux/ethtool_netlink.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 9074fa309bd6..49d1f9220fde 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -134,6 +134,7 @@ enum { ETHTOOL_A_HEADER_DEV_INDEX, /* u32 */ ETHTOOL_A_HEADER_DEV_NAME, /* string */ ETHTOOL_A_HEADER_FLAGS, /* u32 - ETHTOOL_FLAG_* */ + ETHTOOL_A_HEADER_PHY_INDEX, /* u32 */ /* add new constants above here */ __ETHTOOL_A_HEADER_CNT, -- cgit v1.2.3 From 17194be4c8e1e82d8b484e58cdcb495c0714d1fd Mon Sep 17 00:00:00 2001 From: Maxime Chevallier Date: Wed, 21 Aug 2024 17:10:01 +0200 Subject: net: ethtool: Introduce a command to list PHYs on an interface As we have the ability to track the PHYs connected to a net_device through the link_topology, we can expose this list to userspace. This allows userspace to use these identifiers for phy-specific commands and take the decision of which PHY to target by knowing the link topology. Add PHY_GET and PHY_DUMP, which can be a filtered DUMP operation to list devices on only one interface. Signed-off-by: Maxime Chevallier Reviewed-by: Christophe Leroy Tested-by: Christophe Leroy Signed-off-by: David S. Miller --- include/uapi/linux/ethtool_netlink.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 49d1f9220fde..45d8bcdea056 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -58,6 +58,7 @@ enum { ETHTOOL_MSG_MM_GET, ETHTOOL_MSG_MM_SET, ETHTOOL_MSG_MODULE_FW_FLASH_ACT, + ETHTOOL_MSG_PHY_GET, /* add new constants above here */ __ETHTOOL_MSG_USER_CNT, @@ -111,6 +112,8 @@ enum { ETHTOOL_MSG_MM_GET_REPLY, ETHTOOL_MSG_MM_NTF, ETHTOOL_MSG_MODULE_FW_FLASH_NTF, + ETHTOOL_MSG_PHY_GET_REPLY, + ETHTOOL_MSG_PHY_NTF, /* add new constants above here */ __ETHTOOL_MSG_KERNEL_CNT, @@ -1055,6 +1058,22 @@ enum { ETHTOOL_A_MODULE_FW_FLASH_MAX = (__ETHTOOL_A_MODULE_FW_FLASH_CNT - 1) }; +enum { + ETHTOOL_A_PHY_UNSPEC, + ETHTOOL_A_PHY_HEADER, /* nest - _A_HEADER_* */ + ETHTOOL_A_PHY_INDEX, /* u32 */ + ETHTOOL_A_PHY_DRVNAME, /* string */ + ETHTOOL_A_PHY_NAME, /* string */ + ETHTOOL_A_PHY_UPSTREAM_TYPE, /* u32 */ + ETHTOOL_A_PHY_UPSTREAM_INDEX, /* u32 */ + ETHTOOL_A_PHY_UPSTREAM_SFP_NAME, /* string */ + ETHTOOL_A_PHY_DOWNSTREAM_SFP_NAME, /* string */ + + /* add new constants above here */ + __ETHTOOL_A_PHY_CNT, + ETHTOOL_A_PHY_MAX = (__ETHTOOL_A_PHY_CNT - 1) +}; + /* generic netlink info */ #define ETHTOOL_GENL_NAME "ethtool" #define ETHTOOL_GENL_VERSION 1 -- cgit v1.2.3 From b0966c724584a5a9fd7fb529de19807c31f27a45 Mon Sep 17 00:00:00 2001 From: Dave Marchevsky Date: Tue, 13 Aug 2024 21:24:23 +0000 Subject: bpf: Support bpf_kptr_xchg into local kptr Currently, users can only stash kptr into map values with bpf_kptr_xchg(). This patch further supports stashing kptr into local kptr by adding local kptr as a valid destination type. When stashing into local kptr, btf_record in program BTF is used instead of btf_record in map to search for the btf_field of the local kptr. The local kptr specific checks in check_reg_type() only apply when the source argument of bpf_kptr_xchg() is local kptr. Therefore, we make the scope of the check explicit as the destination now can also be local kptr. Acked-by: Martin KaFai Lau Signed-off-by: Dave Marchevsky Signed-off-by: Amery Hung Link: https://lore.kernel.org/r/20240813212424.2871455-5-amery.hung@bytedance.com Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 35bcf52dbc65..e2629457d72d 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5519,11 +5519,12 @@ union bpf_attr { * **-EOPNOTSUPP** if the hash calculation failed or **-EINVAL** if * invalid arguments are passed. * - * void *bpf_kptr_xchg(void *map_value, void *ptr) + * void *bpf_kptr_xchg(void *dst, void *ptr) * Description - * Exchange kptr at pointer *map_value* with *ptr*, and return the - * old value. *ptr* can be NULL, otherwise it must be a referenced - * pointer which will be released when this helper is called. + * Exchange kptr at pointer *dst* with *ptr*, and return the old value. + * *dst* can be map value or local kptr. *ptr* can be NULL, otherwise + * it must be a referenced pointer which will be released when this helper + * is called. * Return * The old value of kptr (which can be NULL). The returned pointer * if not NULL, is a reference which must be released using its -- cgit v1.2.3 From 65ab5ac4df012388481d0414fcac1d5ac1721fb3 Mon Sep 17 00:00:00 2001 From: Jordan Rome Date: Fri, 23 Aug 2024 12:51:00 -0700 Subject: bpf: Add bpf_copy_from_user_str kfunc This adds a kfunc wrapper around strncpy_from_user, which can be called from sleepable BPF programs. This matches the non-sleepable 'bpf_probe_read_user_str' helper except it includes an additional 'flags' param, which allows consumers to clear the entire destination buffer on success or failure. Signed-off-by: Jordan Rome Link: https://lore.kernel.org/r/20240823195101.3621028-1-linux@jordanrome.com Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index e2629457d72d..c3a5728db115 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -7513,4 +7513,13 @@ struct bpf_iter_num { __u64 __opaque[1]; } __attribute__((aligned(8))); +/* + * Flags to control BPF kfunc behaviour. + * - BPF_F_PAD_ZEROS: Pad destination buffer with zeros. (See the respective + * helper documentation for details.) + */ +enum bpf_kfunc_flags { + BPF_F_PAD_ZEROS = (1ULL << 0), +}; + #endif /* _UAPI__LINUX_BPF_H__ */ -- cgit v1.2.3 From d29cb3726f03cdac7889f0109a7cb84f79e168a8 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 7 Aug 2024 15:18:13 +0100 Subject: io_uring: add absolute mode wait timeouts In addition to current relative timeouts for the waiting loop, where the timespec argument specifies the maximum time it can wait for, add support for the absolute mode, with the value carrying a CLOCK_MONOTONIC absolute time until which we should return control back to the user. Suggested-by: Lewis Baker Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index adc2524fd8e3..6a81f55fcd0d 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -507,6 +507,7 @@ struct io_cqring_offsets { #define IORING_ENTER_SQ_WAIT (1U << 2) #define IORING_ENTER_EXT_ARG (1U << 3) #define IORING_ENTER_REGISTERED_RING (1U << 4) +#define IORING_ENTER_ABS_TIMER (1U << 5) /* * Passed in for io_uring_setup(2). Copied back with updated info on success -- cgit v1.2.3 From 2b8e976b984278edbeab3251d370e76d237699f9 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 7 Aug 2024 15:18:14 +0100 Subject: io_uring: user registered clockid for wait timeouts Add a new registration opcode IORING_REGISTER_CLOCK, which allows the user to select which clock id it wants to use with CQ waiting timeouts. It only allows a subset of all posix clocks and currently supports CLOCK_MONOTONIC and CLOCK_BOOTTIME. Suggested-by: Lewis Baker Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6a81f55fcd0d..7af716136df9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -596,6 +596,8 @@ enum io_uring_register_op { IORING_REGISTER_NAPI = 27, IORING_UNREGISTER_NAPI = 28, + IORING_REGISTER_CLOCK = 29, + /* this goes last */ IORING_REGISTER_LAST, @@ -676,6 +678,11 @@ struct io_uring_restriction { __u32 resv2[3]; }; +struct io_uring_clock_register { + __u32 clockid; + __u32 __resv[3]; +}; + struct io_uring_buf { __u64 addr; __u32 len; -- cgit v1.2.3 From 7ed9e09e2d13d5d43385153bba4734cb0eafd7fd Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 4 Jan 2024 10:46:30 -0700 Subject: io_uring: wire up min batch wake timeout Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member that is currently in there. The value is in usecs, which is explained in the name as well. Note that if min_wait_usec and a normal timeout is used in conjunction, the normal timeout is still relative to the base time. For example, if min_wait_usec is set to 100 and the normal timeout is 1000, the max total time waited is still 1000. This also means that if the normal timeout is shorter than min_wait_usec, then only the min_wait_usec will take effect. See previous commit for an explanation of how this works. IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as applications doing submit_and_wait_timeout() style operations will generally not see the -EINVAL from the wait side as they return the number of IOs submitted. Only if no IOs are submitted will the -EINVAL bubble back up to the application. Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 7af716136df9..042eab793e26 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -543,6 +543,7 @@ struct io_uring_params { #define IORING_FEAT_LINKED_FILE (1U << 12) #define IORING_FEAT_REG_REG_RING (1U << 13) #define IORING_FEAT_RECVSEND_BUNDLE (1U << 14) +#define IORING_FEAT_MIN_TIMEOUT (1U << 15) /* * io_uring_register(2) opcodes and arguments @@ -766,7 +767,7 @@ enum io_uring_register_restriction_op { struct io_uring_getevents_arg { __u64 sigmask; __u32 sigmask_sz; - __u32 pad; + __u32 min_wait_usec; __u64 ts; }; -- cgit v1.2.3 From 1d4684fbe88dc28e2bf79f5e94a432f0469d2dac Mon Sep 17 00:00:00 2001 From: Nicolin Chen Date: Fri, 2 Aug 2024 17:32:02 -0700 Subject: iommufd: Reorder include files Reorder include files to alphabetic order to simplify maintenance, and separate local headers and global headers with a blank line. No functional change intended. Link: https://patch.msgid.link/r/7524b037cc05afe19db3c18f863253e1d1554fa2.1722644866.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- include/uapi/linux/iommufd.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 4dde745cfb7e..72010f71c5e4 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -4,8 +4,8 @@ #ifndef _UAPI_IOMMUFD_H #define _UAPI_IOMMUFD_H -#include #include +#include #define IOMMUFD_TYPE (';') -- cgit v1.2.3 From abcd3026dd63417692a5e80aff70e7cd9b5c14ea Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Thu, 22 Aug 2024 14:07:01 +0200 Subject: ethtool: Extend cable testing interface with result source information Extend the ethtool netlink cable testing interface by adding support for specifying the source of cable testing results. This allows users to differentiate between results obtained through different diagnostic methods. For example, some TI 10BaseT1L PHYs provide two variants of cable diagnostics: Time Domain Reflectometry (TDR) and Active Link Cable Diagnostic (ALCD). By introducing `ETHTOOL_A_CABLE_RESULT_SRC` and `ETHTOOL_A_CABLE_FAULT_LENGTH_SRC` attributes, this update enables drivers to indicate whether the result was derived from TDR or ALCD, improving the clarity and utility of diagnostic information. Signed-off-by: Oleksij Rempel Reviewed-by: Andrew Lunn Link: https://patch.msgid.link/20240822120703.1393130-2-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski --- include/uapi/linux/ethtool_netlink.h | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 45d8bcdea056..283305f6b063 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -573,10 +573,20 @@ enum { ETHTOOL_A_CABLE_PAIR_D, }; +/* Information source for specific results. */ +enum { + ETHTOOL_A_CABLE_INF_SRC_UNSPEC, + /* Results provided by the Time Domain Reflectometry (TDR) */ + ETHTOOL_A_CABLE_INF_SRC_TDR, + /* Results provided by the Active Link Cable Diagnostic (ALCD) */ + ETHTOOL_A_CABLE_INF_SRC_ALCD, +}; + enum { ETHTOOL_A_CABLE_RESULT_UNSPEC, ETHTOOL_A_CABLE_RESULT_PAIR, /* u8 ETHTOOL_A_CABLE_PAIR_ */ ETHTOOL_A_CABLE_RESULT_CODE, /* u8 ETHTOOL_A_CABLE_RESULT_CODE_ */ + ETHTOOL_A_CABLE_RESULT_SRC, /* u32 ETHTOOL_A_CABLE_INF_SRC_ */ __ETHTOOL_A_CABLE_RESULT_CNT, ETHTOOL_A_CABLE_RESULT_MAX = (__ETHTOOL_A_CABLE_RESULT_CNT - 1) @@ -586,6 +596,7 @@ enum { ETHTOOL_A_CABLE_FAULT_LENGTH_UNSPEC, ETHTOOL_A_CABLE_FAULT_LENGTH_PAIR, /* u8 ETHTOOL_A_CABLE_PAIR_ */ ETHTOOL_A_CABLE_FAULT_LENGTH_CM, /* u32 */ + ETHTOOL_A_CABLE_FAULT_LENGTH_SRC, /* u32 ETHTOOL_A_CABLE_INF_SRC_ */ __ETHTOOL_A_CABLE_FAULT_LENGTH_CNT, ETHTOOL_A_CABLE_FAULT_LENGTH_MAX = (__ETHTOOL_A_CABLE_FAULT_LENGTH_CNT - 1) -- cgit v1.2.3 From d24dac8eb8111c401b0de40a17760a0a254bcffc Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Thu, 22 Aug 2024 13:57:22 +0100 Subject: packet: Correct spelling in if_packet.h Correct spelling in if_packet.h As reported by codespell. Signed-off-by: Simon Horman Acked-by: Willem de Bruijn Link: https://patch.msgid.link/20240822-net-spell-v1-1-3a98971ce2d2@kernel.org Signed-off-by: Jakub Kicinski --- include/uapi/linux/if_packet.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h index 9efc42382fdb..1d2718dd9647 100644 --- a/include/uapi/linux/if_packet.h +++ b/include/uapi/linux/if_packet.h @@ -230,8 +230,8 @@ struct tpacket_hdr_v1 { * ts_first_pkt: * Is always the time-stamp when the block was opened. * Case a) ZERO packets - * No packets to deal with but atleast you know the - * time-interval of this block. + * No packets to deal with but at least you know + * the time-interval of this block. * Case b) Non-zero packets * Use the ts of the first packet in the block. * @@ -265,7 +265,8 @@ enum tpacket_versions { - struct tpacket_hdr - pad to TPACKET_ALIGNMENT=16 - struct sockaddr_ll - - Gap, chosen so that packet data (Start+tp_net) alignes to TPACKET_ALIGNMENT=16 + - Gap, chosen so that packet data (Start+tp_net) aligns to + TPACKET_ALIGNMENT=16 - Start+tp_mac: [ Optional MAC header ] - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. - Pad to align to TPACKET_ALIGNMENT=16 -- cgit v1.2.3 From 70d0bb45fae87a3b08970a318e15f317446a1956 Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Thu, 22 Aug 2024 13:57:33 +0100 Subject: net: Correct spelling in headers Correct spelling in Networking headers. As reported by codespell. Signed-off-by: Simon Horman Link: https://patch.msgid.link/20240822-net-spell-v1-12-3a98971ce2d2@kernel.org Signed-off-by: Jakub Kicinski --- include/uapi/linux/in.h | 2 +- include/uapi/linux/inet_diag.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h index d358add1611c..5d32d53508d9 100644 --- a/include/uapi/linux/in.h +++ b/include/uapi/linux/in.h @@ -141,7 +141,7 @@ struct in_addr { */ #define IP_PMTUDISC_INTERFACE 4 /* weaker version of IP_PMTUDISC_INTERFACE, which allows packets to get - * fragmented if they exeed the interface mtu + * fragmented if they exceed the interface mtu */ #define IP_PMTUDISC_OMIT 5 diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h index 50655de04c9b..86bb2e8b17c9 100644 --- a/include/uapi/linux/inet_diag.h +++ b/include/uapi/linux/inet_diag.h @@ -143,7 +143,7 @@ enum { INET_DIAG_SHUTDOWN, /* - * Next extenstions cannot be requested in struct inet_diag_req_v2: + * Next extensions cannot be requested in struct inet_diag_req_v2: * its field idiag_ext has only 8 bits. */ -- cgit v1.2.3 From cda1fba15cb2282b3c364805c9767698f11c3b0e Mon Sep 17 00:00:00 2001 From: Arkadiusz Kubalewski Date: Fri, 23 Aug 2024 00:25:12 +0200 Subject: dpll: add Embedded SYNC feature for a pin Implement and document new pin attributes for providing Embedded SYNC capabilities to the DPLL subsystem users through a netlink pin-get do/dump messages. Allow the user to set Embedded SYNC frequency with pin-set do netlink message. Reviewed-by: Aleksandr Loktionov Signed-off-by: Arkadiusz Kubalewski Reviewed-by: Jiri Pirko Link: https://patch.msgid.link/20240822222513.255179-2-arkadiusz.kubalewski@intel.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/dpll.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/dpll.h b/include/uapi/linux/dpll.h index 0c13d7f1a1bc..b0654ade7b7e 100644 --- a/include/uapi/linux/dpll.h +++ b/include/uapi/linux/dpll.h @@ -210,6 +210,9 @@ enum dpll_a_pin { DPLL_A_PIN_PHASE_ADJUST, DPLL_A_PIN_PHASE_OFFSET, DPLL_A_PIN_FRACTIONAL_FREQUENCY_OFFSET, + DPLL_A_PIN_ESYNC_FREQUENCY, + DPLL_A_PIN_ESYNC_FREQUENCY_SUPPORTED, + DPLL_A_PIN_ESYNC_PULSE, __DPLL_A_PIN_MAX, DPLL_A_PIN_MAX = (__DPLL_A_PIN_MAX - 1) -- cgit v1.2.3 From d8ea645d6984c84a87032063a0941f15a323831f Mon Sep 17 00:00:00 2001 From: Selvin Xavier Date: Sun, 18 Aug 2024 21:47:26 -0700 Subject: RDMA/bnxt_re: Handle variable WQE support for user applications User library calculates the number of slots required for user applications and it can pass that information to the driver. Driver can use this value and update the HW directly. This mechanism is currently used only for the newly introduced variable size WQEs. Extend the bnxt_re_qp_req structure to pass the Send Queue slot count. Reorganize the code to get the sq_slots before initializing the Send Queue attributes. Link: https://patch.msgid.link/r/1724042847-1481-5-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Hongguang Gao Signed-off-by: Selvin Xavier Signed-off-by: Jason Gunthorpe --- include/uapi/rdma/bnxt_re-abi.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h index e61104f35d73..71140618700a 100644 --- a/include/uapi/rdma/bnxt_re-abi.h +++ b/include/uapi/rdma/bnxt_re-abi.h @@ -118,10 +118,16 @@ struct bnxt_re_resize_cq_req { __aligned_u64 cq_va; }; +enum bnxt_re_qp_mask { + BNXT_RE_QP_REQ_MASK_VAR_WQE_SQ_SLOTS = 0x1, +}; + struct bnxt_re_qp_req { __aligned_u64 qpsva; __aligned_u64 qprva; __aligned_u64 qp_handle; + __aligned_u64 comp_mask; + __u32 sq_slots; }; struct bnxt_re_qp_resp { -- cgit v1.2.3 From 10a104c0debbb19a1e45193d5670510216e339ff Mon Sep 17 00:00:00 2001 From: Selvin Xavier Date: Sun, 18 Aug 2024 21:47:27 -0700 Subject: RDMA/bnxt_re: Enable variable size WQEs for user space applications Add backward compatibility code to enable variable size WQEs only if the user lib supports it. Link: https://patch.msgid.link/r/1724042847-1481-6-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Hongguang Gao Signed-off-by: Selvin Xavier Signed-off-by: Jason Gunthorpe --- include/uapi/rdma/bnxt_re-abi.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h index 71140618700a..6821002931c8 100644 --- a/include/uapi/rdma/bnxt_re-abi.h +++ b/include/uapi/rdma/bnxt_re-abi.h @@ -66,6 +66,7 @@ enum bnxt_re_wqe_mode { enum { BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT = 0x01, + BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT = 0x02, }; struct bnxt_re_uctx_req { -- cgit v1.2.3 From 947697c6f0f75f9866f2f891a102dece7a09a064 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Thu, 22 Aug 2024 10:18:52 +0530 Subject: uapi: Define GENMASK_U128 This adds GENMASK_U128() and __GENMASK_U128() macros using __BITS_PER_U128 and __int128 data types. These macros will be used in providing support for generating 128 bit masks. The macros wouldn't work in all assembler flavors for reasons described in the comments on top of declarations. Enforce it for more by adding !__ASSEMBLY__ guard. Cc: Yury Norov Cc: Rasmus Villemoes Cc: Arnd Bergmann > Cc: linux-kernel@vger.kernel.org Cc: linux-arch@vger.kernel.org Reviewed-by: Arnd Bergmann Signed-off-by: Anshuman Khandual Signed-off-by: Yury Norov --- include/uapi/linux/bits.h | 3 +++ include/uapi/linux/const.h | 17 +++++++++++++++++ 2 files changed, 20 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h index 3c2a101986a3..5ee30f882736 100644 --- a/include/uapi/linux/bits.h +++ b/include/uapi/linux/bits.h @@ -12,4 +12,7 @@ (((~_ULL(0)) - (_ULL(1) << (l)) + 1) & \ (~_ULL(0) >> (__BITS_PER_LONG_LONG - 1 - (h)))) +#define __GENMASK_U128(h, l) \ + ((_BIT128((h)) << 1) - (_BIT128(l))) + #endif /* _UAPI_LINUX_BITS_H */ diff --git a/include/uapi/linux/const.h b/include/uapi/linux/const.h index a429381e7ca5..e16be0d37746 100644 --- a/include/uapi/linux/const.h +++ b/include/uapi/linux/const.h @@ -28,6 +28,23 @@ #define _BITUL(x) (_UL(1) << (x)) #define _BITULL(x) (_ULL(1) << (x)) +#if !defined(__ASSEMBLY__) +/* + * Missing asm support + * + * __BIT128() would not work in the asm code, as it shifts an + * 'unsigned __init128' data type as direct representation of + * 128 bit constants is not supported in the gcc compiler, as + * they get silently truncated. + * + * TODO: Please revisit this implementation when gcc compiler + * starts representing 128 bit constants directly like long + * and unsigned long etc. Subsequently drop the comment for + * GENMASK_U128() which would then start supporting asm code. + */ +#define _BIT128(x) ((unsigned __int128)(1) << (x)) +#endif + #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (__typeof__(x))(a) - 1) #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) -- cgit v1.2.3 From 57413d8e172c10c90fbd91f98d0f7d8eb27e824c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 27 Aug 2024 08:50:47 +0200 Subject: fs: sort out the fallocate mode vs flag mess The fallocate system call takes a mode argument, but that argument contains a wild mix of exclusive modes and an optional flags. Replace FALLOC_FL_SUPPORTED_MASK with FALLOC_FL_MODE_MASK, which excludes the optional flag bit, so that we can use switch statement on the value to easily enumerate the cases while getting the check for duplicate modes for free. To make this (and in the future the file system implementations) more readable also add a symbolic name for the 0 mode used to allocate blocks. Signed-off-by: Christoph Hellwig Link: https://lore.kernel.org/r/20240827065123.1762168-4-hch@lst.de Reviewed-by: Darrick J. Wong Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/uapi/linux/falloc.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h index 51398fa57f6c..5810371ed72b 100644 --- a/include/uapi/linux/falloc.h +++ b/include/uapi/linux/falloc.h @@ -2,6 +2,7 @@ #ifndef _UAPI_FALLOC_H_ #define _UAPI_FALLOC_H_ +#define FALLOC_FL_ALLOCATE_RANGE 0x00 /* allocate range */ #define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ #define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ #define FALLOC_FL_NO_HIDE_STALE 0x04 /* reserved codepoint */ -- cgit v1.2.3 From b31c9d9dc343146b9f4ce67b4eee748c49296e99 Mon Sep 17 00:00:00 2001 From: Peter Hutterer Date: Tue, 27 Aug 2024 17:19:29 +0900 Subject: HID: hidraw: add HIDIOCREVOKE ioctl There is a need for userspace applications to open HID devices directly. Use-cases include configuration of gaming mice or direct access to joystick devices. The latter is currently handled by the uaccess tag in systemd, other devices include more custom/local configurations or just sudo. A better approach is what we already have for evdev devices: give the application a file descriptor and revoke it when it may no longer access that device. This patch is the hidraw equivalent to the EVIOCREVOKE ioctl, see commit c7dc65737c9a ("Input: evdev - add EVIOCREVOKE ioctl") for full details. An MR for systemd-logind has been filed here: https://github.com/systemd/systemd/pull/33970 Signed-off-by: Peter Hutterer Link: https://patch.msgid.link/20240827-hidraw-revoke-v5-1-d004a7451aea@kernel.org Signed-off-by: Benjamin Tissoires --- include/uapi/linux/hidraw.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/hidraw.h b/include/uapi/linux/hidraw.h index 33ebad81720a..d5ee269864e0 100644 --- a/include/uapi/linux/hidraw.h +++ b/include/uapi/linux/hidraw.h @@ -46,6 +46,7 @@ struct hidraw_devinfo { /* The first byte of SOUTPUT and GOUTPUT is the report number */ #define HIDIOCSOUTPUT(len) _IOC(_IOC_WRITE|_IOC_READ, 'H', 0x0B, len) #define HIDIOCGOUTPUT(len) _IOC(_IOC_WRITE|_IOC_READ, 'H', 0x0C, len) +#define HIDIOCREVOKE _IOW('H', 0x0D, int) /* Revoke device access */ #define HIDRAW_FIRST_MINOR 0 #define HIDRAW_MAX_DEVICES 64 -- cgit v1.2.3 From ae98dbf43d755b4e111fcd086e53939bef3e9a1a Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 9 Aug 2024 11:20:45 -0600 Subject: io_uring/kbuf: add support for incremental buffer consumption By default, any recv/read operation that uses provided buffers will consume at least 1 buffer fully (and maybe more, in case of bundles). This adds support for incremental consumption, meaning that an application may add large buffers, and each read/recv will just consume the part of the buffer that it needs. For example, let's say an application registers 1MB buffers in a provided buffer ring, for streaming receives. If it gets a short recv, then the full 1MB buffer will be consumed and passed back to the application. With incremental consumption, only the part that was actually used is consumed, and the buffer remains the current one. This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example: Application registers a provided buffer ring, and adds two 32K buffers to the ring. Buffer1 address: 0x1000000 (buffer ID 0) Buffer2 address: 0x2000000 (buffer ID 1) A recv completion is received with the following values: cqe->res 0x1000 (4k bytes received) cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0) and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in: cqe->res 0x2010 (8k bytes received) cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0) which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is: cqe->res 0x5000 (20k bytes received) cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0) and the application now knows that 20k of data is available at 0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE isn't set, as no more data is available in this buffer ID. The next completion is then: cqe->res 0x1000 (4k bytes received) cqe->flags 0x10001 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1) which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID. When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application that the kernel isn't done with the buffer yet, and that it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup with IOU_PBUF_RING INC, as that's the only type of buffer that will see multiple consecutive completions for the same buffer ID. For any other provided buffer type, any completion that passes back a buffer to the application is final. Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will indicate the next buffer ID in the CQE cflags. On the send side, the application can manage how much data is sent from an existing buffer by setting sqe->len to the desired send length. An application can request incremental consumption by setting IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like before, no changes there. The only change is in how an application may see multiple completions for the same buffer ID, hence needing to know where the next receive will happen. Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC, as both really require the ring to remain locked over the duration of the buffer selection and the operation completion. It will consume a buffer otherwise regardless of the size of the IO done. Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 042eab793e26..a275f91d2ac0 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -440,11 +440,21 @@ struct io_uring_cqe { * IORING_CQE_F_SOCK_NONEMPTY If set, more data to read after socket recv * IORING_CQE_F_NOTIF Set for notification CQEs. Can be used to distinct * them from sends. + * IORING_CQE_F_BUF_MORE If set, the buffer ID set in the completion will get + * more completions. In other words, the buffer is being + * partially consumed, and will be used by the kernel for + * more completions. This is only set for buffers used via + * the incremental buffer consumption, as provided by + * a ring buffer setup with IOU_PBUF_RING_INC. For any + * other provided buffer type, all completions with a + * buffer passed back is automatically returned to the + * application. */ #define IORING_CQE_F_BUFFER (1U << 0) #define IORING_CQE_F_MORE (1U << 1) #define IORING_CQE_F_SOCK_NONEMPTY (1U << 2) #define IORING_CQE_F_NOTIF (1U << 3) +#define IORING_CQE_F_BUF_MORE (1U << 4) #define IORING_CQE_BUFFER_SHIFT 16 @@ -716,9 +726,17 @@ struct io_uring_buf_ring { * mmap(2) with the offset set as: * IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT) * to get a virtual mapping for the ring. + * IOU_PBUF_RING_INC: If set, buffers consumed from this buffer ring can be + * consumed incrementally. Normally one (or more) buffers + * are fully consumed. With incremental consumptions, it's + * feasible to register big ranges of buffers, and each + * use of it will consume only as much as it needs. This + * requires that both the kernel and application keep + * track of where the current read/recv index is at. */ enum io_uring_register_pbuf_ring_flags { IOU_PBUF_RING_MMAP = 1, + IOU_PBUF_RING_INC = 2, }; /* argument for IORING_(UN)REGISTER_PBUF_RING */ -- cgit v1.2.3 From 433f9d76a01056dfeaefc15167b11e514e56f956 Mon Sep 17 00:00:00 2001 From: Ian Kent Date: Wed, 14 Aug 2024 17:02:31 +0800 Subject: autofs: add per dentry expire timeout Add ability to set per-dentry mount expire timeout to autofs. There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount). To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Now I have a request to add the "nounmount" option so I need to add the per-dentry expire handling to the kernel implementation to do this. The implementation uses the trailing path component to identify the mount (and is also used as the autofs map key) which is passed in the autofs_dev_ioctl structure path field. The expire timeout is passed in autofs_dev_ioctl timeout field (well, of the timeout union). If the passed in timeout is equal to -1 the per-dentry timeout and flag are cleared providing for the "unmount" option. If the timeout is greater than or equal to 0 the timeout is set to the value and the flag is also set. If the dentry timeout is 0 the dentry will not expire by timeout which enables the implementation of the "nounmount" option for the specific mount. When the dentry timeout is greater than zero it allows for the implementation of the "utimeout=" option. Signed-off-by: Ian Kent Link: https://lore.kernel.org/r/20240814090231.963520-1-raven@themaw.net Signed-off-by: Christian Brauner --- include/uapi/linux/auto_fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/auto_fs.h b/include/uapi/linux/auto_fs.h index 1f7925afad2d..8081df849743 100644 --- a/include/uapi/linux/auto_fs.h +++ b/include/uapi/linux/auto_fs.h @@ -23,7 +23,7 @@ #define AUTOFS_MIN_PROTO_VERSION 3 #define AUTOFS_MAX_PROTO_VERSION 5 -#define AUTOFS_PROTO_SUBVERSION 5 +#define AUTOFS_PROTO_SUBVERSION 6 /* * The wait_queue_token (autofs_wqt_t) is part of a structure which is passed -- cgit v1.2.3 From 09022bc196d23484a7a5d48cf373f8583e3fcf23 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 7 Aug 2024 20:35:26 +0100 Subject: mm: remove PG_error The PG_error bit is now unused; delete it and free up a bit in page->flags. Link: https://lkml.kernel.org/r/20240807193528.1865100-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/uapi/linux/kernel-page-flags.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h index 6f2f2720f3ac..ff8032227876 100644 --- a/include/uapi/linux/kernel-page-flags.h +++ b/include/uapi/linux/kernel-page-flags.h @@ -7,7 +7,7 @@ */ #define KPF_LOCKED 0 -#define KPF_ERROR 1 +#define KPF_ERROR 1 /* Now unused */ #define KPF_REFERENCED 2 #define KPF_UPTODATE 3 #define KPF_DIRTY 4 -- cgit v1.2.3 From 181028a0d84cdcc7ac86d05cc49eaa416ce85c8b Mon Sep 17 00:00:00 2001 From: Chandramohan Akula Date: Thu, 29 Aug 2024 08:34:05 -0700 Subject: RDMA/bnxt_re: Share a page to expose per SRQ info with userspace Gen P7 adapters needs to share a toggle bits information received in kernel driver with the user space. User space needs this info to arm the SRQ. User space application can get this page using the UAPI routines. Library will mmap this page and get the toggle bits to be used in the next ARM Doorbell. Uses a hash list to map the SRQ structure from the SRQ ID. SRQ structure is retrieved from the hash list while the library calls the UAPI routine to get the toggle page mapping. Currently the full page is mapped per SRQ. This can be optimized to enable multiple SRQs from the same application share the same page and different offsets in the page Signed-off-by: Chandramohan Akula Signed-off-by: Selvin Xavier Link: https://patch.msgid.link/1724945645-14989-4-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky --- include/uapi/rdma/bnxt_re-abi.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h index 6821002931c8..faa9d62b3b30 100644 --- a/include/uapi/rdma/bnxt_re-abi.h +++ b/include/uapi/rdma/bnxt_re-abi.h @@ -141,8 +141,14 @@ struct bnxt_re_srq_req { __aligned_u64 srq_handle; }; +enum bnxt_re_srq_mask { + BNXT_RE_SRQ_TOGGLE_PAGE_SUPPORT = 0x1, +}; + struct bnxt_re_srq_resp { __u32 srqid; + __u32 rsvd; /* padding */ + __aligned_u64 comp_mask; }; enum bnxt_re_shpg_offt { -- cgit v1.2.3 From 8bfb74ae12fa4cd3c9b49bb5913610b5709bffd7 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Tue, 3 Sep 2024 01:09:27 +0200 Subject: netfilter: nf_tables: zero timeout means element never times out This patch uses zero as timeout marker for those elements that never expire when the element is created. If userspace provides no timeout for an element, then the default set timeout applies. However, if no default set timeout is specified and timeout flag is set on, then timeout extension is allocated and timeout is set to zero to allow for future updates. Use of zero a never timeout marker has been suggested by Phil Sutter. Note that, in older kernels, it is already possible to define elements that never expire by declaring a set with the set timeout flag set on and no global set timeout, in this case, new element with no explicit timeout never expire do not allocate the timeout extension, hence, they never expire. This approach makes it complicated to accomodate element timeout update, because element extensions do not support reallocations. Therefore, allocate the timeout extension and use the new marker for this case, but do not expose it to userspace to retain backward compatibility in the set listing. Signed-off-by: Pablo Neira Ayuso --- include/uapi/linux/netfilter/nf_tables.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 639894ed1b97..d6476ca5d7a6 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -436,7 +436,7 @@ enum nft_set_elem_flags { * @NFTA_SET_ELEM_KEY: key value (NLA_NESTED: nft_data) * @NFTA_SET_ELEM_DATA: data value of mapping (NLA_NESTED: nft_data_attributes) * @NFTA_SET_ELEM_FLAGS: bitmask of nft_set_elem_flags (NLA_U32) - * @NFTA_SET_ELEM_TIMEOUT: timeout value (NLA_U64) + * @NFTA_SET_ELEM_TIMEOUT: timeout value, zero means never times out (NLA_U64) * @NFTA_SET_ELEM_EXPIRATION: expiration time (NLA_U64) * @NFTA_SET_ELEM_USERDATA: user data (NLA_BINARY) * @NFTA_SET_ELEM_EXPR: expression (NLA_NESTED: nft_expr_attributes) -- cgit v1.2.3 From 17519819926211e6b2834e00e4554bec0daf22ac Mon Sep 17 00:00:00 2001 From: Joey Gouly Date: Thu, 22 Aug 2024 16:11:03 +0100 Subject: arm64/ptrace: add support for FEAT_POE Add a regset for POE containing POR_EL0. Signed-off-by: Joey Gouly Cc: Catalin Marinas Cc: Will Deacon Reviewed-by: Mark Brown Reviewed-by: Catalin Marinas Reviewed-by: Anshuman Khandual Link: https://lore.kernel.org/r/20240822151113.1479789-21-joey.gouly@arm.com Signed-off-by: Will Deacon --- include/uapi/linux/elf.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index b54b313bcf07..81762ff3c99e 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -441,6 +441,7 @@ typedef struct elf64_shdr { #define NT_ARM_ZA 0x40c /* ARM SME ZA registers */ #define NT_ARM_ZT 0x40d /* ARM SME ZT registers */ #define NT_ARM_FPMR 0x40e /* ARM floating point mode register */ +#define NT_ARM_POE 0x40f /* ARM POE registers */ #define NT_ARC_V2 0x600 /* ARCv2 accumulator/extra registers */ #define NT_VMCOREDD 0x700 /* Vmcore Device Dump Note */ #define NT_MIPS_DSP 0x800 /* MIPS DSP ASE registers */ -- cgit v1.2.3 From aa16880d9f13c6490e80ad614402c8a6fe6f3efa Mon Sep 17 00:00:00 2001 From: Alexander Mikhalitsyn Date: Tue, 3 Sep 2024 17:16:13 +0200 Subject: fuse: add basic infrastructure to support idmappings Add some preparational changes in fuse_get_req/fuse_force_creds to handle idmappings. Miklos suggested [1], [2] to change the meaning of in.h.uid/in.h.gid fields when daemon declares support for idmapped mounts. In a new semantic, we fill uid/gid values in fuse header with a id-mapped caller uid/gid (for requests which create new inodes), for all the rest cases we just send -1 to userspace. No functional changes intended. Link: https://lore.kernel.org/all/CAJfpegsVY97_5mHSc06mSw79FehFWtoXT=hhTUK_E-Yhr7OAuQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAJfpegtHQsEUuFq1k4ZbTD3E1h-GsrN3PWyv7X8cg6sfU_W2Yw@mail.gmail.com/ [2] Signed-off-by: Alexander Mikhalitsyn Signed-off-by: Miklos Szeredi --- include/uapi/linux/fuse.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index d08b99d60f6f..2ccf38181df2 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -984,6 +984,8 @@ struct fuse_fallocate_in { */ #define FUSE_UNIQUE_RESEND (1ULL << 63) +#define FUSE_INVALID_UIDGID ((uint32_t)(-1)) + struct fuse_in_header { uint32_t len; uint32_t opcode; -- cgit v1.2.3 From 16e1503eaf329129170e4e7a078aee17686967a5 Mon Sep 17 00:00:00 2001 From: Alexander Mikhalitsyn Date: Tue, 3 Sep 2024 17:16:25 +0200 Subject: fuse: allow idmapped mounts Now we have everything in place and we can allow idmapped mounts by setting the FS_ALLOW_IDMAP flag. Notice that real availability of idmapped mounts will depend on the fuse daemon. Fuse daemon have to set FUSE_ALLOW_IDMAP flag in the FUSE_INIT reply. To discuss: - we enable idmapped mounts support only if "default_permissions" mode is enabled, because otherwise we would need to deal with UID/GID mappings in the userspace side OR provide the userspace with idmapped req->in.h.uid/req->in.h.gid values which is not something that we probably want to. Idmapped mounts philosophy is not about faking caller uid/gid. Some extra links and examples: - libfuse support https://github.com/mihalicyn/libfuse/commits/idmap_support - fuse-overlayfs support: https://github.com/mihalicyn/fuse-overlayfs/commits/idmap_support - cephfs-fuse conversion example https://github.com/mihalicyn/ceph/commits/fuse_idmap - glusterfs conversion example https://github.com/mihalicyn/glusterfs/commits/fuse_idmap Signed-off-by: Alexander Mikhalitsyn Reviewed-by: Christian Brauner Signed-off-by: Miklos Szeredi --- include/uapi/linux/fuse.h | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 2ccf38181df2..f1e99458e29e 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -217,6 +217,9 @@ * - add backing_id to fuse_open_out, add FOPEN_PASSTHROUGH open flag * - add FUSE_NO_EXPORT_SUPPORT init flag * - add FUSE_NOTIFY_RESEND, add FUSE_HAS_RESEND init flag + * + * 7.41 + * - add FUSE_ALLOW_IDMAP */ #ifndef _LINUX_FUSE_H @@ -252,7 +255,7 @@ #define FUSE_KERNEL_VERSION 7 /** Minor version number of this interface */ -#define FUSE_KERNEL_MINOR_VERSION 40 +#define FUSE_KERNEL_MINOR_VERSION 41 /** The node ID of the root inode */ #define FUSE_ROOT_ID 1 @@ -421,6 +424,7 @@ struct fuse_file_lock { * FUSE_NO_EXPORT_SUPPORT: explicitly disable export support * FUSE_HAS_RESEND: kernel supports resending pending requests, and the high bit * of the request ID indicates resend requests + * FUSE_ALLOW_IDMAP: allow creation of idmapped mounts */ #define FUSE_ASYNC_READ (1 << 0) #define FUSE_POSIX_LOCKS (1 << 1) @@ -466,6 +470,7 @@ struct fuse_file_lock { /* Obsolete alias for FUSE_DIRECT_IO_ALLOW_MMAP */ #define FUSE_DIRECT_IO_RELAX FUSE_DIRECT_IO_ALLOW_MMAP +#define FUSE_ALLOW_IDMAP (1ULL << 40) /** * CUSE INIT request/reply flags @@ -984,6 +989,19 @@ struct fuse_fallocate_in { */ #define FUSE_UNIQUE_RESEND (1ULL << 63) +/** + * This value will be set by the kernel to + * (struct fuse_in_header).{uid,gid} fields in + * case when: + * - fuse daemon enabled FUSE_ALLOW_IDMAP + * - idmapping information is not available and uid/gid + * can not be mapped in accordance with an idmapping. + * + * Note: an idmapping information always available + * for inode creation operations like: + * FUSE_MKNOD, FUSE_SYMLINK, FUSE_MKDIR, FUSE_TMPFILE, + * FUSE_CREATE and FUSE_RENAME2 (with RENAME_WHITEOUT). + */ #define FUSE_INVALID_UIDGID ((uint32_t)(-1)) struct fuse_in_header { -- cgit v1.2.3 From 4e893545ef8712d25f3176790ebb95beb073637e Mon Sep 17 00:00:00 2001 From: Mariusz Tkaczyk Date: Wed, 4 Sep 2024 12:48:47 +0200 Subject: PCI/NPEM: Add Native PCIe Enclosure Management support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Native PCIe Enclosure Management (NPEM, PCIe r6.1 sec 6.28) allows managing LEDs in storage enclosures. NPEM is indication oriented and it does not give direct access to LEDs. Although each indication *could* represent an individual LED, multiple indications could also be represented as a single, multi-color LED or a single LED blinking in a specific interval. The specification leaves that open. Each enabled indication (capability register bit on) is represented as a ledclass_dev which can be controlled through sysfs. For every ledclass device only 2 brightness states are allowed: LED_ON (1) or LED_OFF (0). This corresponds to the NPEM control register (Indication bit on/off). Ledclass devices appear in sysfs as child devices (subdirectory) of PCI device which has an NPEM Extended Capability and indication is enabled in NPEM capability register. For example, these are LEDs created for pcieport "10000:02:05.0" on my setup: leds/ ├── 10000:02:05.0:enclosure:fail ├── 10000:02:05.0:enclosure:locate ├── 10000:02:05.0:enclosure:ok └── 10000:02:05.0:enclosure:rebuild They can be also found in "/sys/class/leds" directory. The parent PCIe device domain/bus/device/function address is used to guarantee uniqueness across leds subsystem. To enable/disable a "fail" indication, the "brightness" file can be edited: echo 1 > ./leds/10000:02:05.0:enclosure:fail/brightness echo 0 > ./leds/10000:02:05.0:enclosure:fail/brightness PCIe r6.1, sec 7.9.19.2 defines the possible indications. Multiple indications for same parent PCIe device can conflict and hardware may update them when processing new request. To avoid issues, driver refresh all indications by reading back control register. This driver expects to be the exclusive NPEM extended capability manager. It waits up to 1 second after imposing new request, it doesn't verify if controller is busy before write, and it assumes the mutex lock gives protection from concurrent updates. If _DSM LED management is available, we assume the platform may be using NPEM for its own purposes (see PCI Firmware Spec r3.3 sec 4.7), so the driver does not use NPEM. A future patch will add _DSM support; an info message notes whether NPEM or _DSM is being used. NPEM is a PCIe extended capability so it should be registered in pcie_init_capabilities() but it is not possible due to LED dependency. The parent pci_device must be added earlier for led_classdev_register() to be successful. NPEM does not require configuration on kernel side, so it is safe to register LED devices later. Link: https://lore.kernel.org/r/20240904104848.23480-3-mariusz.tkaczyk@linux.intel.com Suggested-by: Lukas Wunner Signed-off-by: Mariusz Tkaczyk Signed-off-by: Bjorn Helgaas Tested-by: Stuart Hayes Reviewed-by: Christoph Hellwig Reviewed-by: Ilpo Järvinen --- include/uapi/linux/pci_regs.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index 94c00996e633..9751b413f3d6 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -740,6 +740,7 @@ #define PCI_EXT_CAP_ID_DVSEC 0x23 /* Designated Vendor-Specific */ #define PCI_EXT_CAP_ID_DLF 0x25 /* Data Link Feature */ #define PCI_EXT_CAP_ID_PL_16GT 0x26 /* Physical Layer 16.0 GT/s */ +#define PCI_EXT_CAP_ID_NPEM 0x29 /* Native PCIe Enclosure Management */ #define PCI_EXT_CAP_ID_PL_32GT 0x2A /* Physical Layer 32.0 GT/s */ #define PCI_EXT_CAP_ID_DOE 0x2E /* Data Object Exchange */ #define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_DOE @@ -1121,6 +1122,40 @@ #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK 0x000000F0 #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT 4 +/* Native PCIe Enclosure Management */ +#define PCI_NPEM_CAP 0x04 /* NPEM capability register */ +#define PCI_NPEM_CAP_CAPABLE 0x00000001 /* NPEM Capable */ + +#define PCI_NPEM_CTRL 0x08 /* NPEM control register */ +#define PCI_NPEM_CTRL_ENABLE 0x00000001 /* NPEM Enable */ + +/* + * Native PCIe Enclosure Management indication bits and Reset command bit + * are corresponding for capability and control registers. + */ +#define PCI_NPEM_CMD_RESET 0x00000002 /* Reset Command */ +#define PCI_NPEM_IND_OK 0x00000004 /* OK */ +#define PCI_NPEM_IND_LOCATE 0x00000008 /* Locate */ +#define PCI_NPEM_IND_FAIL 0x00000010 /* Fail */ +#define PCI_NPEM_IND_REBUILD 0x00000020 /* Rebuild */ +#define PCI_NPEM_IND_PFA 0x00000040 /* Predicted Failure Analysis */ +#define PCI_NPEM_IND_HOTSPARE 0x00000080 /* Hot Spare */ +#define PCI_NPEM_IND_ICA 0x00000100 /* In Critical Array */ +#define PCI_NPEM_IND_IFA 0x00000200 /* In Failed Array */ +#define PCI_NPEM_IND_IDT 0x00000400 /* Device Type */ +#define PCI_NPEM_IND_DISABLED 0x00000800 /* Disabled */ +#define PCI_NPEM_IND_SPEC_0 0x01000000 +#define PCI_NPEM_IND_SPEC_1 0x02000000 +#define PCI_NPEM_IND_SPEC_2 0x04000000 +#define PCI_NPEM_IND_SPEC_3 0x08000000 +#define PCI_NPEM_IND_SPEC_4 0x10000000 +#define PCI_NPEM_IND_SPEC_5 0x20000000 +#define PCI_NPEM_IND_SPEC_6 0x40000000 +#define PCI_NPEM_IND_SPEC_7 0x80000000 + +#define PCI_NPEM_STATUS 0x0c /* NPEM status register */ +#define PCI_NPEM_STATUS_CC 0x00000001 /* Command Completed */ + /* Data Object Exchange */ #define PCI_DOE_CAP 0x04 /* DOE Capabilities Register */ #define PCI_DOE_CAP_INT_SUP 0x00000001 /* Interrupt Support */ -- cgit v1.2.3 From 1083d733eb2624837753046924a9248555a1bfbf Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Tue, 3 Sep 2024 16:35:54 +0300 Subject: ipv4: Fix user space build failure due to header change MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit RT_TOS() from include/uapi/linux/in_route.h is defined using IPTOS_TOS_MASK from include/uapi/linux/ip.h. This is problematic for files such as include/net/ip_fib.h that want to use RT_TOS() as without including both header files kernel compilation fails: In file included from ./include/net/ip_fib.h:25, from ./include/net/route.h:27, from ./include/net/lwtunnel.h:9, from net/core/dst.c:24: ./include/net/ip_fib.h: In function ‘fib_dscp_masked_match’: ./include/uapi/linux/in_route.h:31:32: error: ‘IPTOS_TOS_MASK’ undeclared (first use in this function) 31 | #define RT_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~~~~~~ ./include/net/ip_fib.h:440:45: note: in expansion of macro ‘RT_TOS’ 440 | return dscp == inet_dsfield_to_dscp(RT_TOS(fl4->flowi4_tos)); Therefore, cited commit changed linux/in_route.h to include linux/ip.h. However, as reported by David, this breaks iproute2 compilation due overlapping definitions between linux/ip.h and /usr/include/netinet/ip.h: In file included from ../include/uapi/linux/in_route.h:5, from iproute.c:19: ../include/uapi/linux/ip.h:25:9: warning: "IPTOS_TOS" redefined 25 | #define IPTOS_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~ In file included from iproute.c:17: /usr/include/netinet/ip.h:222:9: note: this is the location of the previous definition 222 | #define IPTOS_TOS(tos) ((tos) & IPTOS_TOS_MASK) Fix by changing include/net/ip_fib.h to include linux/ip.h. Note that usage of RT_TOS() should not spread further in the kernel due to recent work in this area. Fixes: 1fa3314c14c6 ("ipv4: Centralize TOS matching") Reported-by: David Ahern Closes: https://lore.kernel.org/netdev/2f5146ff-507d-4cab-a195-b28c0c9e654e@kernel.org/ Signed-off-by: Ido Schimmel Reviewed-by: David Ahern Reviewed-by: Guillaume Nault Link: https://patch.msgid.link/20240903133554.2807343-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/in_route.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/in_route.h b/include/uapi/linux/in_route.h index 10bdd7e7107f..0cc2c23b47f8 100644 --- a/include/uapi/linux/in_route.h +++ b/include/uapi/linux/in_route.h @@ -2,8 +2,6 @@ #ifndef _LINUX_IN_ROUTE_H #define _LINUX_IN_ROUTE_H -#include - /* IPv4 routing cache flags */ #define RTCF_DEAD RTNH_F_DEAD -- cgit v1.2.3 From b4fef22c2fb97fa204f0c99c7c7f1c6b422ef0aa Mon Sep 17 00:00:00 2001 From: Aleksa Sarai Date: Wed, 28 Aug 2024 20:19:42 +1000 Subject: uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein Reviewed-by: Jeff Layton Reviewed-by: Josef Bacik Signed-off-by: Aleksa Sarai Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/uapi/linux/fcntl.h | 80 ++++++++++++++++++++++++++++++++-------------- 1 file changed, 56 insertions(+), 24 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index e55a3314bcb0..38a6d66d9e88 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -90,37 +90,69 @@ #define DN_ATTRIB 0x00000020 /* File changed attibutes */ #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ +#define AT_FDCWD -100 /* Special value for dirfd used to + indicate openat should use the + current working directory. */ + + +/* Generic flags for the *at(2) family of syscalls. */ + +/* Reserved for per-syscall flags 0xff. */ +#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic + links. */ +/* Reserved for per-syscall flags 0x200 */ +#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ +#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount + traversal. */ +#define AT_EMPTY_PATH 0x1000 /* Allow empty relative + pathname to operate on dirfd + directly. */ /* - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to - * unlinkat. The two functions do completely different things and therefore, - * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to - * faccessat would be undefined behavior and thus treating it equivalent to - * AT_EACCESS is valid undefined behavior. + * These flags are currently statx(2)-specific, but they could be made generic + * in the future and so they should not be used for other per-syscall flags. */ -#define AT_FDCWD -100 /* Special value used to indicate - openat should use the current - working directory. */ -#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ +#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ +#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ +#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ +#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ + +#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ + +/* + * Per-syscall flags for the *at(2) family of syscalls. + * + * These are flags that are so syscall-specific that a user passing these flags + * to the wrong syscall is so "clearly wrong" that we can safely call such + * usage "undefined behaviour". + * + * For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value. + * AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful + * only to unlinkat. The two functions do completely different things and + * therefore, the flags can be allowed to overlap. For example, passing + * AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it + * equivalent to AT_EACCESS is valid undefined behavior. + * + * Note for implementers: When picking a new per-syscall AT_* flag, try to + * reuse already existing flags first. This leaves us with as many unused bits + * as possible, so we can use them for generic bits in the future if necessary. + */ + +/* Flags for renameat2(2) (must match legacy RENAME_* flags). */ +#define AT_RENAME_NOREPLACE 0x0001 +#define AT_RENAME_EXCHANGE 0x0002 +#define AT_RENAME_WHITEOUT 0x0004 + +/* Flag for faccessat(2). */ #define AT_EACCESS 0x200 /* Test access permitted for effective IDs, not real IDs. */ +/* Flag for unlinkat(2). */ #define AT_REMOVEDIR 0x200 /* Remove directory instead of unlinking file. */ -#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ -#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ -#define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ - -#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ -#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ -#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ -#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ - -#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ +/* Flags for name_to_handle_at(2). */ +#define AT_HANDLE_FID 0x200 /* File handle is needed to compare + object identity and may not be + usable with open_by_handle_at(2). */ -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ -#define AT_HANDLE_FID AT_REMOVEDIR /* file handle is needed to - compare object identity and may not - be usable to open_by_handle_at(2) */ #if defined(__KERNEL__) #define AT_GETATTR_NOSEC 0x80000000 #endif -- cgit v1.2.3 From 4356d575ef0f39a3e8e0ce0c40d84ce900ac3b61 Mon Sep 17 00:00:00 2001 From: Aleksa Sarai Date: Wed, 28 Aug 2024 20:19:43 +1000 Subject: fhandle: expose u64 mount id to name_to_handle_at(2) Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call, turning err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, AT_HANDLE_MNT_ID_UNIQUE); into int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC); err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); mntid = statxbuf.stx_mnt_id; close(fd); Reviewed-by: Jeff Layton Signed-off-by: Aleksa Sarai Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-2-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara Reviewed-by: Josef Bacik Signed-off-by: Christian Brauner --- include/uapi/linux/fcntl.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 38a6d66d9e88..87e2dec79fea 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -152,6 +152,7 @@ #define AT_HANDLE_FID 0x200 /* File handle is needed to compare object identity and may not be usable with open_by_handle_at(2). */ +#define AT_HANDLE_MNT_ID_UNIQUE 0x001 /* Return the u64 unique mount ID. */ #if defined(__KERNEL__) #define AT_GETATTR_NOSEC 0x80000000 -- cgit v1.2.3 From 6fe0593bfc3cfab8b4ec7255152a2be40b2e49a3 Mon Sep 17 00:00:00 2001 From: Erling Ljunggren Date: Wed, 28 Sep 2022 13:21:43 +0200 Subject: media: videodev2.h: add V4L2_CAP_EDID Add capability flag to indicate that the device is an EDID-only device. Signed-off-by: Erling Ljunggren Signed-off-by: Hans Verkuil Reviewed-by: Sebastian Fricke Reviewed-by: Ricardo Ribalda Signed-off-by: Mauro Carvalho Chehab --- include/uapi/linux/videodev2.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/videodev2.h b/include/uapi/linux/videodev2.h index 725e86c4bbbd..27239cb64065 100644 --- a/include/uapi/linux/videodev2.h +++ b/include/uapi/linux/videodev2.h @@ -502,6 +502,7 @@ struct v4l2_capability { #define V4L2_CAP_META_CAPTURE 0x00800000 /* Is a metadata capture device */ #define V4L2_CAP_READWRITE 0x01000000 /* read/write systemcalls */ +#define V4L2_CAP_EDID 0x02000000 /* Is an EDID-only device */ #define V4L2_CAP_STREAMING 0x04000000 /* streaming I/O ioctls */ #define V4L2_CAP_META_OUTPUT 0x08000000 /* Is a metadata output device */ -- cgit v1.2.3 From c7a29258737076a7caf9ced6a7d710cce890abe5 Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Wed, 12 Aug 2020 14:20:10 +0200 Subject: media: input: serio.h: add SERIO_EXTRON_DA_HD_PLUS Add a new serio ID for the Extron DA HD 4K Plus series of 4K HDMI Distribution Amplifiers. These devices support CEC over the serial port, so a new serio ID is needed to be able to associate the CEC driver. Signed-off-by: Hans Verkuil Acked-by: Dmitry Torokhov Signed-off-by: Mauro Carvalho Chehab --- include/uapi/linux/serio.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/serio.h b/include/uapi/linux/serio.h index ed2a96f43ce4..5a2af0942c9f 100644 --- a/include/uapi/linux/serio.h +++ b/include/uapi/linux/serio.h @@ -83,5 +83,6 @@ #define SERIO_PULSE8_CEC 0x40 #define SERIO_RAINSHADOW_CEC 0x41 #define SERIO_FSIA6B 0x42 +#define SERIO_EXTRON_DA_HD_4K_PLUS 0x43 #endif /* _UAPI_SERIO_H */ -- cgit v1.2.3 From e49dacc71ec2621ce4c422cd5605d4d06f7807b0 Mon Sep 17 00:00:00 2001 From: Wouter Verhelst Date: Mon, 12 Aug 2024 15:20:37 +0200 Subject: nbd: implement the WRITE_ZEROES command The NBD protocol defines a message for zeroing out a region of an export Add support to the kernel driver for that message. Signed-off-by: Wouter Verhelst Cc: Eric Blake Reviewed-by: Damien Le Moal Link: https://lore.kernel.org/r/20240812133032.115134-3-w@uter.be Signed-off-by: Jens Axboe --- include/uapi/linux/nbd.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/nbd.h b/include/uapi/linux/nbd.h index d75215f2c675..f1d468acfb25 100644 --- a/include/uapi/linux/nbd.h +++ b/include/uapi/linux/nbd.h @@ -42,8 +42,9 @@ enum { NBD_CMD_WRITE = 1, NBD_CMD_DISC = 2, NBD_CMD_FLUSH = 3, - NBD_CMD_TRIM = 4 + NBD_CMD_TRIM = 4, /* userspace defines additional extension commands */ + NBD_CMD_WRITE_ZEROES = 6, }; /* values for flags field, these are server interaction specific. */ @@ -53,11 +54,13 @@ enum { #define NBD_FLAG_SEND_FUA (1 << 3) /* send FUA (forced unit access) */ #define NBD_FLAG_ROTATIONAL (1 << 4) /* device is rotational */ #define NBD_FLAG_SEND_TRIM (1 << 5) /* send trim/discard */ +#define NBD_FLAG_SEND_WRITE_ZEROES (1 << 6) /* supports WRITE_ZEROES */ /* there is a gap here to match userspace */ #define NBD_FLAG_CAN_MULTI_CONN (1 << 8) /* Server supports multiple connections per export. */ /* values for cmd flags in the upper 16 bits of request type */ #define NBD_CMD_FLAG_FUA (1 << 16) /* FUA (forced unit access) op */ +#define NBD_CMD_FLAG_NO_HOLE (1 << 17) /* Do not punch a hole for WRITE_ZEROES */ /* These are client behavior specific flags. */ #define NBD_CFLAG_DESTROY_ON_DISCONNECT (1 << 0) /* delete the nbd device on -- cgit v1.2.3 From 663b0f1e141dc60ce6c09ae6afc5f213b22d13ca Mon Sep 17 00:00:00 2001 From: Philip Yang Date: Fri, 16 Feb 2024 11:00:10 -0500 Subject: drm/amdkfd: Document and define SVM events message macro Document how to use SMI system management interface to enable and receive SVM events. Document SVM event triggers. Define SVM events message string format macro that could be used by user mode for sscanf to parse the event. Add it to uAPI header file to make it obvious that is changing uAPI in future. No functional changes. Signed-off-by: Philip Yang Reviewed-by: James Zhu Signed-off-by: Alex Deucher --- include/uapi/linux/kfd_ioctl.h | 100 +++++++++++++++++++++++++++++++++++------ 1 file changed, 87 insertions(+), 13 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index 71a7ce5f2d4c..717307d6b5b7 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -540,26 +540,29 @@ enum kfd_smi_event { KFD_SMI_EVENT_ALL_PROCESS = 64 }; +/* The reason of the page migration event */ enum KFD_MIGRATE_TRIGGERS { - KFD_MIGRATE_TRIGGER_PREFETCH, - KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, - KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, - KFD_MIGRATE_TRIGGER_TTM_EVICTION + KFD_MIGRATE_TRIGGER_PREFETCH, /* Prefetch to GPU VRAM or system memory */ + KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, /* GPU page fault recover */ + KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, /* CPU page fault recover */ + KFD_MIGRATE_TRIGGER_TTM_EVICTION /* TTM eviction */ }; +/* The reason of user queue evition event */ enum KFD_QUEUE_EVICTION_TRIGGERS { - KFD_QUEUE_EVICTION_TRIGGER_SVM, - KFD_QUEUE_EVICTION_TRIGGER_USERPTR, - KFD_QUEUE_EVICTION_TRIGGER_TTM, - KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, - KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, - KFD_QUEUE_EVICTION_CRIU_RESTORE + KFD_QUEUE_EVICTION_TRIGGER_SVM, /* SVM buffer migration */ + KFD_QUEUE_EVICTION_TRIGGER_USERPTR, /* userptr movement */ + KFD_QUEUE_EVICTION_TRIGGER_TTM, /* TTM move buffer */ + KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, /* GPU suspend */ + KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, /* CRIU checkpoint */ + KFD_QUEUE_EVICTION_CRIU_RESTORE /* CRIU restore */ }; +/* The reason of unmap buffer from GPU event */ enum KFD_SVM_UNMAP_TRIGGERS { - KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, - KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE, - KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU + KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, /* MMU notifier CPU buffer movement */ + KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE,/* MMU notifier page migration */ + KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU /* Unmap to free the buffer */ }; #define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) - 1)) @@ -570,6 +573,77 @@ struct kfd_ioctl_smi_events_args { __u32 anon_fd; /* from KFD */ }; +/* + * SVM event tracing via SMI system management interface + * + * Open event file descriptor + * use ioctl AMDKFD_IOC_SMI_EVENTS, pass in gpuid and return a anonymous file + * descriptor to receive SMI events. + * If calling with sudo permission, then file descriptor can be used to receive + * SVM events from all processes, otherwise, to only receive SVM events of same + * process. + * + * To enable the SVM event + * Write event file descriptor with KFD_SMI_EVENT_MASK_FROM_INDEX(event) bitmap + * mask to start record the event to the kfifo, use bitmap mask combination + * for multiple events. New event mask will overwrite the previous event mask. + * KFD_SMI_EVENT_MASK_FROM_INDEX(KFD_SMI_EVENT_ALL_PROCESS) bit requires sudo + * permisson to receive SVM events from all process. + * + * To receive the event + * Application can poll file descriptor to wait for the events, then read event + * from the file into a buffer. Each event is one line string message, starting + * with the event id, then the event specific information. + * + * To decode event information + * The following event format string macro can be used with sscanf to decode + * the specific event information. + * event triggers: the reason to generate the event, defined as enum for unmap, + * eviction and migrate events. + * node, from, to, prefetch_loc, preferred_loc: GPU ID, or 0 for system memory. + * addr: user mode address, in pages + * size: in pages + * pid: the process ID to generate the event + * ns: timestamp in nanosecond-resolution, starts at system boot time but + * stops during suspend + * migrate_update: GPU page fault is recovered by 'M' for migrate, 'U' for update + * rw: 'W' for write page fault, 'R' for read page fault + * rescheduled: 'R' if the queue restore failed and rescheduled to try again + */ +#define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\ + "%x %s\n", (reset_seq_num), (reset_cause) + +#define KFD_EVENT_FMT_THERMAL_THROTTLING(bitmask, counter)\ + "%llx:%llx\n", (bitmask), (counter) + +#define KFD_EVENT_FMT_VMFAULT(pid, task_name)\ + "%x:%s\n", (pid), (task_name) + +#define KFD_EVENT_FMT_PAGEFAULT_START(ns, pid, addr, node, rw)\ + "%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (rw) + +#define KFD_EVENT_FMT_PAGEFAULT_END(ns, pid, addr, node, migrate_update)\ + "%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (migrate_update) + +#define KFD_EVENT_FMT_MIGRATE_START(ns, pid, start, size, from, to, prefetch_loc,\ + preferred_loc, migrate_trigger)\ + "%lld -%d @%lx(%lx) %x->%x %x:%x %d\n", (ns), (pid), (start), (size),\ + (from), (to), (prefetch_loc), (preferred_loc), (migrate_trigger) + +#define KFD_EVENT_FMT_MIGRATE_END(ns, pid, start, size, from, to, migrate_trigger)\ + "%lld -%d @%lx(%lx) %x->%x %d\n", (ns), (pid), (start), (size),\ + (from), (to), (migrate_trigger) + +#define KFD_EVENT_FMT_QUEUE_EVICTION(ns, pid, node, evict_trigger)\ + "%lld -%d %x %d\n", (ns), (pid), (node), (evict_trigger) + +#define KFD_EVENT_FMT_QUEUE_RESTORE(ns, pid, node, rescheduled)\ + "%lld -%d %x %c\n", (ns), (pid), (node), (rescheduled) + +#define KFD_EVENT_FMT_UNMAP_FROM_GPU(ns, pid, addr, size, node, unmap_trigger)\ + "%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\ + (node), (unmap_trigger) + /************************************************************************************************** * CRIU IOCTLs (Checkpoint Restore In Userspace) * -- cgit v1.2.3 From c259acab839e57eab0318f32da4ae803a8d59397 Mon Sep 17 00:00:00 2001 From: Mahesh Bandewar Date: Wed, 4 Sep 2024 07:13:05 -0700 Subject: ptp/ioctl: support MONOTONIC{,_RAW} timestamps for PTP_SYS_OFFSET_EXTENDED The ability to read the PHC (Physical Hardware Clock) alongside multiple system clocks is currently dependent on the specific hardware architecture. This limitation restricts the use of PTP_SYS_OFFSET_PRECISE to certain hardware configurations. The generic soultion which would work across all architectures is to read the PHC along with the latency to perform PHC-read as offered by PTP_SYS_OFFSET_EXTENDED which provides pre and post timestamps. However, these timestamps are currently limited to the CLOCK_REALTIME timebase. Since CLOCK_REALTIME is affected by NTP (or similar time synchronization services), it can experience significant jumps forward or backward. This hinders the precise latency measurements that PTP_SYS_OFFSET_EXTENDED is designed to provide. This problem could be addressed by supporting MONOTONIC_RAW timestamps within PTP_SYS_OFFSET_EXTENDED. Unlike CLOCK_REALTIME or CLOCK_MONOTONIC, the MONOTONIC_RAW timebase is unaffected by NTP adjustments. This enhancement can be implemented by utilizing one of the three reserved words within the PTP_SYS_OFFSET_EXTENDED struct to pass the clock-id for timestamps. The current behavior aligns with clock-id for CLOCK_REALTIME timebase (value of 0), ensuring backward compatibility of the UAPI. Signed-off-by: Mahesh Bandewar Signed-off-by: Vadim Fedorenko Signed-off-by: David S. Miller --- include/uapi/linux/ptp_clock.h | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h index 053b40d642de..18eefa6d93d6 100644 --- a/include/uapi/linux/ptp_clock.h +++ b/include/uapi/linux/ptp_clock.h @@ -155,13 +155,25 @@ struct ptp_sys_offset { struct ptp_clock_time ts[2 * PTP_MAX_SAMPLES + 1]; }; +/* + * ptp_sys_offset_extended - data structure for IOCTL operation + * PTP_SYS_OFFSET_EXTENDED + * + * @n_samples: Desired number of measurements. + * @clockid: clockid of a clock-base used for pre/post timestamps. + * @rsv: Reserved for future use. + * @ts: Array of samples in the form [pre-TS, PHC, post-TS]. The + * kernel provides @n_samples. + * + * Starting from kernel 6.12 and onwards, the first word of the reserved-field + * is used for @clockid. That's backward compatible since previous kernel + * expect all three reserved words (@rsv[3]) to be 0 while the clockid (first + * word in the new structure) for CLOCK_REALTIME is '0'. + */ struct ptp_sys_offset_extended { - unsigned int n_samples; /* Desired number of measurements. */ - unsigned int rsv[3]; /* Reserved for future use. */ - /* - * Array of [system, phc, system] time stamps. The kernel will provide - * 3*n_samples time stamps. - */ + unsigned int n_samples; + __kernel_clockid_t clockid; + unsigned int rsv[2]; struct ptp_clock_time ts[PTP_MAX_SAMPLES][3]; }; -- cgit v1.2.3 From 6cf1c97dad2ebc4de03105cc444b3dfaa83f3dc2 Mon Sep 17 00:00:00 2001 From: zhenwei pi Date: Tue, 23 Apr 2024 11:41:07 +0800 Subject: virtio_balloon: introduce oom-kill invocations When the guest OS runs under critical memory pressure, the guest starts to kill processes. A guest monitor agent may scan 'oom_kill' from /proc/vmstat, and reports the OOM KILL event. However, the agent may be killed and we will loss this critical event(and the later events). For now we can also grep for magic words in guest kernel log from host side. Rather than this unstable way, virtio balloon reports OOM-KILL invocations instead. Acked-by: David Hildenbrand Signed-off-by: zhenwei pi Message-Id: <20240423034109.1552866-3-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin --- include/uapi/linux/virtio_balloon.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index ddaa45e723c4..b17bbe033697 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -71,7 +71,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */ #define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ -#define VIRTIO_BALLOON_S_NR 10 +#define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ +#define VIRTIO_BALLOON_S_NR 11 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -83,7 +84,8 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \ VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ - VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \ + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") -- cgit v1.2.3 From c5b70a26aac39f09a23fd72f44cfbb3d4d5a14d5 Mon Sep 17 00:00:00 2001 From: zhenwei pi Date: Tue, 23 Apr 2024 11:41:08 +0800 Subject: virtio_balloon: introduce memory allocation stall counter Memory allocation stall counter represents the performance/latency of memory allocation, expose this counter to the host side by virtio balloon device via out-of-bound way. Acked-by: David Hildenbrand Signed-off-by: zhenwei pi Message-Id: <20240423034109.1552866-4-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin --- include/uapi/linux/virtio_balloon.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index b17bbe033697..487b893a160e 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -72,7 +72,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ -#define VIRTIO_BALLOON_S_NR 11 +#define VIRTIO_BALLOON_S_ALLOC_STALL 11 /* Stall count of memory allocatoin */ +#define VIRTIO_BALLOON_S_NR 12 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -85,7 +86,8 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ - VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \ + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \ + VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") -- cgit v1.2.3 From 74c025c5d7e4ac7c7ad269c1ee64da4bdfe4770c Mon Sep 17 00:00:00 2001 From: zhenwei pi Date: Tue, 23 Apr 2024 11:41:09 +0800 Subject: virtio_balloon: introduce memory scan/reclaim info Expose memory scan/reclaim information to the host side via virtio balloon device. Now we have a metric to analyze the memory performance: y: counter increases n: counter does not changes h: the rate of counter change is high l: the rate of counter change is low OOM: VIRTIO_BALLOON_S_OOM_KILL STALL: VIRTIO_BALLOON_S_ALLOC_STALL ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT - OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]: the guest runs under really critial memory pressure - OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]: the memory allocation stalls due to cgroup, not the global memory pressure. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]: the memory allocation stalls due to global memory pressure. The performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows quite effective memory reclaiming. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]: the memory allocation stalls due to global memory pressure. the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing heavily, the serious case leads poor performance and difficult trouble shooting. Ex, sshd may block on memory allocation when accepting new connections, a user can't login a VM by ssh command. - OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]: the low ratio between ARCLM/ASCAN shows that the guest tries to reclaim more memory, but it can't. Once more memory is required in future, it will struggle to reclaim memory. Acked-by: David Hildenbrand Signed-off-by: zhenwei pi Message-Id: <20240423034109.1552866-5-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin --- include/uapi/linux/virtio_balloon.h | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 487b893a160e..ee35a372805d 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -73,7 +73,11 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ #define VIRTIO_BALLOON_S_ALLOC_STALL 11 /* Stall count of memory allocatoin */ -#define VIRTIO_BALLOON_S_NR 12 +#define VIRTIO_BALLOON_S_ASYNC_SCAN 12 /* Amount of memory scanned asynchronously */ +#define VIRTIO_BALLOON_S_DIRECT_SCAN 13 /* Amount of memory scanned directly */ +#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed asynchronously */ +#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed directly */ +#define VIRTIO_BALLOON_S_NR 16 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -87,7 +91,11 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \ - VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \ + VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \ + VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \ + VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \ + VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \ + VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") -- cgit v1.2.3 From 2f87e9cf0c9e21ab9be1fb2ba8520a1525359497 Mon Sep 17 00:00:00 2001 From: Cindy Lu Date: Wed, 31 Jul 2024 11:16:01 +0800 Subject: vdpa: support set mac address from vdpa tool Add new UAPI to support the mac address from vdpa tool Function vdpa_nl_cmd_dev_attr_set_doit() will get the new MAC address from the vdpa tool and then set it to the device. The usage is: vdpa dev set name vdpa_name mac **:**:**:**:**:** Here is example: root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "82:4d:e9:5d:d7:e6", "link ": "up", "link_announce ": false, "mtu": 1500 } } } root@L1# vdpa dev set name vdpa0 mac 00:11:22:33:44:55 root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "00:11:22:33:44:55", "link ": "up", "link_announce ": false, "mtu": 1500 } } } Signed-off-by: Cindy Lu Message-Id: <20240731031653.1047692-2-lulu@redhat.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang --- include/uapi/linux/vdpa.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h index 842bf1201ac4..71edf2c70cc3 100644 --- a/include/uapi/linux/vdpa.h +++ b/include/uapi/linux/vdpa.h @@ -19,6 +19,7 @@ enum vdpa_command { VDPA_CMD_DEV_GET, /* can dump */ VDPA_CMD_DEV_CONFIG_GET, /* can dump */ VDPA_CMD_DEV_VSTATS_GET, + VDPA_CMD_DEV_ATTR_SET, }; enum vdpa_attr { -- cgit v1.2.3 From be8e9eb3750639aa5cffb3f764ca080caed41bd0 Mon Sep 17 00:00:00 2001 From: Jason Xing Date: Mon, 9 Sep 2024 09:56:11 +0800 Subject: net-timestamp: introduce SOF_TIMESTAMPING_OPT_RX_FILTER flag introduce a new flag SOF_TIMESTAMPING_OPT_RX_FILTER in the receive path. User can set it with SOF_TIMESTAMPING_SOFTWARE to filter out rx software timestamp report, especially after a process turns on netstamp_needed_key which can time stamp every incoming skb. Previously, we found out if an application starts first which turns on netstamp_needed_key, then another one only passing SOF_TIMESTAMPING_SOFTWARE could also get rx timestamp. Now we handle this case by introducing this new flag without breaking users. Quoting Willem to explain why we need the flag: "why a process would want to request software timestamp reporting, but not receive software timestamp generation. The only use I see is when the application does request SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_TX_SOFTWARE." Similarly, this new flag could also be used for hardware case where we can set it with SOF_TIMESTAMPING_RAW_HARDWARE, then we won't receive hardware receive timestamp. Another thing about errqueue in this patch I have a few words to say: In this case, we need to handle the egress path carefully, or else reporting the tx timestamp will fail. Egress path and ingress path will finally call sock_recv_timestamp(). We have to distinguish them. Errqueue is a good indicator to reflect the flow direction. Suggested-by: Willem de Bruijn Signed-off-by: Jason Xing Reviewed-by: Willem de Bruijn Link: https://patch.msgid.link/20240909015612.3856-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/net_tstamp.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h index a2c66b3d7f0f..858339d1c1c4 100644 --- a/include/uapi/linux/net_tstamp.h +++ b/include/uapi/linux/net_tstamp.h @@ -32,8 +32,9 @@ enum { SOF_TIMESTAMPING_OPT_TX_SWHW = (1<<14), SOF_TIMESTAMPING_BIND_PHC = (1 << 15), SOF_TIMESTAMPING_OPT_ID_TCP = (1 << 16), + SOF_TIMESTAMPING_OPT_RX_FILTER = (1 << 17), - SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_ID_TCP, + SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_RX_FILTER, SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) | SOF_TIMESTAMPING_LAST }; -- cgit v1.2.3 From 87f10faf166a9114aa0d4132298cad379de16fdd Mon Sep 17 00:00:00 2001 From: Bjorn Helgaas Date: Tue, 27 Aug 2024 18:48:48 -0500 Subject: PCI: Rename CRS Completion Status to RRS PCIe r6.0 changed the abbreviation for "Configuration Request Retry Status" Completion Status from "CRS" to "RRS" and uses the terminology of "Configuration RRS Software Visibility" instead of "CRS Software Visibility". Align the Linux usage with the r6.0 spec language. No functional change intended. It's confusing to make this change, but I think "RRS" *is* a better abbreviation because it was easy to interpret "CRS" as "Completion Retry Status", which really didn't make any sense. Link: https://lore.kernel.org/r/20240827234848.4429-4-helgaas@kernel.org Signed-off-by: Bjorn Helgaas --- include/uapi/linux/pci_regs.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index 94c00996e633..f94591f9f5e9 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -634,9 +634,11 @@ #define PCI_EXP_RTCTL_SENFEE 0x0002 /* System Error on Non-Fatal Error */ #define PCI_EXP_RTCTL_SEFEE 0x0004 /* System Error on Fatal Error */ #define PCI_EXP_RTCTL_PMEIE 0x0008 /* PME Interrupt Enable */ -#define PCI_EXP_RTCTL_CRSSVE 0x0010 /* CRS Software Visibility Enable */ +#define PCI_EXP_RTCTL_RRS_SVE 0x0010 /* Config RRS Software Visibility Enable */ +#define PCI_EXP_RTCTL_CRSSVE PCI_EXP_RTCTL_RRS_SVE /* compatibility */ #define PCI_EXP_RTCAP 0x1e /* Root Capabilities */ -#define PCI_EXP_RTCAP_CRSVIS 0x0001 /* CRS Software Visibility capability */ +#define PCI_EXP_RTCAP_RRS_SV 0x0001 /* Config RRS Software Visibility */ +#define PCI_EXP_RTCAP_CRSVIS PCI_EXP_RTCAP_RRS_SV /* compatibility */ #define PCI_EXP_RTSTA 0x20 /* Root Status */ #define PCI_EXP_RTSTA_PME_RQ_ID 0x0000ffff /* PME Requester ID */ #define PCI_EXP_RTSTA_PME 0x00010000 /* PME status */ -- cgit v1.2.3 From 6ebf2d021a13a77b495b4de15c6834e26b80d08e Mon Sep 17 00:00:00 2001 From: Christian Loehle Date: Tue, 13 Aug 2024 15:43:46 +0100 Subject: sched/deadline: Clarify nanoseconds in uapi Specify the time values of the deadline parameters of deadline, runtime, and period as being in nanoseconds explicitly as they always have been. Signed-off-by: Christian Loehle Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Acked-by: Rafael J. Wysocki Link: https://lore.kernel.org/r/20240813144348.1180344-3-christian.loehle@arm.com --- include/uapi/linux/sched/types.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h index 90662385689b..bf6e9ae031c1 100644 --- a/include/uapi/linux/sched/types.h +++ b/include/uapi/linux/sched/types.h @@ -58,9 +58,9 @@ * * This is reflected by the following fields of the sched_attr structure: * - * @sched_deadline representative of the task's deadline - * @sched_runtime representative of the task's runtime - * @sched_period representative of the task's period + * @sched_deadline representative of the task's deadline in nanoseconds + * @sched_runtime representative of the task's runtime in nanoseconds + * @sched_period representative of the task's period in nanoseconds * * Given this task model, there are a multiplicity of scheduling algorithms * and policies, that can be used to ensure all the tasks will make their -- cgit v1.2.3 From 50c52250e2d74b098465841163c18f4b4e9ad430 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 11 Sep 2024 17:34:41 +0100 Subject: block: implement async io_uring discard cmd io_uring allows implementing custom file specific asynchronous operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD requests or just io_uring commands. Use it to add support for async discards. Normally, it first tries to queue up bios in a non-blocking context, and if that fails, we'd retry from a blocking context by returning -EAGAIN to the core io_uring. We always get the result from bios asynchronously by setting a custom bi_end_io callback, at which point we drag the request into the task context to either reissue or complete it and post a completion to the user. Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Also, apart from cases where discarding is not allowed at all, e.g. discards are not supported or the file/device is read only, the user should assume that the sector range on disk is not valid anymore, even when an error was returned to the user. Suggested-by: Conrad Meyer Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- include/uapi/linux/blkdev.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 include/uapi/linux/blkdev.h (limited to 'include/uapi') diff --git a/include/uapi/linux/blkdev.h b/include/uapi/linux/blkdev.h new file mode 100644 index 000000000000..66373cd1a83a --- /dev/null +++ b/include/uapi/linux/blkdev.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_BLKDEV_H +#define _UAPI_LINUX_BLKDEV_H + +#include +#include + +/* + * io_uring block file commands, see IORING_OP_URING_CMD. + * It's a different number space from ioctl(), reuse the block's code 0x12. + */ +#define BLOCK_URING_CMD_DISCARD _IO(0x12, 0) + +#endif -- cgit v1.2.3 From 3efd7ab46d0aebc3e567a9846e79a98dbad3291c Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Tue, 10 Sep 2024 17:14:46 +0000 Subject: net: netdev netlink api to bind dma-buf to a net device API takes the dma-buf fd as input, and binds it to the netdevice. The user can specify the rx queues to bind the dma-buf to. Suggested-by: Stanislav Fomichev Signed-off-by: Mina Almasry Reviewed-by: Donald Hunter Reviewed-by: Jakub Kicinski Link: https://patch.msgid.link/20240910171458.219195-3-almasrymina@google.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/netdev.h | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 43742ac5b00d..91bf3ecc5f1d 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -173,6 +173,16 @@ enum { NETDEV_A_QSTATS_MAX = (__NETDEV_A_QSTATS_MAX - 1) }; +enum { + NETDEV_A_DMABUF_IFINDEX = 1, + NETDEV_A_DMABUF_QUEUES, + NETDEV_A_DMABUF_FD, + NETDEV_A_DMABUF_ID, + + __NETDEV_A_DMABUF_MAX, + NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -186,6 +196,7 @@ enum { NETDEV_CMD_QUEUE_GET, NETDEV_CMD_NAPI_GET, NETDEV_CMD_QSTATS_GET, + NETDEV_CMD_BIND_RX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) -- cgit v1.2.3 From 8f0b3cc9a4c102c24808c87f1bc943659d7a7f9f Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Tue, 10 Sep 2024 17:14:53 +0000 Subject: tcp: RX path for devmem TCP In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling. tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer. tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information: 1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released. The pages awaiting freeing are stored in the newly added sk->sk_user_frags, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20240910171458.219195-10-almasrymina@google.com Signed-off-by: Jakub Kicinski --- include/uapi/asm-generic/socket.h | 5 +++++ include/uapi/linux/uio.h | 13 +++++++++++++ 2 files changed, 18 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 8ce8a39a1e5f..e993edc9c0ee 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..3a22ddae376a 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,19 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; +struct dmabuf_cmsg { + __u64 frag_offset; /* offset into the dmabuf where the frag starts. + */ + __u32 frag_size; /* size of the frag. */ + __u32 frag_token; /* token representing this frag for + * DEVMEM_DONTNEED. + */ + __u32 dmabuf_id; /* dmabuf id this frag belongs to. */ + __u32 flags; /* Currently unused. Reserved for future + * uses. + */ +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ -- cgit v1.2.3 From 678f6e28b5f6fc2316f2c0fed8f8903101f1e128 Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Tue, 10 Sep 2024 17:14:54 +0000 Subject: net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags Add an interface for the user to notify the kernel that it is done reading the devmem dmabuf frags returned as cmsg. The kernel will drop the reference on the frags to make them available for reuse. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20240910171458.219195-11-almasrymina@google.com Signed-off-by: Jakub Kicinski --- include/uapi/asm-generic/socket.h | 1 + include/uapi/linux/uio.h | 5 +++++ 2 files changed, 6 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index e993edc9c0ee..3b4e3e815602 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -139,6 +139,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 3a22ddae376a..649739e0c404 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -33,6 +33,11 @@ struct dmabuf_cmsg { */ }; +struct dmabuf_token { + __u32 token_start; + __u32 token_count; +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ -- cgit v1.2.3 From d0caf9876a1c9f844307effb598ad1312d9e0025 Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Tue, 10 Sep 2024 17:14:57 +0000 Subject: netdev: add dmabuf introspection Add dmabuf information to page_pool stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump page-pool-get ... {'dmabuf': 10, 'id': 456, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 455, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 454, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 453, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 452, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 451, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 450, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 449, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, And queue stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump queue-get ... {'dmabuf': 10, 'id': 8, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 9, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 10, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 11, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 12, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 13, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 14, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 15, 'ifindex': 3, 'type': 'rx'}, Suggested-by: Jakub Kicinski Signed-off-by: Mina Almasry Reviewed-by: Jakub Kicinski Link: https://patch.msgid.link/20240910171458.219195-14-almasrymina@google.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/netdev.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 91bf3ecc5f1d..7c308f04e7a0 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -93,6 +93,7 @@ enum { NETDEV_A_PAGE_POOL_INFLIGHT, NETDEV_A_PAGE_POOL_INFLIGHT_MEM, NETDEV_A_PAGE_POOL_DETACH_TIME, + NETDEV_A_PAGE_POOL_DMABUF, __NETDEV_A_PAGE_POOL_MAX, NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1) @@ -131,6 +132,7 @@ enum { NETDEV_A_QUEUE_IFINDEX, NETDEV_A_QUEUE_TYPE, NETDEV_A_QUEUE_NAPI_ID, + NETDEV_A_QUEUE_DMABUF, __NETDEV_A_QUEUE_MAX, NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1) -- cgit v1.2.3 From 8f9bf857e43b3f75a098e3af3a6fec2d03203a1e Mon Sep 17 00:00:00 2001 From: Parthiban Veerasooran Date: Mon, 9 Sep 2024 13:55:06 +0530 Subject: net: ethernet: oa_tc6: implement internal PHY initialization Internal PHY is initialized as per the PHY register capability supported by the MAC-PHY. Direct PHY Register Access Capability indicates if PHY registers are directly accessible within the SPI register memory space. Indirect PHY Register Access Capability indicates if PHY registers are indirectly accessible through the MDIO/MDC registers MDIOACCn defined in OPEN Alliance specification. Currently the direct register access is only supported. Reviewed-by: Andrew Lunn Signed-off-by: Parthiban Veerasooran Link: https://patch.msgid.link/20240909082514.262942-7-Parthiban.Veerasooran@microchip.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/mdio.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/mdio.h b/include/uapi/linux/mdio.h index c0c8ec995b06..f0d3f268240d 100644 --- a/include/uapi/linux/mdio.h +++ b/include/uapi/linux/mdio.h @@ -23,6 +23,7 @@ #define MDIO_MMD_DTEXS 5 /* DTE Extender Sublayer */ #define MDIO_MMD_TC 6 /* Transmission Convergence */ #define MDIO_MMD_AN 7 /* Auto-Negotiation */ +#define MDIO_MMD_POWER_UNIT 13 /* PHY Power Unit */ #define MDIO_MMD_C22EXT 29 /* Clause 22 extension */ #define MDIO_MMD_VEND1 30 /* Vendor specific 1 */ #define MDIO_MMD_VEND2 31 /* Vendor specific 2 */ -- cgit v1.2.3 From 7cc2a6eadcd7a5aa36ac63e6659f5c6138c7f4d2 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 11 Sep 2024 13:56:08 -0600 Subject: io_uring: add IORING_REGISTER_COPY_BUFFERS method Buffers can get registered with io_uring, which allows to skip the repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This reduces the overhead of O_DIRECT IO. However, registrering buffers can take some time. Normally this isn't an issue as it's done at initialization time (and hence less critical), but for cases where rings can be created and destroyed as part of an IO thread pool, registering the same buffers for multiple rings become a more time sensitive proposition. As an example, let's say an application has an IO memory pool of 500G. Initial registration takes: Got 500 huge pages (each 1024MB) Registered 500 pages in 409 msec or about 0.4 seconds. If we go higher to 900 1GB huge pages being registered: Registered 900 pages in 738 msec which is, as expected, a fully linear scaling. Rather than have each ring pin/map/register the same buffer pool, provide an io_uring_register(2) opcode to simply duplicate the buffers that are registered with another ring. Adding the same 900GB of registered buffers to the target ring can then be accomplished in: Copied 900 pages in 17 usec While timing differs a bit, this provides around a 25,000-40,000x speedup for this use case. Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index a275f91d2ac0..9dc5bb428c8a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -609,6 +609,9 @@ enum io_uring_register_op { IORING_REGISTER_CLOCK = 29, + /* copy registered buffers from source ring to current ring */ + IORING_REGISTER_COPY_BUFFERS = 30, + /* this goes last */ IORING_REGISTER_LAST, @@ -694,6 +697,16 @@ struct io_uring_clock_register { __u32 __resv[3]; }; +enum { + IORING_REGISTER_SRC_REGISTERED = 1, +}; + +struct io_uring_copy_buffers { + __u32 src_fd; + __u32 flags; + __u32 pad[6]; +}; + struct io_uring_buf { __u64 addr; __u32 len; -- cgit v1.2.3 From b2155807893aac40f1a1cdf43f7fcc270cbfc05a Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 10 Sep 2024 17:21:42 -0700 Subject: uapi: libc-compat: remove ipx leftovers The uAPI headers for IPX were deleted 3 years ago in commit 6c9b40844751 ("net: Remove net/ipx.h and uapi/linux/ipx.h header files") Delete the leftover defines from libc-compat.h Link: https://patch.msgid.link/20240911002142.1508694-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- include/uapi/linux/libc-compat.h | 36 ------------------------------------ 1 file changed, 36 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h index 8254c937c9f4..0eca95ccb41e 100644 --- a/include/uapi/linux/libc-compat.h +++ b/include/uapi/linux/libc-compat.h @@ -140,25 +140,6 @@ #endif /* _NETINET_IN_H */ -/* Coordinate with glibc netipx/ipx.h header. */ -#if defined(__NETIPX_IPX_H) - -#define __UAPI_DEF_SOCKADDR_IPX 0 -#define __UAPI_DEF_IPX_ROUTE_DEFINITION 0 -#define __UAPI_DEF_IPX_INTERFACE_DEFINITION 0 -#define __UAPI_DEF_IPX_CONFIG_DATA 0 -#define __UAPI_DEF_IPX_ROUTE_DEF 0 - -#else /* defined(__NETIPX_IPX_H) */ - -#define __UAPI_DEF_SOCKADDR_IPX 1 -#define __UAPI_DEF_IPX_ROUTE_DEFINITION 1 -#define __UAPI_DEF_IPX_INTERFACE_DEFINITION 1 -#define __UAPI_DEF_IPX_CONFIG_DATA 1 -#define __UAPI_DEF_IPX_ROUTE_DEF 1 - -#endif /* defined(__NETIPX_IPX_H) */ - /* Definitions for xattr.h */ #if defined(_SYS_XATTR_H) #define __UAPI_DEF_XATTR 0 @@ -240,23 +221,6 @@ #define __UAPI_DEF_IP6_MTUINFO 1 #endif -/* Definitions for ipx.h */ -#ifndef __UAPI_DEF_SOCKADDR_IPX -#define __UAPI_DEF_SOCKADDR_IPX 1 -#endif -#ifndef __UAPI_DEF_IPX_ROUTE_DEFINITION -#define __UAPI_DEF_IPX_ROUTE_DEFINITION 1 -#endif -#ifndef __UAPI_DEF_IPX_INTERFACE_DEFINITION -#define __UAPI_DEF_IPX_INTERFACE_DEFINITION 1 -#endif -#ifndef __UAPI_DEF_IPX_CONFIG_DATA -#define __UAPI_DEF_IPX_CONFIG_DATA 1 -#endif -#ifndef __UAPI_DEF_IPX_ROUTE_DEF -#define __UAPI_DEF_IPX_ROUTE_DEF 1 -#endif - /* Definitions for xattr.h */ #ifndef __UAPI_DEF_XATTR #define __UAPI_DEF_XATTR 1 -- cgit v1.2.3 From 9cbed5aab5aeea420d0aa945733bf608449d44fb Mon Sep 17 00:00:00 2001 From: Chiara Meiohas Date: Mon, 9 Sep 2024 20:30:24 +0300 Subject: RDMA/nldev: Add support for RDMA monitoring Introduce a new netlink command to allow rdma event monitoring. The rdma events supported now are IB device registration/unregistration and net device attachment/detachment. Example output of rdma monitor and the commands which trigger the events: $ rdma monitor $ rmmod mlx5_ib [UNREGISTER] dev 1 rocep8s0f1 [UNREGISTER] dev 0 rocep8s0f0 $ modprobe mlx5_ib [REGISTER] dev 2 mlx5_0 [NETDEV_ATTACH] dev 2 mlx5_0 port 1 netdev 4 eth2 [REGISTER] dev 3 mlx5_1 [NETDEV_ATTACH] dev 3 mlx5_1 port 1 netdev 5 eth3 $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev [UNREGISTER] dev 2 rocep8s0f0 [REGISTER] dev 4 mlx5_0 [NETDEV_ATTACH] dev 4 mlx5_0 port 30 netdev 4 eth2 $ echo 4 > /sys/class/net/eth2/device/sriov_numvfs [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 2 netdev 7 eth4 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 3 netdev 8 eth5 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 4 netdev 9 eth6 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 5 netdev 10 eth7 [REGISTER] dev 5 mlx5_0 [NETDEV_ATTACH] dev 5 mlx5_0 port 1 netdev 11 eth8 [REGISTER] dev 6 mlx5_0 [NETDEV_ATTACH] dev 6 mlx5_0 port 1 netdev 12 eth9 [REGISTER] dev 7 mlx5_0 [NETDEV_ATTACH] dev 7 mlx5_0 port 1 netdev 13 eth10 [REGISTER] dev 8 mlx5_0 [NETDEV_ATTACH] dev 8 mlx5_0 port 1 netdev 14 eth11 $ echo 0 > /sys/class/net/eth2/device/sriov_numvfs [UNREGISTER] dev 5 rocep8s0f0v0 [UNREGISTER] dev 6 rocep8s0f0v1 [UNREGISTER] dev 7 rocep8s0f0v2 [UNREGISTER] dev 8 rocep8s0f0v3 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 2 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 3 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 4 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 5 Signed-off-by: Chiara Meiohas Signed-off-by: Michael Guralnik Link: https://patch.msgid.link/20240909173025.30422-7-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky --- include/uapi/rdma/rdma_netlink.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index 2f37568f5556..5f9636d26050 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -15,6 +15,7 @@ enum { enum { RDMA_NL_GROUP_IWPM = 2, RDMA_NL_GROUP_LS, + RDMA_NL_GROUP_NOTIFY, RDMA_NL_NUM_GROUPS }; @@ -305,6 +306,8 @@ enum rdma_nldev_command { RDMA_NLDEV_CMD_DELDEV, + RDMA_NLDEV_CMD_MONITOR, + RDMA_NLDEV_NUM_OPS }; @@ -574,6 +577,8 @@ enum rdma_nldev_attr { RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE, /* u8 */ + RDMA_NLDEV_ATTR_EVENT_TYPE, /* u8 */ + /* * Always the end */ @@ -624,4 +629,14 @@ enum rdma_nl_name_assign_type { RDMA_NAME_ASSIGN_TYPE_USER = 1, /* Provided by user-space */ }; +/* + * Supported rdma monitoring event types. + */ +enum rdma_nl_notify_event_type { + RDMA_REGISTER_EVENT, + RDMA_UNREGISTER_EVENT, + RDMA_NETDEV_ATTACH_EVENT, + RDMA_NETDEV_DETACH_EVENT, +}; + #endif /* _UAPI_RDMA_NETLINK_H */ -- cgit v1.2.3 From 12fb1153c53bf9b53e299c9775b84fa7838640f7 Mon Sep 17 00:00:00 2001 From: Chiara Meiohas Date: Mon, 9 Sep 2024 20:30:25 +0300 Subject: RDMA/nldev: Expose whether RDMA monitoring is supported Extend the "rdma sys" command to display whether RDMA monitoring is supported. RDMA monitoring is not supported in mlx4 because it does not use the ib_device_set_netdev() API, which sends the RDMA events. Example output for kernel where monitoring is supported: $ rdma sys show netns shared privileged-qkey off monitor on copy-on-fork on Example output for kernel where monitoring is not supported: $ rdma sys show netns shared privileged-qkey off monitor off copy-on-fork on Signed-off-by: Chiara Meiohas Signed-off-by: Michael Guralnik Link: https://patch.msgid.link/20240909173025.30422-8-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky --- include/uapi/rdma/rdma_netlink.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index 5f9636d26050..39be09c0ffbb 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -579,6 +579,7 @@ enum rdma_nldev_attr { RDMA_NLDEV_ATTR_EVENT_TYPE, /* u8 */ + RDMA_NLDEV_SYS_ATTR_MONITOR_MODE, /* u8 */ /* * Always the end */ -- cgit v1.2.3 From c951a29f6ba52b86223eb00bbcff43142d59a901 Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Wed, 11 Sep 2024 12:37:43 +0300 Subject: net: fib_rules: Add DSCP selector attribute The FIB rule TOS selector is implemented differently between IPv4 and IPv6. In IPv4 it is used to match on the three "Type of Services" bits specified in RFC 791, while in IPv6 is it is used to match on the six DSCP bits specified in RFC 2474. Add a new FIB rule attribute to allow matching on DSCP. The attribute will be used to implement a 'dscp' selector in ip-rule with a consistent behavior between IPv4 and IPv6. For now, set the type of the attribute to 'NLA_REJECT' so that user space will not be able to configure it. This restriction will be lifted once both IPv4 and IPv6 support the new attribute. Signed-off-by: Ido Schimmel Reviewed-by: Guillaume Nault Reviewed-by: David Ahern Link: https://patch.msgid.link/20240911093748.3662015-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/fib_rules.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h index 232df14e1287..a6924dd3aff1 100644 --- a/include/uapi/linux/fib_rules.h +++ b/include/uapi/linux/fib_rules.h @@ -67,6 +67,7 @@ enum { FRA_IP_PROTO, /* ip proto */ FRA_SPORT_RANGE, /* sport */ FRA_DPORT_RANGE, /* dport */ + FRA_DSCP, /* dscp */ __FRA_MAX }; -- cgit v1.2.3 From 636119af94f2fbf3e4458be66a1bc740ba69ce6d Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Sat, 14 Sep 2024 08:51:15 -0600 Subject: io_uring: rename "copy buffers" to "clone buffers" A recent commit added support for copying registered buffers from one ring to another. But that term is a bit confusing, as no copying of buffer data is done here. What is being done is simply cloning the buffer registrations from one ring to another. Rename it while we still can, so that it's more descriptive. No functional changes in this patch. Fixes: 7cc2a6eadcd7 ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 9dc5bb428c8a..1fe79e750470 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -609,8 +609,8 @@ enum io_uring_register_op { IORING_REGISTER_CLOCK = 29, - /* copy registered buffers from source ring to current ring */ - IORING_REGISTER_COPY_BUFFERS = 30, + /* clone registered buffers from source ring to current ring */ + IORING_REGISTER_CLONE_BUFFERS = 30, /* this goes last */ IORING_REGISTER_LAST, @@ -701,7 +701,7 @@ enum { IORING_REGISTER_SRC_REGISTERED = 1, }; -struct io_uring_copy_buffers { +struct io_uring_clone_buffers { __u32 src_fd; __u32 flags; __u32 pad[6]; -- cgit v1.2.3 From 21d52e295ad2afc76bbd105da82a003b96f6ac77 Mon Sep 17 00:00:00 2001 From: Tahera Fahimi Date: Wed, 4 Sep 2024 18:13:55 -0600 Subject: landlock: Add abstract UNIX socket scoping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a new "scoped" member to landlock_ruleset_attr that can specify LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET to restrict connection to abstract UNIX sockets from a process outside of the socket's domain. Two hooks are implemented to enforce these restrictions: unix_stream_connect and unix_may_send. Closes: https://github.com/landlock-lsm/linux/issues/7 Signed-off-by: Tahera Fahimi Link: https://lore.kernel.org/r/5f7ad85243b78427242275b93481cfc7c127764b.1725494372.git.fahimitahera@gmail.com [mic: Fix commit message formatting, improve documentation, simplify hook_unix_may_send(), and cosmetic fixes including rename of LANDLOCK_SCOPED_ABSTRACT_UNIX_SOCKET] Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün --- include/uapi/linux/landlock.h | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h index 2c8dbc74b955..70edd17bafdc 100644 --- a/include/uapi/linux/landlock.h +++ b/include/uapi/linux/landlock.h @@ -44,6 +44,12 @@ struct landlock_ruleset_attr { * flags`_). */ __u64 handled_access_net; + /** + * @scoped: Bitmask of scopes (cf. `Scope flags`_) + * restricting a Landlock domain from accessing outside + * resources (e.g. IPCs). + */ + __u64 scoped; }; /* @@ -274,4 +280,25 @@ struct landlock_net_port_attr { #define LANDLOCK_ACCESS_NET_BIND_TCP (1ULL << 0) #define LANDLOCK_ACCESS_NET_CONNECT_TCP (1ULL << 1) /* clang-format on */ + +/** + * DOC: scope + * + * Scope flags + * ~~~~~~~~~~~ + * + * These flags enable to isolate a sandboxed process from a set of IPC actions. + * Setting a flag for a ruleset will isolate the Landlock domain to forbid + * connections to resources outside the domain. + * + * Scopes: + * + * - %LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET: Restrict a sandboxed process from + * connecting to an abstract UNIX socket created by a process outside the + * related Landlock domain (e.g. a parent domain or a non-sandboxed process). + */ +/* clang-format off */ +#define LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET (1ULL << 0) +/* clang-format on*/ + #endif /* _UAPI_LINUX_LANDLOCK_H */ -- cgit v1.2.3 From 54a6e6bbf3bef25c8eb65619edde70af49bd3db0 Mon Sep 17 00:00:00 2001 From: Tahera Fahimi Date: Fri, 6 Sep 2024 15:30:03 -0600 Subject: landlock: Add signal scoping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, a sandbox process is not restricted to sending a signal (e.g. SIGKILL) to a process outside the sandbox environment. The ability to send a signal for a sandboxed process should be scoped the same way abstract UNIX sockets are scoped. Therefore, we extend the "scoped" field in a ruleset with LANDLOCK_SCOPE_SIGNAL to specify that a ruleset will deny sending any signal from within a sandbox process to its parent (i.e. any parent sandbox or non-sandboxed processes). This patch adds file_set_fowner and file_free_security hooks to set and release a pointer to the file owner's domain. This pointer, fown_domain in landlock_file_security will be used in file_send_sigiotask to check if the process can send a signal. The ruleset_with_unknown_scope test is updated to support LANDLOCK_SCOPE_SIGNAL. This depends on two new changes: - commit 1934b212615d ("file: reclaim 24 bytes from f_owner"): replace container_of(fown, struct file, f_owner) with fown->file . - commit 26f204380a3c ("fs: Fix file_set_fowner LSM hook inconsistencies"): lock before calling the hook. Signed-off-by: Tahera Fahimi Closes: https://github.com/landlock-lsm/linux/issues/8 Link: https://lore.kernel.org/r/df2b4f880a2ed3042992689a793ea0951f6798a5.1725657727.git.fahimitahera@gmail.com [mic: Update landlock_get_current_domain()'s return type, improve and fix locking in hook_file_set_fowner(), simplify and fix sleepable call and locking issue in hook_file_send_sigiotask() and rebase on the latest VFS tree, simplify hook_task_kill() and quickly return when not sandboxed, improve comments, rename LANDLOCK_SCOPED_SIGNAL] Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün --- include/uapi/linux/landlock.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h index 70edd17bafdc..33745642f787 100644 --- a/include/uapi/linux/landlock.h +++ b/include/uapi/linux/landlock.h @@ -296,9 +296,12 @@ struct landlock_net_port_attr { * - %LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET: Restrict a sandboxed process from * connecting to an abstract UNIX socket created by a process outside the * related Landlock domain (e.g. a parent domain or a non-sandboxed process). + * - %LANDLOCK_SCOPE_SIGNAL: Restrict a sandboxed process from sending a signal + * to another process outside the domain. */ /* clang-format off */ #define LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET (1ULL << 0) +#define LANDLOCK_SCOPE_SIGNAL (1ULL << 1) /* clang-format on*/ #endif /* _UAPI_LINUX_LANDLOCK_H */ -- cgit v1.2.3 From f761fcdd289d07e8547fef7ac76c3760fc7803f2 Mon Sep 17 00:00:00 2001 From: Dongliang Cui Date: Wed, 18 Sep 2024 07:40:05 +0900 Subject: exfat: Implement sops->shutdown and ioctl We found that when writing a large file through buffer write, if the disk is inaccessible, exFAT does not return an error normally, which leads to the writing process not stopping properly. To easily reproduce this issue, you can follow the steps below: 1. format a device to exFAT and then mount (with a full disk erase) 2. dd if=/dev/zero of=/exfat_mount/test.img bs=1M count=8192 3. eject the device You may find that the dd process does not stop immediately and may continue for a long time. The root cause of this issue is that during buffer write process, exFAT does not need to access the disk to look up directory entries or the FAT table (whereas FAT would do) every time data is written. Instead, exFAT simply marks the buffer as dirty and returns, delegating the writeback operation to the writeback process. If the disk cannot be accessed at this time, the error will only be returned to the writeback process, and the original process will not receive the error, so it cannot be returned to the user side. When the disk cannot be accessed normally, an error should be returned to stop the writing process. Implement sops->shutdown and ioctl to shut down the file system when underlying block device is marked dead. Signed-off-by: Dongliang Cui Signed-off-by: Zhiguo Niu Signed-off-by: Namjae Jeon --- include/uapi/linux/exfat.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 include/uapi/linux/exfat.h (limited to 'include/uapi') diff --git a/include/uapi/linux/exfat.h b/include/uapi/linux/exfat.h new file mode 100644 index 000000000000..46d95b16fc4b --- /dev/null +++ b/include/uapi/linux/exfat.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Copyright (C) 2024 Unisoc Technologies Co., Ltd. + */ + +#ifndef _UAPI_LINUX_EXFAT_H +#define _UAPI_LINUX_EXFAT_H +#include +#include + +/* + * exfat-specific ioctl commands + */ + +#define EXFAT_IOC_SHUTDOWN _IOR('X', 125, __u32) + +/* + * Flags used by EXFAT_IOC_SHUTDOWN + */ + +#define EXFAT_GOING_DOWN_DEFAULT 0x0 /* default with full sync */ +#define EXFAT_GOING_DOWN_FULLSYNC 0x1 /* going down with full sync*/ +#define EXFAT_GOING_DOWN_NOSYNC 0x2 /* going down */ + +#endif /* _UAPI_LINUX_EXFAT_H */ -- cgit v1.2.3 From 2fae6bb7be320270801b3c3b040189bd7daa8056 Mon Sep 17 00:00:00 2001 From: Jiqian Chen Date: Tue, 24 Sep 2024 14:14:37 +0800 Subject: xen/privcmd: Add new syscall to get gsi from dev On PVH dom0, when passthrough a device to domU, QEMU and xl tools want to use gsi number to do pirq mapping, see QEMU code xen_pt_realize->xc_physdev_map_pirq, and xl code pci_add_dm_done->xc_physdev_map_pirq, but in current codes, the gsi number is got from file /sys/bus/pci/devices//irq, that is wrong, because irq is not equal with gsi, they are in different spaces, so pirq mapping fails. And in current linux codes, there is no method to get gsi for userspace. For above purpose, record gsi of pcistub devices when init pcistub and add a new syscall into privcmd to let userspace can get gsi when they have a need. Signed-off-by: Jiqian Chen Signed-off-by: Huang Rui Signed-off-by: Jiqian Chen Reviewed-by: Stefano Stabellini Message-ID: <20240924061437.2636766-4-Jiqian.Chen@amd.com> Signed-off-by: Juergen Gross --- include/uapi/xen/privcmd.h | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/xen/privcmd.h b/include/uapi/xen/privcmd.h index 8b8c5d1420fe..8e2c8fd44764 100644 --- a/include/uapi/xen/privcmd.h +++ b/include/uapi/xen/privcmd.h @@ -126,6 +126,11 @@ struct privcmd_ioeventfd { __u8 pad[2]; }; +struct privcmd_pcidev_get_gsi { + __u32 sbdf; + __u32 gsi; +}; + /* * @cmd: IOCTL_PRIVCMD_HYPERCALL * @arg: &privcmd_hypercall_t @@ -157,5 +162,7 @@ struct privcmd_ioeventfd { _IOW('P', 8, struct privcmd_irqfd) #define IOCTL_PRIVCMD_IOEVENTFD \ _IOW('P', 9, struct privcmd_ioeventfd) +#define IOCTL_PRIVCMD_PCIDEV_GET_GSI \ + _IOC(_IOC_NONE, 'P', 10, sizeof(struct privcmd_pcidev_get_gsi)) #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */ -- cgit v1.2.3