summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-11-26drm/plane: Add COLOR PIPELINE propertyHarry Wentland
We're adding a new enum COLOR PIPELINE property. This property will have entries for each COLOR PIPELINE by referencing the DRM object ID of the first drm_colorop of the pipeline. 0 disables the entire COLOR PIPELINE. Userspace can use this to discover the available color pipelines, as well as set the desired one. The color pipelines are programmed via properties on the actual drm_colorop objects. Reviewed-by: Simon Ser <contact@emersion.fr> Signed-off-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Sebastian Wick <sebastian.wick@redhat.com> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patch.msgid.link/20251115000237.3561250-11-alex.hung@amd.com
2025-11-26drm/colorop: Add NEXT propertyHarry Wentland
We'll construct color pipelines out of drm_colorop by chaining them via the NEXT pointer. NEXT will point to the next drm_colorop in the pipeline, or by 0 if we're at the end of the pipeline. Reviewed-by: Simon Ser <contact@emersion.fr> Signed-off-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Sebastian Wick <sebastian.wick@redhat.com> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patch.msgid.link/20251115000237.3561250-9-alex.hung@amd.com
2025-11-26drm/colorop: Add BYPASS propertyHarry Wentland
We want to be able to bypass each colorop at all times. Introduce a new BYPASS boolean property for this. Reviewed-by: Simon Ser <contact@emersion.fr> Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com> Signed-off-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Sebastian Wick <sebastian.wick@redhat.com> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patch.msgid.link/20251115000237.3561250-8-alex.hung@amd.com
2025-11-26drm/colorop: Add 1D Curve subtypeHarry Wentland
Add a new drm_colorop with DRM_COLOROP_1D_CURVE with two subtypes: DRM_COLOROP_1D_CURVE_SRGB_EOTF and DRM_COLOROP_1D_CURVE_SRGB_INV_EOTF. Reviewed-by: Simon Ser <contact@emersion.fr> Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Co-developed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Hung <alex.hung@amd.com> Reviewed-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Sebastian Wick <sebastian.wick@redhat.com> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patch.msgid.link/20251115000237.3561250-7-alex.hung@amd.com
2025-11-26drm/colorop: Add TYPE propertyHarry Wentland
Add a read-only TYPE property. The TYPE specifies the colorop type, such as enumerated curve, 1D LUT, CTM, 3D LUT, PWL LUT, etc. For now we're only introducing an enumerated 1D LUT type to illustrate the concept. Reviewed-by: Simon Ser <contact@emersion.fr> Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com> Signed-off-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Sebastian Wick <sebastian.wick@redhat.com> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patch.msgid.link/20251115000237.3561250-6-alex.hung@amd.com
2025-11-26drm/colorop: Introduce new drm_colorop mode objectHarry Wentland
This patches introduces a new drm_colorop mode object. This object represents color transformations and can be used to define color pipelines. We also introduce the drm_colorop_state here, as well as various helpers and state tracking bits. Reviewed-by: Simon Ser <contact@emersion.fr> Signed-off-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Sebastian Wick <sebastian.wick@redhat.com> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patch.msgid.link/20251115000237.3561250-5-alex.hung@amd.com
2025-11-26regulator: Use container_of_const() when all types areMark Brown
Merge series from Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>: Use container_of_const(), which is preferred over container_of(), when the argument 'ptr' and returned pointer are already const, for better code safety and readability. Some drivers already have const everywhere, so container_of_const can be directly used. In few other drivers, the final pointer can be constified that way.
2025-11-26drm: Add helper for conversion from signed-magnitudeHarry Wentland
CTM values are defined as signed-magnitude values. Add a helper that converts from CTM signed-magnitude fixed point value to the twos-complement value used by drm_fixed. Reviewed-by: Sebastian Wick <sebastian.wick@redhat.com> Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Melissa Wen <mwen@igalia.com> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patch.msgid.link/20251115000237.3561250-2-alex.hung@amd.com
2025-11-26io_uring: Introduce getsockname io_uring cmdGabriel Krisman Bertazi
Introduce a socket-specific io_uring_cmd to support getsockname/getpeername via io_uring. I made this an io_uring_cmd instead of a new operation to avoid polluting the command namespace with what is exclusively a socket operation. In addition, since we don't need to conform to existing interfaces, this merges the getsockname/getpeername in a single operation, since the implementation is pretty much the same. This has been frequently requested, for instance at [1] and more recently in the project Discord channel. The main use-case is to support fixed socket file descriptors. [1] https://github.com/axboe/liburing/issues/1356 Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26socket: Split out a getsockname helper for io_uringGabriel Krisman Bertazi
Similar to getsockopt, split out a helper to check security and issue the operation from the main handler that can be used by io_uring. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26socket: Unify getsockname and getpeername implementationGabriel Krisman Bertazi
They are already implemented by the same get_name hook in the protocol level. Bring the unification one level up to reduce code duplication in preparation to supporting these as io_uring operations. Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26function_graph: Enable funcgraph-args and funcgraph-retaddr to work ↵pengdonglin
simultaneously Currently, the funcgraph-args and funcgraph-retaddr features are mutually exclusive. This patch resolves this limitation by allowing funcgraph-retaddr to have an args array. To verify the change, use perf to trace vfs_write with both options enabled: Before: # perf ftrace -G vfs_write --graph-opts args,retaddr ...... down_read() { /* <-n_tty_write+0xa3/0x540 */ __cond_resched(); /* <-down_read+0x12/0x160 */ preempt_count_add(); /* <-down_read+0x3b/0x160 */ preempt_count_sub(); /* <-down_read+0x8b/0x160 */ } After: # perf ftrace -G vfs_write --graph-opts args,retaddr ...... down_read(sem=0xffff8880100bea78) { /* <-n_tty_write+0xa3/0x540 */ __cond_resched(); /* <-down_read+0x12/0x160 */ preempt_count_add(val=1); /* <-down_read+0x3b/0x160 */ preempt_count_sub(val=1); /* <-down_read+0x8b/0x160 */ } Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com> Link: https://patch.msgid.link/20251125093425.2563849-1-dolinux.peng@gmail.com Signed-off-by: pengdonglin <pengdonglin@xiaomi.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-26Merge branch 'iommufd_dmabuf' into k.o-iommufd/for-nextJason Gunthorpe
Jason Gunthorpe says: ==================== This series is the start of adding full DMABUF support to iommufd. Currently it is limited to only work with VFIO's DMABUF exporter. It sits on top of Leon's series to add a DMABUF exporter to VFIO: https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com/ The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but otherwise works the same as it does today for a memfd. The user can select a slice of the FD to map into the ioas and if the underliyng alignment requirements are met it will be placed in the iommu_domain. Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR memory from VFIO to an iommu_domain controlled by iommufd. This is used for PCI Peer to Peer support in VMs, and is the last feature that the VFIO type 1 container has that iommufd couldn't do. The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime control and is a use-after-free security problem. Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there should be no access to the MMIO it can shoot down the mapping in iommufd which will unmap it from the iommu_domain. There is no automatic remap, this is a safety protocol so the kernel doesn't get stuck. Userspace is expected to know it is doing something that will revoke the dmabuf and map/unmap it around the activity. Eg when QEMU goes to issue FLR it should do the map/unmap to iommufd. Since DMABUF is missing some key general features for this use case it relies on a "private interconnect" between VFIO and iommufd via the vfio_pci_dma_buf_iommufd_map() call. The call confirms the DMABUF has revoke semantics and delivers a phys_addr for the memory suitable for use with iommu_map(). Medium term there is a desire to expand the supported DMABUFs to include GPU drivers to support DPDK/SPDK type use cases so future series will work to add a general concept of revoke and a general negotiation of interconnect to remove vfio_pci_dma_buf_iommufd_map(). I also plan another series to modify iommufd's vfio_compat to transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI of type1. The latest series for interconnect negotation to exchange a phys_addr is: https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com And the discussion for design of revoke is here: https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/ ==================== Based on a shared branch with vfio. * iommufd_dmabuf: iommufd/selftest: Add some tests for the dmabuf flow iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE iommufd: Have iopt_map_file_pages convert the fd to a file iommufd: Have pfn_reader process DMABUF iopt_pages iommufd: Allow MMIO pages in a batch iommufd: Allow a DMABUF to be revoked iommufd: Do not map/unmap revoked DMABUFs iommufd: Add DMABUF to iopt_pages vfio/pci: Add vfio_pci_dma_buf_iommufd_map() vfio/nvgrace: Support get_dmabuf_phys vfio/pci: Add dma-buf export support for MMIO regions vfio/pci: Enable peer-to-peer DMA transactions by default vfio/pci: Share the core device pointer while invoking feature functions vfio: Export vfio device get and put registration helpers dma-buf: provide phys_vec to scatter-gather mapping routine PCI/P2PDMA: Document DMABUF model PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation PCI/P2PDMA: Simplify bus address mapping API PCI/P2PDMA: Separate the mmap() support from the core logic Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-26iommu/arm-smmu-v3-iommufd: Allow attaching nested domain for GBPA casesNicolin Chen
A vDEVICE has been a hard requirement for attaching a nested domain to the device. This makes sense when installing a guest STE, since a vSID must be present and given to the kernel during the vDEVICE allocation. But, when CR0.SMMUEN is disabled, VM doesn't really need a vSID to program the vSMMU behavior as GBPA will take effect, in which case the vSTE in the nested domain could have carried the bypass or abort configuration in GBPA register. Thus, having such a hard requirement doesn't work well for GBPA. Skip vmaster allocation in arm_smmu_attach_prepare_vmaster() for an abort or bypass vSTE. Note that device on this attachment won't report vevents. Update the uAPI doc accordingly. Link: https://patch.msgid.link/r/20251103172755.2026145-1-nicolinc@nvidia.com Tested-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-26drivers: hid: renegotiate resolution multipliers with device after resetBenedek Kupper
The scroll resolution multipliers are set in the context of hidinput_connect(), which is only called at probe time: when the host changes the value on the device with a SET_REPORT(FEATURE), and the device accepts it, these multipliers are stored on the host side, and used to calculate the final scroll event values sent to userspace. After a USB suspend, the resume operation on many hubs and chipsets involve a USB reset signal as well. A reset on the device side clears all previous state information, including the value of the multiplier report. This reset is not handled by the multiplier handling logic, so what ends up happening is the host is still expecting high-resolution scroll events, but the device is reset to default resolution, making the effective, user-perceived scroll speed incredibly slow. The solution is to renegotiate the multiplier selection after each reset. This is not the only bug related to the high-resolution scrolling implementation in the kernel (the other one is https://bugzilla.kernel.org/show_bug.cgi?id=220144), but for this one, there is no device side workaround for, leading to poor user experience with our product: https://github.com/UltimateHackingKeyboard/firmware/issues/1155 https://github.com/UltimateHackingKeyboard/firmware/issues/1261 https://github.com/UltimateHackingKeyboard/firmware/pull/1355 This patch was tested by an affected user and has been reported to fix the issue (see discussion in 1355). Signed-off-by: Benedek Kupper <kupper.benedek@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-11-26mod_devicetable: Bump auxiliary_device_id name sizeRaag Jadav
We have an upcoming driver named "intel_ehl_pse_io". This creates an auxiliary child device for it's GPIO sub-functionality, which matches against "intel_ehl_pse_io.gpio-elkhartlake" and overshoots the current maximum limit of 32 bytes for auxiliary device id string. Bump the size to 40 bytes to satisfy such cases. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Raag Jadav <raag.jadav@intel.com> Link: https://patch.msgid.link/20251106052838.433673-1-raag.jadav@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: simplify attribute definition macrosThomas Weißschuh
Define the macros in terms of each other. This makes them easier to understand and also will make it easier to implement the transition machinery for 'const struct attribute'. __ATTR_RO_MODE() can't be implemented in terms of __ATTR() as not all attributes have a .store callback. The same issue theoretically exists for __ATTR_WO(), but practically that does not occur today. Reorder __ATTR_RO() below __ATTR_RO_MODE() to keep the order of the macro definition consistent with respect to each other. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-7-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: attribute_group: enable const variants of is_visible()Thomas Weißschuh
When constifying instances of struct attribute, for consistency the corresponding .is_visible() callback should be adapted, too. Introduce a temporary transition mechanism until all callbacks are converted. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-4-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: introduce __SYSFS_FUNCTION_ALTERNATIVE()Thomas Weißschuh
For the constification phase of 'struct attribute' various callback struct members will need to exist in both const and non-const variants. Keeping both members in a union avoids memory and CPU overhead but will be detected and trapped by Control Flow Integrity (CFI). By deciding between a struct and a union depending whether CFI is enabled, most configurations can avoid this overhead. Code using these callbacks will still need to be updated to handle both members explicitly. In the union case the compiler will recognize that testing for one union member is enough and optimize away the code for the other one. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-3-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: transparently handle const pointers in ATTRIBUTE_GROUPS()Thomas Weißschuh
To ease the constification process of 'struct attribute', transparently handle the const pointers in ATTRIBUTE_GROUPS(). A cast is used instead of assigning to .attrs_new as it keeps the macro smaller. As both members are aliased to each other the result is identical. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-2-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: attribute_group: allow registration of const attributeThomas Weißschuh
To be able to constify instances of struct attribute it has to be possible to add them to struct attribute_group. The current type of the attrs member however is not compatible with that. Introduce a union that allows registration of both const and non-const attributes to enable a piecewise transition. As both union member types are compatible no logic needs to be adapted. Technically it is now possible register a const struct attribute and receive it as mutable pointer in the callbacks. This is a soundness issue. But this same soundness issue already exists today in sysfs_create_file(). Also the struct definition and callback implementation are always closely linked and are meant to be moved to const in lockstep. Similar to commit 906c508afdca ("sysfs: attribute_group: allow registration of const bin_attribute") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-1-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26virt: acrn: split acrn_mmio_dev_res out of acrn_mmiodevRandy Dunlap
Add struct acrn_mmio_dev_res before struct acrn_mmio_dev. The former is used in the latter and breaking them up provides better kernel-doc documentation for the struct members. Suggested-by: Fei Li <fei1.li@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Fei Li <fei1.li@intel.com> Link: https://patch.msgid.link/20251028040409.868254-1-rdunlap@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26comedi: kcomedilib: Add loop checking variants of open and closeIan Abbott
Add `comedi_open_from(path, from)` and `comedi_close_from(dev, from)` as variants of the existing `comedi_from(path)` and `comedi_close(dev)`. The additional `from` parameter is a minor device number that tells the function that the COMEDI device is being opened or closed from another COMEDI device if the value is in the range [0, `COMEDI_NUM_BOARD_MINORS`-1]. In that case the function will refuse to open the device if it would lead to a chain of devices opening each other. (It will also impose a limit on the number of simultaneous opens from one device to another because we need to count those.) The new functions are intended to be used by the "comedi_bond" driver, which is the only driver that uses the existing `comedi_open()` and `comedi_close()` functions. The new functions will be used to avoid some possible deadlock situations. Replace the existing, exported `comedi_open()` and `comedi_close()` functions with inline wrapper functions that call the newly exported `comedi_open_from()` and `comedi_close_from()` functions. Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Link: https://patch.msgid.link/20251027153748.4569-2-abbotti@mev.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26comedi: Add reference counting for Comedi command handlingIan Abbott
For interrupts from badly behaved hardware (as emulated by Syzbot), it is possible for the Comedi core functions that manage the progress of asynchronous data acquisition to be called from driver ISRs while no asynchronous command has been set up, which can cause problems such as invalid pointer dereferencing or dividing by zero. To help protect against that, introduce new functions to maintain a reference counter for asynchronous commands that are being set up. `comedi_get_is_subdevice_running(s)` will check if a command has been set up on a subdevice and is still marked as running, and if so will increment the reference counter and return `true`, otherwise it will return `false` without modifying the reference counter. `comedi_put_is_subdevice_running(s)` will decrement the reference counter and set a completion event when decremented to 0. Change the `do_cmd_ioctl()` function (responsible for setting up the asynchronous command) to reinitialize the completion event and set the reference counter to 1 before it marks the subdevice as running. Change the `do_become_nonbusy()` function (responsible for destroying a completed command) to call `comedi_put_is_subdevice_running(s)` and wait for the completion event after marking the subdevice as not running. Because the subdevice normally gets marked as not running before the call to `do_become_nonbusy()` (and may also be called when the Comedi device is being detached from the low-level driver), add a new flag `COMEDI_SRF_BUSY` to the set of subdevice run-flags that indicates that an asynchronous command was set up and will need to be destroyed. This flag is set by `do_cmd_ioctl()` and cleared and checked by `do_become_nonbusy()`. Subsequent patches will change the Comedi core functions that are called from low-level drivers for asynchrous command handling to make use of the `comedi_get_is_subdevice_running()` and `comedi_put_is_subdevice_running()` functions, and will modify the ISRs of some of these low-level drivers if they dereference the subdevice's `async` pointer directly. Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Link: https://patch.msgid.link/20251023133001.8439-2-abbotti@mev.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26drm/ttm: rework pipelined eviction fence handlingPierre-Eric Pelloux-Prayer
Until now ttm stored a single pipelined eviction fence which means drivers had to use a single entity for these evictions. To lift this requirement, this commit allows up to 8 entities to be used. Ideally a dma_resv object would have been used as a container of the eviction fences, but the locking rules makes it complex. dma_resv all have the same ww_class, which means "Attempting to lock more mutexes after ww_acquire_done." is an error. One alternative considered was to introduced a 2nd ww_class for specific resv to hold a single "transient" lock (= the resv lock would only be held for a short period, without taking any other locks). The other option, is to statically reserve a fence array, and extend the existing code to deal with N fences, instead of 1. The driver is still responsible to reserve the correct number of fence slots. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Link: https://lore.kernel.org/r/20251121101315.3585-20-pierre-eric.pelloux-prayer@amd.com Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com>
2025-11-26can: netlink: add PWM netlink interfaceVincent Mailhol
When the TMS is switched on, the node uses PWM (Pulse Width Modulation) during the data phase instead of the classic NRZ (Non Return to Zero) encoding. PWM is configured by three parameters: - PWMS: Pulse Width Modulation Short phase - PWML: Pulse Width Modulation Long phase - PWMO: Pulse Width Modulation Offset time For each of these parameters, define three IFLA symbols: - IFLA_CAN_PWM_PWM*_MIN: the minimum allowed value. - IFLA_CAN_PWM_PWM*_MAX: the maximum allowed value. - IFLA_CAN_PWM_PWM*: the runtime value. This results in a total of nine IFLA symbols which are all nested in a parent IFLA_CAN_XL_PWM symbol. IFLA_CAN_PWM_PWM*_MIN and IFLA_CAN_PWM_PWM*_MAX define the range of allowed values and will match the value statically configured by the device in struct can_pwm_const. IFLA_CAN_PWM_PWM* match the runtime values stored in struct can_pwm. Those parameters may only be configured when the tms mode is on. If the PWMS, PWML and PWMO parameters are provided, check that all the needed parameters are present using can_validate_pwm(), then check their value using can_validate_pwm_bittiming(). PWMO defaults to zero if omitted. Otherwise, if CAN_CTRLMODE_XL_TMS is true but none of the PWM parameters are provided, calculate them using can_calc_pwm(). Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-11-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: calc_bittiming: add PWM calculationVincent Mailhol
Perform the PWM calculation according to CiA recommendations. Note that for databitrates greater than 5 MBPS, tqmin is less than CAN_PWM_NS_MAX (which is defined to 200 nano seconds), consequently, the result of the division: DIV_ROUND_UP(xl_ns, CAN_PWM_NS_MAX) is one and thus the for loop automatically stops on the first iteration giving a single PWM symbol per bit as expected. Because of that, there is no actual need for a separate conditional branch for when the databitrate is greater than 5 MBPS. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-10-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: bittiming: add PWM validationVincent Mailhol
Add can_validate_pwm() to validate the values pwms, pwml and pwml. Error messages are added to each of the checks to inform the user on what went wrong. Refer to those error messages to understand the validation logic. The boundary values CAN_PWM_DECODE_NS (the transceiver minimum decoding margin) and CAN_PWM_NS_MAX (the maximum PWM symbol duration) are hardcoded for the moment. Note that a transceiver capable of bitrates higher than 20 Mbps may be able to handle a CAN_PWM_DECODE_NS below 5 ns. If such transceivers become commercially available, this code could be revisited to make this parameter configurable. For now, leave it static. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-9-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: bittiming: add PWM parametersVincent Mailhol
In CAN XL, higher data bit rates require the CAN transceiver to switch its operation mode to use Pulse-Width Modulation (PWM) transmission mode instead of the classic dominant/recessive transmission mode. The PWM parameters are: - PWMS: pulse width modulation short phase - PWML: pulse width modulation long phase - PWMO: pulse width modulation offset CiA 612-2 specifies PWMS and PWML to be at least 1 (arguably, PWML shall be at least 2 to respect the PWMS < PWML rule). PWMO's minimum is expected to always be zero. It is added more for consistency than anything else. Add struct can_pwm_const so that the different devices can provide their minimum and maximum values. When TMS is on, the runtime PWMS, PWML and PWMO are needed (either calculated or provided by the user): add struct can_pwm to store these. TDC and PWM can not be used at the same time (TDC can only be used when TMS is off and PWM only when TMS is on). struct can_pwm is thus put together with struct can_tdc inside a union to save some space. The netlink logic will be added in an upcoming change. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-8-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: dev: can_dev_dropped_skb: drop CC/FD frames in CANXL-only modeOliver Hartkopp
The error-signalling (ES) is a mandatory functionality for CAN CC and CAN FD to report CAN frame format violations by sending an error-frame signal on the bus. A so-called 'mixed-mode' is intended to have (XL-tolerant) CAN FD nodes and CAN XL nodes on one CAN segment, where the FD-controllers can talk CC/FD and the XL-controllers can talk CC/FD/XL. This mixed-mode utilizes the error-signalling for sending CC/FD/XL frames. The CANXL-only mode disables the error-signalling in the CAN XL controller. This mode does not allow CC/FD frames to be sent but additionally offers a CAN XL transceiver mode switching (TMS). Configured with CAN_CTRLMODE_FD and CAN_CTRLMODE_XL this leads to: FD=0 XL=0 CC-only mode (ES=1) FD=1 XL=0 FD/CC mixed-mode (ES=1) FD=1 XL=1 XL/FD/CC mixed-mode (ES=1) FD=0 XL=1 XL-only mode (ES=0, TMS optional) The helper function can_dev_in_xl_only_mode() determines the required value to disable error signalling in the CAN XL controller. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-7-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: netlink: add CAN_CTRLMODE_XL_TMS flagVincent Mailhol
The Transceiver Mode Switching (TMS) indicates whether the CAN XL controller shall use the PWM or NRZ encoding during the data phase. The term "transceiver mode switching" is used in both ISO 11898-1 and CiA 612-2 (although only the latter one uses the abbreviation TMS). We adopt the same naming convention here for consistency. Add the CAN_CTRLMODE_XL_TMS flag to the list of the CAN control modes. Add can_validate_xl_flags() to check the coherency of the TMS flag. That function will be reused in upcoming changes to validate the other CAN XL flags. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-6-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: netlink: add initial CAN XL supportVincent Mailhol
CAN XL uses bittiming parameters different from Classical CAN and CAN FD. Thus, all the data bittiming parameters, including TDC, need to be duplicated for CAN XL. Add the CAN XL netlink interface for all the features which are common with CAN FD. Any new CAN XL specific features are added later on. The first time CAN XL is activated, the MTU is set by default to CANXL_MAX_MTU. The user may then configure a custom MTU within the CANXL_MIN_MTU to CANXL_MAX_MTU range, in which case, the custom MTU value will be kept as long as CAN XL remains active. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-5-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: netlink: add CAN_CTRLMODE_RESTRICTEDVincent Mailhol
ISO 11898-1:2024 adds a new restricted operation mode. This mode is added as a mandatory feature for nodes which support CAN XL and is retrofitted as optional for legacy nodes (i.e. the ones which only support Classical CAN and CAN FD). The restricted operation mode is nearly the same as the listen only mode: the node can not send data frames or remote frames and can not send dominant bits if an error occurs. The only exception is that the node shall still send the acknowledgment bit. A second niche exception is that the node may still send a data frame containing a time reference message if the node is a primary time provider, but because the time provider feature is not yet implemented in the kernel, this second exception is not relevant to us at the moment. Add the CAN_CTRLMODE_RESTRICTED control mode flag and update the can_dev_dropped_skb() helper function accordingly. Finally, bail out if both CAN_CTRLMODE_LISTENONLY and CAN_CTRLMODE_RESTRICTED are provided. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-4-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: dev: can_dev_dropped_skb: drop CAN FD skbs if FD is offVincent Mailhol
Currently, the CAN FD skb validation logic is based on the MTU: the interface is deemed FD capable if and only if its MTU is greater or equal to CANFD_MTU. This logic is showing its limit with the introduction of CAN XL. For example, consider the two scenarios below: 1. An interface configured with CAN FD on and CAN XL on 2. An interface configured with CAN FD off and CAN XL on In those two scenarios, the interfaces would have the same MTU: CANXL_MTU making it impossible to differentiate which one has CAN FD turned on and which one has it off. Because of the limitation, the only non-UAPI-breaking workaround is to do the check at the device level using the can_priv->ctrlmode flags. Unfortunately, the virtual interfaces (vcan, vxcan), which do not have a can_priv, are left behind. Add a check on the CAN_CTRLMODE_FD flag in can_dev_dropped_skb() and drop FD frames whenever the feature is turned off. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-3-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: bittiming: apply NL_SET_ERR_MSG() to can_calc_bittiming()Vincent Mailhol
When CONFIG_CAN_CALC_BITTIMING is disabled, the can_calc_bittiming() functions can not be used and the user needs to provide all the bittiming parameters. Currently, can_calc_bittiming() prints an error message to the kernel log. Instead use NL_SET_ERR_MSG() to make it return the error message through the netlink interface so that the user can directly see it. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-2-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.19: - Fix a few missing "VMCB dirty" bugs. - Fix the worst of KVM's lack of EFER.LMSLE emulation. - Add AVIC support for addressing 4k vCPUs in x2AVIC mode. - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions. - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT. - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3. - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM.
2025-11-26Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM TDX changes for 6.19: - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module, which KVM was either working around in weird, ugly ways, or was simply oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever selftests). - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if creating said vCPU failed partway through. - Fix a few sparse warnings (bad annotation, 0 != NULL). - Use struct_size() to simplify copying capabilities to userspace.
2025-11-26Merge tag 'kvm-x86-gmem-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM guest_memfd changes for 6.19: - Add NUMA mempolicy support for guest_memfd, and clean up a variety of rough edges in guest_memfd along the way. - Define a CLASS to automatically handle get+put when grabbing a guest_memfd from a memslot to make it harder to leak references. - Enhance KVM selftests to make it easer to develop and debug selftests like those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs often result in hard-to-debug SIGBUS errors. - Misc cleanups.
2025-11-25tcp: remove icsk->icsk_retransmit_timerEric Dumazet
Now sk->sk_timer is no longer used by TCP keepalive, we can use its storage for TCP and MPTCP retransmit timers for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: introduce icsk->icsk_keepalive_timerEric Dumazet
sk->sk_timer has been used for TCP keepalives. Keepalive timers are not in fast path, we want to use sk->sk_timer storage for retransmit timers, for better cache locality. Create icsk->icsk_keepalive_timer and change keepalive code to no longer use sk->sk_timer. Added space is reclaimed in the following patch. This includes changes to MPTCP, which was also using sk_timer. Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer for inet_sk_diag_fill() sake. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: move sk_dst_pending_confirm and sk_pacing_status to sock_read_tx groupEric Dumazet
These two fields are mostly read in TCP tx path, move them in an more appropriate group for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: rename icsk_timeout() to tcp_timeout_expires()Eric Dumazet
In preparation of sk->tcp_timeout_timer introduction, rename icsk_timeout() helper and change its argument to plain 'const struct sock *sk'. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tools: ynl-gen: add regeneration commentAsbjørn Sloth Tønnesen
Add a comment on regeneration to the generated files. The comment is placed after the YNL-GEN line[1], as to not interfere with ynl-regen.sh's detection logic. [1] and after the optional YNL-ARG line. Link: https://lore.kernel.org/r/aR5m174O7pklKrMR@zx2c4.com/ Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251120174429.390574-3-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25bpf: Introduce internal bpf_map_check_op_flags helper functionLeon Hwang
It is to unify map flags checking for lookup_elem, update_elem, lookup_batch and update_batch APIs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20251125145857.98134-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-25hfs/hfsplus: move on-disk layout declarations into hfs_common.hViacheslav Dubeyko
Currently, HFS declares on-disk layout's metadata structures in fs/hfs/hfs.h and HFS+ declares it in fs/hfsplus/hfsplus_raw.h. However, HFS and HFS+ on-disk layouts have some similarity and overlapping in declarations. As a result, fs/hfs/hfs.h and fs/hfsplus/hfsplus_raw.h contain multiple duplicated declarations. Moreover, both HFS and HFS+ drivers contain completely similar implemented functionality in multiple places. This patch is moving the on-disk layout declarations from fs/hfs/hfs.h and fs/hfsplus/hfsplus_raw.h into include/linux/hfs_common.h with the goal to exclude the duplication in declarations. Also, this patch prepares the basis for creating a hfslib that can aggregate common functionality without necessity to duplicate the same code in HFS and HFS+ drivers. Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
2025-11-25sched/mmcid: Switch over to the new mechanismThomas Gleixner
Now that all pieces are in place, change the implementations of sched_mm_cid_fork() and sched_mm_cid_exit() to adhere to the new strict ownership scheme and switch context_switch() over to use the new mm_cid_schedin() functionality. The common case is that there is no mode change required, which makes fork() and exit() just update the user count and the constraints. In case that a new user would exceed the CID space limit the fork() context handles the transition to per CPU mode with mm::mm_cid::mutex held. exit() handles the transition back to per task mode when the user count drops below the switch back threshold. fork() might also be forced to handle a deferred switch back to per task mode, when a affinity change increased the number of allowed CPUs enough. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.280380631@linutronix.de
2025-11-25sched/mmcid: Implement deferred mode changeThomas Gleixner
When affinity changes cause an increase of the number of CPUs allowed for tasks which are related to a MM, that might results in a situation where the ownership mode can go back from per CPU mode to per task mode. As affinity changes happen with runqueue lock held there is no way to do the actual mode change and required fixup right there. Add the infrastructure to defer it to a workqueue. The scheduled work can race with a fork() or exit(). Whatever happens first takes care of it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.216484739@linutronix.de
2025-11-25irqwork: Move data struct to a types headerThomas Gleixner
... to avoid header recursion hell. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.152813625@linutronix.de
2025-11-25sched/mmcid: Provide CID ownership mode fixup functionsThomas Gleixner
CIDs are either owned by tasks or by CPUs. The ownership mode depends on the number of tasks related to a MM and the number of CPUs on which these tasks are theoretically allowed to run on. Theoretically because that number is the superset of CPU affinities of all tasks which only grows and never shrinks. Switching to per CPU mode happens when the user count becomes greater than the maximum number of CIDs, which is calculated by: opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users); max_cids = min(1.25 * opt_cids, nr_cpu_ids); The +25% allowance is useful for tight CPU masks in scenarios where only a few threads are created and destroyed to avoid frequent mode switches. Though this allowance shrinks, the closer opt_cids becomes to nr_cpu_ids, which is the (unfortunate) hard ABI limit. At the point of switching to per CPU mode the new user is not yet visible in the system, so the task which initiated the fork() runs the fixup function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either transfers each tasks owned CID to the CPU the task runs on or drops it into the CID pool if a task is not on a CPU at that point in time. Tasks which schedule in before the task walk reaches them do the handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's guaranteed that no task related to that MM owns a CID anymore. Switching back to task mode happens when the user count goes below the threshold which was recorded on the per CPU mode switch: pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2); This threshold is updated when a affinity change increases the number of allowed CPUs for the MM, which might cause a switch back to per task mode. If the switch back was initiated by a exiting task, then that task runs the fixup function. If it was initiated by a affinity change, then it's run either in the deferred update function in context of a workqueue or by a task which forks a new one or by a task which exits. Whatever happens first. mm_cid_fixup_cpus_to_task() walks through the possible CPUs and either transfers the CPU owned CIDs to a related task which runs on the CPU or drops it into the pool. Tasks which schedule in on a CPU which the walk did not cover yet do the handover themselves. This transition from CPU to per task ownership happens in two phases: 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task CID and denotes that the CID is only temporarily owned by the task. When it schedules out the task drops the CID back into the pool if this bit is set. 2) The initiating context walks the per CPU space and after completion clears mm:mm_cid.transit. After that point the CIDs are strictly task owned again. This two phase transition is required to prevent CID space exhaustion during the transition as a direct transfer of ownership would fail if two tasks are scheduled in on the same CPU before the fixup freed per CPU CIDs. When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID related to that MM is owned by a CPU anymore. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.088189028@linutronix.de
2025-11-25sched/mmcid: Provide new scheduler CID mechanismThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least some LTTng library depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes it's affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. Implement the context switch parts of a strict ownership mechanism to address this. This removes most of the work from the task which schedules out. Only during transitioning from per CPU to per task ownership it is required to drop the CID when leaving the CPU to prevent CID space exhaustion. Other than that scheduling out is just a single check and branch. The task which schedules in has to check whether: 1) The ownership mode changed 2) The CID is within the optimal CID space In stable situations this results in zero work. The only short disruption is when ownership mode changes or when the associated CID is not in the optimal CID space. The latter only happens when tasks exit and therefore the optimal CID space shrinks. That mechanism is strictly optimized for the common case where no change happens. The only case where it actually causes a temporary one time spike is on mode changes when and only when a lot of tasks related to a MM schedule exactly at the same time and have eventually to compete on allocating a CID from the bitmap. In the sysbench test case which triggered the spinlock contention in the initial CID code, __schedule() drops significantly in perf top on a 128 Core (256 threads) machine when running sysbench with 255 threads, which fits into the task mode limit of 256 together with the parent thread: Upstream rseq/perf branch +CID rework 0.42% 0.37% 0.32% [k] __schedule Increasing the number of threads to 256, which puts the test process into per CPU mode looks about the same. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.023984859@linutronix.de