summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-09-29slab: sheaf prefilling for guaranteed allocationsVlastimil Babka
Add functions for efficient guaranteed allocations e.g. in a critical section that cannot sleep, when the exact number of allocations is not known beforehand, but an upper limit can be calculated. kmem_cache_prefill_sheaf() returns a sheaf containing at least given number of objects. kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf and is guaranteed not to fail until depleted. kmem_cache_return_sheaf() is for giving the sheaf back to the slab allocator after the critical section. This will also attempt to refill it to cache's sheaf capacity for better efficiency of sheaves handling, but it's not stricly necessary to succeed. kmem_cache_refill_sheaf() can be used to refill a previously obtained sheaf to requested size. If the current size is sufficient, it does nothing. If the requested size exceeds cache's sheaf_capacity and the sheaf's current capacity, the sheaf will be replaced with a new one, hence the indirect pointer parameter. kmem_cache_sheaf_size() can be used to query the current size. The implementation supports requesting sizes that exceed cache's sheaf_capacity, but it is not efficient - such "oversize" sheaves are allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be especially ineffective when replacing a sheaf with a new one of a larger capacity. It is therefore better to size cache's sheaf_capacity accordingly to make oversize sheaves exceptional. CONFIG_SLUB_STATS counters are added for sheaf prefill and return operations. A prefill or return is considered _fast when it is able to grab or return a percpu spare sheaf (even if the sheaf needs a refill to satisfy the request, as those should amortize over time), and _slow otherwise (when the barn or even sheaf allocation/freeing has to be involved). sheaf_prefill_oversize is provided to determine how many prefills were oversize (counter for oversize returns is not necessary as all oversize refills result in oversize returns). When slub_debug is enabled for a cache with sheaves, no percpu sheaves exist for it, but the prefill functionality is still provided simply by all prefilled sheaves becoming oversize. If percpu sheaves are not created for a cache due to not passing the sheaf_capacity argument on cache creation, the prefills also work through oversize sheaves, but there's a WARN_ON_ONCE() to indicate the omission. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-28lib/string_choices: Add str_assert_deassert() helperLad Prabhakar
Add str_assert_deassert() helper to return "assert" or "deassert" string literal depending on the boolean argument. Also add the inversed variant str_deassert_assert(). Suggested-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Andy Shevchenko <andy@kernel.org> Link: https://lore.kernel.org/r/20250923095229.2149740-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-09-28mm: convert folio_page() back to a macroDavid Hildenbrand
In commit 73b3294b1152 ("mm: simplify folio_page() and folio_page_idx()") we converted folio_page() into a static inline function. However briefly afterwards in commit a847b17009ec ("mm: constify highmem related functions for improved const-correctness") we had to add some nasty const-away casting to make the compiler happy when checking const correctness. So let's just convert it back to a simple macro so the compiler can check const correctness properly. There is the alternative of using a _Generic() similar to page_folio(), but there is not a lot of benefit compared to just using a simple macro. Link: https://lkml.kernel.org/r/20250923140058.2020023-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm: silence data-race in update_hiwater_rssLance Yang
KCSAN reports a data race on mm_cluster.hiwater_rss, which can be accessed concurrently from various paths like page migration and memory unmapping without synchronization. Since hiwater_rss is a statistical field for accounting purposes, this data race is benign. Annotate both the read and write accesses with data_race() to make KCSAN happy. Link: https://lkml.kernel.org/r/20250926092426.43312-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Reported-by: syzbot+60192c8877d0bc92a92b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/68d6364e.050a0220.3390a8.000d.GAE@google.com Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Marco Elver <elver@google.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/ksm: fix incorrect KSM counter handling in mm_struct during forkDonet Tom
Patch series "mm/ksm: Fix incorrect accounting of KSM counters during fork", v3. The first patch in this series fixes the incorrect accounting of KSM counters such as ksm_merging_pages, ksm_rmap_items, and the global ksm_zero_pages during fork. The following patch add a selftest to verify the ksm_merging_pages counter was updated correctly during fork. Test Results ============ Without the first patch ----------------------- # [RUN] test_fork_ksm_merging_page_count not ok 10 ksm_merging_page in child: 32 With the first patch -------------------- # [RUN] test_fork_ksm_merging_page_count ok 10 ksm_merging_pages is not inherited after fork This patch (of 2): Currently, the KSM-related counters in `mm_struct`, such as `ksm_merging_pages`, `ksm_rmap_items`, and `ksm_zero_pages`, are inherited by the child process during fork. This results in inconsistent accounting. When a process uses KSM, identical pages are merged and an rmap item is created for each merged page. The `ksm_merging_pages` and `ksm_rmap_items` counters are updated accordingly. However, after a fork, these counters are copied to the child while the corresponding rmap items are not. As a result, when the child later triggers an unmerge, there are no rmap items present in the child, so the counters remain stale, leading to incorrect accounting. A similar issue exists with `ksm_zero_pages`, which maintains both a global counter and a per-process counter. During fork, the per-process counter is inherited by the child, but the global counter is not incremented. Since the child also references zero pages, the global counter should be updated as well. Otherwise, during zero-page unmerge, both the global and per-process counters are decremented, causing the global counter to become inconsistent. To fix this, ksm_merging_pages and ksm_rmap_items are reset to 0 during fork, and the global ksm_zero_pages counter is updated with the per-process ksm_zero_pages value inherited by the child. This ensures that KSM statistics remain accurate and reflect the activity of each process correctly. Link: https://lkml.kernel.org/r/cover.1758648700.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/7b9870eb67ccc0d79593940d9dbd4a0b39b5d396.1758648700.git.donettom@linux.ibm.com Fixes: 7609385337a4 ("ksm: count ksm merging pages for each process") Fixes: cb4df4cae4f2 ("ksm: count allocated ksm rmap_items for each process") Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/page_vma_mapped: track if the page is mapped across page table boundaryKiryl Shutsemau
Patch series "mm: Improve mlock tracking for large folios", v3. The patchset includes several fixes and improvements related to mlock tracking of large folios. The main objective is to reduce the undercount of Mlocked memory in /proc/meminfo and improve the accuracy of the statistics. Patches 1-2: These patches address a minor race condition in folio_referenced_one() related to mlock_vma_folio(). Currently, mlock_vma_folio() is called on large folio without the page table lock, which can result in a race condition with unmap (i.e. MADV_DONTNEED). This can lead to partially mapped folios on the unevictable LRU list. While not a significant issue, I do not believe backporting is necessary. Patch 3: This patch adds mlocking logic similar to folio_referenced_one() to try_to_unmap_one(), allowing for mlocking of large folios where possible. Patch 4-5: These patches modifies finish_fault() and faultaround to map in the entire folio when possible, enabling efficient mlocking upon addition to the rmap. Patch 6: This patch makes rmap mlock large folios if they are fully mapped, addressing the primary source of mlock undercount for large folios. This patch (of 6): Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if the page is mapped across page table boundary. Unlike other PVMW_* flags, this one is result of page_vma_mapped_walk() and not set by the caller. folio_referenced_one() will use it to detect if it safe to mlock the folio. [akpm@linux-foundation.org: s/CROSSSED/CROSSED/] Link: https://lkml.kernel.org/r/20250923110711.690639-1-kirill@shutemov.name Link: https://lkml.kernel.org/r/20250923110711.690639-2-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28Merge tag 'mm-hotfixes-stable-2025-09-27-22-35' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. 4 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 6 of these fixes are for MM. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-09-27-22-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: include/linux/pgtable.h: convert arch_enter_lazy_mmu_mode() and friends to static inlines mm/damon/sysfs: do not ignore callback's return value in damon_sysfs_damon_call() mailmap: add entry for Bence Csókás fs/proc/task_mmu: check p->vec_buf for NULL kmsan: fix out-of-bounds access to shadow memory mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count mm/hugetlb: fix folio is still mapped when deleted
2025-09-28net/mlx5: IFC add balance ID and LAG per MP group bitsMark Bloch
Add interface definitions for load balance ID and LAG per multiplane group functionality. This patch introduces the hardware capability bits needed to support balance ID in multiplane LAG configurations. The new fields include: - load_balance_id: 4-bit field for balance identifier. - lag_per_mp_group: capability bit for LAG per multiplane group support. These interface additions are prerequisites for implementing balance ID support in the MLX5 driver. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1758521191-814350-3-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-28net/mlx5: Add IFC bit for TIR/SQ order capabilityTariq Toukan
Before this cap, firmware requested a certain creation order between TIR objects and SQs of the same transport domain to properly support the self loopback prevention feature. If order is not preserved, explicit modify_tir operations are necessary after the opening of the SQs. When set, this cap bit indicates that this firmware requirement / limitation no longer holds. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1758521191-814350-2-git-send-email-tariqt@nvidia.com Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-27Have cc(1) catch attempts to modify ->f_pathAl Viro
There are very few places that have cause to do that - all in core VFS now, and all done to files that are not yet opened (or visible to anybody else, for that matter). Let's turn f_path into a union of struct path __f_path and const struct path f_path. It's C, not C++ - 6.5.2.3[4] in C99 and later explicitly allows that kind of type-punning. That way any attempts to bypass these checks will be either very easy to catch, or (if the bastards get sufficiently creative to make it hard to spot with grep alone) very clearly malicious - and still catchable with a bit of instrumentation for sparse. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-27Merge branches 'work.path' and 'work.mount' into work.f_pathAl Viro
2025-09-28Merge tag 'i2c-host-6.18' of ↵Wolfram Sang
git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-mergewindow i2c-host for v6.18 - Add support for MediaTek MT6878 I2C - Drop support for S3C2410
2025-09-27bpf: Enforce expected_attach_type for tailcall compatibilityDaniel Borkmann
Yinhao et al. recently reported: Our fuzzer tool discovered an uninitialized pointer issue in the bpf_prog_test_run_xdp() function within the Linux kernel's BPF subsystem. This leads to a NULL pointer dereference when a BPF program attempts to deference the txq member of struct xdp_buff object. The test initializes two programs of BPF_PROG_TYPE_XDP: progA acts as the entry point for bpf_prog_test_run_xdp() and its expected_attach_type can neither be of be BPF_XDP_DEVMAP nor BPF_XDP_CPUMAP. progA calls into a slot of a tailcall map it owns. progB's expected_attach_type must be BPF_XDP_DEVMAP to pass xdp_is_valid_access() validation. The program returns struct xdp_md's egress_ifindex, and the latter is only allowed to be accessed under mentioned expected_attach_type. progB is then inserted into the tailcall which progA calls. The underlying issue goes beyond XDP though. Another example are programs of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR. sock_addr_is_valid_access() as well as sock_addr_func_proto() have different logic depending on the programs' expected_attach_type. Similarly, a program attached to BPF_CGROUP_INET4_GETPEERNAME should not be allowed doing a tailcall into a program which calls bpf_bind() out of BPF which is only enabled for BPF_CGROUP_INET4_CONNECT. In short, specifying expected_attach_type allows to open up additional functionality or restrictions beyond what the basic bpf_prog_type enables. The use of tailcalls must not violate these constraints. Fix it by enforcing expected_attach_type in __bpf_prog_map_compatible(). Note that we only enforce this for tailcall maps, but not for BPF devmaps or cpumaps: There, the programs are invoked through dev_map_bpf_prog_run*() and cpu_map_bpf_prog_run*() which set up a new environment / context and therefore these situations are not prone to this issue. Fixes: 5e43f899b03a ("bpf: Check attach type at prog load time") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250926171201.188490-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-27PM: runtime: Update kerneldoc return codesBrian Norris
APIs based on __pm_runtime_idle() (pm_runtime_idle(), pm_request_idle()) do not return 1 when already suspended. They return -EAGAIN. This is already covered in the docs, so the entry for "1" is redundant and conflicting. (pm_runtime_put() and pm_runtime_put_sync() were previously incorrect, but that's fixed in "PM: runtime: pm_runtime_put{,_sync}() returns 1 when already suspended", to ensure consistency with APIs like pm_runtime_put_autosuspend().) RPM_GET_PUT APIs based on __pm_runtime_suspend() do return 1 when already suspended, but the language is a little unclear -- it's not really an "error", so it seems better to list as a clarification before the 0/success case. Additionally, they only actually return 1 when the refcount makes it to 0; if the usage counter is still non-zero, we return 0. pm_runtime_put(), etc., also don't appear at first like they can ever see "-EAGAIN: Runtime PM usage_count non-zero", because in non-racy conditions, pm_runtime_put() would drop its reference count, see it's non-zero, and return early (in __pm_runtime_idle()). However, it's possible to race with another actor that increments the usage_count afterward, since rpm_idle() is protected by a separate lock; in such a case, we may see -EAGAIN. Because this case is only seen in the presence of concurrent actors, it makes sense to clarify that this is when "usage_count **became** non-zero", by way of some racing actor. Lastly, pm_runtime_put_sync_suspend() duplicated some -EAGAIN language. Fix that. Fixes: 271ff96d6066 ("PM: runtime: Document return values of suspend-related API functions") Link: https://lore.kernel.org/linux-pm/aJ5pkEJuixTaybV4@google.com/ Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: 6.17+ <stable@vger.kernel.org> # 6.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-27bpf: Remove duplicate crypto/sha2.h headerJiapeng Chong
./include/linux/bpf.h: crypto/sha2.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=25501 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/r/20250926095240.3397539-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-26net: airoha: npu: Add a NPU callback to initialize flow statsLorenzo Bianconi
Introduce a NPU callback to initialize flow stats and remove NPU stats initialization from airoha_npu_get routine. Add num_stats_entries to airoha_npu_ppe_stats_setup routine. This patch makes the code more readable since NPU statistic are now initialized on demand by the NPU consumer (at the moment NPU statistic are configured just by the airoha_eth driver). Moreover this patch allows the NPU consumer (PPE module) to explicitly enable/disable NPU flow stats. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250924-airoha-npu-init-stats-callback-v1-1-88bdf3c941b2@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-26ethtool: add FEC bins histogram reportVadim Fedorenko
IEEE 802.3ck-2022 defines counters for FEC bins and 802.3df-2024 clarifies it a bit further. Implement reporting interface through as addition to FEC stats available in ethtool. Drivers can leave bin counter uninitialized if per-lane values are provided. In this case the core will recalculate summ for the bin. Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Link: https://patch.msgid.link/20250924124037.1508846-2-vadim.fedorenko@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-26ptr_ring: drop duplicated tail zeroing codeMichael S. Tsirkin
We have some rather subtle code around zeroing tail entries, minimizing cache bouncing. Let's put it all in one place. Doing this also reduces the text size slightly, e.g. for drivers/vhost/net.o Before: text: 15,114 bytes After: text: 15,082 bytes Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/adb9d941de4a2b619ddb2be271a9939849e70687.1758690291.git.mst@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-26Merge tag 'linux-can-next-for-6.18-20250924' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2025-09-25 this is a pull request of 48 patches for net-next/main, which supersedes tags/linux-can-next-for-6.18-20250923. The 1st patch is by Xichao Zhao and converts ns_to_ktime() to us_to_ktime() in the m_can driver. Vincent Mailhol contributes 2 patches: Updating the MAINTAINERS and mailmap files to Vincent's new email address and sorting the includes in the CAN helper library alphabeticaly. Stéphane Grosjean's patch modifies all peak CAN drivers and the mailmap to reflect Stéphane's new email address. 4 patches by Biju Das update the CAN-FD handling in the rcar_canfd driver. Followed by 11 patches by Geert Uytterhoeven updating and improving the rcar_can driver. Stefan Mätje contributes 2 patches for the esd_usb driver updating the error messages. The next 3 patch series are all by Vincent Mailhol: 3 patches to optimize the size of struct raw_sock and struct uniqframe. 4 patches which rework the CAN MTU logic as preparation for CAN-XL interfaces. And finally 20 patches that prepare and refactor the CAN netlink code for the upcoming CAN-XL support. * tag 'linux-can-next-for-6.18-20250924' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (48 commits) can: netlink: add userland error messages can: dev: add can_get_ctrlmode_str() can: calc_bittiming: make can_calc_tdco() FD agnostic can: netlink: make can_tdc_fill_info() FD agnostic can: netlink: add can_bitrate_const_fill_info() can: netlink: add can_bittiming_const_fill_info() can: netlink: add can_bittiming_fill_info() can: netlink: add can_data_bittiming_get_size() can: netlink: make can_tdc_get_size() FD agnostic can: netlink: add can_ctrlmode_changelink() can: netlink: add can_dtb_changelink() can: netlink: make can_tdc_changelink() FD agnostic can: netlink: remove useless check in can_tdc_changelink() can: netlink: refactor CAN_CTRLMODE_TDC_{AUTO,MANUAL} flag reset logic can: netlink: add can_validate_databittiming() can: netlink: add can_validate_tdc() can: netlink: refactor can_validate_bittiming() can: netlink: document which symbols are FD specific can: dev: make can_get_relative_tdco() FD agnostic and move it to bittiming.h can: dev: move struct data_bittiming_params to linux/can/bittiming.h ... ==================== Link: https://patch.msgid.link/20250925121332.848157-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-26Merge tag 'wireless-next-2025-09-25' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Quite a bit more things, including pull requests from drivers: - mt76: MLO support, HW restart improvements - rtw88/89: small features, prep for RTL8922DE support - ath10k: GTK rekey fixes - cfg80211/mac80211: - additions for more NAN support - S1G channel representation cleanup * tag 'wireless-next-2025-09-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (167 commits) wifi: libertas: add WQ_UNBOUND to alloc_workqueue users Revert "wifi: libertas: WQ_PERCPU added to alloc_workqueue users" wifi: libertas: WQ_PERCPU added to alloc_workqueue users wifi: cfg80211: fix width unit in cfg80211_radio_chandef_valid() wifi: ath11k: HAL SRNG: don't deinitialize and re-initialize again wifi: ath12k: enforce CPU endian format for all QMI data wifi: ath12k: Use 1KB Cache Flush Command for QoS TID Descriptors wifi: ath12k: Fix flush cache failure during RX queue update wifi: ath12k: Add Retry Mechanism for REO RX Queue Update Failures wifi: ath12k: Refactor REO command to use ath12k_dp_rx_tid_rxq wifi: ath12k: Refactor RX TID buffer cleanup into helper function wifi: ath12k: Refactor RX TID deletion handling into helper function wifi: ath12k: Increase DP_REO_CMD_RING_SIZE to 256 wifi: cfg80211: remove IEEE80211_CHAN_{1,2,4,8,16}MHZ flags wifi: rtw89: avoid circular locking dependency in ser_state_run() wifi: rtw89: fix leak in rtw89_core_send_nullfunc() wifi: rtw89: avoid possible TX wait initialization race wifi: rtw89: fix use-after-free in rtw89_core_tx_kick_off_and_wait() wifi: ath12k: Fix peer lookup in ath12k_dp_mon_rx_deliver_msdu() wifi: mac80211: fix Rx packet handling when pubsta information is not available ... ==================== Link: https://patch.msgid.link/20250925232341.4544-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-26of: base: Add of_get_next_child_with_prefix() stubBjorn Helgaas
1fcc67e3a354 ("of: base: Add for_each_child_of_node_with_prefix()") added of_get_next_child_with_prefix() but did not add a stub for the !CONFIG_OF case. Add a of_get_next_child_with_prefix() stub so users of for_each_child_of_node_with_prefix() can be built for compile testing even when !CONFIG_OF. Fixes: 1fcc67e3a354 ("of: base: Add for_each_child_of_node_with_prefix()") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2025-09-26Merge tag 'sched-urgent-2025-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix two dl_server regressions: a race that can end up leaving the dl_server stuck, and a dl_server throttling bug causing lag to fair tasks" * tag 'sched-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix dl_server behaviour sched/deadline: Fix dl_server getting stuck
2025-09-26PM: hibernate: Fix pm_hibernation_mode_is_suspend() build breakageRafael J. Wysocki
Commit 495c8d35035e ("PM: hibernate: Add pm_hibernation_mode_is_suspend()") that introduced pm_hibernation_mode_is_suspend() did not define it in the case when CONFIG_HIBERNATION is unset, but CONFIG_SUSPEND is set. Subsequent commit 0a6e9e098fcc ("drm/amd: Fix hybrid sleep") made the amdgpu driver use that function which led to kernel build breakage in the case mentioned above [1]. Address this by using appropriate #ifdeffery around the definition of pm_hibernation_mode_is_suspend(). Fixes: 0a6e9e098fcc ("drm/amd: Fix hybrid sleep") Reported-by: KernelCI bot <bot@kernelci.org> Closes: https://groups.io/g/kernelci-results/topic/regression_pm_testing/115439919 [1] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
2025-09-26slab: add opt-in caching layer of percpu sheavesVlastimil Babka
Specifying a non-zero value for a new struct kmem_cache_args field sheaf_capacity will setup a caching layer of percpu arrays called sheaves of given capacity for the created cache. Allocations from the cache will allocate via the percpu sheaves (main or spare) as long as they have no NUMA node preference. Frees will also put the object back into one of the sheaves. When both percpu sheaves are found empty during an allocation, an empty sheaf may be replaced with a full one from the per-node barn. If none are available and the allocation is allowed to block, an empty sheaf is refilled from slab(s) by an internal bulk alloc operation. When both percpu sheaves are full during freeing, the barn can replace a full one with an empty one, unless over a full sheaves limit. In that case a sheaf is flushed to slab(s) by an internal bulk free operation. Flushing sheaves and barns is also wired to the existing cpu flushing and cache shrinking operations. The sheaves do not distinguish NUMA locality of the cached objects. If an allocation is requested with kmem_cache_alloc_node() (or a mempolicy with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE), the sheaves are bypassed. The bulk operations exposed to slab users also try to utilize the sheaves as long as the necessary (full or empty) sheaves are available on the cpu or in the barn. Once depleted, they will fallback to bulk alloc/free to slabs directly to avoid double copying. The sheaf_capacity value is exported in sysfs for observability. Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf count objects allocated or freed using the sheaves (and thus not counting towards the other alloc/free path counters). Counters sheaf_refill and sheaf_flush count objects filled or flushed from or to slab pages, and can be used to assess how effective the caching is. The refill and flush operations will also count towards the usual alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for the backing slabs. For barn operations, barn_get and barn_put count how many full sheaves were get from or put to the barn, the _fail variants count how many such requests could not be satisfied mainly because the barn was either empty or full. While the barn also holds empty sheaves to make some operations easier, these are not as critical to mandate own counters. Finally, there are sheaf_alloc/sheaf_free counters. Access to the percpu sheaves is protected by local_trylock() when potential callers include irq context, and local_lock() otherwise (such as when we already know the gfp flags allow blocking). The trylock failures should be rare and we can easily fallback. Each per-NUMA-node barn has a spin_lock. When slub_debug is enabled for a cache with sheaf_capacity also specified, the latter is ignored so that allocations and frees reach the slow path where debugging hooks are processed. Similarly, we ignore it with CONFIG_SLUB_TINY which prefers low memory usage to performance. [boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ] Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-26locking/local_lock: Expose dep_map in local_trylock_t.Alexei Starovoitov
lockdep_is_held() macro assumes that "struct lockdep_map dep_map;" is a top level field of any lock that participates in LOCKDEP. Make it so for local_trylock_t. Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-26Merge branches 'apple/dart', 'ti/omap', 'riscv', 'intel/vt-d' and ↵Joerg Roedel
'amd/amd-vi' into next
2025-09-25bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACETao Chen
The stacktrace map can be easily full, which will lead to failure in obtaining the stack. In addition to increasing the size of the map, another solution is to delete the stack_id after looking it up from the user, so extend the existing bpf_map_lookup_and_delete_elem() functionality to stacktrace map types. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250925175030.1615837-1-chen.dylane@linux.dev
2025-09-25include/linux/pgtable.h: convert arch_enter_lazy_mmu_mode() and friends to ↵Andrew Morton
static inlines commit c519c3c0a113 ("mm/kasan: avoid lazy MMU mode hazards") introduced the use of arch_enter_lazy_mmu_mode(), which results in the compiler complaining about "statement has no effect", when __HAVE_ARCH_LAZY_MMU_MODE is not defined in include/linux/pgtable.h The exact warning/error is: In file included from ./include/linux/kasan.h:37, from mm/kasan/shadow.c:14: mm/kasan/shadow.c: In function kasan_populate_vmalloc_pte: ./include/linux/pgtable.h:247:41: error: statement with no effect [-Werror=unused-value] 247 | #define arch_enter_lazy_mmu_mode() (LAZY_MMU_DEFAULT) | ^ mm/kasan/shadow.c:322:9: note: in expansion of macro arch_enter_lazy_mmu_mode> 322 | arch_enter_lazy_mmu_mode(); | ^~~~~~~~~~~~~~~~~~~~~~~~ switching these "functions" to static inlines fixes this up. Fixes: c519c3c0a113 ("mm/kasan: avoid lazy MMU mode hazards") Reported-by: Balbir Singh <balbirs@nvidia.com> Closes: https://lkml.kernel.org/r/20250912235515.367061-1-balbirs@nvidia.com Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-25mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_countJane Chu
commit 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared count") introduced ->pt_share_count dedicated to hugetlb PMD share count tracking, but omitted fixing copy_hugetlb_page_range(), leaving the function relying on page_count() for tracking that no longer works. When lazy page table copy for hugetlb is disabled, that is, revert commit bcd51a3c679d ("hugetlb: lazy page table copies in fork()") fork()'ing with hugetlb PMD sharing quickly lockup - [ 239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s! [ 239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0 [ 239.446631] Call Trace: [ 239.446633] <TASK> [ 239.446636] _raw_spin_lock+0x3f/0x60 [ 239.446639] copy_hugetlb_page_range+0x258/0xb50 [ 239.446645] copy_page_range+0x22b/0x2c0 [ 239.446651] dup_mmap+0x3e2/0x770 [ 239.446654] dup_mm.constprop.0+0x5e/0x230 [ 239.446657] copy_process+0xd17/0x1760 [ 239.446660] kernel_clone+0xc0/0x3e0 [ 239.446661] __do_sys_clone+0x65/0xa0 [ 239.446664] do_syscall_64+0x82/0x930 [ 239.446668] ? count_memcg_events+0xd2/0x190 [ 239.446671] ? syscall_trace_enter+0x14e/0x1f0 [ 239.446676] ? syscall_exit_work+0x118/0x150 [ 239.446677] ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0 [ 239.446681] ? clear_bhb_loop+0x30/0x80 [ 239.446684] ? clear_bhb_loop+0x30/0x80 [ 239.446686] entry_SYSCALL_64_after_hwframe+0x76/0x7e There are two options to resolve the potential latent issue: 1. warn against PMD sharing in copy_hugetlb_page_range(), 2. fix it. This patch opts for the second option. While at it, simplify the comment, the details are not actually relevant anymore. Link: https://lkml.kernel.org/r/20250916004520.1604530-1-jane.chu@oracle.com Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-25i2c: boardinfo: Annotate code used in init phase onlyHeiner Kallweit
Annotate two places in boardinfo code: - __i2c_first_dynamic_bus_num is set in init phase. Annotate it as __ro_after_init to prevent later changes. - i2c_register_board_info() is used in init phase only, so annotate it as __init, allowing to free the memory after init phase. This is safe, see comment: "done in board-specific init code near arch_initcall() time" Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2025-09-25irqchip: Add driver for the RPMI system MSI service groupAnup Patel
The RPMI specification defines a system MSI service group which allows application processors to receive MSIs upon system events such as graceful shutdown/reboot request, CPU hotplug event, memory hotplug event, etc. Add an irqchip driver for the RISC-V RPMI system MSI service group to directly receive system MSIs in Linux kernel. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20250818040920.272664-14-apatel@ventanamicro.com Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-25PM: hibernate: Add pm_hibernation_mode_is_suspend()Mario Limonciello (AMD)
Some drivers have different flows for hibernation and suspend. If the driver opportunistically will skip thaw() then it needs a hint to know what is happening after the hibernate. Introduce a new symbol pm_hibernation_mode_is_suspend() that drivers can call to determine if suspending the system for this purpose. Tested-by: Ionut Nechita <ionut_n2001@yahoo.com> Tested-by: Kenneth Crudup <kenny@panix.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-25clk: Add clock driver for the RISC-V RPMI clock service groupRahul Pathak
The RPMI specification defines a clock service group which can be accessed via SBI MPXY extension or dedicated S-mode RPMI transport. Add mailbox client based clock driver for the RISC-V RPMI clock service group. Reviewed-by: Stephen Boyd <sboyd@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Co-developed-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Rahul Pathak <rpathak@ventanamicro.com> Link: https://lore.kernel.org/r/20250818040920.272664-11-apatel@ventanamicro.com [pjw@kernel.org: converted rpmi_clkrate_u64 macro to a function; replaced bare constant with a macro] Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c 6b6968084721 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") 27ce71e1ce81 ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c a35c04de2565 ("net/smc: fix warning in smc_rx_splice() when calling get_page()") cc21191b584c ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c c0054b25e2f1 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") 7a1eaef0a791 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-25Merge tag 'thunderbolt-for-v6.18-rc1' of ↵Greg Kroah-Hartman
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt into usb-next Mika writes: thunderbolt: Changes for v6.18 merge window This includes following USB4/Thunderbolt changes for the v6.18 merge window: - HMAC hashing improvements - Switch to use Linux Foundation IDs for XDomain discovery - Use is_pciehp instead of is_hotplug_bridge - Fixes for various kernel-doc issues - Fix use-after-free in DP tunneling error path. I'm sending the UAF fix with this pull request because it came quite late and I would like to give it some exposure before it lands the mainline. All these except the UAF fix have been in linux-next with no reported issues. * tag 'thunderbolt-for-v6.18-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt: (33 commits) thunderbolt: Fix use-after-free in tb_dp_dprx_work thunderbolt: Update thunderbolt.h header file thunderbolt: Update xdomain.c function documentation thunderbolt: Update usb4_port.c function documentation thunderbolt: Update usb4.c function documentation thunderbolt: Update tunnel.h function documentation thunderbolt: Update tunnel.c function documentation thunderbolt: Update tmu.c function documentation thunderbolt: Add missing documentation in tb.h thunderbolt: Update tb.h function documentation thunderbolt: Update tb.c function documentation thunderbolt: Update switch.c function documentation thunderbolt: Update retimer.c function documentation thunderbolt: Update property.c function documentation thunderbolt: Update path.c function documentation thunderbolt: Update nvm.c function documentation thunderbolt: Add missing documentation in nhi_regs.h ring_desc structure thunderbolt: Update nhi.c function documentation thunderbolt: Update lc.c function documentation thunderbolt: Update eeprom.c function documentation ...
2025-09-25Merge tag 'net-6.17-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from Bluetooth, IPsec and CAN. No known regressions at this point. Current release - regressions: - xfrm: xfrm_alloc_spi shouldn't use 0 as SPI Previous releases - regressions: - xfrm: fix offloading of cross-family tunnels - bluetooth: fix several races leading to UaFs - dsa: lantiq_gswip: fix FDB entries creation for the CPU port - eth: - tun: update napi->skb after XDP process - mlx: fix UAF in flow counter release Previous releases - always broken: - core: forbid FDB status change while nexthop is in a group - smc: fix warning in smc_rx_splice() when calling get_page() - can: provide missing ndo_change_mtu(), to prevent buffer overflow. - eth: - i40e: fix VF config validation - broadcom: fix support for PTP_EXTTS_REQUEST2 ioctl" * tag 'net-6.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits) octeontx2-pf: Fix potential use after free in otx2_tc_add_flow() net: dsa: lantiq_gswip: suppress -EINVAL errors for bridge FDB entries added to the CPU port net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup() libie: fix string names for AQ error codes net/mlx5e: Fix missing FEC RS stats for RS_544_514_INTERLEAVED_QUAD net/mlx5: HWS, ignore flow level for multi-dest table net/mlx5: fs, fix UAF in flow counter release selftests: fib_nexthops: Add test cases for FDB status change selftests: fib_nexthops: Fix creation of non-FDB nexthops nexthop: Forbid FDB status change while nexthop is in a group net: allow alloc_skb_with_frags() to use MAX_SKB_FRAGS bnxt_en: correct offset handling for IPv6 destination address ptp: document behavior of PTP_STRICT_FLAGS broadcom: fix support for PTP_EXTTS_REQUEST2 ioctl broadcom: fix support for PTP_PEROUT_DUTY_CYCLE Bluetooth: MGMT: Fix possible UAFs Bluetooth: hci_event: Fix UAF in hci_acl_create_conn_sync Bluetooth: hci_event: Fix UAF in hci_conn_tx_dequeue Bluetooth: hci_sync: Fix hci_resume_advertising_sync Bluetooth: Fix build after header cleanup ...
2025-09-25platform/chrome: update pwm fan control host commandsSung-Chi Li
Update cros_ec_commands.h to include definitions for getting PWM fan duty, getting and setting the fan control mode. Signed-off-by: Sung-Chi Li <lschyi@chromium.org> Acked-by: Tzung-Bi Shih <tzungbi@kernel.org> Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250911-cros_ec_fan-v6-1-a1446cc098af@google.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2025-09-25Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fixes from Michael Tsirkin: "virtio,vhost: last minute fixes More small fixes. Most notably this fixes crashes and hangs in vhost-net" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: MAINTAINERS, mailmap: Update address for Peter Hilber virtio_config: clarify output parameters uapi: vduse: fix typo in comment vhost: Take a reference on the task in struct vhost_task. vhost-net: flush batched before enabling notifications Revert "vhost/net: Defer TX queue re-enable until after sendmsg" vhost-net: unbreak busy polling vhost-scsi: fix argument order in tport allocation error message
2025-09-25net: gso: restore ids of outer ip headers correctlyRichard Gobert
Currently, NETIF_F_TSO_MANGLEID indicates that the inner-most ID can be mangled. Outer IDs can always be mangled. Make GSO preserve outer IDs by default, with NETIF_F_TSO_MANGLEID allowing both inner and outer IDs to be mangled. This commit also modifies a few drivers that use SKB_GSO_FIXEDID directly. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250923085908.4687-4-richardbgobert@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-25sched: Fix some typos in include/linux/preempt.hMenglong Dong
There are some typos in the comments of migrate in include/linux/preempt.h: elegible -> eligible it's -> its migirate_disable -> migrate_disable abritrary -> arbitrary Just fix them. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-25sched: Make migrate_{en,dis}able() inlineMenglong Dong
For now, migrate_enable and migrate_disable are global, which makes them become hotspots in some case. Take BPF for example, the function calling to migrate_enable and migrate_disable in BPF trampoline can introduce significant overhead, and following is the 'perf top' of FENTRY's benchmark (./tools/testing/selftests/bpf/bench trig-fentry): 54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k] bpf_prog_2dcccf652aac1793_bench_trigger_fentry 10.43% [kernel] [k] migrate_enable 10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037 8.06% [kernel] [k] __bpf_prog_exit_recur 4.11% libc.so.6 [.] syscall 2.15% [kernel] [k] entry_SYSCALL_64 1.48% [kernel] [k] memchr_inv 1.32% [kernel] [k] fput 1.16% [kernel] [k] _copy_to_user 0.73% [kernel] [k] bpf_prog_test_run_raw_tp So in this commit, we make migrate_enable/migrate_disable inline to obtain better performance. The struct rq is defined internally in kernel/sched/sched.h, and the field "nr_pinned" is accessed in migrate_enable/migrate_disable, which makes it hard to make them inline. Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1], so we can define the migrate_enable/migrate_disable in include/linux/sched.h and access "this_rq()->nr_pinned" with "(void *)this_rq() + RQ_nr_pinned". The offset of "nr_pinned" is generated in include/generated/rq-offsets.h by kernel/sched/rq-offsets.c. Generally speaking, we move the definition of migrate_enable and migrate_disable to include/linux/sched.h from kernel/sched/core.c. The calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable(). The "struct rq" is not available in include/linux/sched.h, so we can't access the "runqueues" with this_cpu_ptr(), as the compilation will fail in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr(): typeof((ptr) + 0) So we introduce the this_rq_raw() and access the runqueues with arch_raw_cpu_ptr/PERCPU_PTR directly. The variable "runqueues" is not visible in the kernel modules, and export it is not a good idea. As Peter Zijlstra advised in [2], we define and export migrate_enable/migrate_disable in kernel/sched/core.c too, and use them for the modules. Before this patch, the performance of BPF FENTRY is: fentry : 113.030 ± 0.149M/s fentry : 112.501 ± 0.187M/s fentry : 112.828 ± 0.267M/s fentry : 115.287 ± 0.241M/s After this patch, the performance of BPF FENTRY increases to: fentry : 143.644 ± 0.670M/s fentry : 149.764 ± 0.362M/s fentry : 149.642 ± 0.156M/s fentry : 145.263 ± 0.221M/s Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
2025-09-25rcu: Replace preempt.h with sched.h in include/linux/rcupdate.hMenglong Dong
In the next commit, we will move the definition of migrate_enable() and migrate_disable() to linux/sched.h. However, migrate_enable/migrate_disable will be used in commit 1b93c03fb319 ("rcu: add rcu_read_lock_dont_migrate()") in bpf-next tree. In order to fix potential compiling error, replace linux/preempt.h with linux/sched.h in include/linux/rcupdate.h. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-25sched/deadline: Fix dl_server behaviourPeter Zijlstra
John reported undesirable behaviour with the dl_server since commit: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling"). When starving fair tasks on purpose (starting spinning FIFO tasks), his fair workload, which often goes (briefly) idle, would delay fair invocations for a second, running one invocation per second was both unexpected and terribly slow. The reason this happens is that when dl_se->server_pick_task() returns NULL, indicating no runnable tasks, it would yield, pushing any later jobs out a whole period (1 second). Instead simply stop the server. This should restore behaviour in that a later wakeup (which restarts the server) will be able to continue running (subject to the CBS wakeup rules). Notably, this does not re-introduce the behaviour cccb45d7c4295 set out to solve, any start/stop cycle is naturally throttled by the timer period (no active cancel). Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: John Stultz <jstultz@google.com>
2025-09-25sched/deadline: Fix dl_server getting stuckPeter Zijlstra
John found it was easy to hit lockup warnings when running locktorture on a 2 CPU VM, which he bisected down to: commit cccb45d7c429 ("sched/deadline: Less agressive dl_server handling"). While debugging it seems there is a chance where we end up with the dl_server dequeued, with dl_se->dl_server_active. This causes dl_server_start() to return without enqueueing the dl_server, thus it fails to run when RT tasks starve the cpu. When this happens, dl_server_timer() catches the '!dl_se->server_has_tasks(dl_se)' case, which then calls replenish_dl_entity() and dl_server_stopped() and finally return HRTIMER_NO_RESTART. This ends in no new timer and also no enqueue, leaving the dl_server 'dead', allowing starvation. What should have happened is for the bandwidth timer to start the zero-laxity timer, which in turn would enqueue the dl_server and cause dl_se->server_pick_task() to be called -- which will stop the dl_server if no fair tasks are observed for a whole period. IOW, it is totally irrelevant if there are fair tasks at the moment of bandwidth refresh. This removes all dl_se->server_has_tasks() users, so remove the whole thing. Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: John Stultz <jstultz@google.com>
2025-09-25ns: move ns type into struct ns_commonChristian Brauner
It's misplaced in struct proc_ns_operations and ns->ops might be NULL if the namespace is compiled out but we still want to know the type of the namespace for the initial namespace struct. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-25nstree: make struct ns_tree privateChristian Brauner
Don't expose it directly. There's no need to do that. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-25accel/habanalabs: add HL_GET_P_STATE passthrough typeAriel Aviad
Add a new passthrough type HL_GET_P_STATE to the cpucp generic ioctl to allow userspace to read the device performance state via firmware. Signed-off-by: Ariel Aviad <ariel.aviad@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: add generic message type to get error countersVitaly Margolin
Add a new CPUCP generic message type to retrieve HBM, SRAM and critical error counters from the device. Signed-off-by: Vitaly Margolin <vitaly.margolin@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25once: fix race by moving DO_ONCE to separate sectionQi Xi
The commit c2c60ea37e5b ("once: use __section(".data.once")") moved DO_ONCE's ___done variable to .data.once section, which conflicts with DO_ONCE_LITE() that also uses the same section. This creates a race condition when clear_warn_once is used: Thread 1 (DO_ONCE) Thread 2 (DO_ONCE) __do_once_start read ___done (false) acquire once_lock execute func __do_once_done write ___done (true) __do_once_start release once_lock // Thread 3 clear_warn_once reset ___done read ___done (false) acquire once_lock execute func schedule once_work __do_once_done once_deferred: OK write ___done (true) static_branch_disable release once_lock schedule once_work once_deferred: BUG_ON(!static_key_enabled) DO_ONCE_LITE() in once_lite.h is used by WARN_ON_ONCE() and other warning macros. Keep its ___done flag in the .data..once section and allow resetting by clear_warn_once, as originally intended. In contrast, DO_ONCE() is used for functions like get_random_once() and relies on its ___done flag for internal synchronization. We should not reset DO_ONCE() by clear_warn_once. Fix it by isolating DO_ONCE's ___done into a separate .data..do_once section, shielding it from clear_warn_once. Fixes: c2c60ea37e5b ("once: use __section(".data.once")") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Qi Xi <xiqi2@huawei.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-09-24byteorder: Add memcpy_to_le32() and memcpy_from_le32()Anup Patel
Add common memcpy APIs for copying u32 array to/from __le32 array. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Jassi Brar <jassisinghbrar@gmail.com> Link: https://lore.kernel.org/r/20250818040920.272664-7-apatel@ventanamicro.com Signed-off-by: Paul Walmsley <pjw@kernel.org>