summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-09-13printk/nbcon: use panic_on_this_cpu() helperJinchao Wang
nbcon_context_try_acquire() compared panic_cpu directly with smp_processor_id(). This open-coded check is now provided by panic_on_this_cpu(). Switch to panic_on_this_cpu() to simplify the code and improve readability. Link: https://lkml.kernel.org/r/20250825022947.1596226-7-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13panic: use panic_try_start() in vpanic()Jinchao Wang
vpanic() had open-coded logic to claim panic_cpu with atomic_try_cmpxchg. This is already handled by panic_try_start(). Switch to panic_try_start() and use panic_on_other_cpu() for the fallback path. This removes duplicate code and makes panic handling consistent across functions. Link: https://lkml.kernel.org/r/20250825022947.1596226-6-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13panic: use panic_try_start() in nmi_panic()Jinchao Wang
nmi_panic() duplicated the logic to claim panic_cpu with atomic_try_cmpxchg. This is already wrapped in panic_try_start(). Replace the open-coded logic with panic_try_start(), and use panic_on_other_cpu() for the fallback path. This removes duplication and keeps panic handling code consistent. Link: https://lkml.kernel.org/r/20250825022947.1596226-5-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13crash_core: use panic_try_start() in crash_kexec()Jinchao Wang
crash_kexec() had its own code to exclude parallel execution by setting panic_cpu. This is already handled by panic_try_start(). Switch to panic_try_start() to remove the duplication and keep the logic consistent. Link: https://lkml.kernel.org/r/20250825022947.1596226-4-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13panic: introduce helper functions for panic stateJinchao Wang
Patch series "panic: introduce panic status function family", v2. This series introduces a family of helper functions to manage panic state and updates existing code to use them. Before this series, panic state helpers were scattered and inconsistent. For example, panic_in_progress() was defined in printk/printk.c, not in panic.c or panic.h. As a result, developers had to look in unexpected places to understand or re-use panic state logic. Other checks were open- coded, duplicating logic across panic, crash, and watchdog paths. The new helpers centralize the functionality in panic.c/panic.h: - panic_try_start() - panic_reset() - panic_in_progress() - panic_on_this_cpu() - panic_on_other_cpu() Patches 1–8 add the helpers and convert panic/crash and printk/nbcon code to use them. Patch 9 fixes a bug in the watchdog subsystem by skipping checks when a panic is in progress, avoiding interference with the panic CPU. Together, this makes panic state handling simpler, more discoverable, and more robust. This patch (of 9): This patch introduces four new helper functions to abstract the management of the panic_cpu variable. These functions will be used in subsequent patches to refactor existing code. The direct use of panic_cpu can be error-prone and ambiguous, as it requires manual checks to determine which CPU is handling the panic. The new helpers clarify intent: panic_try_start(): Atomically sets the current CPU as the panicking CPU. panic_reset(): Reset panic_cpu to PANIC_CPU_INVALID. panic_in_progress(): Checks if a panic has been triggered. panic_on_this_cpu(): Returns true if the current CPU is the panic originator. panic_on_other_cpu(): Returns true if a panic is on another CPU. This change lays the groundwork for improved code readability and robustness in the panic handling subsystem. Link: https://lkml.kernel.org/r/20250825022947.1596226-1-wangjinchao600@gmail.com Link: https://lkml.kernel.org/r/20250825022947.1596226-2-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>b Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13panic: clean up message about deprecated 'panic_print' parameterPetr Mladek
Remove duplication of the message about deprecated 'panic_print' parameter. Also make the wording more direct. Make it clear that the new parameters already exist and should be used instead. Link: https://lkml.kernel.org/r/20250825025701.81921-5-feng.tang@linux.alibaba.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Cc: Askar Safin <safinaskar@zohomail.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13panic: add note that 'panic_print' parameter is deprecatedFeng Tang
Just like for 'panic_print's systcl interface, add similar note for setup of kernel cmdline parameter and parameter under /sys/module/kernel/. Also add __core_param_cb() macro, which enables to add special get/set operation for a kernel parameter. Link: https://lkml.kernel.org/r/20250825025701.81921-4-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Askar Safin <safinaskar@zohomail.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13kexec_core: remove redundant 0 value initializationLiao Yuanhong
The kimage struct is already zeroed by kzalloc(). It's redundant to initialize image->head to 0. Link: https://lkml.kernel.org/r/20250825123307.306634-1-liaoyuanhong@vivo.com Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13fork: kill the pointless lower_32_bits() in create_io_thread(), ↵Oleg Nesterov
kernel_thread(), and user_mode_thread() Unlike sys_clone(), these helpers have only in kernel users which should pass the correct "flags" argument. lower_32_bits(flags) just adds the unnecessary confusion and doesn't allow to use the CLONE_ flags which don't fit into 32 bits. create_io_thread() looks especially confusing because: - "flags" is a compile-time constant, so lower_32_bits() simply has no effect - .exit_signal = (lower_32_bits(flags) & CSIGNAL) is harmless but doesn't look right, copy_process(CLONE_THREAD) will ignore this argument anyway. None of these helpers actually need CLONE_UNTRACED or "& ~CSIGNAL", but their presence does not add any confusion and improves code clarity. Link: https://lkml.kernel.org/r/20250820163946.GA18549@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13fork: remove #ifdef CONFIG_LOCKDEP in copy_process()Tio Zhang
lockdep_init_task() is defined as an empty when CONFIG_LOCKDEP is not set. So the #ifdef here is redundant, remove it. Link: https://lkml.kernel.org/r/20250820101826.GA2484@didi-ThinkCentre-M930t-N000 Signed-off-by: Tio Zhang <tiozhang@didiglobal.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13watchdog/softlockup: fix incorrect CPU utilization output during softlockupZhenguoYao
Since we use 16-bit precision, the raw data will undergo integer division, which may sometimes result in data loss. This can lead to slightly inaccurate CPU utilization calculations. Under normal circumstances, this isn't an issue. However, when CPU utilization reaches 100%, the calculated result might exceed 100%. For example, with raw data like the following: sample_period 400000134 new_stat 83648414036 old_stat 83247417494 sample_period=400000134/2^24=23 new_stat=83648414036/2^24=4985 old_stat=83247417494/2^24=4961 util=105% Below log will output: CPU#3 Utilization every 0s during lockup: #1: 0% system, 0% softirq, 105% hardirq, 0% idle #2: 0% system, 0% softirq, 105% hardirq, 0% idle #3: 0% system, 0% softirq, 100% hardirq, 0% idle #4: 0% system, 0% softirq, 105% hardirq, 0% idle #5: 0% system, 0% softirq, 105% hardirq, 0% idle To avoid confusion, we enforce a 100% display cap when calculations exceed this threshold. We also round to the nearest multiple of 16.8 milliseconds to improve the accuracy. [yaozhenguo1@gmail.com: make get_16bit_precision() more accurate, fix comment layout] Link: https://lkml.kernel.org/r/20250818081438.40540-1-yaozhenguo@jd.com Link: https://lkml.kernel.org/r/20250812082510.32291-1-yaozhenguo@jd.com Signed-off-by: ZhenguoYao <yaozhenguo1@gmail.com> Cc: Bitao Hu <yaoma@linux.alibaba.com> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13watchdog/softlockup: fix wrong output when watchdog_thresh < 3ZhenguoYao
When watchdog_thresh is below 3, sample_period will be less than 1 second. So the following output will print when softlockup: CPU#3 Utilization every 0s during lockup Fix this by changing time unit from seconds to milliseconds. Link: https://lkml.kernel.org/r/20250812074132.27810-1-yaozhenguo@jd.com Signed-off-by: ZhenguoYao <yaozhenguo1@gmail.com> Cc: Bitao Hu <yaoma@linux.alibaba.com> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13kcov: use write memory barrier after memcpy() in kcov_move_area()Soham Bagchi
KCOV Remote uses two separate memory buffers, one private to the kernel space (kcov_remote_areas) and the second one shared between user and kernel space (kcov->area). After every pair of kcov_remote_start() and kcov_remote_stop(), the coverage data collected in the kcov_remote_areas is copied to kcov->area so the user can read the collected coverage data. This memcpy() is located in kcov_move_area(). The load/store pattern on the kernel-side [1] is: ``` /* dst_area === kcov->area, dst_area[0] is where the count is stored */ dst_len = READ_ONCE(*(unsigned long *)dst_area); ... memcpy(dst_entries, src_entries, ...); ... WRITE_ONCE(*(unsigned long *)dst_area, dst_len + entries_moved); ``` And for the user [2]: ``` /* cover is equivalent to kcov->area */ n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED); ``` Without a write-memory barrier, the atomic load for the user can potentially read fresh values of the count stored at cover[0], but continue to read stale coverage data from the buffer itself. Hence, we recommend adding a write-memory barrier between the memcpy() and the WRITE_ONCE() in kcov_move_area(). Link: https://lkml.kernel.org/r/20250728184318.1839137-1-soham.bagchi@utah.edu Link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/kcov.c?h=master#n978 [1] Link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/dev-tools/kcov.rst#n364 [2] Signed-off-by: Soham Bagchi <soham.bagchi@utah.edu> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13x86/kexec: carry forward the boot DTB on kexecBrian Mak
Currently, the kexec_file_load syscall on x86 does not support passing a device tree blob to the new kernel. Some embedded x86 systems use device trees. On these systems, failing to pass a device tree to the new kernel causes a boot failure. To add support for this, we copy the behavior of ARM64 and PowerPC and copy the current boot's device tree blob for use in the new kernel. We do this on x86 by passing the device tree blob as a setup_data entry in accordance with the x86 boot protocol. This behavior is gated behind the KEXEC_FILE_FORCE_DTB flag. Link: https://lkml.kernel.org/r/20250805211527.122367-3-makb@juniper.net Signed-off-by: Brian Mak <makb@juniper.net> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Young <dyoung@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13hung_task: dump blocker task if it is not hungMasami Hiramatsu (Google)
Dump the lock blocker task if it is not hung because if the blocker task is also hung, it should be dumped by the detector. This will de-duplicate the same stackdumps if the blocker task is also blocked by another task (and hung). Link: https://lkml.kernel.org/r/175391351423.688839.11917911323784986774.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13kho: make sure kho_scratch argument is fully consumedPratyush Yadav
When specifying fixed sized scratch areas, the parser only parses the three scratch sizes and ignores the rest of the argument. This means the argument can have any bogus trailing characters. For example, "kho_scratch=256M,512M,512Mfoobar" results in successful parsing: [ 0.000000] KHO: scratch areas: lowmem: 256MiB global: 512MiB pernode: 512MiB It is generally a good idea to parse arguments as strictly as possible. In addition, if bogus trailing characters are allowed in the kho_scratch argument, it is possible that some people might end up using them and later extensions to the argument format will cause unexpected breakages. Make sure the argument is fully consumed after all three scratch sizes are parsed. With this change, the bogus argument "kho_scratch=256M,512M,512Mfoobar" results in: [ 0.000000] Malformed early option 'kho_scratch' Link: https://lkml.kernel.org/r/20250826123817.64681-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGEDavid Hildenbrand
Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised", v5. This will allow individual processes to opt-out of THP = "always" into THP = "madvise", without affecting other workloads on the system. This has been extensively discussed on the mailing list and has been summarized very well by David in the first patch which also includes the links to alternatives, please refer to the first patch commit message for the motivation for this series. Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this, along with the MMF changes. Patch 2 is a cleanup patch for tva_flags that will allow the forced collapse case to be transmitted to vma_thp_disabled (which is done in patch 3). Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE. Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely disabling THPs (old behaviour) and only enabling it at advise (PR_THP_DISABLE_EXCEPT_ADVISED). This patch (of 7): People want to make use of more THPs, for example, moving from the "never" system policy to "madvise", or from "madvise" to "always". While this is great news for every THP desperately waiting to get allocated out there, apparently there are some workloads that require a bit of care during that transition: individual processes may need to opt-out from this behavior for various reasons, and this should be permitted without needing to make all other workloads on the system similarly opt-out. The following scenarios are imaginable: (1) Switch from "none" system policy to "madvise"/"always", but keep THPs disabled for selected workloads. (2) Stay at "none" system policy, but enable THPs for selected workloads, making only these workloads use the "madvise" or "always" policy. (3) Switch from "madvise" system policy to "always", but keep the "madvise" policy for selected workloads: allocate THPs only when advised. (4) Stay at "madvise" system policy, but enable THPs even when not advised for selected workloads -- "always" policy. Once can emulate (2) through (1), by setting the system policy to "madvise"/"always" while disabling THPs for all processes that don't want THPs. It requires configuring all workloads, but that is a user-space problem to sort out. (4) can be emulated through (3) in a similar way. Back when (1) was relevant in the past, as people started enabling THPs, we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet (i.e., used by Redis) were able to just disable THPs completely. Redis still implements the option to use this interface to disable THPs completely. With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a workload -- a process, including fork+exec'ed process hierarchy. That essentially made us support (1): simply disable THPs for all workloads that are not ready for THPs yet, while still enabling THPs system-wide. The quest for handling (3) and (4) started, but current approaches (completely new prctl, options to set other policies per process, alternatives to prctl -- mctrl, cgroup handling) don't look particularly promising. Likely, the future will use bpf or something similar to implement better policies, in particular to also make better decisions about THP sizes to use, but this will certainly take a while as that work just started. Long story short: a simple enable/disable is not really suitable for the future, so we're not willing to add completely new toggles. While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs completely for these processes, this is a step backwards, because these processes can no longer allocate THPs in regions where THPs were explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that imposes a problem for relevant workloads, because "not THPs" is certainly worse than "THPs only when advised". Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this would change the documented semantics quite a bit, and the versatility to use it for debugging purposes, so I am not 100% sure that is what we want -- although it would certainly be much easier. So instead, as an easy way forward for (3) and (4), add an option to make PR_SET_THP_DISABLE disable *less* THPs for a process. In essence, this patch: (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED). (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. Previously, it would return 1 if THPs were disabled completely. Now it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set. (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express the semantics clearly. Fortunately, there are only two instances outside of prctl() code. (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs with VM_HUGEPAGE" -- essentially "thp=madvise" behavior Fortunately, we only have to extend vma_thp_disabled(). (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are disabled completely Only indicating that THPs are disabled when they are really disabled completely, not only partially. For now, we don't add another interface to obtained whether THPs are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If ever required, we could add a new entry. The documented semantics in the man page for PR_SET_THP_DISABLE "is inherited by a child created via fork(2) and is preserved across execve(2)" is maintained. This behavior, for example, allows for disabling THPs for a workload through the launching process (e.g., systemd where we fork() a helper process to then exec()). For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space thinks a THP is a good idea, we'll enable that separately next (requiring a bit of cleanup first). There is currently not way to prevent that a process will not issue PR_SET_THP_DISABLE itself to re-enable THP. There are not really known users for re-enabling it, and it's against the purpose of the original interface. So if ever required, we could investigate just forbidding to re-enable them, or make this somehow configurable. Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Tested-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: convert remaining users to mm_flags_*() accessorsLorenzo Stoakes
As part of the effort to move to mm->flags becoming a bitmap field, convert existing users to making use of the mm_flags_*() accessors which will, when the conversion is complete, be the only means of accessing mm_struct flags. No functional change intended. Link: https://lkml.kernel.org/r/cc67a56f9a8746a8ec7d9791853dc892c1c33e0b.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: update fork mm->flags initialisation to use bitmapLorenzo Stoakes
We now need to account for flag initialisation on fork. We retain the existing logic as much as we can, but dub the existing flag mask legacy. These flags are therefore required to fit in the first 32-bits of the flags field. However, further flag propagation upon fork can be implemented in mm_init() on a per-flag basis. We ensure we clear the entire bitmap prior to setting it, and use __mm_flags_get_word() and __mm_flags_set_word() to manipulate these legacy fields efficiently. Link: https://lkml.kernel.org/r/9fb8954a7a0f0184f012a8e66f8565bcbab014ba.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: convert uprobes to mm_flags_*() accessorsLorenzo Stoakes
As part of the effort to move to mm->flags becoming a bitmap field, convert existing users to making use of the mm_flags_*() accessors which will, when the conversion is complete, be the only means of accessing mm_struct flags. No functional change intended. Link: https://lkml.kernel.org/r/1d4fe5963904cc0c707da1f53fbfe6471d3eff10.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: convert prctl to mm_flags_*() accessorsLorenzo Stoakes
As part of the effort to move to mm->flags becoming a bitmap field, convert existing users to making use of the mm_flags_*() accessors which will, when the conversion is complete, be the only means of accessing mm_struct flags. No functional change intended. Link: https://lkml.kernel.org/r/b64f07b94822d02beb88d0d21a6a85f9ee45fc69.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13kho: allow scratch areas with zero sizeMike Rapoport (Microsoft)
Patch series "kho: fixes and cleanups", v3. These are small KHO and KHO test fixes and cleanups. This patch (of 3): Parsing of kho_scratch parameter treats zero size as an invalid value, although it should be fine for user to request zero sized scratch area for some types if scratch memory, when for example there is no need to create scratch area in the low memory. Treat zero as a valid value for a scratch area size but reject kho_scratch parameter that defines no scratch memory at all. Link: https://lkml.kernel.org/r/20250811082510.4154080-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250811082510.4154080-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: replace (20 - PAGE_SHIFT) with common macros for pages<->MB conversionYe Liu
Replace repeated (20 - PAGE_SHIFT) calculations with standard macros: - MB_TO_PAGES(mb) converts MB to page count - PAGES_TO_MB(pages) converts pages to MB No functional change. [akpm@linux-foundation.org: remove arc's private PAGES_TO_MB, remove its unused PAGES_TO_KB] [akpm@linux-foundation.org: don't include mm.h due to include file ordering mess] Link: https://lkml.kernel.org/r/20250718024134.1304745-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: memory-tiering: fix PGPROMOTE_CANDIDATE countingRuan Shiyang
Goto-san reported confusing pgpromote statistics where the pgpromote_success count significantly exceeded pgpromote_candidate. On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): # Enable demotion only echo 1 > /sys/kernel/mm/numa/demotion_enabled numactl -m 0-1 memhog -r200 3500M >/dev/null & pid=$! sleep 2 numactl memhog -r100 2500M >/dev/null & sleep 10 kill -9 $pid # terminate the 1st memhog # Enable promotion echo 2 > /proc/sys/kernel/numa_balancing After a few seconds, we observeed `pgpromote_candidate < pgpromote_success` $ grep -e pgpromote /proc/vmstat pgpromote_success 2579 pgpromote_candidate 0 In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, and triggers promotion. However, these migrated pages are only counted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. To solve these confusing statistics, introduce PGPROMOTE_CANDIDATE_NRL to count the missed promotion pages. And also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid changing the existing algorithm or performance of the promotion rate limit. Link: https://lkml.kernel.org/r/20250901090122.124262-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20250729035101.1601407-1-ruansy.fnst@fujitsu.com Fixes: c6833e10008f ("memory tiering: rate limit NUMA migration throughput") Co-developed-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13rseq: Protect event mask against membarrier IPIThomas Gleixner
rseq_need_restart() reads and clears task::rseq_event_mask with preemption disabled to guard against the scheduler. But membarrier() uses an IPI and sets the PREEMPT bit in the event mask from the IPI, which leaves that RMW operation unprotected. Use guard(irq) if CONFIG_MEMBARRIER is enabled to fix that. Fixes: 2a36ab717e8f ("rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org
2025-09-13padata: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-09-13padata: replace use of system_unbound_wq with system_dfl_wqMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. queue_work() / queue_delayed_work() / mod_delayed_work() will now use the new unbound wq: whether the user still use the old wq a warn will be printed along with a wq redirect to the new one. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-09-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc6). Conflicts: net/netfilter/nft_set_pipapo.c net/netfilter/nft_set_pipapo_avx2.c c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") 84c1da7b38d9 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too") Only trivial adjacent changes (in a doc and a Makefile). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-12dma-mapping: export new dma_*map_phys() interfaceLeon Romanovsky
Introduce new DMA mapping functions dma_map_phys() and dma_unmap_phys() that operate directly on physical addresses instead of page+offset parameters. This provides a more efficient interface for drivers that already have physical addresses available. The new functions are implemented as the primary mapping layer, with the existing dma_map_page_attrs()/dma_map_resource() and dma_unmap_page_attrs()/dma_unmap_resource() functions converted to simple wrappers around the phys-based implementations. In case dma_map_page_attrs(), the struct page is converted to physical address with help of page_to_phys() function and dma_map_resource() provides physical address as is together with addition of DMA_ATTR_MMIO attribute. The old page-based API is preserved in mapping.c to ensure that existing code won't be affected by changing EXPORT_SYMBOL to EXPORT_SYMBOL_GPL variant for dma_*map_phys(). Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/54cc52af91777906bbe4a386113437ba0bcfba9c.1757423202.git.leonro@nvidia.com
2025-09-12dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs()Leon Romanovsky
Make dma_map_page_attrs() and dma_map_page_attrs() respect DMA_ATTR_MMIO. DMA_ATR_MMIO makes the functions behave the same as dma_(un)map_resource(): - No swiotlb is possible - Legacy dma_ops arches use ops->map_resource() - No kmsan - No arch_dma_map_phys_direct() The prior patches have made the internal functions called here support DMA_ATTR_MMIO. This is also preparation for turning dma_map_resource() into an inline calling dma_map_phys(DMA_ATTR_MMIO) to consolidate the flows. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/3660e2c78ea409d6c483a215858fb3af52cd0ed3.1757423202.git.leonro@nvidia.com
2025-09-12kmsan: convert kmsan_handle_dma to use physical addressesLeon Romanovsky
Convert the KMSAN DMA handling function from page-based to physical address-based interface. The refactoring renames kmsan_handle_dma() parameters from accepting (struct page *page, size_t offset, size_t size) to (phys_addr_t phys, size_t size). The existing semantics where callers are expected to provide only kmap memory is continued here. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/3557cbaf66e935bc794f37d2b891ef75cbf2c80c.1757423202.git.leonro@nvidia.com
2025-09-12dma-mapping: convert dma_direct_*map_page to be phys_addr_t basedLeon Romanovsky
Convert the DMA direct mapping functions to accept physical addresses directly instead of page+offset parameters. The functions were already operating on physical addresses internally, so this change eliminates the redundant page-to-physical conversion at the API boundary. The functions dma_direct_map_page() and dma_direct_unmap_page() are renamed to dma_direct_map_phys() and dma_direct_unmap_phys() respectively, with their calling convention changed from (struct page *page, unsigned long offset) to (phys_addr_t phys). Architecture-specific functions arch_dma_map_page_direct() and arch_dma_unmap_page_direct() are similarly renamed to arch_dma_map_phys_direct() and arch_dma_unmap_phys_direct(). The is_pci_p2pdma_page() checks are replaced with DMA_ATTR_MMIO checks to allow integration with dma_direct_map_resource and dma_direct_map_phys() is extended to support MMIO path either. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/bb15a22f76dc2e26683333ff54e789606cfbfcf0.1757423202.git.leonro@nvidia.com
2025-09-12iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_physLeon Romanovsky
Rename the IOMMU DMA mapping functions to better reflect their actual calling convention. The functions iommu_dma_map_page() and iommu_dma_unmap_page() are renamed to iommu_dma_map_phys() and iommu_dma_unmap_phys() respectively, as they already operate on physical addresses rather than page structures. The calling convention changes from accepting (struct page *page, unsigned long offset) to (phys_addr_t phys), which eliminates the need for page-to-physical address conversion within the functions. This renaming prepares for the broader DMA API conversion from page-based to physical address-based mapping throughout the kernel. All callers are updated to pass physical addresses directly, including dma_map_page_attrs(), scatterlist mapping functions, and DMA page allocation helpers. The change simplifies the code by removing the page_to_phys() + offset calculation that was previously done inside the IOMMU functions. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/ed172f95f8f57782beae04f782813366894e98df.1757423202.git.leonro@nvidia.com
2025-09-12dma-mapping: rename trace_dma_*map_page to trace_dma_*map_physLeon Romanovsky
As a preparation for following map_page -> map_phys API conversion, let's rename trace_dma_*map_page() to be trace_dma_*map_phys(). Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/c0c02d7d8bd4a148072d283353ba227516a76682.1757423202.git.leonro@nvidia.com
2025-09-12dma-debug: refactor to use physical addresses for page mappingLeon Romanovsky
Convert the DMA debug infrastructure from page-based to physical address-based mapping as a preparation to rely on physical address for DMA mapping routines. The refactoring renames debug_dma_map_page() to debug_dma_map_phys() and changes its signature to accept a phys_addr_t parameter instead of struct page and offset. Similarly, debug_dma_unmap_page() becomes debug_dma_unmap_phys(). A new dma_debug_phy type is introduced to distinguish physical address mappings from other debug entry types. All callers throughout the codebase are updated to pass physical addresses directly, eliminating the need for page-to-physical conversion in the debug layer. This refactoring eliminates the need to convert between page pointers and physical addresses in the debug layer, making the code more efficient and consistent with the DMA mapping API's physical address focus. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> [mszyprow: added a fixup] Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/56d1a6769b68dfcbf8b26a75a7329aeb8e3c3b6a.1757423202.git.leonro@nvidia.com Link: https://lore.kernel.org/all/20250910052618.GH341237@unreal/
2025-09-12Merge tag 'dma-mapping-6.17-2025-09-09' into HEADMarek Szyprowski
dma-mapping fix for Linux 6.17 - one more fix for DMA API debugging infrastructure (Baochen Qiang)
2025-09-11bpf: Report arena faults to BPF stderrPuranjay Mohan
Begin reporting arena page faults and the faulting address to BPF program's stderr, this patch adds support in the arm64 and x86-64 JITs, support for other archs can be added later. The fault handlers receive the 32 bit address in the arena region so the upper 32 bits of user_vm_start is added to it before printing the address. This is what the user would expect to see as this is what is printed by bpf_printk() is you pass it an address returned by bpf_arena_alloc_pages(); Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250911145808.58042-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-11bpf: core: introduce main_prog_aux for stream accessPuranjay Mohan
BPF streams are only valid for the main programs, to make it easier to access streams from subprogs, introduce main_prog_aux in struct bpf_prog_aux. prog->aux->main_prog_aux = prog->aux, for main programs and prog->aux->main_prog_aux = main_prog->aux, for subprograms. Make bpf_prog_find_from_stack() use the added main_prog_aux to return the mainprog when a subprog is found on the stack. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250911145808.58042-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-11Merge tag 'pm-6.17-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a nasty hibernation regression introduced during the 6.16 cycle, an issue related to energy model management occurring on Intel hybrid systems where some CPUs are offline to start with, and two regressions in the amd-pstate driver: - Restore a pm_restrict_gfp_mask() call in hibernation_snapshot() that was removed incorrectly during the 6.16 development cycle (Rafael Wysocki) - Introduce a function for registering a perf domain without triggering a system-wide CPU capacity update and make the intel_pstate driver use it to avoid reocurring unsuccessful attempts to update capacities of all CPUs in the system (Rafael Wysocki) - Fix setting of CPPC.min_perf in the active mode with performance governor in the amd-pstate driver to restore its expected behavior changed recently (Gautham Shenoy) - Avoid mistakenly setting EPP to 0 in the amd-pstate driver after system resume as a result of recent code changes (Mario Limonciello)" * tag 'pm-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: Restrict GFP mask in hibernation_snapshot() PM: EM: Add function for registering a PD without capacity update cpufreq/amd-pstate: Fix a regression leading to EPP 0 after resume cpufreq/amd-pstate: Fix setting of CPPC.min_perf in active mode for performance governor
2025-09-11entry: Add arch_irqentry_exit_need_resched() for arm64Jinjie Ruan
Compared to the generic entry code, ARM64 does additional checks when deciding to reschedule on return from interrupt. So introduce arch_irqentry_exit_need_resched() in the need_resched() condition of the generic raw_irqentry_exit_cond_resched(), with a NOP default. This will allow ARM64 to implement the architecture specific version for switching over to the generic entry code. Suggested-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Suggested-by: Kevin Brodsky <kevin.brodsky@arm.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2025-09-11Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: "A number of fixes accumulated due to summer vacations - Fix out-of-bounds dynptr write in bpf_crypto_crypt() kfunc which was misidentified as a security issue (Daniel Borkmann) - Update the list of BPF selftests maintainers (Eduard Zingerman) - Fix selftests warnings with icecc compiler (Ilya Leoshkevich) - Disable XDP/cpumap direct return optimization (Jesper Dangaard Brouer) - Fix unexpected get_helper_proto() result in unusual configuration BPF_SYSCALL=y and BPF_EVENTS=n (Jiri Olsa) - Allow fallback to interpreter when JIT support is limited (KaFai Wan) - Fix rqspinlock and choose trylock fallback for NMI waiters. Pick the simplest fix. More involved fix is targeted bpf-next (Kumar Kartikeya Dwivedi) - Fix cleanup when tcp_bpf_send_verdict() fails to allocate psock->cork (Kuniyuki Iwashima) - Disallow bpf_timer in PREEMPT_RT for now. Proper solution is being discussed for bpf-next. (Leon Hwang) - Fix XSK cq descriptor production (Maciej Fijalkowski) - Tell memcg to use allow_spinning=false path in bpf_timer_init() to avoid lockup in cgroup_file_notify() (Peilin Ye) - Fix bpf_strnstr() to handle suffix match cases (Rong Tao)" * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Skip timer cases when bpf_timer is not supported bpf: Reject bpf_timer for PREEMPT_RT tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork. bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init() bpf: Allow fall back to interpreter for programs with stack size <= 512 rqspinlock: Choose trylock fallback for NMI waiters xsk: Fix immature cq descriptor production bpf: Update the list of BPF selftests maintainers selftests/bpf: Add tests for bpf_strnstr selftests/bpf: Fix "expression result unused" warnings with icecc bpf: Fix bpf_strnstr() to handle suffix match cases better selftests/bpf: Extend crypto_sanity selftest with invalid dst buffer bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt bpf: Check the helper function is valid in get_helper_proto bpf, cpumap: Disable page_pool direct xdp_return need larger scope
2025-09-11Merge branches 'pm-sleep' and 'pm-em'Rafael J. Wysocki
Merge a hibernation regression fix and an fix related to energy model management for 6.17-rc6 * pm-sleep: PM: hibernate: Restrict GFP mask in hibernation_snapshot() * pm-em: PM: EM: Add function for registering a PD without capacity update
2025-09-10audit: fix skb leak when audit rate limit is exceededGerald Yang
When configuring a small audit rate limit in /etc/audit/rules.d/audit.rules: -a always,exit -F arch=b64 -S openat -S truncate -S ftruncate -F exit=-EACCES -F auid>=1000 -F auid!=4294967295 -k access -r 100 And then repeatedly triggering permission denied as a normal user: while :; do cat /proc/1/environ; done We can see the messages in kernel log: [ 2531.862184] audit: rate limit exceeded The unreclaimable slab objects start to leak quickly. With kmemleak enabled, many call traces appear like: unreferenced object 0xffff99144b13f600 (size 232): comm "cat", pid 1100, jiffies 4294739144 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 8540ec4f): kmemleak_alloc+0x4a/0x90 kmem_cache_alloc_node+0x2ea/0x390 __alloc_skb+0x174/0x1b0 audit_log_start+0x198/0x3d0 audit_log_proctitle+0x32/0x160 audit_log_exit+0x6c6/0x780 __audit_syscall_exit+0xee/0x140 syscall_exit_work+0x12b/0x150 syscall_exit_to_user_mode_prepare+0x39/0x80 syscall_exit_to_user_mode+0x11/0x260 do_syscall_64+0x8c/0x180 entry_SYSCALL_64_after_hwframe+0x78/0x80 This shows that the skb allocated in audit_log_start() and queued onto skb_list is never freed. In audit_log_end(), each skb is dequeued from skb_list and passed to __audit_log_end(). However, when the audit rate limit is exceeded, __audit_log_end() simply prints "rate limit exceeded" and returns without processing the skb. Since the skb is already removed from skb_list, audit_buffer_free() cannot free it later, leading to a memory leak. Fix this by freeing the skb when the rate limit is exceeded. Fixes: eb59d494eebd ("audit: add record for multiple task security contexts") Signed-off-by: Gerald Yang <gerald.yang@canonical.com> [PM: fixes tag, subj tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-09-10bpf: Reject bpf_timer for PREEMPT_RTLeon Hwang
When enable CONFIG_PREEMPT_RT, the kernel will warn when run timer selftests by './test_progs -t timer': BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 In order to avoid such warning, reject bpf_timer in verifier when PREEMPT_RT is enabled. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20250910125740.52172-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-10Merge tag 'trace-v6.17-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Remove redundant __GFP_NOWARN flag is kmalloc As now __GFP_NOWARN is part of __GFP_NOWAIT, it can be removed from kmalloc as it is redundant. - Use copy_from_user_nofault() instead of _inatomic() for trace markers The trace_marker files are written to to allow user space to quickly write into the tracing ring buffer. Back in 2016, the get_user_pages_fast() and the kmap() logic was replaced by a __copy_from_user_inatomic(), but didn't properly disable page faults around it. Since the time this was added, copy_from_user_nofault() was added which does the required page fault disabling for us. - Fix the assembly markup in the ftrace direct sample code The ftrace direct sample code (which is also used for selftests), had the size directive between the "leave" and the "ret" instead of after the ret. This caused objtool to think the code was unreachable. - Only call unregister_pm_notifier() on outer most fgraph registration There was an error path in register_ftrace_graph() that did not call unregister_pm_notifier() on error, so it was added in the error path. The problem with that fix, is that register_pm_notifier() is only called by the initial user of fgraph. If that succeeds, but another fgraph registration were to fail, then unregister_pm_notifier() would be called incorrectly. - Fix a crash in osnoise when zero size cpumask is passed in If a zero size CPU mask is passed in, the kmalloc() would return ZERO_SIZE_PTR which is not checked, and the code would continue thinking it had real memory and crash. If zero is passed in as the size of the write, simply return 0. - Fix possible warning in trace_pid_write() If while processing a series of numbers passed to the "set_event_pid" file, and one of the updates fails to allocate (triggered by a fault injection), it can cause a warning to trigger. Check the return value of the call to trace_pid_list_set() and break out early with an error code if it fails. * tag 'trace-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Silence warning when chunk allocation fails in trace_pid_write tracing/osnoise: Fix null-ptr-deref in bitmap_parselist() trace/fgraph: Fix error handling ftrace/samples: Fix function size computation tracing: Fix tracing_marker may trigger page fault during preempt_disable trace: Remove redundant __GFP_NOWARN
2025-09-10PM: hibernate: Restrict GFP mask in hibernation_snapshot()Rafael J. Wysocki
Commit 12ffc3b1513e ("PM: Restrict swap use to later in the suspend sequence") incorrectly removed a pm_restrict_gfp_mask() call from hibernation_snapshot(), so memory allocations involving swap are not prevented from being carried out in this code path any more which may lead to serious breakage. The symptoms of such breakage have become visible after adding a shrink_shmem_memory() call to hibernation_snapshot() in commit 2640e819474f ("PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()") which caused this problem to be much more likely to manifest itself. However, since commit 2640e819474f was initially present in the DRM tree that did not include commit 12ffc3b1513e, the symptoms of this issue were not visible until merge commit 260f6f4fda93 ("Merge tag 'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel") that exposed it through an entirely reasonable merge conflict resolution. Fixes: 12ffc3b1513e ("PM: Restrict swap use to later in the suspend sequence") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220555 Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
2025-09-10cgroup: replace global percpu_rwsem with per threadgroup resem when writing ↵Yi Tao
to cgroup.procs The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to pr_warn_once(). Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-10cgroup: relocate cgroup_attach_lock within cgroup_procs_write_startYi Tao
Later patches will introduce a new parameter `task` to cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock within cgroup_procs_write_start. Between obtaining the threadgroup leader via PID and acquiring the cgroup attach lock, the threadgroup leader may change, which could lead to incorrect cgroup migration. Therefore, after acquiring the cgroup attach lock, we check whether the threadgroup leader has changed, and if so, retry the operation. tj: Minor comment adjustments. Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-10cgroup: refactor the cgroup_attach_lock code to make it clearerYi Tao
Dynamic cgroup migration involving threadgroup locks can be in one of two states: no lock held, or holding the global lock. Explicitly declaring the different lock modes to make the code easier to understand and facilitates future extensions of the lock modes. Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>