summaryrefslogtreecommitdiff
path: root/arch/riscv/mm
AgeCommit message (Collapse)Author
2025-10-27riscv: ptdump: use seq_puts() in pt_dump_seq_puts() macroJosephine Pfeiffer
The pt_dump_seq_puts() macro incorrectly uses seq_printf() instead of seq_puts(). This is both a performance issue and conceptually wrong, as the macro name suggests plain string output (puts) but the implementation uses formatted output (printf). The macro is used in ptdump.c:301 to output a newline character. Using seq_printf() adds unnecessary overhead for format string parsing when outputting this constant string. This bug was introduced in commit 59c4da8640cc ("riscv: Add support to dump the kernel page tables") in 2020, which copied the implementation pattern from other architectures that had the same bug. Fixes: 59c4da8640cc ("riscv: Add support to dump the kernel page tables") Signed-off-by: Josephine Pfeiffer <hi@josie.lol> Link: https://lore.kernel.org/r/20251018170451.3355496-1-hi@josie.lol Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-09-21riscv: stop calling page_address() in free_pages()Vishal Moola (Oracle)
free_pages() should be used when we only have a virtual address. We should call __free_pages() directly on our page instead. Link: https://lkml.kernel.org/r/20250903185921.1785167-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21kasan: call kasan_init_generic in kasan_initSabyrzhan Tasbolatov
Call kasan_init_generic() which handles Generic KASAN initialization. For architectures that do not select ARCH_DEFER_KASAN, this will be a no-op for the runtime flag but will print the initialization banner. For SW_TAGS and HW_TAGS modes, their respective init functions will handle the flag enabling, if they are enabled/implemented. Link: https://lkml.kernel.org/r/20250810125746.1105476-3-snovitoll@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217049 Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> [riscv] Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Alexander Potapenko <glider@google.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Gow <davidgow@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Marco Elver <elver@google.com> Cc: Qing Zhang <zhangqing@loongson.cn> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-19riscv: errata: Fix the PAUSE Opcode for MIPS P8700Djordje Todorovic
Add ERRATA_MIPS and ERRATA_MIPS_P8700_PAUSE_OPCODE configs. Handle errata for the MIPS PAUSE instruction. Signed-off-by: Djordje Todorovic <djordje.todorovic@htecgroup.com> Signed-off-by: Aleksandar Rikalo <arikalo@gmail.com> Signed-off-by: Raj Vishwanathan4 <rvishwanathan@mips.com> Signed-off-by: Aleksa Paunovic <aleksa.paunovic@htecgroup.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250724-p8700-pause-v5-7-a6cbbe1c3412@htecgroup.com [pjw@kernel.org: updated to apply and compile; fixed a checkpatch issue] Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18riscv: mm: Use mmu-type from FDT to limit SATP modeJunhui Liu
Some RISC-V implementations may hang when attempting to write an unsupported SATP mode, even though the latest RISC-V specification states such writes should have no effect. To avoid this issue, the logic for selecting SATP mode has been refined: The kernel now determines the SATP mode limit by taking the minimum of the value specified by the kernel command line (noXlvl) and the "mmu-type" property in the device tree (FDT). If only one is specified, use that. - If the resulting limit is sv48 or higher, the kernel will probe SATP modes from this limit downward until a supported mode is found. - If the limit is sv39, the kernel will directly use sv39 without probing. This ensures SATP mode selection is safe and compatible with both hardware and user configuration, minimizing the risk of hangs. Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250722-satp-from-fdt-v1-2-5ba22218fa5f@pigmoral.tech Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-17riscv: mm: Return intended SATP mode for noXlvl optionsJunhui Liu
Change the return value of match_noXlvl() to return the SATP mode that will be used, rather than the mode being disabled. This enables unified logic for return value judgement with the function that obtains mmu-type from the fdt, avoiding extra conversion. This only changes the naming, with no functional impact. Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250722-satp-from-fdt-v1-1-5ba22218fa5f@pigmoral.tech Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-13mm: introduce memdesc_flags_tMatthew Wilcox (Oracle)
Patch series "Add and use memdesc_flags_t". At some point struct page will be separated from struct slab and struct folio. This is a step towards that by introducing a type for the 'flags' word of all three structures. This gives us a certain amount of type safety by establishing that some of these unsigned longs are different from other unsigned longs in that they contain things like node ID, section number and zone number in the upper bits. That lets us have functions that can be easily called by anyone who has a slab, folio or page (but not easily by anyone else) to get the node or zone. There's going to be some unusual merge problems with this as some odd bits of the kernel decide they want to print out the flags value or something similar by writing page->flags and now they'll need to write page->flags.f instead. That's most of the churn here. Maybe we should be removing these things from the debug output? This patch (of 11): Wrap the unsigned long flags in a typedef. In upcoming patches, this will provide a strong hint that you can't just pass a random unsigned long to functions which take this as an argument. [willy@infradead.org: s/flags/flags.f/ in several architectures] Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org [nicola.vetrini@gmail.com: mips: fix compilation error] Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-03Merge tag 'mm-nonmm-stable-2025-08-03-12-47' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Significant patch series in this pull request: - "squashfs: Remove page->mapping references" (Matthew Wilcox) gets us closer to being able to remove page->mapping - "relayfs: misc changes" (Jason Xing) does some maintenance and minor feature addition work in relayfs - "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches us from static preallocation of the kdump crashkernel's working memory over to dynamic allocation. So the difficulty of a-priori estimation of the second kernel's needs is removed and the first kernel obtains extra memory - "generalize panic_print's dump function to be used by other kernel parts" (Feng Tang) implements some consolidation and rationalization of the various ways in which a failing kernel splats information at the operator * tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits) tools/getdelays: add backward compatibility for taskstats version kho: add test for kexec handover delaytop: enhance error logging and add PSI feature description samples: Kconfig: fix spelling mistake "instancess" -> "instances" fat: fix too many log in fat_chain_add() scripts/spelling.txt: add notifer||notifier to spelling.txt xen/xenbus: fix typo "notifer" net: mvneta: fix typo "notifer" drm/xe: fix typo "notifer" cxl: mce: fix typo "notifer" KVM: x86: fix typo "notifer" MAINTAINERS: add maintainers for delaytop ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below() ucount: fix atomic_long_inc_below() argument type kexec: enable CMA based contiguous allocation stackdepot: make max number of pools boot-time configurable lib/xxhash: remove unused functions init/Kconfig: restore CONFIG_BROKEN help text lib/raid6: update recov_rvv.c zero page usage docs: update docs after introducing delaytop ...
2025-07-31Merge tag 'mm-stable-2025-07-30-15-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_* flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ...
2025-07-24mm: remove arch_flush_tlb_batched_pending() arch helperRyan Roberts
Since commit 4b634918384c ("arm64/mm: Close theoretical race where stale TLB entry remains valid"), all arches that use tlbbatch for reclaim (arm64, riscv, x86) implement arch_flush_tlb_batched_pending() with a flush_tlb_mm(). So let's simplify by removing the unnecessary abstraction and doing the flush_tlb_mm() directly in flush_tlb_batched_pending(). This effectively reverts commit db6c1f6f236d ("mm/tlbbatch: introduce arch_flush_tlb_batched_pending()"). Link: https://lkml.kernel.org/r/20250609103132.447370-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Suggested-by: Will Deacon <will@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Will Deacon <will@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19Add a new optional ",cma" suffix to the crashkernel= command line optionJiri Bohac
Patch series "kdump: crashkernel reservation from CMA", v5. This series implements a way to reserve additional crash kernel memory using CMA. Currently, all the memory for the crash kernel is not usable by the 1st (production) kernel. It is also unmapped so that it can't be corrupted by the fault that will eventually trigger the crash. This makes sense for the memory actually used by the kexec-loaded crash kernel image and initrd and the data prepared during the load (vmcoreinfo, ...). However, the reserved space needs to be much larger than that to provide enough run-time memory for the crash kernel and the kdump userspace. Estimating the amount of memory to reserve is difficult. Being too careful makes kdump likely to end in OOM, being too generous takes even more memory from the production system. Also, the reservation only allows reserving a single contiguous block (or two with the "low" suffix). I've seen systems where this fails because the physical memory is fragmented. By reserving additional crashkernel memory from CMA, the main crashkernel reservation can be just large enough to fit the kernel and initrd image, minimizing the memory taken away from the production system. Most of the run-time memory for the crash kernel will be memory previously available to userspace in the production system. As this memory is no longer wasted, the reservation can be done with a generous margin, making kdump more reliable. Kernel memory that we need to preserve for dumping is normally not allocated from CMA, unless it is explicitly allocated as movable. Currently this is only the case for memory ballooning and zswap. Such movable memory will be missing from the vmcore. User data is typically not dumped by makedumpfile. When dumping of user data is intended this new CMA reservation cannot be used. There are five patches in this series: The first adds a new ",cma" suffix to the recenly introduced generic crashkernel parsing code. parse_crashkernel() takes one more argument to store the cma reservation size. The second patch implements reserve_crashkernel_cma() which performs the reservation. If the requested size is not available in a single range, multiple smaller ranges will be reserved. The third patch updates Documentation/, explicitly mentioning the potential DMA corruption of the CMA-reserved memory. The fourth patch adds a short delay before booting the kdump kernel, allowing pending DMA transfers to finish. The fifth patch enables the functionality for x86 as a proof of concept. There are just three things every arch needs to do: - call reserve_crashkernel_cma() - include the CMA-reserved ranges in the physical memory map - exclude the CMA-reserved ranges from the memory available through /proc/vmcore by excluding them from the vmcoreinfo PT_LOAD ranges. Adding other architectures is easy and I can do that as soon as this series is merged. With this series applied, specifying crashkernel=100M craskhernel=1G,cma on the command line will make a standard crashkernel reservation of 100M, where kexec will load the kernel and initrd. An additional 1G will be reserved from CMA, still usable by the production system. The crash kernel will have 1.1G memory available. The 100M can be reliably predicted based on the size of the kernel and initrd. The new cma suffix is completely optional. When no crashkernel=size,cma is specified, everything works as before. This patch (of 5): Add a new cma_size parameter to parse_crashkernel(). When not NULL, call __parse_crashkernel to parse the CMA reservation size from "crashkernel=size,cma" and store it in cma_size. Set cma_size to NULL in all calls to parse_crashkernel(). Link: https://lkml.kernel.org/r/aEqnxxfLZMllMC8I@dwarf.suse.cz Link: https://lkml.kernel.org/r/aEqoQckgoTQNULnh@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd()Anshuman Khandual
Memory hot remove unmaps and tears down various kernel page table regions as required. The ptdump code can race with concurrent modifications of the kernel page tables. When leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. But when intermediate levels of kernel page table are freed, the dump code will continue to use memory that has been freed and potentially reallocated for another purpose. In such cases, the ptdump code may dereference bogus addresses, leading to a number of potential problems. To avoid the above mentioned race condition, platforms such as arm64, riscv and s390 take memory hotplug lock, while dumping kernel page table via the sysfs interface /sys/kernel/debug/kernel_page_tables. Similar race condition exists while checking for pages that might have been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages which in turn calls ptdump_check_wx(). Instead of solving this race condition again, let's just move the memory hotplug lock inside generic ptdump_check_wx() which will benefit both the scenarios. Drop get_online_mems() and put_online_mems() combination from all existing platform ptdump code paths. Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/pagewalk: split walk_page_range_novma() into kernel/user partsLorenzo Stoakes
walk_page_range_novma() is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. It is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore establish two separate functions - walk_page_range_debug() for this single usage, and walk_kernel_page_table_range() for general kernel page table walking. The walk_page_range_debug() function is currently used to traverse both userland and kernel mappings, so we maintain this and in the case of kernel mappings being traversed, we have walk_page_range_debug() invoke walk_kernel_page_table_range() internally. We additionally make walk_page_range_debug() internal to mm. Link: https://lkml.kernel.org/r/20250605135104.90720-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Barry Song <baohua@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09riscv: mm: Add page fault trace pointsNam Cao
Add page fault trace points, which are useful to implement RV monitor that watches page faults. Signed-off-by: Nam Cao <namcao@linutronix.de> Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06Merge tag 'riscv-for-linus-6.16-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for the FWFT SBI extension, which is part of SBI 3.0 and a dependency for many new SBI and ISA extensions - Support for getrandom() in the VDSO - Support for mseal - Optimized routines for raid6 syndrome and recovery calculations - kexec_file() supports loading Image-formatted kernel binaries - Improvements to the instruction patching framework to allow for atomic instruction patching, along with rules as to how systems need to behave in order to function correctly - Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha, some SiFive vendor extensions - Various fixes and cleanups, including: misaligned access handling, perf symbol mangling, module loading, PUD THPs, and improved uaccess routines * tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (69 commits) riscv: uaccess: Only restore the CSR_STATUS SUM bit RISC-V: vDSO: Wire up getrandom() vDSO implementation riscv: enable mseal sysmap for RV64 raid6: Add RISC-V SIMD syndrome and recovery calculations riscv: mm: Add support for Svinval extension RISC-V: Documentation: Add enough title underlines to CMODX riscv: Improve Kconfig help for RISCV_ISA_V_PREEMPTIVE MAINTAINERS: Update Atish's email address riscv: uaccess: do not do misaligned accesses in get/put_user() riscv: process: use unsigned int instead of unsigned long for put_user() riscv: make unsafe user copy routines use existing assembly routines riscv: hwprobe: export Zabha extension riscv: Make regs_irqs_disabled() more clear perf symbols: Ignore mapping symbols on riscv RISC-V: Kconfig: Fix help text of CMDLINE_EXTEND riscv: module: Optimize PLT/GOT entry counting riscv: Add support for PUD THP riscv: xchg: Prefetch the destination word for sc.w riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop riscv: Add support for Zicbop ...
2025-06-05Merge tag 'riscv-mw2-6.16-rc1' of ↵Palmer Dabbelt
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux into for-next riscv patches for 6.16-rc1, part 2 * Performance improvements - Add support for vdso getrandom - Implement raid6 calculations using vectors - Introduce svinval tlb invalidation * Cleanup - A bunch of deduplication of the macros we use for manipulating instructions * Misc - Introduce a kunit test for kprobes - Add support for mseal as riscv fits the requirements (thanks to Lorenzo for making sure of that :)) [Palmer: There was a rebase between part 1 and part 2, so I've had to do some more git surgery here... at least two rounds of surgery...] * alex-pr-2: (866 commits) RISC-V: vDSO: Wire up getrandom() vDSO implementation riscv: enable mseal sysmap for RV64 raid6: Add RISC-V SIMD syndrome and recovery calculations riscv: mm: Add support for Svinval extension riscv: Add kprobes KUnit test riscv: kprobes: Remove duplication of RV_EXTRACT_ITYPE_IMM riscv: kprobes: Remove duplication of RV_EXTRACT_UTYPE_IMM riscv: kprobes: Remove duplication of RV_EXTRACT_RD_REG riscv: kprobes: Remove duplication of RVC_EXTRACT_BTYPE_IMM riscv: kprobes: Remove duplication of RVC_EXTRACT_C2_RS1_REG riscv: kproves: Remove duplication of RVC_EXTRACT_JTYPE_IMM riscv: kprobes: Remove duplication of RV_EXTRACT_BTYPE_IMM riscv: kprobes: Remove duplication of RV_EXTRACT_RS1_REG riscv: kprobes: Remove duplication of RV_EXTRACT_JTYPE_IMM riscv: kprobes: Move branch_funct3 to insn.h riscv: kprobes: Move branch_rs2_idx to insn.h Linux 6.15-rc6 Input: xpad - fix xpad_device sorting Input: xpad - add support for several more controllers Input: xpad - fix Share button on Xbox One controllers ...
2025-06-05riscv: mm: Add support for Svinval extensionMayuresh Chitale
The Svinval extension splits SFENCE.VMA instruction into finer-grained invalidation and ordering operations and is mandatory for RVA23S64 profile. When Svinval is enabled the local_flush_tlb_range_threshold_asid function should use the following sequence to optimize the tlb flushes instead of a simple sfence.vma: sfence.w.inval svinval.vma . . svinval.vma sfence.inval.ir The maximum number of consecutive svinval.vma instructions that can be executed in local_flush_tlb_range_threshold_asid function is limited to 64. This is required to avoid soft lockups and the approach is similar to that used in arm64. Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240702102637.9074-1-mchitale@ventanamicro.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05Merge patch series "riscv: Add Zicbop & prefetchw support"Alexandre Ghiti
Alexandre Ghiti <alexghiti@rivosinc.com> says: I found this lost series developed by Guo so here is a respin with the comments on v2 applied. This patch series adds Zicbop support and then enables the Linux prefetch features. * patches from https://lore.kernel.org/r/20250421142441.395849-1-alexghiti@rivosinc.com: riscv: xchg: Prefetch the destination word for sc.w riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop riscv: Add support for Zicbop riscv: Introduce Zicbop instructions Link: https://lore.kernel.org/r/20250421142441.395849-1-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05Merge patch series "riscv: ftrace: atmoic patching and preempt improvements"Alexandre Ghiti
Andy Chiu <andybnac@gmail.com> says: This series makes atomic code patching in ftrace possible and eliminates the need of the stop_machine dance. The major difference of this version is that we merge the CALL_OPS support from Puranjay [1] and make direct calls available for practical uses such as BPF. Thanks for the time reviewing the series and suggestions, we hope this version gets a step closer to happening in the upstream. Please reference the link to v3 below for more introductory view of the implementation [2] Added patch: 2, 4, 10, 11, 12 Modified patch: 5, 6 Unchanged patch: 1, 3, 7, 8, 9 (1, 8 has commit msg modified) Special thanks to Björn for his efforts on testing and guiding the series! [1]: https://lore.kernel.org/lkml/20240306165904.108141-1-puranjay12@gmail.com/ [2]: https://lore.kernel.org/linux-riscv/20241127172908.17149-1-andybnac@gmail.com/ * patches from https://lore.kernel.org/r/20250407180838.42877-1-andybnac@gmail.com: riscv: Documentation: add a description about dynamic ftrace riscv: ftrace: support direct call using call_ops riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS riscv: ftrace: support PREEMPT riscv: add a data fence for CMODX in the kernel mode riscv: vector: Support calling schedule() for preemptible Vector riscv: ftrace: do not use stop_machine to update code riscv: ftrace: prepare ftrace for atomic code patching kernel: ftrace: export ftrace_sync_ipi riscv: ftrace: align patchable functions to 4 Byte boundary riscv: ftrace factor out code defined by !WITH_ARG riscv: ftrace: support fastcc in Clang for WITH_ARGS Link: https://lore.kernel.org/r/20250407180838.42877-1-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
2025-06-05riscv: Add support for PUD THPAlexandre Ghiti
Add the necessary page table functions to deal with PUD THP, this enables the use of PUD pfnmap. Link: https://lore.kernel.org/r/20250321123954.225097-1-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05riscv: Add support for ZicbopAlexandre Ghiti
Zicbop introduces cache blocks prefetching instructions, add the necessary support for the kernel to use it in the coming commits. Co-developed-by: Guo Ren <guoren@kernel.org> Signed-off-by: Guo Ren <guoren@kernel.org> Tested-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20250421142441.395849-3-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05riscv: add a data fence for CMODX in the kernel modeAndy Chiu
RISC-V spec explicitly calls out that a local fence.i is not enough for the code modification to be visble from a remote hart. In fact, it states: To make a store to instruction memory visible to all RISC-V harts, the writing hart also has to execute a data FENCE before requesting that all remote RISC-V harts execute a FENCE.I. Although current riscv drivers for IPI use ordered MMIO when sending IPIs in order to synchronize the action between previous csd writes, riscv does not restrict itself to any particular flavor of IPI. Any driver or firmware implementation that does not order data writes before the IPI may pose a risk for code-modifying race. Thus, add a fence here to order data writes before making the IPI. Signed-off-by: Andy Chiu <andybnac@gmail.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250407180838.42877-8-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-03Merge tag 'bitmap-for-6.16-rc1' of https://github.com/norov/linuxLinus Torvalds
Pull bitmap updates from Yury Norov: - dead code cleanups for cpumasks and nodemasks (me) - fixed-width flavors of GENMASK() and BIT() (Vincent, Lucas and me) - FIELD_MODIFY() helper (Luo) - for_each_node_with_cpus() optimization (me) - bitmap-str fixes (Andy) * tag 'bitmap-for-6.16-rc1' of https://github.com/norov/linux: topology: make for_each_node_with_cpus() O(N) bitfield: Add FIELD_MODIFY() helper bitmap-str: Add missing header(s) bitmap-str: Get rid of 'extern' for function prototypes build_bug.h: more user friendly error messages in BUILD_BUG_ON_ZERO() test_bits: add tests for BIT_U*() test_bits: add tests for GENMASK_U*() drm/i915: Convert REG_GENMASK*() to fixed-width GENMASK_U*() bits: introduce fixed-type BIT_U*() bits: introduce fixed-type GENMASK_U*() bits: add comments and newlines to #if, #else and #endif directives cpumask: drop cpumask_assign_cpu() riscv: switch set_icache_stale_mask() to using non-atomic assign_cpu() cpumask: add non-atomic __assign_cpu() nodemask: drop nodes_shift
2025-05-11riscv: mm: call PUD/P4D ctor in special kernel pgtable allocKevin Brodsky
Constructors for PUD/P4D-level pgtables were recently introduced. They should be called for all pgtables; make sure they are called for special kernel mappings created by create_pgd_mapping() too. While at it also switch to using pagetable_alloc() like in alloc_{pte,pmd}_late(). Link: https://lkml.kernel.org/r/20250408095222.860601-13-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <x86@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11riscv: mm: clarify ctor mm argument in alloc_{pte,pmd}_lateKevin Brodsky
pagetable_{pte,pmd}_ctor(mm, ptdesc) skip the ptlock initialisation if mm is &init_mm. To avoid unnecessary overhead, it is therefore preferable to pass the actual mm associated to the PTE/PMD. Unfortunately, this proves challenging for alloc_{pte,pmd}_late() as the associated mm is not available at the point where they are called - in fact not even top-level functions like create_pgd_mapping() are passed the mm. As a result they both call the ctor with NULL as mm; this is safe but potentially wasteful. This is not a new situation, but let's add a couple of comments to clarify it. Link: https://lkml.kernel.org/r/20250408095222.860601-11-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <x86@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: pass mm down to pagetable_{pte,pmd}_ctorKevin Brodsky
Patch series "Always call constructor for kernel page tables", v2. There has been much confusion around exactly when page table constructors/destructors (pagetable_*_[cd]tor) are supposed to be called. They were initially introduced for user PTEs only (to support split page table locks), then at the PMD level for the same purpose. Accounting was added later on, starting at the PTE level and then moving to higher levels (PMD, PUD). Finally, with my earlier series "Account page tables at all levels" [1], the ctor/dtor is run for all levels, all the way to PGD. I thought this was the end of the story, and it hopefully is for user pgtables, but I was wrong for what concerns kernel pgtables. The current situation there makes very little sense: * At the PTE level, the ctor/dtor is not called (at least in the generic implementation). Specific helpers are used for kernel pgtables at this level (pte_{alloc,free}_kernel()) and those have never called the ctor/dtor, most likely because they were initially irrelevant in the kernel case. * At all other levels, the ctor/dtor is normally called. This is potentially wasteful at the PMD level (more on that later). This series aims to ensure that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. Besides consistency, the main motivation is to guarantee that ctor/dtor hooks are systematically called; this makes it possible to insert hooks to protect page tables [2], for instance. There is however an extra challenge: split locks are not used for kernel pgtables, and it would therefore be wasteful to initialise them (ptlock_init()). It is worth clarifying exactly when split locks are used. They clearly are for user pgtables, but as illustrated in commit 61444cde9170 ("ARM: 8591/1: mm: use fully constructed struct pages for EFI pgd allocations"), they also are for special page tables like efi_mm. The one case where split locks are definitely unused is pgtables owned by init_mm; this is consistent with the behaviour of apply_to_pte_range(). The approach chosen in this series is therefore to pass the mm associated to the pgtables being constructed to pagetable_{pte,pmd}_ctor() (patch 1), and skip ptlock_init() if mm == &init_mm (patch 3 and 7). This makes it possible to call the PTE ctor/dtor from pte_{alloc,free}_kernel() without unintended consequences (patch 3). As a result the accounting functions are now called at all levels for kernel pgtables, and split locks are never initialised. In configurations where ptlocks are dynamically allocated (32-bit, PREEMPT_RT, etc.) and ARCH_ENABLE_SPLIT_PMD_PTLOCK is selected, this series results in the removal of a kmem_cache allocation for every kernel PMD. Additionally, for certain architectures that do not use <asm-generic/pgalloc.h> such as s390, the same optimisation occurs at the PTE level. === Things get more complicated when it comes to special pgtable allocators (patch 8-12). All architectures need such allocators to create initial kernel pgtables; we are not concerned with those as the ctor cannot be called so early in the boot sequence. However, those allocators may also be used later in the boot sequence or during normal operations. There are two main use-cases: 1. Mapping EFI memory: efi_mm (arm, arm64, riscv) 2. arch_add_memory(): init_mm The ctor is already explicitly run (at the PTE/PMD level) in the first case, as required for pgtables that are not associated with init_mm. However the same allocators may also be used for the second use-case (or others), and this is where it gets messy. Patch 1 calls the ctor with NULL as mm in those situations, as the actual mm isn't available. Practically this means that ptlocks will be unconditionally initialised. This is fine on arm - create_mapping_late() is only used for the EFI mapping. On arm64, __create_pgd_mapping() is also used by arch_add_memory(); patch 8/9/11 ensure that ctors are called at all levels with the appropriate mm. The situation is similar on riscv, but propagating the mm down to the ctor would require significant refactoring. Since they are already called unconditionally, this series leaves riscv no worse off - patch 10 adds comments to clarify the situation. From a cursory look at other architectures implementing arch_add_memory(), s390 and x86 may also need a similar treatment to add constructor calls. This is to be taken care of in a future version or as a follow-up. === The complications in those special pgtable allocators beg the question: does it really make sense to treat efi_mm and init_mm differently in e.g. apply_to_pte_range()? Maybe what we really need is a way to tell if an mm corresponds to user memory or not, and never use split locks for non-user mm's. Feedback and suggestions welcome! This patch (of 12): In preparation for calling constructors for all kernel page tables while eliding unnecessary ptlock initialisation, let's pass down the associated mm to the PTE/PMD level ctors. (These are the two levels where ptlocks are used.) In most cases the mm is already around at the point of calling the ctor so we simply pass it down. This is however not the case for special page table allocators: * arch/arm/mm/mmu.c * arch/arm64/mm/mmu.c * arch/riscv/mm/init.c In those cases, the page tables being allocated are either for standard kernel memory (init_mm) or special page directories, which may not be associated to any mm. For now let's pass NULL as mm; this will be refined where possible in future patches. No functional change in this patch. Link: https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ [1] Link: https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/ [2] Link: https://lkml.kernel.org/r/20250408095222.860601-1-kevin.brodsky@arm.com Link: https://lkml.kernel.org/r/20250408095222.860601-2-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: <x86@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/ptdump: split note_page() into level specific callbacksAnshuman Khandual
Patch series "mm/ptdump: Drop assumption that pxd_val() is u64", v2. Last argument passed down in note_page() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit - which might not be the case going ahead when D128 page tables is enabled on arm64 platform. Besides pxd_val() is very platform specific and its type should not be assumed in generic MM. A similar problem exists for effective_prot(), although it is restricted to x86 platform. This series splits note_page() and effective_prot() into individual page table level specific callbacks which accepts corresponding pxd_t page table entry as an argument instead and later on all subscribing platforms could derive pxd_val() from the table entries as required and proceed as before. Define ptdesc_t type which describes the basic page table descriptor layout on arm64 platform. Subsequently all level specific pxxval_t descriptors are derived from ptdesc_t thus establishing a common original format, which can also be appropriate for page table entries, masks and protection values etc which are used at all page table levels. This patch (of 3): Last argument passed down in note_page() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit - which might not be the case going ahead when D128 page tables is enabled on arm64 platform. Besides pxd_val() is very platform specific and its type should not be assumed in generic MM. Split note_page() into individual page table level specific callbacks which accepts corresponding pxd_t argument instead and then subscribing platforms just derive pxd_val() from the entries as required and proceed as earlier. Also add a note_page_flush() callback for flushing the last page table page that was being handled earlier via level = -1. Link: https://lkml.kernel.org/r/20250407053113.746295-1-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/20250407053113.746295-2-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-29riscv: switch set_icache_stale_mask() to using non-atomic assign_cpu()Yury Norov
The atomic cpumask_assign_cpu() follows non-atomic cpumask_setall(), which makes the whole operation non-atomic. Fix this by relaxing to non-atomic __assign_cpu(). Fixes: 7c1e5b9690b0e14 ("riscv: Disable preemption while handling PR_RISCV_CTX_SW_FENCEI_OFF") Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
2025-04-04Merge tag 'riscv-for-linus-6.15-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - The sub-architecture selection Kconfig system has been cleaned up, the documentation has been improved, and various detections have been fixed - The vector-related extensions dependencies are now validated when parsing from device tree and in the DT bindings - Misaligned access probing can be overridden via a kernel command-line parameter, along with various fixes to misalign access handling - Support for relocatable !MMU kernels builds - Support for hpge pfnmaps, which should improve TLB utilization - Support for runtime constants, which improves the d_hash() performance - Support for bfloat16, Zicbom, Zaamo, Zalrsc, Zicntr, Zihpm - Various fixes, including: - We were missing a secondary mmu notifier call when flushing the tlb which is required for IOMMU - Fix ftrace panics by saving the registers as expected by ftrace - Fix a couple of stimecmp usage related to cpu hotplug - purgatory_start is now aligned as per the STVEC requirements - A fix for hugetlb when calculating the size of non-present PTEs * tag 'riscv-for-linus-6.15-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (65 commits) riscv: Add norvc after .option arch in runtime const riscv: Make sure toolchain supports zba before using zba instructions riscv/purgatory: 4B align purgatory_start riscv/kexec_file: Handle R_RISCV_64 in purgatory relocator selftests: riscv: fix v_exec_initval_nolibc.c riscv: Fix hugetlb retrieval of number of ptes in case of !present pte riscv: print hartid on bringup riscv: Add norvc after .option arch in runtime const riscv: Remove CONFIG_PAGE_OFFSET riscv: Support CONFIG_RELOCATABLE on riscv32 asm-generic: Always define Elf_Rel and Elf_Rela riscv: Support CONFIG_RELOCATABLE on NOMMU riscv: Allow NOMMU kernels to access all of RAM riscv: Remove duplicate CONFIG_PAGE_OFFSET definition RISC-V: errata: Use medany for relocatable builds dt-bindings: riscv: document vector crypto requirements dt-bindings: riscv: add vector sub-extension dependencies dt-bindings: riscv: d requires f RISC-V: add f & d extension validation checks RISC-V: add vector crypto extension validation checks ...
2025-04-03Merge tag 'riscv-mw2-6.15-rc1' of ↵Palmer Dabbelt
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux into for-next riscv patches for 6.15-rc1, part 2 * A bunch of fixes: - 2 fixes in the purgatory code which prevented kexec to work - Workaround an issue with gcc-15 * tag 'riscv-mw2-6.15-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux: riscv: Add norvc after .option arch in runtime const riscv: Make sure toolchain supports zba before using zba instructions riscv/purgatory: 4B align purgatory_start riscv/kexec_file: Handle R_RISCV_64 in purgatory relocator selftests: riscv: fix v_exec_initval_nolibc.c riscv: Fix hugetlb retrieval of number of ptes in case of !present pte riscv: print hartid on bringup dt-bindings: riscv: document vector crypto requirements dt-bindings: riscv: add vector sub-extension dependencies dt-bindings: riscv: d requires f RISC-V: add f & d extension validation checks RISC-V: add vector crypto extension validation checks RISC-V: add vector extension validation checks
2025-04-01Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The series "resource: Split and use DEFINE_RES*() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. * tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) mailmap: consolidate email addresses of Alexander Sverdlin fs/procfs: fix the comment above proc_pid_wchan() relay: use kasprintf() instead of fixed buffer formatting resource: replace open coded variant of DEFINE_RES() resource: replace open coded variants of DEFINE_RES_*_NAMED() resource: replace open coded variant of DEFINE_RES_NAMED_DESC() resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED() samples: add hung_task detector mutex blocking sample hung_task: show the blocker task if the task is hung on mutex kexec_core: accept unaccepted kexec segments' destination addresses watchdog/perf: optimize bytes copied and remove manual NUL-termination lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap() lib/interval_tree: skip the check before go to the right subtree lib/interval_tree: add test case for span iteration lib/interval_tree: add test case for interval_tree_iter_xxx() helpers lib/rbtree: add random seed lib/rbtree: split tests lib/rbtree: enable userland test suite for rbtree related data structure checkpatch: describe --min-conf-desc-length scripts/gdb/symbols: determine KASLR offset on s390 ...
2025-04-01riscv: Fix hugetlb retrieval of number of ptes in case of !present pteAlexandre Ghiti
Ryan sent a fix [1] for arm64 that applies to riscv too: in some hugetlb functions, we must not use the pte value to get the size of a mapping because the pte may not be present. So use the already present size parameter for huge_pte_clear() and the newly introduced size parameter for huge_ptep_get_and_clear(). And make sure to gather A/D bits only on present ptes. Fixes: 82a1a1f3bfb6 ("riscv: mm: support Svnapot in hugetlb page") Link: https://lore.kernel.org/all/20250217140419.1702389-1-ryan.roberts@arm.com/ [1] Link: https://lore.kernel.org/r/20250317072551.572169-1-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
2025-03-26Merge patch series "riscv: Relocatable NOMMU kernels"Palmer Dabbelt
Samuel Holland <samuel.holland@sifive.com> says: Currently, RISC-V NOMMU kernels are linked at CONFIG_PAGE_OFFSET, and since they are not relocatable, must be loaded at this address as well. CONFIG_PAGE_OFFSET is not a user-visible Kconfig option, so its value is not obvious, and users must patch the kernel source if they want to load it at a different address. Make NOMMU kernels more portable by making them relocatable by default. This allows a single kernel binary to work when loaded at any address. * b4-shazam-merge: riscv: Remove CONFIG_PAGE_OFFSET riscv: Support CONFIG_RELOCATABLE on riscv32 asm-generic: Always define Elf_Rel and Elf_Rela riscv: Support CONFIG_RELOCATABLE on NOMMU riscv: Allow NOMMU kernels to access all of RAM riscv: Remove duplicate CONFIG_PAGE_OFFSET definition Link: https://lore.kernel.org/r/20241026171441.3047904-1-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-03-26riscv: Remove CONFIG_PAGE_OFFSETSamuel Holland
The current definition of CONFIG_PAGE_OFFSET is problematic for a couple of reasons: 1) The value is misleading for normal 64-bit kernels, where it is overridden at runtime if Sv48 or Sv39 is chosen. This is especially the case for XIP kernels, which always use Sv39. 2) The option is not user-visible, but for NOMMU kernels it must be a valid RAM address, and for !RELOCATABLE it must additionally be the exact address where the kernel is loaded. Fix both of these by removing the option. 1) For MMU kernels, drop the indirection through Kconfig. Additionally, for XIP, drop the indirection through kernel_map. 2) For NOMMU kernels, use the user-visible physical RAM base if provided. Otherwise, force the kernel to be relocatable. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Jesse Taube <mr.bossman075@gmail.com> Link: https://lore.kernel.org/r/20241026171441.3047904-7-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-03-26riscv: Support CONFIG_RELOCATABLE on riscv32Samuel Holland
When adjusted to use the correctly-sized ELF types, relocate_kernel() works on riscv32 as well. The caveat about crossing an intermediate page table boundary does not apply to riscv32, since for Sv32 the early kernel mapping uses only PGD entries. Since KASLR is not yet supported on riscv32, this option is mostly useful for NOMMU. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20241026171441.3047904-6-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-03-26riscv: Support CONFIG_RELOCATABLE on NOMMUSamuel Holland
Move relocate_kernel() out of the CONFIG_MMU block so it can be called from the NOMMU version of setup_vm(). Set some offsets in kernel_map so relocate_kernel() does not need to be modified. Relocatable NOMMU kernels can be loaded to any physical memory address; they no longer depend on CONFIG_PAGE_OFFSET. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20241026171441.3047904-4-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-03-18riscv: mm: Don't use %pK through printkThomas Weißschuh
Restricted pointers ("%pK") are not meant to be used through printk(). It can unintentionally expose security sensitive, raw pointer values. Use regular pointer formatting instead. Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250217-restricted-pointers-riscv-v1-1-72a078076a76@linutronix.de Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
2025-03-18riscv: Fix a comment typo in set_mm_asid()Chin Yik Ming
s/verion/version Signed-off-by: Chin Yik Ming <yikming2222@gmail.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20241114212725.4172401-1-yikming2222@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
2025-03-18riscv: Call secondary mmu notifier when flushing the tlbAlexandre Ghiti
This is required to allow the IOMMU driver to correctly flush its own TLB. Reviewed-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20250113142424.30487-1-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
2025-03-17arch, mm: make releasing of memory to page allocator more explicitMike Rapoport (Microsoft)
The point where the memory is released from memblock to the buddy allocator is hidden inside arch-specific mem_init()s and the call to memblock_free_all() is needlessly duplicated in every artiste cure and after introduction of arch_mm_preinit() hook, mem_init() implementation on many architecture only contains the call to memblock_free_all(). Pull memblock_free_all() call into mm_core_init() and drop mem_init() on relevant architectures to make it more explicit where the free memory is released from memblock to the buddy allocator and to reduce code duplication in architecture specific code. Link: https://lkml.kernel.org/r/20250313135003.836600-14-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: introduce arch_mm_preinitMike Rapoport (Microsoft)
Currently, implementation of mem_init() in every architecture consists of one or more of the following: * initializations that must run before page allocator is active, for instance swiotlb_init() * a call to memblock_free_all() to release all the memory to the buddy allocator * initializations that must run after page allocator is ready and there is no arch-specific hook other than mem_init() for that, like for example register_page_bootmem_info() in x86 and sparc64 or simple setting of mem_init_done = 1 in several architectures * a bunch of semi-related stuff that apparently had no better place to live, for example a ton of BUILD_BUG_ON()s in parisc. Introduce arch_mm_preinit() that will be the first thing called from mm_core_init(). On architectures that have initializations that must happen before the page allocator is ready, move those into arch_mm_preinit() along with the code that does not depend on ordering with page allocator setup. On several architectures this results in reduction of mem_init() to a single call to memblock_free_all() that allows its consolidation next. Link: https://lkml.kernel.org/r/20250313135003.836600-13-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: set high_memory in free_area_init()Mike Rapoport (Microsoft)
high_memory defines upper bound on the directly mapped memory. This bound is defined by the beginning of ZONE_HIGHMEM when a system has high memory and by the end of memory otherwise. All this is known to generic memory management initialization code that can set high_memory while initializing core mm structures. Add a generic calculation of high_memory to free_area_init() and remove per-architecture calculation except for the architectures that set and use high_memory earlier than that. Link: https://lkml.kernel.org/r/20250313135003.836600-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: set max_mapnr when allocating memory map for FLATMEMMike Rapoport (Microsoft)
max_mapnr is essentially the size of the memory map for systems that use FLATMEM. There is no reason to calculate it in each and every architecture when it's anyway calculated in alloc_node_mem_map(). Drop setting of max_mapnr from architecture code and set it once in alloc_node_mem_map(). While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so there won't be two copies for MMU and !MMU variants. Link: https://lkml.kernel.org/r/20250313135003.836600-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: rename GENERIC_PTDUMP and PTDUMP_COREAnshuman Khandual
Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP. But generic ptdump gets enabled via PTDUMP_CORE. These configs combination is confusing as they sound very similar and does not differentiate between platform's feature subscription and feature enablement for ptdump. Rename the configs as ARCH_HAS_PTDUMP and PTDUMP making it more clear and improve readability. Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16crash: remove an unused argument from reserve_crashkernel_generic()Sourabh Jain
cmdline argument is not used in reserve_crashkernel_generic() so remove it. Correspondingly, all the callers have been updated as well. No functional change intended. Link: https://lkml.kernel.org/r/20250131113830.925179-3-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: support tlbbatch flush for a range of PTEsBarry Song
This patch lays the groundwork for supporting batch PTE unmapping in try_to_unmap_one(). It introduces range handling for TLB batch flushing, with the range currently set to the size of PAGE_SIZE. The function __flush_tlb_range_nosync() is architecture-specific and is only used within arch/arm64. This function requires the mm structure instead of the vma structure. To allow its reuse by arch_tlbbatch_add_pending(), which operates with mm but not vma, this patch modifies the argument of __flush_tlb_range_nosync() to take mm as its parameter. Link: https://lkml.kernel.org/r/20250214093015.51024-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mauricio Faria de Oliveira <mfo@canonical.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-27mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()Ryan Roberts
In order to fix a bug, arm64 needs to be told the size of the huge page for which the huge_pte is being cleared in huge_ptep_get_and_clear(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear() and set_huge_pte_at(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, loongarch, mips, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390 Link: https://lore.kernel.org/r/20250226120656.2400136-2-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-01-31Merge tag 'riscv-for-linus-6.14-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - The PH1520 pinctrl and dwmac drivers are enabeled in defconfig - A redundant AQRL barrier has been removed from the futex cmpxchg implementation - Support for the T-Head vector extensions, which includes exposing these extensions to userspace on systems that implement them - Some more page table information is now printed on die() and systems that cause PA overflows * tag 'riscv-for-linus-6.14-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: add a warning when physical memory address overflows riscv/mm/fault: add show_pte() before die() riscv: Add ghostwrite vulnerability selftests: riscv: Support xtheadvector in vector tests selftests: riscv: Fix vector tests riscv: hwprobe: Document thead vendor extensions and xtheadvector extension riscv: hwprobe: Add thead vendor extension probing riscv: vector: Support xtheadvector save/restore riscv: Add xtheadvector instruction definitions riscv: csr: Add CSR encodings for CSR_VXRM/CSR_VXSAT RISC-V: define the elements of the VCSR vector CSR riscv: vector: Use vlenb from DT for thead riscv: Add thead and xtheadvector as a vendor extension riscv: dts: allwinner: Add xtheadvector to the D1/D1s devicetree dt-bindings: cpus: add a thead vlen register length property dt-bindings: riscv: Add xtheadvector ISA extension description RISC-V: Mark riscv_v_init() as __init riscv: defconfig: drop RT_GROUP_SCHED=y riscv/futex: Optimize atomic cmpxchg riscv: defconfig: enable pinctrl and dwmac support for TH1520
2025-01-29riscv: add a warning when physical memory address overflowsYunhui Cui
The part of physical memory that exceeds the size of the linear mapping will be discarded. When the system starts up normally, a warning message will be printed to prevent confusion caused by the mismatch between the system memory and the actual physical memory. Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240814062625.19794-1-cuiyunhui@bytedance.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>