| Age | Commit message (Collapse) | Author |
|
TCR2_ELx.E0POE is set during smp_init().
However, this bit is not reprogrammed when the CPU enters suspension and
later resumes via cpu_resume(), as __cpu_setup() does not re-enable E0POE
and there is no save/restore logic for the TCR2_ELx system register.
As a result, the E0POE feature no longer works after cpu_resume().
To address this, save and restore TCR2_EL1 in the cpu_suspend()/cpu_resume()
path, rather than adding related logic to __cpu_setup(), taking into account
possible future extensions of the TCR2_ELx feature.
Fixes: bf83dae90fbc ("arm64: enable the Permission Overlay Extension for EL0")
Cc: <stable@vger.kernel.org> # 6.12.x
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Running KASAN KUnit tests with {HW,SW}_TAGS mode triggers a fault in
change_memory_common():
Call trace:
change_memory_common+0x168/0x210 (P)
set_memory_ro+0x20/0x48
vmalloc_helpers_tags+0xe8/0x338
kunit_try_run_case+0x74/0x188
kunit_generic_run_threadfn_adapter+0x30/0x70
kthread+0x11c/0x200
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
# vmalloc_helpers_tags: try faulted
not ok 67 vmalloc_helpers_tags
Commit a06494adb7ef ("arm64: mm: use untagged address to calculate page index")
fixed a KASAN warning in the BPF subsystem by adding kasan_reset_tag() to
the index calculation. In the execmem flow:
bpf_prog_pack_alloc()
-> bpf_jit_alloc_exec()
-> execmem_alloc()
The returned address from execmem_vmalloc/execmem_cache_alloc is passed
through kasan_reset_tag(), so start has no tag while area->addr still
retains the original tag. The fix correctly handled this case by resetting
the tag on area->addr:
(start - (unsigned long)kasan_reset_tag(area->addr)) >> PAGE_SHIFT
However, in normal vmalloc paths, both start and area->addr have matching
tags(or no tags). Resetting only area->addr causes a mismatch when
subtracting a tagged address from an untagged one, resulting in an
incorrect index.
Fix this by resetting tags on both addresses in the index calculation.
This ensures correct results regardless of the tag state of either address.
Tested with KASAN KUnit tests under CONFIG_KASAN_GENERIC,
CONFIG_KASAN_SW_TAGS, and CONFIG_KASAN_HW_TAGS - all pass. Also verified
the original BPF KASAN warning from [1] is still fixed.
[1] https://lore.kernel.org/all/20251118164115.GA3977565@ax162/
Fixes: a06494adb7ef ("arm64: mm: use untagged address to calculate page index")
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
TL;DR: checking force_pte_mapping() in arch_kfence_init_pool() is
sufficient
Commit ce2b3a50ad92 ("arm64: mm: Don't sleep in
split_kernel_leaf_mapping() when in atomic context") recently added
an arm64 implementation of arch_kfence_init_pool() to ensure that
the KFENCE pool is PTE-mapped. Assuming that the pool was not
initialised early, block splitting is necessary if the linear
mapping is not fully PTE-mapped, in other words if
force_pte_mapping() is false.
arch_kfence_init_pool() currently makes another check: whether
BBML2-noabort is supported, i.e. whether we are *able* to split
block mappings. This check is however unnecessary, because
force_pte_mapping() is always true if KFENCE is enabled and
BBML2-noabort is not supported. This must be the case by design,
since KFENCE requires PTE-mapped pages in all cases. We can
therefore remove that check.
The situation is different in split_kernel_leaf_mapping(), as that
function is called unconditionally regardless of the configuration.
If BBML2-noabort is not supported, it cannot do anything and bails
out. If force_pte_mapping() is true, there is nothing to do and it
also bails out, but these are independent checks.
Commit 53357f14f924 ("arm64: mm: Tidy up force_pte_mapping()")
grouped these checks into a helper, split_leaf_mapping_possible().
This isn't so helpful as only split_kernel_leaf_mapping() should
check both. Revert the parts of that commit that introduced the
helper, reintroducing the more accurate comments in
split_kernel_leaf_mapping().
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
Rework the vmalloc() code to support non-blocking allocations
(GFP_ATOIC, GFP_NOWAIT)
"ksm: fix exec/fork inheritance" (xu xin)
Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
inherited across fork/exec
"mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
Some light maintenance work on the zswap code
"mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
Enhance the /sys/kernel/debug/page_owner debug feature by adding
unique identifiers to differentiate the various stack traces so
that userspace monitoring tools can better match stack traces over
time
"mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
Minor alterations to the page allocator's per-cpu-pages feature
"Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
Address a scalability issue in userfaultfd's UFFDIO_MOVE operation
"kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)
"drivers/base/node: fold node register and unregister functions" (Donet Tom)
Clean up the NUMA node handling code a little
"mm: some optimizations for prot numa" (Kefeng Wang)
Cleanups and small optimizations to the NUMA allocation hinting
code
"mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
Address long lock hold times at boot on large machines. These were
causing (harmless) softlockup warnings
"optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
Remove some now-unnecessary work from page reclaim
"mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
Enhance the DAMOS auto-tuning feature
"mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
configuration
"expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
Enhance the new(ish) file_operations.mmap_prepare() method and port
additional callsites from the old ->mmap() over to ->mmap_prepare()
"Fix stale IOTLB entries for kernel address space" (Lu Baolu)
Fix a bug (and possible security issue on non-x86) in the IOMMU
code. In some situations the IOMMU could be left hanging onto a
stale kernel pagetable entry
"mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
Clean up and optimize the folio splitting code
"mm, swap: misc cleanup and bugfix" (Kairui Song)
Some cleanups and a minor fix in the swap discard code
"mm/damon: misc documentation fixups" (SeongJae Park)
"mm/damon: support pin-point targets removal" (SeongJae Park)
Permit userspace to remove a specific monitoring target in the
middle of the current targets list
"mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
A couple of cleanups related to mm header file inclusion
"mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
improve the selection of swap devices for NUMA machines
"mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
Change the memory block labels from macros to enums so they will
appear in kernel debug info
"ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
Address an inefficiency when KSM unmerges an address range
"mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
Fix leaks and unhandled malloc() failures in DAMON userspace unit
tests
"some cleanups for pageout()" (Baolin Wang)
Clean up a couple of minor things in the page scanner's
writeback-for-eviction code
"mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
Move hugetlb's sysfs/sysctl handling code into a new file
"introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
Make the VMA guard regions available in /proc/pid/smaps and
improves the mergeability of guarded VMAs
"mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
Reduce mmap lock contention for callers performing VMA guard region
operations
"vma_start_write_killable" (Matthew Wilcox)
Start work on permitting applications to be killed when they are
waiting on a read_lock on the VMA lock
"mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
Add additional userspace testing of DAMON's "commit" feature
"mm/damon: misc cleanups" (SeongJae Park)
"make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
VMA is merged with another
"mm: support device-private THP" (Balbir Singh)
Introduce support for Transparent Huge Page (THP) migration in zone
device-private memory
"Optimize folio split in memory failure" (Zi Yan)
"mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
Some more cleanups in the folio splitting code
"mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
Clean up our handling of pagetable leaf entries by introducing the
concept of 'software leaf entries', of type softleaf_t
"reparent the THP split queue" (Muchun Song)
Reparent the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory
resources
"unify PMD scan results and remove redundant cleanup" (Wei Yang)
A little cleanup in the hugepage collapse code
"zram: introduce writeback bio batching" (Sergey Senozhatsky)
Improve zram writeback efficiency by introducing batched bio
writeback support
"memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
Clean up our handling of the interrupt safety of some memcg stats
"make vmalloc gfp flags usage more apparent" (Vishal Moola)
Clean up vmalloc's handling of incoming GFP flags
"mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
Teach soft dirty and userfaultfd write protect tracking to use
RISC-V's Svrsw60t59b extension
"mm: swap: small fixes and comment cleanups" (Youngjun Park)
Fix a small bug and clean up some of the swap code
"initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
Start work on converting the vma struct's flags to a bitmap, so we
stop running out of them, especially on 32-bit
"mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
Address a possible bug in the swap discard code and clean things
up a little
[ This merge also reverts commit ebb9aeb980e5 ("vfio/nvgrace-gpu:
register device memory for poison handling") because it looks
broken to me, I've asked for clarification - Linus ]
* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm: fix vma_start_write_killable() signal handling
mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
mm/swapfile: fix list iteration when next node is removed during discard
fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
mm/kfence: add reboot notifier to disable KFENCE on shutdown
memcg: remove inc/dec_lruvec_kmem_state helpers
selftests/mm/uffd: initialize char variable to Null
mm: fix DEBUG_RODATA_TEST indentation in Kconfig
mm: introduce VMA flags bitmap type
tools/testing/vma: eliminate dependency on vma->__vm_flags
mm: simplify and rename mm flags function for clarity
mm: declare VMA flags by bit
zram: fix a spelling mistake
mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
pagemap: update BUDDY flag documentation
mm: swap: remove scan_swap_map_slots() references from comments
mm: swap: change swap_alloc_slow() to void
mm, swap: remove redundant comment for read_swap_cache_async
mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"These are the arm64 updates for 6.19.
The biggest part is the Arm MPAM driver under drivers/resctrl/.
There's a patch touching mm/ to handle spurious faults for huge pmd
(similar to the pte version). The corresponding arm64 part allows us
to avoid the TLB maintenance if a (huge) page is reused after a write
fault. There's EFI refactoring to allow runtime services with
preemption enabled and the rest is the usual perf/PMU updates and
several cleanups/typos.
Summary:
Core features:
- Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
driver under drivers/resctrl/ which makes use of the fs/rectrl/ API
Perf and PMU:
- Avoid cycle counter on multi-threaded CPUs
- Extend CSPMU device probing and add additional filtering support
for NVIDIA implementations
- Add support for the PMUs on the NoC S3 interconnect
- Add additional compatible strings for new Cortex and C1 CPUs
- Add support for data source filtering to the SPE driver
- Add support for i.MX8QM and "DB" PMU in the imx PMU driver
Memory managemennt:
- Avoid broadcast TLBI if page reused in write fault
- Elide TLB invalidation if the old PTE was not valid
- Drop redundant cpu_set_*_tcr_t0sz() macros
- Propagate pgtable_alloc() errors outside of __create_pgd_mapping()
- Propagate return value from __change_memory_common()
ACPI and EFI:
- Call EFI runtime services without disabling preemption
- Remove unused ACPI function
Miscellaneous:
- ptrace support to disable streaming on SME-only systems
- Improve sysreg generation to include a 'Prefix' descriptor
- Replace __ASSEMBLY__ with __ASSEMBLER__
- Align register dumps in the kselftest zt-test
- Remove some no longer used macros/functions
- Various spelling corrections"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
arm64/pageattr: Propagate return value from __change_memory_common
arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
KVM: arm64: selftests: Consider all 7 possible levels of cache
KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
Documentation/arm64: Fix the typo of register names
ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
perf: arm_spe: Add support for filtering on data source
perf: Add perf_event_attr::config4
perf/imx_ddr: Add support for PMU in DB (system interconnects)
perf/imx_ddr: Get and enable optional clks
perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
arm64: mm: use untagged address to calculate page index
MAINTAINERS: new entry for MPAM Driver
arm_mpam: Add kunit tests for props_mismatch()
arm_mpam: Add kunit test for bitmap reset
arm_mpam: Add helper to reset saved mbwu state
...
|
|
* for-next/set_memory:
: Fix + documentation for the arm64 change_memory_common()
arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
arm64/pageattr: Propagate return value from __change_memory_common
|
|
* for-next/sysreg:
: arm64 sysreg updates/cleanups
arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
KVM: arm64: selftests: Consider all 7 possible levels of cache
KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
arm64/sysreg: Add ICH_VMCR_EL2
arm64/sysreg: Move generation of RES0/RES1/UNKN to function
arm64/sysreg: Support feature-specific fields with 'Prefix' descriptor
arm64/sysreg: Fix checks for incomplete sysreg definitions
arm64/sysreg: Replace TCR_EL1 field macros
|
|
'for-next/efi-preempt', 'for-next/assembler-macro', 'for-next/typos', 'for-next/sme-ptrace-disable', 'for-next/local-tlbi-page-reused', 'for-next/mpam', 'for-next/acpi' and 'for-next/documentation', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
perf: arm_spe: Add support for filtering on data source
perf: Add perf_event_attr::config4
perf/imx_ddr: Add support for PMU in DB (system interconnects)
perf/imx_ddr: Get and enable optional clks
perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
arch_topology: Provide a stub topology_core_has_smt() for !CONFIG_GENERIC_ARCH_TOPOLOGY
perf/arm-ni: Fix and optimise register offset calculation
perf: arm_pmuv3: Add new Cortex and C1 CPU PMUs
perf: arm_cspmu: fix error handling in arm_cspmu_impl_unregister()
perf/arm-ni: Add NoC S3 support
perf/arm_cspmu: nvidia: Add pmevfiltr2 support
perf/arm_cspmu: nvidia: Add revision id matching
perf/arm_cspmu: Add pmpidr support
perf/arm_cspmu: Add callback to reset filter config
perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
* for-next/misc:
: Miscellaneous patches
arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
arm64: mm: use untagged address to calculate page index
arm64: mm: make linear mapping permission update more robust for patial range
arm64/mm: Elide TLB flush in certain pte protection transitions
arm64/mm: Rename try_pgd_pgtable_alloc_init_mm
arm64/mm: Allow __create_pgd_mapping() to propagate pgtable_alloc() errors
arm64: add unlikely hint to MTE async fault check in el0_svc_common
arm64: acpi: add newline to deferred APEI warning
arm64: entry: Clean out some indirection
arm64/mm: Ensure PGD_SIZE is aligned to 64 bytes when PA_BITS = 52
arm64/mm: Drop cpu_set_[default|idmap]_tcr_t0sz()
arm64: remove unused ARCH_PFN_OFFSET
arm64: use SOFTIRQ_ON_OWN_STACK for enabling softirq stack
arm64: Remove assertion on CONFIG_VMAP_STACK
* for-next/kselftest:
: arm64 kselftest patches
kselftest/arm64: Align zt-test register dumps
* for-next/efi-preempt:
: arm64: Make EFI calls preemptible
arm64/efi: Call EFI runtime services without disabling preemption
arm64/efi: Move uaccess en/disable out of efi_set_pgd()
arm64/efi: Drop efi_rt_lock spinlock from EFI arch wrapper
arm64/fpsimd: Permit kernel mode NEON with IRQs off
arm64/fpsimd: Don't warn when EFI execution context is preemptible
efi/runtime-wrappers: Keep track of the efi_runtime_lock owner
efi: Add missing static initializer for efi_mm::cpus_allowed_lock
* for-next/assembler-macro:
: arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in headers
arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-uapi headers
arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in uapi headers
* for-next/typos:
: Random typo/spelling fixes
arm64: Fix double word in comments
arm64: Fix typos and spelling errors in comments
* for-next/sme-ptrace-disable:
: Support disabling streaming mode via ptrace on SME only systems
kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace
kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems
arm64/sme: Support disabling streaming mode via ptrace on SME only systems
* for-next/local-tlbi-page-reused:
: arm64, mm: avoid TLBI broadcast if page reused in write fault
arm64, tlbflush: don't TLBI broadcast if page reused in write fault
mm: add spurious fault fixing support for huge pmd
* for-next/mpam: (34 commits)
: Basic Arm MPAM driver (more to follow)
MAINTAINERS: new entry for MPAM Driver
arm_mpam: Add kunit tests for props_mismatch()
arm_mpam: Add kunit test for bitmap reset
arm_mpam: Add helper to reset saved mbwu state
arm_mpam: Use long MBWU counters if supported
arm_mpam: Probe for long/lwd mbwu counters
arm_mpam: Consider overflow in bandwidth counter state
arm_mpam: Track bandwidth counter state for power management
arm_mpam: Add mpam_msmon_read() to read monitor value
arm_mpam: Add helpers to allocate monitors
arm_mpam: Probe and reset the rest of the features
arm_mpam: Allow configuration to be applied and restored during cpu online
arm_mpam: Use a static key to indicate when mpam is enabled
arm_mpam: Register and enable IRQs
arm_mpam: Extend reset logic to allow devices to be reset any time
arm_mpam: Add a helper to touch an MSC from any CPU
arm_mpam: Reset MSC controls from cpuhp callbacks
arm_mpam: Merge supported features during mpam_enable() into mpam_class
arm_mpam: Probe the hardware features resctrl supports
arm_mpam: Add helpers for managing the locking around the mon_sel registers
...
* for-next/acpi:
: arm64 acpi updates
ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
* for-next/documentation:
: arm64 Documentation updates
Documentation/arm64: Fix the typo of register names
|
|
problematic
Consider the following code path:
(1) vmalloc -> (2) set_vm_flush_reset_perms -> (3) set_memory_ro/set_memory_rox
-> .... (4) use the mapping .... -> (5) vfree -> (6) vm_reset_perms
-> (7) set_area_direct_map.
Or, it may happen that we encounter failure at (3) and directly jump to (5).
In both cases, (7) may fail due to linear map split failure. But, we care
about its success *only* for the region which got successfully changed by
(3). Such a region is guaranteed to be pte-mapped.
The TLDR is that (7) will surely succeed for the regions we care about.
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The rodata=on security measure requires that any code path which does
vmalloc -> set_memory_ro/set_memory_rox must protect the linear map alias
too. Therefore, if such a call fails, we must abort set_memory_* and caller
must take appropriate action; currently we are suppressing the error, and
there is a real chance of such an error arising post commit a166563e7ec3
("arm64: mm: support large block mapping when rodata=full"). Therefore,
propagate any error to the caller.
Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Nathan Chancellor reported the below bug:
[ 0.149929] BUG: KASAN: invalid-access in change_memory_common+0x258/0x2d0
[ 0.151006] Read of size 8 at addr f96680000268a000 by task swapper/0/1
[ 0.152031]
[ 0.152274] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-rc1-00012-g37cb0aab9068 #1 PREEMPT
[ 0.152288] Hardware name: linux,dummy-virt (DT)
[ 0.152292] Call trace:
[ 0.152295] show_stack+0x18/0x30 (C)
[ 0.152309] dump_stack_lvl+0x60/0x80
[ 0.152320] print_report+0x480/0x498
[ 0.152331] kasan_report+0xac/0xf0
[ 0.152343] kasan_check_range+0x90/0xb0
[ 0.152353] __hwasan_load8_noabort+0x20/0x34
[ 0.152364] change_memory_common+0x258/0x2d0
[ 0.152375] set_memory_ro+0x18/0x24
[ 0.152386] bpf_prog_pack_alloc+0x200/0x2e8
[ 0.152397] bpf_jit_binary_pack_alloc+0x78/0x188
[ 0.152409] bpf_int_jit_compile+0xa4c/0xc74
[ 0.152420] bpf_prog_select_runtime+0x1c0/0x2bc
[ 0.152430] bpf_prepare_filter+0x5a4/0x7c0
[ 0.152443] bpf_prog_create+0xa4/0x100
[ 0.152454] ptp_classifier_init+0x80/0xd0
[ 0.152465] sock_init+0x12c/0x178
[ 0.152474] do_one_initcall+0xa0/0x260
[ 0.152484] kernel_init_freeable+0x2d8/0x358
[ 0.152495] kernel_init+0x20/0x140
[ 0.152510] ret_from_fork+0x10/0x20
It is because the KASAN tagged address was used when calculating the page
index. The untagged address should be used.
Fixes: 37cb0aab9068 ("arm64: mm: make linear mapping permission update more robust for patial range")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
A multi-thread customer workload with large memory footprint uses
fork()/exec() to run some external programs every tens seconds. When
running the workload on an arm64 server machine, it's observed that
quite some CPU cycles are spent in the TLB flushing functions. While
running the workload on the x86_64 server machine, it's not. This
causes the performance on arm64 to be much worse than that on x86_64.
During the workload running, after fork()/exec() write-protects all
pages in the parent process, memory writing in the parent process
will cause a write protection fault. Then the page fault handler
will make the PTE/PDE writable if the page can be reused, which is
almost always true in the workload. On arm64, to avoid the write
protection fault on other CPUs, the page fault handler flushes the TLB
globally with TLBI broadcast after changing the PTE/PDE. However, this
isn't always necessary. Firstly, it's safe to leave some stale
read-only TLB entries as long as they will be flushed finally.
Secondly, it's quite possible that the original read-only PTE/PDEs
aren't cached in remote TLB at all if the memory footprint is large.
In fact, on x86_64, the page fault handler doesn't flush the remote
TLB in this situation, which benefits the performance a lot.
To improve the performance on arm64, make the write protection fault
handler flush the TLB locally instead of globally via TLBI broadcast
after making the PTE/PDE writable. If there are stale read-only TLB
entries in the remote CPUs, the page fault handler on these CPUs will
regard the page fault as spurious and flush the stale TLB entries.
To test the patchset, make the usemem.c from
vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
support calling fork()/exec() periodically. To mimic the behavior of
the customer workload, run usemem with 4 threads, access 100GB memory,
and call fork()/exec() every 40 seconds. Test results show that with
the patchset the score of usemem improves ~40.6%. The cycles% of TLB
flush functions reduces from ~50.5% to ~0.3% in perf profile.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The recent fix to properly initialize the tags of the huge zero folio
had an unfortunate not-so-subtle side effect: it caused the actual
*contents* of the huge zero folio to not be initialized at all when the
hardware didn't support the memory tagging.
The reason was the unfortunate semantics of tag_clear_highpage(): on
hardware that didn't do the tagging, it would silently just not do
anything at all. And since this is done only on arm64 with MTE support,
that basically meant most hardware.
It wasn't necessarily immediately obvious since the huge zero page isn't
necessarily very heavily used - or because it might already be zero
because all-zeroes is the most common pattern. But it ends up causing
random odd user space failures when you do hit it.
The unfortunate semantics have been around for a while, but became a
real bug only when we started actively using __GFP_ZEROTAGS in the
generic get_huge_zero_folio() function - before that, it had only ever
been used in code that checked that the hardware supported it.
Fix this by simply changing the semantics of tag_clear_highpage() to
return whether it actually successfully did something or not. While at
it, also make it initialize multiple pages in one go, since that's
actually what the only caller wants it to do and it simplifies the whole
logic.
Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio")
Link: https://lore.kernel.org/all/20251117082023.90176-1-00107082@163.com/
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reported-and-tested-by: David Wang <00107082@163.com>
Reported-and-tested-by: Carlos Llamas <cmllamas@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
INVALID_PHYS_ADDR has very similar definitions across the code base.
Hence just move that inside header <liux/mm.h> for more generic usage.
Also drop the now redundant ones which are no longer required.
Link: https://lkml.kernel.org/r/20251021025638.2420216-1-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The commit fcf8dda8cc48 ("arm64: pageattr: Explicitly bail out when changing
permissions for vmalloc_huge mappings") made permission update for
partial range more robust. But the linear mapping permission update
still assumes update the whole range by iterating from the first page
all the way to the last page of the area.
Make it more robust by updating the linear mapping permission from the
page mapped by start address, and update the number of numpages.
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
With BUG_ON in pgd_pgtable_alloc_init_mm moved up to higher layer,
gfp flags is the only difference between try_pgd_pgtable_alloc_init_mm
and pgd_pgtable_alloc_init_mm. Hence rename the "try" version
to pgd_pgtable_alloc_init_mm_gfp.
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Linu Cherian <linu.cherian@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
arch_add_memory() is used to hotplug memory into a system but as a part
of its implementation it calls __create_pgd_mapping(), which uses
pgtable_alloc() in order to build intermediate page tables. As this path
was initally only used during early boot pgtable_alloc() is designed to
BUG_ON() on failure. However, in the event that memory hotplug is
attempted when the system's memory is extremely tight and the allocation
were to fail, it would lead to panicking the system, which is not
desirable. Hence update __create_pgd_mapping and all it's callers to be
non void and propagate -ENOMEM on allocation failure to allow system to
fail gracefully.
But during early boot if there is an allocation failure, we want the
system to panic, hence create a wrapper around __create_pgd_mapping()
called early_create_pgd_mapping() which is designed to panic, if ret
is non zero value. All the init calls are updated to use this wrapper
rather than the modified __create_pgd_mapping() to restore
functionality.
Fixes: 4ab215061554 ("arm64: Add memory hotplug support")
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Signed-off-by: Linu Cherian <linu.cherian@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
This just replaces all used TCR_EL1 field macros with tools sysreg variant
based fields and subsequently drops them from the header (pgtable-hwdef.h),
although while retaining the ones used for KVM (represented via the sysreg
tools format).
Cc: Will Deacon <will@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Although the comment clearly states about PGD table's alignment requirement
(when PA_BITS = 52) but the subsequent BUILD_BUG_ON() tests size comparison
to 64 bytes instead. So change it as an actual alignment test.
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"There's more here than I would ideally like at this stage, but there's
been a steady trickle of fixes and some of them took a few rounds of
review.
The bulk of the changes are fixing some fallout from the recent BBM
level two support which allows the linear map to be split from block
to page mappings at runtime, but inadvertently led to sleeping in
atomic context on some paths where the linear map was already mapped
with page granularity. The fix is simply to avoid splitting in those
cases but the implementation of that is a little involved.
The other interesting fix is addressing a catastophic performance
issue with our per-cpu atomics discovered by Paul in the SRCU locking
code but which took some interactions with the hardware folks to
resolve.
Summary:
- Avoid sleeping in atomic context when changing linear map
permissions for DEBUG_PAGEALLOC or KFENCE
- Rework printing of Spectre mitigation status to avoid hardlockup
when enabling per-task mitigations on the context-switch path
- Reject kernel modules when instruction patching fails either due to
the DWARF-based SCS patching or because of an alternatives callback
residing outside of the core kernel text
- Propagate error when updating kernel memory permissions in kprobes
- Drop pointless, incorrect message when enabling the ACPI SPCR
console
- Use value-returning LSE instructions for per-cpu atomics to reduce
latency in SRCU locking routines"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Reject modules with internal alternative callbacks
arm64: Fail module loading if dynamic SCS patching fails
arm64: proton-pack: Fix hard lockup due to print in scheduler context
arm64: proton-pack: Drop print when !CONFIG_MITIGATE_SPECTRE_BRANCH_HISTORY
arm64: mm: Tidy up force_pte_mapping()
arm64: mm: Optimize range_split_to_ptes()
arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context
arm64: kprobes: check the return value of set_memory_rox()
arm64: acpi: Drop message logging SPCR default console
Revert "ACPI: Suppress misleading SPCR console message when SPCR table is absent"
arm64: Use load LSE atomics for the non-return per-CPU atomic operations
|
|
On arm64 with MTE enabled, a page mapped as Normal Tagged (PROT_MTE) in
user space will need to have its allocation tags initialised. This is
normally done in the arm64 set_pte_at() after checking the memory
attributes. Such page is also marked with the PG_mte_tagged flag to avoid
subsequent clearing. Since this relies on having a struct page,
pte_special() mappings are ignored.
Commit d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero
folio special") maps the huge zero folio special and the arm64
set_pmd_at() will no longer zero the tags. There is no guarantee that the
tags are zero, especially if parts of this huge page have been previously
tagged.
It's fairly easy to detect this by regularly dropping the caches to
force the reallocation of the huge zero folio.
Allocate the huge zero folio with the __GFP_ZEROTAGS flag. In addition,
do not warn in the arm64 __access_remote_tags() when reading tags from the
huge zero page.
I bundled the arm64 change in here as well since they are both related to
the commit mapping the huge zero folio as special.
[catalin.marinas@arm.com: handle arch mte_zero_clear_page_tags() code issuing MTE instructions]
Link: https://lkml.kernel.org/r/aQi8dA_QpXM8XqrE@arm.com
Link: https://lkml.kernel.org/r/20251031170133.280742-1-catalin.marinas@arm.com
Fixes: d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Tested-by: Beleswar Padhi <b-padhi@ti.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Aishwarya TCV <aishwarya.tcv@arm.com>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Tidy up the implementation of force_pte_mapping() to make it easier to
read and introduce the split_leaf_mapping_possible() helper to reduce
code duplication in split_kernel_leaf_mapping() and
arch_kfence_init_pool().
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Enter lazy_mmu mode while splitting a range of memory to pte mappings.
This causes barriers, which would otherwise be emitted after every pte
(and pmd/pud) write, to be deferred until exiting lazy_mmu mode.
For large systems, this is expected to significantly speed up fallback
to pte-mapping the linear map for the case where the boot CPU has
BBML2_NOABORT, but secondary CPUs do not. I haven't directly measured
it, but this is equivalent to commit 1fcb7cea8a5f ("arm64: mm: Batch dsb
and isb when populating pgtables").
Note that for the path from arch_kfence_init_pool(), we may sleep while
allocating memory inside the lazy_mmu mode. Sleeping is not allowed by
generic code inside lazy_mmu, but we know that the arm64 implementation
is sleep-safe. So this is ok and follows the same pattern already used
by split_kernel_leaf_mapping().
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
It has been reported that split_kernel_leaf_mapping() is trying to sleep
in non-sleepable context. It does this when acquiring the
pgtable_split_lock mutex, when either CONFIG_DEBUG_PAGEALLOC or
CONFIG_KFENCE are enabled, which change linear map permissions within
softirq context during memory allocation and/or freeing. All other paths
into this function are called from sleepable context and so are safe.
But it turns out that the memory for which these 2 features may attempt
to modify the permissions is always mapped by pte, so there is no need
to attempt to split the mapping. So let's exit early in these cases and
avoid attempting to take the mutex.
There is one wrinkle to this approach; late-initialized kfence allocates
it's pool from the buddy which may be block mapped. So we must hook that
allocation and convert it to pte-mappings up front. Previously this was
done as a side-effect of kfence protecting all the individual pages in
its pool at init-time, but this no longer works due to the added early
exit path in split_kernel_leaf_mapping().
So instead, do this via the existing arch_kfence_init_pool() arch hook,
and reuse the existing linear_map_split_to_ptes() infrastructure.
Closes: https://lore.kernel.org/all/f24b9032-0ec9-47b1-8b95-c0eeac7a31c5@roeck-us.net/
Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The arm64 copy_highpage() assumes that the destination page is newly
allocated and not MTE-tagged (PG_mte_tagged unset) and warns
accordingly. However, following commit 060913999d7a ("mm: migrate:
support poisoned recover from migrate folio"), folio_mc_copy() is called
before __folio_migrate_mapping(). If the latter fails (-EAGAIN), the
copy will be done again to the same destination page. Since
copy_highpage() already set the PG_mte_tagged flag, this second copy
will warn.
Replace the WARN_ON_ONCE(page already tagged) in the arm64
copy_highpage() with a comment.
Reported-by: syzbot+d1974fc28545a3e6218b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/68dda1ae.a00a0220.102ee.0065.GAE@google.com
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: stable@vger.kernel.org # 6.12.x
Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
- Preserve old 'tt_core' UAPI for Hisilicon L3C PMU driver
- Ensure linear alias of kprobes instruction page is not writable
- Fix kernel stack unwinding from BPF
- Fix build warnings from the Fujitsu uncore PMU documentation
- Fix hang with deferred 'struct page' initialisation and MTE
- Consolidate KPTI page-table re-writing code
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: mte: Do not flag the zero page as PG_mte_tagged
docs: perf: Fujitsu: Fix htmldocs build warnings and errors
arm64: mm: Move KPTI helpers to mmu.c
tracing: Fix the bug where bpf_get_stackid returns -EFAULT on the ARM64
arm64: kprobes: call set_memory_rox() for kprobe page
drivers/perf: hisi: Add tt_core_deprecated for compatibility
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
|
create_kpti_ng_temp_pgd() is currently defined (as an alias) in
mmu.c without matching declaration in a header; instead cpufeature.c
makes its own declaration. This is clearly not pretty, and as commit
ceca927c86e6 ("arm64: mm: Fix CFI failure due to kpti_ng_pgd_alloc
function signature") showed, it also makes it very easy for the
prototypes to go out of sync.
All this would be much simpler if kpti_install_ng_mappings() and
associated functions lived in mmu.c, where they logically belong.
This is what this patch does:
- Move kpti_install_ng_mappings() and associated functions from
cpufeature.c to mmu.c, add a declaration to <asm/mmu.h>
- Remove create_kpti_ng_temp_pgd() and just call
__create_pgd_mapping_locked() directly instead
- Mark all these functions __init
- Move __initdata after kpti_ng_temp_alloc (as suggested by
checkpatch)
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
[will: Fix conflicts with init_idmap_kpti_bbml2_flag()]
Signed-off-by: Will Deacon <will@kernel.org>
|
|
* for-next/mm:
arm64: map [_text, _stext) virtual address range non-executable+read-only
arm64: Enable vmalloc-huge with ptdump
arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs
arm64: mm: support large block mapping when rodata=full
arm64: Enable permission change on arm64 kernel block mappings
arm64/Kconfig: Remove CONFIG_RODATA_FULL_DEFAULT_ENABLED
arm64: mm: Rework the 'rodata=' options
arm64: mm: Represent physical memory with phys_addr_t and resource_size_t
arm64: mm: Make map_fdt() return mapped pointer
arm64: mm: Cast start/end markers to char *, not u64
|
|
Since the referenced fixes commit, the kernel's .text section is only
mapped starting from _stext; the region [_text, _stext) is omitted. As a
result, other vmalloc/vmap allocations may use the virtual addresses
nominally in the range [_text, _stext). This address reuse confuses
multiple things:
1. crash_prepare_elf64_headers() sets up a segment in /proc/vmcore
mapping the entire range [_text, _end) to
[__pa_symbol(_text), __pa_symbol(_end)). Reading an address in
[_text, _stext) from /proc/vmcore therefore gives the incorrect
result.
2. Tools doing symbolization (either by reading /proc/kallsyms or based
on the vmlinux ELF file) will incorrectly identify vmalloc/vmap
allocations in [_text, _stext) as kernel symbols.
In practice, both of these issues affect the drgn debugger.
Specifically, there were cases where the vmap IRQ stacks for some CPUs
were allocated in [_text, _stext). As a result, drgn could not get the
stack trace for a crash in an IRQ handler because the core dump
contained invalid data for the IRQ stack address. The stack addresses
were also symbolized as being in the _text symbol.
Fix this by bringing back the mapping of [_text, _stext), but now make
it non-executable and read-only. This prevents other allocations from
using it while still achieving the original goal of not mapping
unpredictable data as executable. Other than the changed protection,
this is effectively a revert of the fixes commit.
Fixes: e2a073dde921 ("arm64: omit [_text, _stext) from permanent kernel mapping")
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Our goal is to move towards enabling vmalloc-huge by default on arm64 so
as to reduce TLB pressure. Therefore, we need a way to analyze the portion
of block mappings in vmalloc space we can get on a production system; this
can be done through ptdump, but currently we disable vmalloc-huge if
CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel
pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump
may dereference a bogus address.
To solve this, we need to synchronize ptdump_walk() and ptdump_check_wx()
with pud_free_pmd_page() and pmd_free_pte_page().
Since this race is very unlikely to happen in practice, we do not want to
penalize the vmalloc pagetable tearing path by taking the init_mm
mmap_lock. Therefore, we use static keys. ptdump_walk() and
ptdump_check_wx() are the pagetable walkers; they will enable the static
key - upon observing that, the vmalloc pagetable tearing path will get
patched in with an mmap_read_lock/unlock sequence. A combination of the
patched-in mmap_read_lock/unlock, the acquire semantics of
static_branch_inc(), and the barriers in __flush_tlb_kernel_pgtable()
ensures that ptdump will never get a hold on the address of a freed PMD
or PTE table.
We can verify the correctness of the algorithm via the following litmus
test (thanks to James Houghton and Will Deacon):
AArch64 ptdump
Variant=Ifetch
{
uint64_t pud=0xa110c;
uint64_t pmd;
0:X0=label:"P1:L0"; 0:X1=instr:"NOP"; 0:X2=lock; 0:X3=pud; 0:X4=pmd;
1:X1=0xdead; 1:X2=lock; 1:X3=pud; 1:X4=pmd;
}
P0 | P1 ;
(* static_key_enable *) | (* pud_free_pmd_page *) ;
STR W1, [X0] | LDR X9, [X3] ;
DC CVAU,X0 | STR XZR, [X3] ;
DSB ISH | DSB ISH ;
IC IVAU,X0 | ISB ;
DSB ISH | ;
ISB | (* static key *) ;
| L0: ;
(* mmap_lock *) | B out1 ;
Lwlock: | ;
MOV W7, #1 | (* mmap_lock *) ;
SWPA W7, W8, [X2] | Lrlock: ;
| MOV W7, #1 ;
| SWPA W7, W8, [X2] ;
(* walk pgtable *) | ;
LDR X9, [X3] | (* mmap_unlock *) ;
CBZ X9, out0 | STLR WZR, [X2] ;
EOR X10, X9, X9 | ;
LDR X11, [X4, X10] | out1: ;
| EOR X10, X9, X9 ;
out0: | STR X1, [X4, X10] ;
exists (0:X8=0 /\ 1:X8=0 /\ (* Lock acquisitions succeed *)
0:X9=0xa110c /\ (* P0 sees the valid PUD ...*)
0:X11=0xdead) (* ... but the freed PMD *)
For an approximate written proof of why this algorithm works, please read
the code comment in [1], which is now removed for the sake of simplicity.
mm-selftests pass. No issues were observed while parallelly running
test_vmalloc.sh (which stresses the vmalloc subsystem),
and cat /sys/kernel/debug/{kernel_page_tables, check_wx_pages} in a loop.
Link: https://lore.kernel.org/all/20250723161827.15802-1-dev.jain@arm.com/ [1]
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
free_pages() should be used when we only have a virtual address. We
should call __free_pages() directly on our page instead.
Link: https://lkml.kernel.org/r/20250903185921.1785167-7-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Justin Sanders <justin@coraid.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Call kasan_init_generic() which handles Generic KASAN initialization. For
architectures that do not select ARCH_DEFER_KASAN, this will be a no-op
for the runtime flag but will print the initialization banner.
For SW_TAGS and HW_TAGS modes, their respective init functions will handle
the flag enabling, if they are enabled/implemented.
Link: https://lkml.kernel.org/r/20250810125746.1105476-3-snovitoll@gmail.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217049
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> [riscv]
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390]
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Gow <davidgow@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@loongson.cn>
Cc: Marco Elver <elver@google.com>
Cc: Qing Zhang <zhangqing@loongson.cn>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel linear mapping is painted in very early stage of system boot.
The cpufeature has not been finalized yet at this point. So the linear
mapping is determined by the capability of boot CPU only. If the boot
CPU supports BBML2, large block mappings will be used for linear
mapping.
But the secondary CPUs may not support BBML2, so repaint the linear
mapping if large block mapping is used and the secondary CPUs don't
support BBML2 once cpufeature is finalized on all CPUs.
If the boot CPU doesn't support BBML2 or the secondary CPUs have the
same BBML2 capability with the boot CPU, repainting the linear mapping
is not needed.
Repainting is implemented by the boot CPU, which we know supports BBML2,
so it is safe for the live mapping size to change for this CPU. The
linear map region is walked using the pagewalk API and any discovered
large leaf mappings are split to pte mappings using the existing helper
functions. Since the repainting is performed inside of a stop_machine(),
we must use GFP_ATOMIC to allocate the extra intermediate pgtables. But
since we are still early in boot, it is expected that there is plenty of
memory available so we will never need to sleep for reclaim, and so
GFP_ATOMIC is acceptable here.
The secondary CPUs are all put into a waiting area with the idmap in
TTBR0 and reserved map in TTBR1 while this is performed since they
cannot be allowed to observe any size changes on the live mappings. Some
of this infrastructure is reused from the kpti case. Specifically we
share the same flag (was __idmap_kpti_flag, now idmap_kpti_bbml2_flag)
since it means we don't have to reserve any extra pgtable memory to
idmap the extra flag.
Co-developed-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When rodata=full is specified, kernel linear mapping has to be mapped at
PTE level since large page table can't be split due to break-before-make
rule on ARM64.
This resulted in a couple of problems:
- performance degradation
- more TLB pressure
- memory waste for kernel page table
With FEAT_BBM level 2 support, splitting large block page table to
smaller ones doesn't need to make the page table entry invalid anymore.
This allows kernel split large block mapping on the fly.
Add kernel page table split support and use large block mapping by
default when FEAT_BBM level 2 is supported for rodata=full. When
changing permissions for kernel linear mapping, the page table will be
split to smaller size.
The machine without FEAT_BBM level 2 will fallback to have kernel linear
mapping PTE-mapped when rodata=full.
With this we saw significant performance boost with some benchmarks and
much less memory consumption on my AmpereOne machine (192 cores, 1P)
with 256GB memory.
* Memory use after boot
Before:
MemTotal: 258988984 kB
MemFree: 254821700 kB
After:
MemTotal: 259505132 kB
MemFree: 255410264 kB
Around 500MB more memory are free to use. The larger the machine, the
more memory saved.
* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on. Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses. The kernel TLB
MPKI is reduced by 28.5%.
The benchmark data is now on par with rodata=on too.
* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
disk encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap \
--randrepeat 1 --status-interval=999 --rw=write --bs=4k --loops=1 \
--ioengine=sync --iodepth=1 --numjobs=1 --fsync_on_close=1 \
--group_reporting --thread --name=iops-test-job --eta-newline=1 \
--size 100G
The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad
case). The bandwidth is increased and the avg clat is reduced
proportionally.
* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache
populated). The bandwidth is increased by 150%.
Co-developed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch paves the path to enable huge mappings in vmalloc space and
linear map space by default on arm64. For this we must ensure that we
can handle any permission games on the kernel (init_mm) pagetable.
Previously, __change_memory_common() used apply_to_page_range() which
does not support changing permissions for block mappings. We move away
from this by using the pagewalk API, similar to what riscv does right
now. It is the responsibility of the caller to ensure that the range
over which permissions are being changed falls on leaf mapping
boundaries. For systems with BBML2, this will be handled in future
patches by dyanmically splitting the mappings when required.
Unlike apply_to_page_range(), the pagewalk API currently enforces the
init_mm.mmap_lock to be held. To avoid the unnecessary bottleneck of the
mmap_lock for our usecase, this patch extends this generic API to be
used locklessly, so as to retain the existing behaviour for changing
permissions. Apart from this reason, it is noted at [1] that KFENCE can
manipulate kernel pgtable entries during softirqs. It does this by
calling set_memory_valid() -> __change_memory_common(). This being a
non-sleepable context, we cannot take the init_mm mmap lock.
Add comments to highlight the conditions under which we can use the
lockless variant - no underlying VMA, and the user having exclusive
control over the range, thus guaranteeing no concurrent access.
We require that the start and end of a given range do not partially
overlap block mappings, or cont mappings. Return -EINVAL in case a
partial block mapping is detected in any of the PGD/P4D/PUD/PMD levels;
add a corresponding comment in update_range_prot() to warn that
eliminating such a condition is the responsibility of the caller.
Note that, the pte level callback may change permissions for a whole
contpte block, and that will be done one pte at a time, as opposed to an
atomic operation for the block mappings. This is fine as any access will
decode either the old or the new permission until the TLBI.
apply_to_page_range() currently performs all pte level callbacks while
in lazy mmu mode. Since arm64 can optimize performance by batching
barriers when modifying kernel pgtables in lazy mmu mode, we would like
to continue to benefit from this optimisation. Unfortunately
walk_kernel_page_table_range() does not use lazy mmu mode. However,
since the pagewalk framework is not allocating any memory, we can safely
bracket the whole operation inside lazy mmu mode ourselves. Therefore,
wrap the call to walk_kernel_page_table_range() with the lazy MMU
helpers.
Link: https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/ [1]
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Yang Shi <yshi@os.amperecomputing.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that 'rodata=full' has been removed in favour of parity with x86,
CONFIG_RODATA_FULL_DEFAULT_ENABLED no longer serves a useful purpose.
Remove it.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This is a type-correctness cleanup to MMU/boot code that replaces
several instances of void * and u64 with phys_addr_t (to represent
addresses) and resource_size_t (to represent sizes) to emphasize that
the code in question concerns physical memory specifically.
The rationale for this change is to improve clarity and readability in
a few modules that handle both types (physical and virtual) of address
and differentiation is essential.
I have left u64 in cases where the address may be either physical or
virtual, where the address is exclusively virtual but used in heavy
pointer arithmetic, and in cases I may have overlooked. I do not
necessarily consider u64 the ideal type in those situations, but it
avoids breaking existing semantics in this cleanup.
This patch provably has no effect at runtime: I have verified that
.text of vmlinux is identical after this change.
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Patch series "Add and use memdesc_flags_t".
At some point struct page will be separated from struct slab and struct
folio. This is a step towards that by introducing a type for the 'flags'
word of all three structures. This gives us a certain amount of type
safety by establishing that some of these unsigned longs are different
from other unsigned longs in that they contain things like node ID,
section number and zone number in the upper bits. That lets us have
functions that can be easily called by anyone who has a slab, folio or
page (but not easily by anyone else) to get the node or zone.
There's going to be some unusual merge problems with this as some odd bits
of the kernel decide they want to print out the flags value or something
similar by writing page->flags and now they'll need to write page->flags.f
instead. That's most of the churn here. Maybe we should be removing
these things from the debug output?
This patch (of 11):
Wrap the unsigned long flags in a typedef. In upcoming patches, this will
provide a strong hint that you can't just pass a random unsigned long to
functions which take this as an argument.
[willy@infradead.org: s/flags/flags.f/ in several architectures]
Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org
[nicola.vetrini@gmail.com: mips: fix compilation error]
Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com
Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Seen during KPTI initialization:
CFI failure at create_kpti_ng_temp_pgd+0x124/0xce8 (target: kpti_ng_pgd_alloc+0x0/0x14; expected type: 0xd61b88b6)
The call site is alloc_init_pud() at arch/arm64/mm/mmu.c:
pud_phys = pgtable_alloc(TABLE_PUD);
alloc_init_pud() has the prototype:
static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
int flags)
where the pgtable_alloc() prototype is declared.
The target (kpti_ng_pgd_alloc) is used in arch/arm64/kernel/cpufeature.c:
create_kpti_ng_temp_pgd(kpti_ng_temp_pgd, __pa(alloc), KPTI_NG_TEMP_VA,
PAGE_SIZE, PAGE_KERNEL, kpti_ng_pgd_alloc, 0);
which is an alias for __create_pgd_mapping_locked() with prototype:
extern __alias(__create_pgd_mapping_locked)
void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt,
phys_addr_t size, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
int flags);
__create_pgd_mapping_locked() passes the function pointer down:
__create_pgd_mapping_locked() -> alloc_init_p4d() -> alloc_init_pud()
But the target function (kpti_ng_pgd_alloc) has the wrong signature:
static phys_addr_t __init kpti_ng_pgd_alloc(int shift);
The "int" should be "enum pgtable_type".
To make "enum pgtable_type" available to cpufeature.c, move
enum pgtable_type definition from arch/arm64/mm/mmu.c to
arch/arm64/include/asm/mmu.h.
Adjust kpti_ng_pgd_alloc to use "enum pgtable_type" instead of "int".
The function behavior remains identical (parameter is unused).
Fixes: c64f46ee1377 ("arm64: mm: use enum to identify pgtable level instead of *_SHIFT")
Cc: <stable@vger.kernel.org> # 6.16.x
Signed-off-by: Kees Cook <kees@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250829190721.it.373-kees@kernel.org
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "mseal cleanups" (Lorenzo Stoakes)
Some mseal cleaning with no intended functional change.
- "Optimizations for khugepaged" (David Hildenbrand)
Improve khugepaged throughput by batching PTE operations for large
folios. This gain is mainly for arm64.
- "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes" (Mike Rapoport)
A bugfix, additional debug code and cleanups to the execmem code.
- "mm/shmem, swap: bugfix and improvement of mTHP swap in" (Kairui Song)
Bugfixes, cleanups and performance improvememnts to the mTHP swapin
code"
* tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (38 commits)
mm: mempool: fix crash in mempool_free() for zero-minimum pools
mm: correct type for vmalloc vm_flags fields
mm/shmem, swap: fix major fault counting
mm/shmem, swap: rework swap entry and index calculation for large swapin
mm/shmem, swap: simplify swapin path and result handling
mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
mm/shmem, swap: tidy up swap entry splitting
mm/shmem, swap: tidy up THP swapin checks
mm/shmem, swap: avoid redundant Xarray lookup during swapin
x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations
x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations
execmem: drop writable parameter from execmem_fill_trapping_insns()
execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP)
execmem: move execmem_force_rw() and execmem_restore_rox() before use
execmem: rework execmem_cache_free()
execmem: introduce execmem_alloc_rw()
execmem: drop unused execmem_update_copy()
mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped
mm/rmap: add anon_vma lifetime debug check
mm: remove mm/io-mapping.c
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
us closer to being able to remove page->mapping
- "relayfs: misc changes" (Jason Xing) does some maintenance and
minor feature addition work in relayfs
- "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
us from static preallocation of the kdump crashkernel's working
memory over to dynamic allocation. So the difficulty of a-priori
estimation of the second kernel's needs is removed and the first
kernel obtains extra memory
- "generalize panic_print's dump function to be used by other
kernel parts" (Feng Tang) implements some consolidation and
rationalization of the various ways in which a failing kernel
splats information at the operator
* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
tools/getdelays: add backward compatibility for taskstats version
kho: add test for kexec handover
delaytop: enhance error logging and add PSI feature description
samples: Kconfig: fix spelling mistake "instancess" -> "instances"
fat: fix too many log in fat_chain_add()
scripts/spelling.txt: add notifer||notifier to spelling.txt
xen/xenbus: fix typo "notifer"
net: mvneta: fix typo "notifer"
drm/xe: fix typo "notifer"
cxl: mce: fix typo "notifer"
KVM: x86: fix typo "notifer"
MAINTAINERS: add maintainers for delaytop
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
ucount: fix atomic_long_inc_below() argument type
kexec: enable CMA based contiguous allocation
stackdepot: make max number of pools boot-time configurable
lib/xxhash: remove unused functions
init/Kconfig: restore CONFIG_BROKEN help text
lib/raid6: update recov_rvv.c zero page usage
docs: update docs after introducing delaytop
...
|
|
Several functions refer to the unfortunately named 'vm_flags' field when
referencing vmalloc flags, which happens to be the precise same name used
for VMA flags.
As a result these were erroneously changed to use the vm_flags_t type
(which currently is a typedef equivalent to unsigned long).
Currently this has no impact, but in future when vm_flags_t changes this
will result in issues, so change the type to unsigned long to account for
this.
[lorenzo.stoakes@oracle.com: fixup very disguised vmalloc flags parameter]
Link: https://lkml.kernel.org/r/e74dd8de-7e60-47ab-8a45-2c851f3c5d26@lucifer.local
Link: https://lkml.kernel.org/r/20250729114906.55347-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Harry Yoo <harry.yoo@oracle.com>
Closes: https://lore.kernel.org/all/aIgSpAnU8EaIcqd9@hyeyoo/
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Optimizations for khugepaged", v4.
If the underlying folio mapped by the ptes is large, we can process those
ptes in a batch using folio_pte_batch().
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls, since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits. Next, ptep_clear()
will cause a TLBI for every contig block in the range via
contpte_try_unfold(). Instead, use clear_ptes() to only do the TLBI at
the first and last contig block of the range.
For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement is
expected due to a reduction in the number of function calls and batching
atomic operations.
This patch (of 3):
Let's add variants to be used where "full" does not apply -- which will
be the majority of cases in the future. "full" really only applies if
we are about to tear down a full MM.
Use get_and_clear_ptes() in existing code, clear_ptes() users will
be added next.
Link: https://lkml.kernel.org/r/20250724052301.23844-2-dev.jain@arm.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"A quick summary: perf support for Branch Record Buffer Extensions
(BRBE), typical PMU hardware updates, small additions to MTE for
store-only tag checking and exposing non-address bits to signal
handlers, HAVE_LIVEPATCH enabled on arm64, VMAP_STACK forced on.
There is also a TLBI optimisation on hardware that does not require
break-before-make when changing the user PTEs between contiguous and
non-contiguous.
More details:
Perf and PMU updates:
- Add support for new (v3) Hisilicon SLLC and DDRC PMUs
- Add support for Arm-NI PMU integrations that share interrupts
between clock domains within a given instance
- Allow SPE to be configured with a lower sample period than the
minimum recommendation advertised by PMSIDR_EL1.Interval
- Add suppport for Arm's "Branch Record Buffer Extension" (BRBE)
- Adjust the perf watchdog period according to cpu frequency changes
- Minor driver fixes and cleanups
Hardware features:
- Support for MTE store-only checking (FEAT_MTE_STORE_ONLY)
- Support for reporting the non-address bits during a synchronous MTE
tag check fault (FEAT_MTE_TAGGED_FAR)
- Optimise the TLBI when folding/unfolding contiguous PTEs on
hardware with FEAT_BBM (break-before-make) level 2 and no TLB
conflict aborts
Software features:
- Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable()
and using the text-poke API for late module relocations
- Force VMAP_STACK always on and change arm64_efi_rt_init() to use
arch_alloc_vmap_stack() in order to avoid KASAN false positives
ACPI:
- Improve SPCR handling and messaging on systems lacking an SPCR
table
Debug:
- Simplify the debug exception entry path
- Drop redundant DBG_MDSCR_* macros
Kselftests:
- Cleanups and improvements for SME, SVE and FPSIMD tests
Miscellaneous:
- Optimise loop to reduce redundant operations in contpte_ptep_get()
- Remove ISB when resetting POR_EL0 during signal handling
- Mark the kernel as tainted on SEA and SError panic
- Remove redundant gcs_free() call"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64/gcs: task_gcs_el0_enable() should use passed task
arm64: Kconfig: Keep selects somewhat alphabetically ordered
arm64: signal: Remove ISB when resetting POR_EL0
kselftest/arm64: Handle attempts to disable SM on SME only systems
kselftest/arm64: Fix SVE write data generation for SME only systems
kselftest/arm64: Test SME on SME only systems in fp-ptrace
kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace
kselftest/arm64: Allow sve-ptrace to run on SME only systems
arm64/mm: Drop redundant addr increment in set_huge_pte_at()
kselftest/arm4: Provide local defines for AT_HWCAP3
arm64: Mark kernel as tainted on SAE and SError panic
arm64/gcs: Don't call gcs_free() when releasing task_struct
drivers/perf: hisi: Support PMUs with no interrupt
drivers/perf: hisi: Relax the event number check of v2 PMUs
drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver
drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information
drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver
drivers/perf: hisi: Simplify the probe process for each DDRC version
perf/arm-ni: Support sharing IRQs within an NI instance
perf/arm-ni: Consolidate CPU affinity handling
...
|
|
Override the generic definition of modify_prot_start_ptes() to use
get_and_clear_full_ptes(). This helper does a TLBI only for the starting
and ending contpte block of the range, whereas the current implementation
will call ptep_get_and_clear() for every contpte block, thus doing a TLBI
on every contpte block. Therefore, we have a performance win.
The arm64 definition of pte_accessible() allows us to batch in the
errata specific case:
#define pte_accessible(mm, pte) \
(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
All ptes are obviously present in the folio batch, and they are also valid.
Override the generic definition of modify_prot_commit_ptes() to simply use
set_ptes() to map the new ptes into the pagetable.
Link: https://lkml.kernel.org/r/20250718090244.21092-8-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
'for-next/misc', 'for-next/acpi', 'for-next/debug-entry', 'for-next/feat_mte_tagged_far', 'for-next/kselftest', 'for-next/mdscr-cleanup' and 'for-next/vmap-stack', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf: (23 commits)
drivers/perf: hisi: Support PMUs with no interrupt
drivers/perf: hisi: Relax the event number check of v2 PMUs
drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver
drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information
drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver
drivers/perf: hisi: Simplify the probe process for each DDRC version
perf/arm-ni: Support sharing IRQs within an NI instance
perf/arm-ni: Consolidate CPU affinity handling
perf/cxlpmu: Fix typos in cxl_pmu.c comments and documentation
perf/cxlpmu: Remove unintended newline from IRQ name format string
perf/cxlpmu: Fix devm_kcalloc() argument order in cxl_pmu_probe()
perf: arm_spe: Relax period restriction
perf: arm_pmuv3: Add support for the Branch Record Buffer Extension (BRBE)
KVM: arm64: nvhe: Disable branch generation in nVHE guests
arm64: Handle BRBE booting requirements
arm64/sysreg: Add BRBE registers and fields
perf/arm: Add missing .suppress_bind_attrs
perf/arm-cmn: Reduce stack usage during discovery
perf: imx9_perf: make the read-only array mask static const
perf/arm-cmn: Broaden module description for wider interconnect support
...
* for-next/livepatch:
: Support for HAVE_LIVEPATCH on arm64
arm64: Kconfig: Keep selects somewhat alphabetically ordered
arm64: Implement HAVE_LIVEPATCH
arm64: stacktrace: Implement arch_stack_walk_reliable()
arm64: stacktrace: Check kretprobe_find_ret_addr() return value
arm64/module: Use text-poke API for late relocations.
* for-next/user-contig-bbml2:
: Optimise the TLBI when folding/unfolding contigous PTEs on hardware with BBML2 and no TLB conflict aborts
arm64/mm: Elide tlbi in contpte_convert() under BBML2
iommu/arm: Add BBM Level 2 smmu feature
arm64: Add BBM Level 2 cpu feature
arm64: cpufeature: Introduce MATCH_ALL_EARLY_CPUS capability type
* for-next/misc:
: Miscellaneous arm64 patches
arm64/gcs: task_gcs_el0_enable() should use passed task
arm64: signal: Remove ISB when resetting POR_EL0
arm64/mm: Drop redundant addr increment in set_huge_pte_at()
arm64: Mark kernel as tainted on SAE and SError panic
arm64/gcs: Don't call gcs_free() when releasing task_struct
arm64: fix unnecessary rebuilding when CONFIG_DEBUG_EFI=y
arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
arm64: pi: use 'targets' instead of extra-y in Makefile
* for-next/acpi:
: Various ACPI arm64 changes
ACPI: Suppress misleading SPCR console message when SPCR table is absent
ACPI: Return -ENODEV from acpi_parse_spcr() when SPCR support is disabled
* for-next/debug-entry:
: Simplify the debug exception entry path
arm64: debug: remove debug exception registration infrastructure
arm64: debug: split bkpt32 exception entry
arm64: debug: split brk64 exception entry
arm64: debug: split hardware watchpoint exception entry
arm64: debug: split single stepping exception entry
arm64: debug: refactor reinstall_suspended_bps()
arm64: debug: split hardware breakpoint exception entry
arm64: entry: Add entry and exit functions for debug exceptions
arm64: debug: remove break/step handler registration infrastructure
arm64: debug: call step handlers statically
arm64: debug: call software breakpoint handlers statically
arm64: refactor aarch32_break_handler()
arm64: debug: clean up single_step_handler logic
* for-next/feat_mte_tagged_far:
: Support for reporting the non-address bits during a synchronous MTE tag check fault
kselftest/arm64/mte: Add mtefar tests on check_mmap_options
kselftest/arm64/mte: Refactor check_mmap_option test
kselftest/arm64/mte: Add verification for address tag in signal handler
kselftest/arm64/mte: Add address tag related macro and function
kselftest/arm64/mte: Check MTE_FAR feature is supported
kselftest/arm64/mte: Register mte signal handler with SA_EXPOSE_TAGBITS
kselftest/arm64: Add MTE_FAR hwcap test
KVM: arm64: Expose FEAT_MTE_TAGGED_FAR feature to guest
arm64: Report address tag when FEAT_MTE_TAGGED_FAR is supported
arm64/cpufeature: Add FEAT_MTE_TAGGED_FAR feature
* for-next/kselftest:
: Kselftest updates for arm64
kselftest/arm64: Handle attempts to disable SM on SME only systems
kselftest/arm64: Fix SVE write data generation for SME only systems
kselftest/arm64: Test SME on SME only systems in fp-ptrace
kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace
kselftest/arm64: Allow sve-ptrace to run on SME only systems
kselftest/arm4: Provide local defines for AT_HWCAP3
kselftest/arm64: Specify SVE data when testing VL set in sve-ptrace
kselftest/arm64: Fix test for streaming FPSIMD write in sve-ptrace
kselftest/arm64: Fix check for setting new VLs in sve-ptrace
kselftest/arm64: Convert tpidr2 test to use kselftest.h
* for-next/mdscr-cleanup:
: Drop redundant DBG_MDSCR_* macros
KVM: selftests: Change MDSCR_EL1 register holding variables as uint64_t
arm64/debug: Drop redundant DBG_MDSCR_* macros
* for-next/vmap-stack:
: Force VMAP_STACK on arm64
arm64: remove CONFIG_VMAP_STACK checks from entry code
arm64: remove CONFIG_VMAP_STACK checks from SDEI stack handling
arm64: remove CONFIG_VMAP_STACK checks from stacktrace overflow logic
arm64: remove CONFIG_VMAP_STACK conditionals from traps overflow stack
arm64: remove CONFIG_VMAP_STACK conditionals from irq stack setup
arm64: Remove CONFIG_VMAP_STACK conditionals from THREAD_SHIFT and THREAD_ALIGN
arm64: efi: Remove CONFIG_VMAP_STACK check
arm64: Mandate VMAP_STACK
arm64: efi: Fix KASAN false positive for EFI runtime stack
arm64/ptrace: Fix stack-out-of-bounds read in regs_get_kernel_stack_nth()
arm64/gcs: Don't call gcs_free() during flush_gcs()
arm64: Restrict pagetable teardown to avoid false warning
docs: arm64: Fix ICC_SRE_EL2 register typo in booting.rst
|
|
Patch series "kdump: crashkernel reservation from CMA", v5.
This series implements a way to reserve additional crash kernel memory
using CMA.
Currently, all the memory for the crash kernel is not usable by the 1st
(production) kernel. It is also unmapped so that it can't be corrupted by
the fault that will eventually trigger the crash. This makes sense for
the memory actually used by the kexec-loaded crash kernel image and initrd
and the data prepared during the load (vmcoreinfo, ...). However, the
reserved space needs to be much larger than that to provide enough
run-time memory for the crash kernel and the kdump userspace. Estimating
the amount of memory to reserve is difficult. Being too careful makes
kdump likely to end in OOM, being too generous takes even more memory from
the production system. Also, the reservation only allows reserving a
single contiguous block (or two with the "low" suffix). I've seen systems
where this fails because the physical memory is fragmented.
By reserving additional crashkernel memory from CMA, the main crashkernel
reservation can be just large enough to fit the kernel and initrd image,
minimizing the memory taken away from the production system. Most of the
run-time memory for the crash kernel will be memory previously available
to userspace in the production system. As this memory is no longer
wasted, the reservation can be done with a generous margin, making kdump
more reliable. Kernel memory that we need to preserve for dumping is
normally not allocated from CMA, unless it is explicitly allocated as
movable. Currently this is only the case for memory ballooning and zswap.
Such movable memory will be missing from the vmcore. User data is
typically not dumped by makedumpfile. When dumping of user data is
intended this new CMA reservation cannot be used.
There are five patches in this series:
The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more argument to
store the cma reservation size.
The second patch implements reserve_crashkernel_cma() which performs the
reservation. If the requested size is not available in a single range,
multiple smaller ranges will be reserved.
The third patch updates Documentation/, explicitly mentioning the
potential DMA corruption of the CMA-reserved memory.
The fourth patch adds a short delay before booting the kdump kernel,
allowing pending DMA transfers to finish.
The fifth patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
through /proc/vmcore by excluding them from the vmcoreinfo
PT_LOAD ranges.
Adding other architectures is easy and I can do that as soon as this
series is merged.
With this series applied, specifying
crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.
An additional 1G will be reserved from CMA, still usable by the production
system. The crash kernel will have 1.1G memory available. The 100M can
be reliably predicted based on the size of the kernel and initrd.
The new cma suffix is completely optional. When no
crashkernel=size,cma is specified, everything works as before.
This patch (of 5):
Add a new cma_size parameter to parse_crashkernel(). When not NULL, call
__parse_crashkernel to parse the CMA reservation size from
"crashkernel=size,cma" and store it in cma_size.
Set cma_size to NULL in all calls to parse_crashkernel().
Link: https://lkml.kernel.org/r/aEqnxxfLZMllMC8I@dwarf.suse.cz
Link: https://lkml.kernel.org/r/aEqoQckgoTQNULnh@dwarf.suse.cz
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Donald Dutile <ddutile@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Tao Liu <ltao@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The 'addr' need not be incremented in the loop because that is not going to
be used subsequently.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250716035432.293682-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|