| Age | Commit message (Collapse) | Author |
|
These definitions are useful for relocating the kernel image as well,
regardless of the type of relocations used for modules.
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20241026171441.3047904-5-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
When arm64 is built with LTO, it upgrades READ_ONCE() to ldar / ldapr
(load-acquire) to avoid issues that can be caused by the compiler
optimizing away implicit address dependencies.
Unlike plain loads, these load-acquire instructions actually require an
aligned address.
For now, fix it by removing the READ_ONCE() that the buggy commit
introduced.
Fixes: ece69af2ede1 ("rwonce: handle KCSAN like KASAN in read_word_at_a_time()")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/r/20250326203926.GA10484@ax162
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Add support for running as the root partition in Hyper-V (Microsoft
Hypervisor) by exposing /dev/mshv (Nuno and various people)
- Add support for CPU offlining in Hyper-V (Hamza Mahfooz)
- Misc fixes and cleanups (Roman Kisel, Tianyu Lan, Wei Liu, Michael
Kelley, Thorsten Blum)
* tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits)
x86/hyperv: fix an indentation issue in mshyperv.h
x86/hyperv: Add comments about hv_vpset and var size hypercall input args
Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
hyperv: Add definitions for root partition driver to hv headers
x86: hyperv: Add mshv_handler() irq handler and setup function
Drivers: hv: Introduce per-cpu event ring tail
Drivers: hv: Export some functions for use by root partition module
acpi: numa: Export node_to_pxm()
hyperv: Introduce hv_recommend_using_aeoi()
arm64/hyperv: Add some missing functions to arm64
x86/mshyperv: Add support for extended Hyper-V features
hyperv: Log hypercall status codes as strings
x86/hyperv: Fix check of return value from snp_set_vmsa()
x86/hyperv: Add VTL mode callback for restarting the system
x86/hyperv: Add VTL mode emergency restart callback
hyperv: Remove unused union and structs
hyperv: Add CONFIG_MSHV_ROOT to gate root partition support
hyperv: Change hv_root_partition into a function
hyperv: Convert hypercall statuses to linux error codes
drivers/hv: add CPU offlining support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull VDSO infrastructure updates from Thomas Gleixner:
- Consolidate the VDSO storage
The VDSO data storage and data layout has been largely architecture
specific for historical reasons. That increases the maintenance
effort and causes inconsistencies over and over.
There is no real technical reason for architecture specific layouts
and implementations. The architecture specific details can easily be
integrated into a generic layout, which also reduces the amount of
duplicated code for managing the mappings.
Convert all architectures over to a unified layout and common mapping
infrastructure. This splits the VDSO data layout into subsystem
specific blocks, timekeeping, random and architecture parts, which
provides a better structure and allows to improve and update the
functionalities without conflict and interaction.
- Rework the timekeeping data storage
The current implementation is designed for exposing system
timekeeping accessors, which was good enough at the time when it was
designed.
PTP and Time Sensitive Networking (TSN) change that as there are
requirements to expose independent PTP clocks, which are not related
to system timekeeping.
Replace the monolithic data storage by a structured layout, which
allows to add support for independent PTP clocks on top while reusing
both the data structures and the time accessor implementations.
* tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
sparc/vdso: Always reject undefined references during linking
x86/vdso: Always reject undefined references during linking
vdso: Rework struct vdso_time_data and introduce struct vdso_clock
vdso: Move architecture related data before basetime data
powerpc/vdso: Prepare introduction of struct vdso_clock
arm64/vdso: Prepare introduction of struct vdso_clock
x86/vdso: Prepare introduction of struct vdso_clock
time/namespace: Prepare introduction of struct vdso_clock
vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock struct
vdso/vsyscall: Prepare introduction of struct vdso_clock
vdso/gettimeofday: Prepare helper functions for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_coarse_timens() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_coarse() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_hres_timens() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_hres() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare introduction of struct vdso_clock
vdso/helpers: Prepare introduction of struct vdso_clock
vdso/datapage: Define vdso_clock to prepare for multiple PTP clocks
vdso: Make vdso_time_data cacheline aligned
arm64: Make asm/cache.h compatible with vDSO
...
|
|
read_word_at_a_time() is allowed to read out of bounds by straddling the
end of an allocation (and the caller is expected to then mask off
out-of-bounds data). This works as long as the caller guarantees that the
access won't hit a pagefault (either by ensuring that addr is aligned or by
explicitly checking where the next page boundary is).
Such out-of-bounds data could include things like KASAN redzones, adjacent
allocations that are concurrently written to, or simply an adjacent struct
field that is concurrently updated. KCSAN should ignore racy reads of OOB
data that is not actually used, just like KASAN, so (similar to the code
above) change read_word_at_a_time() to use __no_sanitize_or_inline instead
of __no_kasan_or_inline, and explicitly inform KCSAN that we're reading
the first byte.
We do have an instrument_read() helper that calls into both KASAN and
KCSAN, but I'm instead open-coding that here to avoid having to pull the
entire instrumented.h header into rwonce.h.
Also, since this read can be racy by design, we should technically do
READ_ONCE(), so add that.
Fixes: dfd402a4c4ba ("kcsan: Add Kernel Concurrency Sanitizer infrastructure")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/fpu updates from Ingo Molnar:
- Improve crypto performance by making kernel-mode FPU reliably usable
in softirqs ((Eric Biggers)
- Fully optimize out WARN_ON_FPU() (Eric Biggers)
- Initial steps to support Support Intel APX (Advanced Performance
Extensions) (Chang S. Bae)
- Fix KASAN for arch_dup_task_struct() (Benjamin Berg)
- Refine and simplify the FPU magic number check during signal return
(Chang S. Bae)
- Fix inconsistencies in guest FPU xfeatures (Chao Gao, Stanislav
Spassov)
- selftests/x86/xstate: Introduce common code for testing extended
states (Chang S. Bae)
- Misc fixes and cleanups (Borislav Petkov, Colin Ian King, Uros
Bizjak)
* tag 'x86-fpu-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu/xstate: Fix inconsistencies in guest FPU xfeatures
x86/fpu: Clarify the "xa" symbolic name used in the XSTATE* macros
x86/fpu: Use XSAVE{,OPT,C,S} and XRSTOR{,S} mnemonics in xstate.h
x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs
x86/fpu/xstate: Simplify print_xstate_features()
x86/fpu: Refine and simplify the magic number check during signal return
selftests/x86/xstate: Fix spelling mistake "hader" -> "header"
x86/fpu: Avoid copying dynamic FP state from init_task in arch_dup_task_struct()
vmlinux.lds.h: Remove entry to place init_task onto init_stack
selftests/x86/avx: Add AVX tests
selftests/x86/xstate: Clarify supported xstates
selftests/x86/xstate: Consolidate test invocations into a single entry
selftests/x86/xstate: Introduce signal ABI test
selftests/x86/xstate: Refactor ptrace ABI test
selftests/x86/xstate: Refactor context switching test
selftests/x86/xstate: Enumerate and name xstate components
selftests/x86/xstate: Refactor XSAVE helpers for general use
selftests/x86: Consolidate redundant signal helper functions
x86/fpu: Fix guest FPU state buffer allocation size
x86/fpu: Fully optimize out WARN_ON_FPU()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
"x86 CPU features support:
- Generate the <asm/cpufeaturemasks.h> header based on build config
(H. Peter Anvin, Xin Li)
- x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
- Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
- Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan
Jackman)
- Utilize CPU-type for CPU matching (Pawan Gupta)
- Warn about unmet CPU feature dependencies (Sohil Mehta)
- Prepare for new Intel Family numbers (Sohil Mehta)
Percpu code:
- Standardize & reorganize the x86 percpu layout and related cleanups
(Brian Gerst)
- Convert the stackprotector canary to a regular percpu variable
(Brian Gerst)
- Add a percpu subsection for cache hot data (Brian Gerst)
- Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
- Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
MM:
- Add support for broadcast TLB invalidation using AMD's INVLPGB
instruction (Rik van Riel)
- Rework ROX cache to avoid writable copy (Mike Rapoport)
- PAT: restore large ROX pages after fragmentation (Kirill A.
Shutemov, Mike Rapoport)
- Make memremap(MEMREMAP_WB) map memory as encrypted by default
(Kirill A. Shutemov)
- Robustify page table initialization (Kirill A. Shutemov)
- Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
- Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
(Matthew Wilcox)
KASLR:
- x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI
BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir
Singh)
CPU bugs:
- Implement FineIBT-BHI mitigation (Peter Zijlstra)
- speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
- speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan
Gupta)
- RFDS: Exclude P-only parts from the RFDS affected list (Pawan
Gupta)
System calls:
- Break up entry/common.c (Brian Gerst)
- Move sysctls into arch/x86 (Joel Granados)
Intel LAM support updates: (Maciej Wieczor-Retman)
- selftests/lam: Move cpu_has_la57() to use cpuinfo flag
- selftests/lam: Skip test if LAM is disabled
- selftests/lam: Test get_user() LAM pointer handling
AMD SMN access updates:
- Add SMN offsets to exclusive region access (Mario Limonciello)
- Add support for debugfs access to SMN registers (Mario Limonciello)
- Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
Power management updates: (Patryk Wlazlyn)
- Allow calling mwait_play_dead with an arbitrary hint
- ACPI/processor_idle: Add FFH state handling
- intel_idle: Provide the default enter_dead() handler
- Eliminate mwait_play_dead_cpuid_hint()
Build system:
- Raise the minimum GCC version to 8.1 (Brian Gerst)
- Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor)
Kconfig: (Arnd Bergmann)
- Add cmpxchg8b support back to Geode CPUs
- Drop 32-bit "bigsmp" machine support
- Rework CONFIG_GENERIC_CPU compiler flags
- Drop configuration options for early 64-bit CPUs
- Remove CONFIG_HIGHMEM64G support
- Drop CONFIG_SWIOTLB for PAE
- Drop support for CONFIG_HIGHPTE
- Document CONFIG_X86_INTEL_MID as 64-bit-only
- Remove old STA2x11 support
- Only allow CONFIG_EISA for 32-bit
Headers:
- Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI
headers (Thomas Huth)
Assembly code & machine code patching:
- x86/alternatives: Simplify alternative_call() interface (Josh
Poimboeuf)
- x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
- KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
- x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
- x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
- x86/kexec: Merge x86_32 and x86_64 code using macros from
<asm/asm.h> (Uros Bizjak)
- Use named operands in inline asm (Uros Bizjak)
- Improve performance by using asm_inline() for atomic locking
instructions (Uros Bizjak)
Earlyprintk:
- Harden early_serial (Peter Zijlstra)
NMI handler:
- Add an emergency handler in nmi_desc & use it in
nmi_shootdown_cpus() (Waiman Long)
Miscellaneous fixes and cleanups:
- by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem
Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan
Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar,
Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej
Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter
Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly
Kuznetsov, Xin Li, liuye"
* tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits)
zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault
x86/asm: Make asm export of __ref_stack_chk_guard unconditional
x86/mm: Only do broadcast flush from reclaim if pages were unmapped
perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones
perf/x86/intel, x86/cpu: Simplify Intel PMU initialization
x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers
x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers
x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions
x86/asm: Use asm_inline() instead of asm() in clwb()
x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h>
x86/hweight: Use asm_inline() instead of asm()
x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm()
x86/hweight: Use named operands in inline asm()
x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP
x86/head/64: Avoid Clang < 17 stack protector in startup code
x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro
x86/cpu/intel: Limit the non-architectural constant_tsc model checks
x86/mm/pat: Replace Intel x86_model checks with VFM ones
x86/cpu/intel: Fix fast string initialization for extended Families
...
|
|
Pull documentation updates from Jonathan Corbet:
"It has been a reasonably busy cycle for docs...
- Significant changes throughout the tree to bring Python code up to
current standards and raise the minimum Python required to 3.9
Much of this is preparatory to replacing the ancient Perl
scripts/kernel-doc horror with a slightly less horrifying Python
implementation, expected for 6.16
- Update the minimum Sphinx required to 3.4.3, allowing us to remove
a bunch of older compatibility code
- Rework and improve the generation of the ABI documentation
(All of the above done by Mauro)
- Lots of translation updates. Alex Shi and Yanteng Si are taking on
responsibility for the Chinese translations going forward; that
work will still get to you via docs-next
- Try to standardize the format for indicating a developer's
affiliation in commit tags
- Clarify the TAB's role in CoC enforcement actions
- Try to spell out the rules for when a commit tag can name another
developer without their explicit permission
Plus lots of other typo fixes and updates"
* tag 'docs-6.15' of git://git.lwn.net/linux: (98 commits)
docs/zh_CN: fix spelling mistake
docs/Chinese: change the disclaimer words
docs/zh_CN: Add snp-tdx-threat-model index Chinese translation
docs: driver-api: firmware: clarify userspace requirements
docs: clarify rules wrt tagging other people
docs: Remove outdated highuid.rst documentation
Documentation: dma-buf: heaps: Add heap name definitions
docs/.../submit-checklist: Use Documentation/admin-guide/abi.rst for cross-ref of README
docs: Correct installation instruction
Documentation: kcsan: fix "Plain Accesses and Data Races" URL in kcsan.rst
Documentation/CoC: Spell out the TAB role in enforcement decisions
Documentation: ocxl.rst: Update consortium site
scripts: get_feat.pl: substitute s390x with s390
scripts/kernel-doc: drop dead code for Wcontents_before_sections
scripts/kernel-doc: don't add not needed new lines
docs: driver-api/infiniband.rst: fix Kerneldoc markup
drivers: firewire: firewire-cdev.h: fix identation on a kernel-doc markup
drivers: media: intel-ipu3.h: fix identation on a kernel-doc markup
include/asm-generic/io.h: fix kerneldoc markup
Docs/arch/arm64: Fix spelling in amu.rst
...
|
|
Add mshv_handler() to process messages related to managing guest
partitions such as intercepts, doorbells, and scheduling messages.
In a (non-nested) root partition, the same interrupt vector is shared
between the vmbus and mshv_root drivers.
Introduce a stub for mshv_handler() and call it in
sysvec_hyperv_callback alongside vmbus_handler().
Even though both handlers will be called for every Hyper-V interrupt,
the messages for each driver are delivered to different offsets
within the SYNIC message page, so they won't step on each other.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Link: https://lore.kernel.org/r/1741980536-3865-9-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-9-git-send-email-nunodasneves@linux.microsoft.com>
|
|
Factor out the check for enabling auto eoi, to be reused in root
partition code.
No functional changes.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Link: https://lore.kernel.org/r/1741980536-3865-5-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-5-git-send-email-nunodasneves@linux.microsoft.com>
|
|
These non-nested msr and fast hypercall functions are present in x86,
but they must be available in both architectures for the root partition
driver code.
While at it, remove the redundant 'extern' keywords from the
hv_do_hypercall() variants in asm-generic/mshyperv.h.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Link: https://lore.kernel.org/r/1741980536-3865-4-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-4-git-send-email-nunodasneves@linux.microsoft.com>
|
|
Extend the "ms_hyperv_info" structure to include a new field,
"ext_features", for capturing extended Hyper-V features.
Update the "ms_hyperv_init_platform" function to retrieve these features
using the cpuid instruction and include them in the informational output.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1741980536-3865-3-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-3-git-send-email-nunodasneves@linux.microsoft.com>
|
|
Introduce hv_status_printk() macros as a convenience to log hypercall
errors, formatting them with the status code (HV_STATUS_*) as a raw hex
value and also as a string, which saves some time while debugging.
Create a table of HV_STATUS_ codes with strings and mapped errnos, and
use it for hv_result_to_string() and hv_result_to_errno().
Use the new hv_status_printk()s in hv_proc.c, hyperv-iommu.c, and
irqdomain.c hypercalls to aid debugging in the root partition.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Link: https://lore.kernel.org/r/1741980536-3865-2-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-2-git-send-email-nunodasneves@linux.microsoft.com>
|
|
CONFIG_MSHV_ROOT allows kernels built to run as a normal Hyper-V guest
to exclude the root partition code, which is expected to grow
significantly over time.
This option is a tristate so future driver code can be built as a
(m)odule, allowing faster development iteration cycles.
If CONFIG_MSHV_ROOT is disabled, don't compile hv_proc.c, and stub
hv_root_partition() to return false unconditionally. This allows the
compiler to optimize away root partition code blocks since they will
be disabled at compile time.
In the case of booting as root partition *without* CONFIG_MSHV_ROOT
enabled, print a critical error (the kernel will likely crash).
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1740167795-13296-4-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1740167795-13296-4-git-send-email-nunodasneves@linux.microsoft.com>
|
|
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock,
and their irqsave/irqrestore variants, which wrap the rqspinlock APIs.
bpf_res_spin_lock returns a conditional result, depending on whether the
lock was acquired (NULL is returned when lock acquisition succeeds,
non-NULL upon failure). The memory pointed to by the returned pointer
upon failure can be dereferenced after the NULL check to obtain the
error code.
Instead of using the old bpf_spin_lock type, introduce a new type with
the same layout, and the same alignment, but a different name to avoid
type confusion.
Preemption is disabled upon successful lock acquisition, however IRQs
are not. Special kfuncs can be introduced later to allow disabling IRQs
when taking a spin lock. Resilient locks are safe against AA deadlocks,
hence not disabling IRQs currently does not allow violation of kernel
safety.
__irq_flag annotation is used to accept IRQ flags for the IRQ-variants,
with the same semantics as existing bpf_local_irq_{save, restore}.
These kfuncs will require additional verifier-side support in subsequent
commits, to allow programs to hold multiple locks at the same time.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-23-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Ensure that the rqspinlock code is only built when the BPF subsystem is
compiled in. Depending on queued spinlock support, we may or may not end
up building the queued spinlock slowpath, and instead fallback to the
test-and-set implementation. Also add entries to MAINTAINERS file.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-18-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce helper macros that wrap around the rqspinlock slow path and
provide an interface analogous to the raw_spin_lock API. Note that
in case of error conditions, preemption and IRQ disabling is
automatically unrolled before returning the error back to the caller.
Ensure that in absence of CONFIG_QUEUED_SPINLOCKS support, we fallback
to the test-and-set implementation.
Add some comments describing the subtle memory ordering logic during
unlock, and why it's safe.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-17-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
We ripped out PV and virtualization related bits from rqspinlock in an
earlier commit, however, a fair lock performs poorly within a virtual
machine when the lock holder is preempted. As such, retain the
virt_spin_lock fallback to test and set lock, but with timeout and
deadlock detection. We can do this by simply depending on the
resilient_tas_spin_lock implementation from the previous patch.
We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that
requires more involved algorithmic changes and introduces more
complexity. It can be done when the need arises in the future.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-15-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Include a test-and-set fallback when queued spinlock support is not
available. Introduce a rqspinlock type to act as a fallback when
qspinlock support is absent.
Include ifdef guards to ensure the slow path in this file is only
compiled when CONFIG_QUEUED_SPINLOCKS=y. Subsequent patches will add
further logic to ensure fallback to the test-and-set implementation
when queued spinlock support is unavailable on an architecture.
Unlike other waiting loops in rqspinlock code, the one for test-and-set
has no theoretical upper bound under contention, therefore we need a
longer timeout than usual. Bump it up to a second in this case.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
While the timeout logic provides guarantees for the waiter's forward
progress, the time until a stalling waiter unblocks can still be long.
The default timeout of 1/4 sec can be excessively long for some use
cases. Additionally, custom timeouts may exacerbate recovery time.
Introduce logic to detect common cases of deadlocks and perform quicker
recovery. This is done by dividing the time from entry into the locking
slow path until the timeout into intervals of 1 ms. Then, after each
interval elapses, deadlock detection is performed, while also polling
the lock word to ensure we can quickly break out of the detection logic
and proceed with lock acquisition.
A 'held_locks' table is maintained per-CPU where the entry at the bottom
denotes a lock being waited for or already taken. Entries coming before
it denote locks that are already held. The current CPU's table can thus
be looked at to detect AA deadlocks. The tables from other CPUs can be
looked at to discover ABBA situations. Finally, when a matching entry
for the lock being taken on the current CPU is found on some other CPU,
a deadlock situation is detected. This function can take a long time,
therefore the lock word is constantly polled in each loop iteration to
ensure we can preempt detection and proceed with lock acquisition, using
the is_lock_released check.
We set 'spin' member of rqspinlock_timeout struct to 0 to trigger
deadlock checks immediately to perform faster recovery.
Note: Extending lock word size by 4 bytes to record owner CPU can allow
faster detection for ABBA. It is typically the owner which participates
in a ABBA situation. However, to keep compatibility with existing lock
words in the kernel (struct qspinlock), and given deadlocks are a rare
event triggered by bugs, we choose to favor compatibility over faster
detection.
The release_held_lock_entry function requires an smp_wmb, while the
release store on unlock will provide the necessary ordering for us. Add
comments to document the subtleties of why this is correct. It is
possible for stores to be reordered still, but in the context of the
deadlock detection algorithm, a release barrier is sufficient and
needn't be stronger for unlock's case.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-13-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The pending bit is used to avoid queueing in case the lock is
uncontended, and has demonstrated benefits for the 2 contender scenario,
esp. on x86. In case the pending bit is acquired and we wait for the
locked bit to disappear, we may get stuck due to the lock owner not
making progress. Hence, this waiting loop must be protected with a
timeout check.
To perform a graceful recovery once we decide to abort our lock
acquisition attempt in this case, we must unset the pending bit since we
own it. All waiters undoing their changes and exiting gracefully allows
the lock word to be restored to the unlocked state once all participants
(owner, waiters) have been recovered, and the lock remains usable.
Hence, set the pending bit back to zero before returning to the caller.
Introduce a lockevent (rqspinlock_lock_timeout) to capture timeout
event statistics.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce policy macro RES_CHECK_TIMEOUT which can be used to detect
when the timeout has expired for the slow path to return an error. It
depends on being passed two variables initialized to 0: ts, ret. The
'ts' parameter is of type rqspinlock_timeout.
This macro resolves to the (ret) expression so that it can be used in
statements like smp_cond_load_acquire to break the waiting loop
condition.
The 'spin' member is used to amortize the cost of checking time by
dispatching to the implementation every 64k iterations. The
'timeout_end' member is used to keep track of the timestamp that denotes
the end of the waiting period. The 'ret' parameter denotes the status of
the timeout, and can be checked in the slow path to detect timeouts
after waiting loops.
The 'duration' member is used to store the timeout duration for each
waiting loop. The default timeout value defined in the header
(RES_DEF_TIMEOUT) is 0.25 seconds.
This macro will be used as a condition for waiting loops in the slow
path. Since each waiting loop applies a fresh timeout using the same
rqspinlock_timeout, we add a new RES_RESET_TIMEOUT as well to ensure the
values can be easily reinitialized to the default state.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This header contains the public declarations usable in the rest of the
kernel for rqspinlock.
Let's also type alias qspinlock to rqspinlock_t to ensure consistent use
of the new lock type. We want to remove dependence on the qspinlock type
in later patches as we need to provide a test-and-set fallback, hence
begin abstracting away from now onwards.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move the definition of the struct mcs_spinlock from the private
mcs_spinlock.h header in kernel/locking to the mcs_spinlock.h
asm-generic header, since we will need to reference it from the
qspinlock.h header in subsequent commits.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
max_mapnr is essentially the size of the memory map for systems that use
FLATMEM. There is no reason to calculate it in each and every architecture
when it's anyway calculated in alloc_node_mem_map().
Drop setting of max_mapnr from architecture code and set it once in
alloc_node_mem_map().
While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so
there won't be two copies for MMU and !MMU variants.
Link: https://lkml.kernel.org/r/20250313135003.836600-10-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ioremap_prot() currently accepts pgprot_val parameter as an unsigned long,
thus implicitly assuming that pgprot_val and pgprot_t could never be
bigger than unsigned long. But this assumption soon will not be true on
arm64 when using D128 pgtables. In 128 bit page table configuration,
unsigned long is 64 bit, but pgprot_t is 128 bit.
Passing platform abstracted pgprot_t argument is better as compared to
size based data types. Let's change the parameter to directly pass
pgprot_t like another similar helper generic_ioremap_prot().
Without this change in place, D128 configuration does not work on arm64 as
the top 64 bits gets silently stripped when passing the protection value
to this function.
Link: https://lkml.kernel.org/r/20250218101954.415331-1-anshuman.khandual@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In commit d7f861b9c43a ("mm/mmu_gather: add __tlb_remove_folio_pages()"),
helper function __tlb_remove_folio_pages_size() was added. And based on
the helper, wrapper functions __tlb_remove_folio_pages() and
__tlb_remove_page_size() are created and used by upper level functions.
So let's update the code comment to reflect the current code about
tlb_remove_page()/tlb_remove_page_size(), etc.
Link: https://lkml.kernel.org/r/20250211034348.39531-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Nobody is using __tlb_remove_page() now, clean it up.
And also remove the code comment above tlb_remove_page() because it's not
meaningful any more.
Link: https://lkml.kernel.org/r/Z6si0/A/zzEF/bFJ@MiWiFi-R3L-srv
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The patch introduces __percpu_qual define and repurposes __percpu tag as a
named address space qualifier using the new define.
Arches can now conditionally define __percpu_qual as their named address
space qualifier for percpu variables.
Link: https://lkml.kernel.org/r/20250127160709.80604-6-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use TYPEOF_UNQUAL() to declare variables as a corresponding type without
named address space qualifier to avoid "`__seg_gs' specified for auto
variable `var'" errors.
Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be an array of VDSO clocks.
Now that all preparatory changes are in place:
Split the clock related struct members into a separate struct
vdso_clock. Make sure all users are aware, that vdso_time_data is no longer
initialized as an array and vdso_clock is now the array inside
vdso_data. Remove the vdso_clock define, which mapped it to vdso_time_data
for the transition.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-19-c1b5c69a166f@linutronix.de
|
|
Make __per_cpu_hot_end marker point to the end of the percpu cache
hot data, not to the end of the percpu cache hot section.
This fixes CONFIG_MPENTIUM4 case where X86_L1_CACHE_SHIFT
is set to 7 (128 bytes).
Also update assert message accordingly.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Link: https://lore.kernel.org/r/20250304173455.89361-1-ubizjak@gmail.com
Closes: https://lore.kernel.org/lkml/Z8a-NVJs-pm5W-mG@gmail.com/
|
|
Add a subsection to the percpu data for frequently accessed variables
that should remain cached on each processor. These varables should not
be accessed from other processors to avoid cacheline bouncing.
This will replace the pcpu_hot struct on x86, and open up similar
functionality to other architectures and the kernel core.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-2-brgerst@gmail.com
|
|
Kerneldoc requires a "-" after the name of a function for it
to be recognized as a function.
Add it.
Fix those kernel-doc warnings:
include/asm-generic/io.h:1215: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* memset_io Set a range of I/O memory to a constant value
include/asm-generic/io.h:1227: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* memcpy_fromio Copy a block of data from I/O memory
include/asm-generic/io.h:1239: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* memcpy_toio Copy a block of data into I/O memory
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/066968c00196ed88f6dc97e3d317926fc4ab7d52.1740387599.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Ryan's been hard at work finding and fixing mm bugs in the arm64 code,
so here's a small crop of fixes for -rc5.
The main changes are to fix our zapping of non-present PTEs for
hugetlb entries created using the contiguous bit in the page-table
rather than a block entry at the level above. Prior to these fixes, we
were pulling the contiguous bit back out of the PTE in order to
determine the size of the hugetlb page but this is clearly bogus if
the thing isn't present and consequently both the clearing of the
PTE(s) and the TLB invalidation were unreliable.
Although the problem was found by code inspection, we really don't
want this sitting around waiting to trigger and the changes are CC'd
to stable accordingly.
Note that the diffstat looks a lot worse than it really is;
huge_ptep_get_and_clear() now takes a size argument from the core code
and so all the arch implementations of that have been updated in a
pretty mechanical fashion.
- Fix a sporadic boot failure due to incorrect randomization of the
linear map on systems that support it
- Fix the zapping (both clearing the entries *and* invalidating the
TLB) of hugetlb PTEs constructed using the contiguous bit"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes
mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
arm64/mm: Fix Boot panic on Ampere Altra
|
|
There are two incompatible sets of definitions of these eight functions:
On 64-bit architectures setting CONFIG_HAS_IOPORT, they turn into
either pair of 32-bit PIO (inl/outl) accesses or a single 64-bit MMIO
(readq/writeq). On other 64-bit architectures, they are always split
into 32-bit accesses.
Depending on which header gets included in a driver, there are
additionally definitions for ioread64()/iowrite64() that are
expected to produce a 64-bit register MMIO access on all 64-bit
architectures.
To separate the conflicting definitions, make the version in
include/linux/io-64-nonatomic-*.h visible on all architectures
but pick the one from lib/iomap.c on architectures that set
CONFIG_GENERIC_IOMAP in place of the default fallback.
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:
"Fix an objtool false positive, and objtool related build warnings that
happens on PIE-enabled architectures such as LoongArch"
* tag 'objtool-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Add bch2_trans_unlocked_or_in_restart_error() to bcachefs noreturns
objtool: Fix C jump table annotations for Clang
vmlinux.lds: Ensure that const vars with relocations are mapped R/O
|
|
In order to fix a bug, arm64 needs to be told the size of the huge page
for which the huge_pte is being cleared in huge_ptep_get_and_clear().
Provide for this by adding an `unsigned long sz` parameter to the
function. This follows the same pattern as huge_pte_clear() and
set_huge_pte_at().
This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, loongarch, mips,
parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed
in a separate commit.
Cc: stable@vger.kernel.org
Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390
Link: https://lore.kernel.org/r/20250226120656.2400136-2-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Since commit 0eb5085c3874 ("arch: remove ARCH_TASK_STRUCT_ON_STACK")
there is no option that would allow placing task_struct on the stack.
Remove the unused linker script entry.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241217202745.1402932-2-benjamin@sipsolutions.net
|
|
In the kernel, there are architectures (x86, arm64) that perform
boot-time relocation (for KASLR) without relying on PIE codegen. In this
case, all const global objects are emitted into .rodata, including const
objects with fields that will be fixed up by the boot-time relocation
code. This implies that .rodata (and .text in some cases) need to be
writable at boot, but they will usually be mapped read-only as soon as
the boot completes.
When using PIE codegen, the compiler will emit const global objects into
.data.rel.ro rather than .rodata if the object contains fields that need
such fixups at boot-time. This permits the linker to annotate such
regions as requiring read-write access only at load time, but not at
execution time (in user space), while keeping .rodata truly const (in
user space, this is important for reducing the CoW footprint of dynamic
executables).
This distinction does not matter for the kernel, but it does imply that
const data will end up in writable memory if the .data.rel.ro sections
are not treated in a special way, as they will end up in the writable
.data segment by default.
So emit .data.rel.ro into the .rodata segment.
Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250221135704.431269-5-ardb+git@google.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
|
Introduce hv_curr_partition_type to store the partition type
as an enum.
Right now this is limited to guest or root partition, but there will
be other kinds in future and the enum is easily extensible.
Set up hv_curr_partition_type early in Hyper-V initialization with
hv_identify_partition_type(). hv_root_partition() just queries this
value, and shouldn't be called before that.
Making this check into a function sets the stage for adding a config
option to gate the compilation of root partition code. In particular,
hv_root_partition() can be stubbed out always be false if root
partition support isn't desired.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1740167795-13296-3-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1740167795-13296-3-git-send-email-nunodasneves@linux.microsoft.com>
|
|
Return linux-friendly error codes from hypercall helper functions,
which allows them to be used more flexibly.
Introduce hv_result_to_errno() for this purpose, which also handles
the special value U64_MAX returned from hv_do_hypercall().
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1740167795-13296-2-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1740167795-13296-2-git-send-email-nunodasneves@linux.microsoft.com>
|
|
All users of the time releated parts of the vDSO are now using the generic
storage implementation. Remove the therefore unnecessary compatibility
accessor functions and symbols.
Co-developed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-18-13a4669dfc8c@linutronix.de
|
|
All users of the random vDSO are using the generic storage
implementation. Remove the now unnecessary compatibility accessor
functions and symbols.
Co-developed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-17-13a4669dfc8c@linutronix.de
|
|
Extend the generic vDSO data storage with a page for the random state data.
The random state data is stored in a dedicated page, as the existing
storage page is only meant for time-related, time-namespace-aware data.
This simplifies to access logic to not need to handle time namespaces
anymore and also frees up more space in the time-related page.
In case further generic vDSO data store is required it can be added to
the random state page.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-6-13a4669dfc8c@linutronix.de
|
|
Historically each architecture defined their own way to store the vDSO
data page. Add a generic mechanism to provide storage for that page.
Furthermore this generic storage will be extended to also provide
uniform storage for *non*-time-related data, like the random state or
architecture-specific data. These will have their own pages and data
structures, so rename 'vdso_data' into 'vdso_time_data' to make that
split clear from the name.
Also introduce a new consistent naming scheme for the symbols related to
the vDSO, which makes it clear if the symbol is accessible from
userspace or kernel space and the type of data behind the symbol.
The generic fault handler contains an optimization to prefault the vvar
page when the timens page is accessed. This was lifted from s390 and x86.
Co-developed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-5-13a4669dfc8c@linutronix.de
|
|
__per_cpu_load is now always equal to __per_cpu_start.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-15-brgerst@gmail.com
|
|
x86-64 was the last user.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-14-brgerst@gmail.com
|
|
x86-64 was the last user.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-13-brgerst@gmail.com
|