summaryrefslogtreecommitdiff
path: root/arch/x86/entry
AgeCommit message (Collapse)Author
9 hoursx86: keep legacy generated vdso files around in .gitignore fileLinus Torvalds
Commit 93d73005bff4 ("x86/entry/vdso: Rename vdso_image_* to vdso*_image") updated the vdso .gitignore file with the new filenames, which is certainly not incorrect. However, while adding new generated names is obviously the right thing to do, you should *not* immediately remove the old filenames from the .gitignore file when things move around or get renamed, because people still have those old generated files in their build trees - and they haven't suddenly become valid files to commit to the repository just because they were moved or renamed. While it's mostly just a slight visual nuisance for 'git status' that can be fixed up with a clean build tree, it can become more serious than that: see for example commit 04a3389b3535 ("Remove stale generated 'genheaders' file"). That commit removed up a stale generated file that had been carelessly committed by a kernel developer because it wasn't properly ignored any more and thus showed up as a new file in their tree. Fixes: 93d73005bff4 ("x86/entry/vdso: Rename vdso_image_* to vdso*_image") Cc: Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
10 hoursMerge tag 'x86_entry_for_7.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry code updates from Dave Hansen: "This is entirely composed of a set of long overdue VDSO cleanups. They makes the VDSO build much more logical and zap quite a bit of old cruft. It also results in a coveted net-code-removal diffstat" * tag 'x86_entry_for_7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry/vdso: Add vdso2c to .gitignore x86/entry/vdso32: Omit '.cfi_offset eflags' for LLVM < 16 MAINTAINERS: Adjust vdso file entry in INTEL SGX x86/entry/vdso/selftest: Update location of vgetrandom-chacha.S x86/entry/vdso: Fix filtering of vdso compiler flags x86/entry/vdso: Update the object paths for "make vdso_install" x86/entry/vdso32: When using int $0x80, use it directly x86/cpufeature: Replace X86_FEATURE_SYSENTER32 with X86_FEATURE_SYSFAST32 x86/vdso: Abstract out vdso system call internals x86/entry/vdso: Include GNU_PROPERTY and GNU_STACK PHDRs x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S x86/entry/vdso32: Remove SYSCALL_ENTER_KERNEL macro in sigreturn.S x86/entry/vdso32: Don't rely on int80_landing_pad for adjusting ip x86/entry/vdso: Refactor the vdso build x86/entry/vdso: Move vdso2c to arch/x86/tools x86/entry/vdso: Rename vdso_image_* to vdso*_image
10 hoursMerge tag 'x86_paravirt_for_v7.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt updates from Borislav Petkov: - A nice cleanup to the paravirt code containing a unification of the paravirt clock interface, taming the include hell by splitting the pv_ops structure and removing of a bunch of obsolete code (Juergen Gross) * tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted() x86/paravirt: Remove trailing semicolons from alternative asm templates x86/pvlocks: Move paravirt spinlock functions into own header x86/paravirt: Specify pv_ops array in paravirt macros x86/paravirt: Allow pv-calls outside paravirt.h objtool: Allow multiple pv_ops arrays x86/xen: Drop xen_mmu_ops x86/xen: Drop xen_cpu_ops x86/xen: Drop xen_irq_ops x86/paravirt: Move pv_native_*() prototypes to paravirt.c x86/paravirt: Introduce new paravirt-base.h header x86/paravirt: Move paravirt_sched_clock() related code into tsc.c x86/paravirt: Use common code for paravirt_steal_clock() riscv/paravirt: Use common code for paravirt_steal_clock() loongarch/paravirt: Use common code for paravirt_steal_clock() arm64/paravirt: Use common code for paravirt_steal_clock() arm/paravirt: Use common code for paravirt_steal_clock() sched: Move clock related paravirt code to kernel/sched paravirt: Remove asm/paravirt_api_clock.h x86/paravirt: Move thunk macros to paravirt_types.h ...
12 hoursMerge tag 'timers-vdso-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull VDSO updates from Thomas Gleixner: - Provide the missing 64-bit variant of clock_getres() This allows the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit time types from the kernel and UAPI. - Remove the useless and broken getcpu_cache from the VDSO The intention was to provide a trivial way to retrieve the CPU number from the VDSO, but as the VDSO data is per process there is no way to make it work. - Switch get/put_unaligned() from packed struct to memcpy() The packed struct violates strict aliasing rules which requires to pass -fno-strict-aliasing to the compiler. As this are scalar values __builtin_memcpy() turns them into simple loads and stores - Use __typeof_unqual__() for __unqual_scalar_typeof() The get/put_unaligned() changes triggered a new sparse warning when __beNN types are used with get/put_unaligned() as sparse builds add a special 'bitwise' attribute to them which prevents sparse to evaluate the Generic in __unqual_scalar_typeof(). Newer sparse versions support __typeof_unqual__() which avoids the problem, but requires a recent sparse install. So this adds a sanity check to sparse builds, which validates that sparse is available and capable of handling it. - Force inline __cvdso_clock_getres_common() Compilers sometimes un-inline agressively, which results in function call overhead and problems with automatic stack variable initialization. Interestingly enough the force inlining results in smaller code than the un-inlined variant produced by GCC when optimizing for size. * tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: vdso/gettimeofday: Force inlining of __cvdso_clock_getres_common() x86/percpu: Make CONFIG_USE_X86_SEG_SUPPORT work with sparse compiler: Use __typeof_unqual__() for __unqual_scalar_typeof() powerpc/vdso: Provide clock_getres_time64() tools headers: Remove unneeded ignoring of warnings in unaligned.h tools headers: Update the linux/unaligned.h copy with the kernel sources vdso: Switch get/put_unaligned() from packed struct to memcpy() parisc: Inline a type punning version of get_unaligned_le32() vdso: Remove struct getcpu_cache MIPS: vdso: Provide getres_time64() for 32-bit ABIs arm64: vdso32: Provide clock_getres_time64() ARM: VDSO: Provide clock_getres_time64() ARM: VDSO: Patch out __vdso_clock_getres() if unavailable x86/vdso: Provide clock_getres_time64() for x86-32 selftests: vDSO: vdso_test_abi: Add test for clock_getres_time64() selftests: vDSO: vdso_test_abi: Use UAPI system call numbers selftests: vDSO: vdso_config: Add configurations for clock_getres_time64() vdso: Add prototype for __vdso_clock_getres_time64()
16 hoursMerge tag 'sched-core-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler Kconfig space updates: - Further consolidate configurable preemption modes (Peter Zijlstra) Reduce the number of architectures that are allowed to offer PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of preemption models from four to just two: 'full' and 'lazy' on up-to-date architectures (arm64, loongarch, powerpc, riscv, s390, x86). None and voluntary are only available as legacy features on platforms that don't implement lazy preemption yet, or which don't even support preemption. The goal is to eventually remove cond_resched() and voluntary preemption altogether. RSEQ based 'scheduler time slice extension' support (Thomas Gleixner and Peter Zijlstra): This allows a thread to request a time slice extension when it enters a critical section to avoid contention on a resource when the thread is scheduled out inside of the critical section. - Add fields and constants for time slice extension - Provide static branch for time slice extensions - Add statistics for time slice extensions - Add prctl() to enable time slice extensions - Implement sys_rseq_slice_yield() - Implement syscall entry work for time slice extensions - Implement time slice extension enforcement timer - Reset slice extension when scheduled - Implement rseq_grant_slice_extension() - entry: Hook up rseq time slice extension - selftests: Implement time slice extension test - Allow registering RSEQ with slice extension - Move slice_ext_nsec to debugfs - Lower default slice extension - selftests/rseq: Add rseq slice histogram script Scheduler performance/scalability improvements: - Update rq->avg_idle when a task is moved to an idle CPU, which improves the scalability of various workloads (Shubhang Kaushik) - Reorder fields in 'struct rq' for better caching (Blake Jones) - Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde): - Move checking for nohz cpus after time check - Change likelyhood of nohz.nr_cpus - Remove nohz.nr_cpus and use weight of cpumask instead - Avoid false sharing for sched_clock_irqtime (Wangyang Guo) - Cleanups (Yury Norov): - Drop useless cpumask_empty() in find_energy_efficient_cpu() - Simplify task_numa_find_cpu() - Use cpumask_weight_and() in sched_balance_find_dst_group() DL scheduler updates: - Add a deadline server for sched_ext tasks (by Andrea Righi and Joel Fernandes, with fixes by Peter Zijlstra) RT scheduler updates: - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang) Entry code updates and performance improvements (Jinjie Ruan) This is part of the scheduler tree in this cycle due to inter- dependencies with the RSEQ based time slice extension work: - Remove unused syscall argument from syscall_trace_enter() - Rework syscall_exit_to_user_mode_work() for architecture reuse - Add arch_ptrace_report_syscall_entry/exit() - Inline syscall_exit_work() and syscall_trace_enter() Scheduler core updates (Peter Zijlstra): - Rework sched_class::wakeup_preempt() and rq_modified_*() - Avoid rq->lock bouncing in sched_balance_newidle() - Rename rcu_dereference_check_sched_domain() => rcu_dereference_sched_domain() - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar): - Fold the sched_avg update - Change rcu_dereference_check_sched_domain() to rcu-sched - Switch to rcu_dereference_all() - Remove superfluous rcu_read_lock() - Limit hrtick work - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks - Clean up comments in 'struct cfs_rq' - Separate se->vlag from se->vprot - Rename cfs_rq::avg_load to cfs_rq::sum_weight - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions - Introduce and use the vruntime_cmp() and vruntime_op() wrappers for wrapped-signed aritmetics - Sort out 'blocked_load*' namespace noise Scheduler debugging code updates: - Export hidden tracepoints to modules (Gabriele Monaco) - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user() (Fushuai Wang) - Add assertions to QUEUE_CLASS (Peter Zijlstra) - hrtimer: Fix tracing oddity (Thomas Gleixner) Misc fixes and cleanups: - Re-evaluate scheduling when migrating queued tasks out of throttled cgroups (Zicheng Qu) - Remove task_struct->faults_disabled_mapping (Christoph Hellwig) - Fix math notation errors in avg_vruntime comment (Zhan Xusheng) - sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)" * tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups sched/cpufreq: Use %pe format for PTR_ERR() printing sched/rt: Skip currently executing CPU in rto_next_cpu() sched/clock: Avoid false sharing for sched_clock_irqtime selftests/sched_ext: Add test for DL server total_bw consistency selftests/sched_ext: Add test for sched_ext dl_server sched/debug: Fix dl_server (re)start conditions sched/debug: Add support to change sched_ext server params sched_ext: Add a DL server for sched_ext tasks sched/debug: Stop and start server based on if it was active sched/debug: Fix updating of ppos on server write ops sched/deadline: Clear the defer params entry: Inline syscall_exit_work() and syscall_trace_enter() entry: Add arch_ptrace_report_syscall_entry/exit() entry: Rework syscall_exit_to_user_mode_work() for architecture reuse entry: Remove unused syscall argument from syscall_trace_enter() sched: remove task_struct->faults_disabled_mapping sched: Update rq->avg_idle when a task is moved to an idle CPU selftests/rseq: Add rseq slice histogram script hrtimer: Fix trace oddity ...
2026-01-24x86/entry/vdso32: Omit '.cfi_offset eflags' for LLVM < 16Nathan Chancellor
After commit: 884961618ee5 ("x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S") building arch/x86/entry/vdso/vdso32/sigreturn.S with LLVM 15 fails with: <instantiation>:18:20: error: invalid register name .cfi_offset eflags, 64 ^ arch/x86/entry/vdso/vdso32/sigreturn.S:33:2: note: while in macro instantiation STARTPROC_SIGNAL_FRAME 8 ^ Support for eflags as an argument to .cfi_offset was added in the LLVM 16 development cycle: https://github.com/llvm/llvm-project/commit/67bd3c58c0c7389e39c5a2f4d3b1a30459ccf5b7 [1] Only add this .cfi_offset directive if it is supported by the assembler to clear up the error. [ mingo: Tidied up the changelog and the comment a bit ] Fixes: 884961618ee5 ("x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com> Link: https://patch.msgid.link/20260123-x86-vdso32-wa-llvm-15-cfi-offset-eflags-v1-1-0f412e3516a4@kernel.org
2026-01-22rseq: Implement sys_rseq_slice_yield()Thomas Gleixner
Provide a new syscall which has the only purpose to yield the CPU after the kernel granted a time slice extension. sched_yield() is not suitable for that because it unconditionally schedules, but the end of the time slice extension is not required to schedule when the task was already preempted. This also allows to have a strict check for termination to catch user space invoking random syscalls including sched_yield() from a time slice extension region. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
2026-01-16x86/entry/vdso: Fix filtering of vdso compiler flagsH. Peter Anvin
This fixes several typos in the filtering of compiler flags for vdso, discovered by Chris Mason using an AI script: 1. "-fno-PIE" was written as "fno-PIE". 2. "CC_PLUGINS_FLAGS" was written as "CC_PLUGIN_FLAGS" To the best of my knowledge, none of these actually had any real impact on the build at this time but they are genuine bugs which could break things at any point in the future. Chris's script also found that "CONFIG_X86_USER_SHADOW_STACK" was missing "CONFIG_", but it needs a different fix. [ dhansen: remove CONFIG_X86_USER_SHADOW_STACK munging, add mention in changelog. ] Closes: https://lore.kernel.org/20260116035807.2307742-1-clm@meta.com Fixes: 693c819fedcd ("x86/entry/vdso: Refactor the vdso build") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20260116204057.386268-3-hpa@zytor.com
2026-01-14vdso: Remove struct getcpu_cacheThomas Weißschuh
The cache parameter of getcpu() is useless nowadays for various reasons. * It is never passed by userspace for either the vDSO or syscalls. * It is never used by the kernel. * It could not be made to work on the current vDSO architecture. * The structure definition is not part of the UAPI headers. * vdso_getcpu() is superseded by restartable sequences in any case. Remove the struct and its header. As a side-effect this gets rid of an unwanted inclusion of the linux/ header namespace from vDSO code. [ tglx: Adapt to s390 upstream changes */ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://patch.msgid.link/20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de
2026-01-13x86/entry/vdso32: When using int $0x80, use it directlyH. Peter Anvin
When neither sysenter32 nor syscall32 is available (on either FRED-capable 64-bit hardware or old 32-bit hardware), there is no reason to do a bunch of stack shuffling in __kernel_vsyscall. Unfortunately, just overwriting the initial "push" instructions will mess up the CFI annotations, so suffer the 3-byte NOP if not applicable. Similarly, inline the int $0x80 when doing inline system calls in the vdso instead of calling __kernel_vsyscall. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-11-hpa@zytor.com
2026-01-13x86/cpufeature: Replace X86_FEATURE_SYSENTER32 with X86_FEATURE_SYSFAST32H. Peter Anvin
In most cases, the use of "fast 32-bit system call" depends either on X86_FEATURE_SEP or X86_FEATURE_SYSENTER32 || X86_FEATURE_SYSCALL32. However, nearly all the logic for both is identical. Define X86_FEATURE_SYSFAST32 which indicates that *either* SYSENTER32 or SYSCALL32 should be used, for either 32- or 64-bit kernels. This defaults to SYSENTER; use SYSCALL if the SYSCALL32 bit is also set. As this removes ALL existing uses of X86_FEATURE_SYSENTER32, which is a kernel-only synthetic feature bit, simply remove it and replace it with X86_FEATURE_SYSFAST32. This leaves an unused alternative for a true 32-bit kernel, but that should really not matter in any way. The clearing of X86_FEATURE_SYSCALL32 can be removed once the patches for automatically clearing disabled features has been merged. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-10-hpa@zytor.com
2026-01-13x86/entry/vdso: Include GNU_PROPERTY and GNU_STACK PHDRsH. Peter Anvin
Currently the vdso doesn't include .note.gnu.property or a GNU noexec stack annotation (the -z noexecstack in the linker script is ineffective because we specify PHDRs explicitly.) The motivation is that the dynamic linker currently do not check these. However, this is a weak excuse: the vdso*.so are also supposed to be usable at link libraries, and there is no reason why the dynamic linker might not want or need to check these in the future, so add them back in -- it is trivial enough. Use symbolic constants for the PHDR permission flags. [ v4: drop unrelated formatting changes ] Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-8-hpa@zytor.com
2026-01-13x86/entry/vdso32: Remove open-coded DWARF in sigreturn.SH. Peter Anvin
The vdso32 sigreturn.S contains open-coded DWARF bytecode, which includes a hack for gdb to not try to step back to a previous call instruction when backtracing from a signal handler. Neither of those are necessary anymore: the backtracing issue is handled by ".cfi_entry simple" and ".cfi_signal_frame", both of which have been supported for a very long time now, which allows the remaining frame to be built using regular .cfi annotations. Add a few more register offsets to the signal frame just for good measure. Replace the nop on fallthrough of the system call (which should never, ever happen) with a ud2a trap. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-7-hpa@zytor.com
2026-01-13x86/entry/vdso32: Remove SYSCALL_ENTER_KERNEL macro in sigreturn.SH. Peter Anvin
A macro SYSCALL_ENTER_KERNEL was defined in sigreturn.S, with the ability of overriding it. The override capability, however, is not used anywhere, and the macro name is potentially confusing because it seems to imply that sysenter/syscall could be used here, which is NOT true: the sigreturn system calls MUST use int $0x80. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-6-hpa@zytor.com
2026-01-13x86/entry/vdso32: Don't rely on int80_landing_pad for adjusting ipH. Peter Anvin
There is no fundamental reason to use the int80_landing_pad symbol to adjust ip when moving the vdso. If ip falls within the vdso, and the vdso is moved, we should change the ip accordingly, regardless of mode or location within the vdso. This *currently* can only happen on 32 bits, but there isn't any reason not to do so generically. Note that if this is ever possible from a vdso-internal call, then the user space stack will also needed to be adjusted (as well as the shadow stack, if enabled.) Fortunately this is not currently the case. At the moment, we don't even consider other threads when moving the vdso. The assumption is that it is only used by process freeze/thaw for migration, where this is not an issue. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-5-hpa@zytor.com
2026-01-13x86/entry/vdso: Refactor the vdso buildH. Peter Anvin
- Separate out the vdso sources into common, vdso32, and vdso64 directories. - Build the 32- and 64-bit vdsos in their respective subdirectories; this greatly simplifies the build flags handling. - Unify the mangling of Makefile flags between the 32- and 64-bit vdso code as much as possible; all common rules are put in arch/x86/entry/vdso/common/Makefile.include. The remaining is very simple for 32 bits; the 64-bit one is only slightly more complicated because it contains the x32 generation rule. - Define __DISABLE_EXPORTS when building the vdso. This need seems to have been masked by different ordering compile flags before. - Change CONFIG_X86_64 to BUILD_VDSO32_64 in vdso32/system_call.S, to make it compatible with including fake_32bit_build.h. - The -fcf-protection= option was "leaking" from the kernel build, for reasons that was not clear to me. Furthermore, several distributions ship with it set to a default value other than "-fcf-protection=none". Make it match the configuration options for *user space*. Note that this patch may seem large, but the vast majority of it is simply code movement. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-4-hpa@zytor.com
2026-01-13x86/entry/vdso: Move vdso2c to arch/x86/toolsH. Peter Anvin
It is generally better to build tools in arch/x86/tools to keep host cflags proliferation down, and to reduce makefile sequencing issues. Move the vdso build tool vdso2c into arch/x86/tools in preparation for refactoring the vdso makefiles. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-3-hpa@zytor.com
2026-01-13x86/entry/vdso: Rename vdso_image_* to vdso*_imageH. Peter Anvin
The vdso .so files are named vdso*.so. These structures are binary images and descriptions of these files, so it is more consistent for them to have a naming that more directly mirrors the filenames. It is also very slightly more compact (by one character...) and simplifies the Makefile just a little bit. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-2-hpa@zytor.com
2026-01-13x86/vdso: Provide clock_getres_time64() for x86-32Thomas Weißschuh
For consistency with __vdso_clock_gettime64() there should also be a 64-bit variant of clock_getres(). This will allow the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit time types from the kernel and UAPI. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20251223-vdso-compat-time32-v1-5-97ea7a06a543@linutronix.de
2026-01-12x86/paravirt: Remove not needed includes of paravirt.hJuergen Gross
In some places asm/paravirt.h is included without really being needed. Remove the related #include statements. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260105110520.21356-2-jgross@suse.com
2025-12-17perf/x86/core: Register a new vector for handling mediated guest PMIsSean Christopherson
Wire up system vector 0xf5 for handling PMIs (i.e. interrupts delivered through the LVTPC) while running KVM guests with a mediated PMU. Perf currently delivers all PMIs as NMIs, e.g. so that events that trigger while IRQs are disabled aren't delayed and generate useless records, but due to the multiplexing of NMIs throughout the system, correctly identifying NMIs for a mediated PMU is practically infeasible. To (greatly) simplify identifying guest mediated PMU PMIs, perf will switch the CPU's LVTPC between PERF_GUEST_MEDIATED_PMI_VECTOR and NMI when guest PMU context is loaded/put. I.e. PMIs that are generated by the CPU while the guest is active will be identified purely based on the IRQ vector. Route the vector through perf, e.g. as opposed to letting KVM attach a handler directly a la posted interrupt notification vectors, as perf owns the LVTPC and thus is the rightful owner of PERF_GUEST_MEDIATED_PMI_VECTOR. Functionally, having KVM directly own the vector would be fine (both KVM and perf will be completely aware of when a mediated PMU is active), but would lead to an undesirable split in ownership: perf would be responsible for installing the vector, but not handling the resulting IRQs. Add a new perf_guest_info_callbacks hook (and static call) to allow KVM to register its handler with perf when running guests with mediated PMUs. Note, because KVM always runs guests with host IRQs enabled, there is no danger of a PMI being delayed from the guest's perspective due to using a regular IRQ instead of an NMI. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-9-seanjc@google.com
2025-12-02Merge tag 'x86_entry_for_6.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry update from Dave Hansen: "This one is pretty trivial: fix a badly-named FRED data structure member" * tag 'x86_entry_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fred: Fix 64bit identifier in fred_ss
2025-12-02Merge tag 'x86_misc_for_6.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Dave Hansen: "The most significant are some changes to ensure that symbols exported for KVM are used only by KVM modules themselves, along with some related cleanups. In true x86/misc fashion, the other patch is completely unrelated and just enhances an existing pr_warn() to make it clear to users how they have tainted their kernel when something is mucking with MSRs. Summary: - Make MSR-induced taint easier for users to track down - Restrict KVM-specific exports to KVM itself" * tag 'x86_misc_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possible x86/mm: Drop unnecessary export of "ptdump_walk_pgd_level_debugfs" x86/mtrr: Drop unnecessary export of "mtrr_state" x86/bugs: Drop unnecessary export of "x86_spec_ctrl_base" x86/msr: Add CPU_OUT_OF_SPEC taint name to "unrecognized" pr_warn(msg)
2025-12-02Merge tag 'core-rseq-2025-11-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq updates from Thomas Gleixner: "A large overhaul of the restartable sequences and CID management: The recent enablement of RSEQ in glibc resulted in regressions which are caused by the related overhead. It turned out that the decision to invoke the exit to user work was not really a decision. More or less each context switch caused that. There is a long list of small issues which sums up nicely and results in a 3-4% regression in I/O benchmarks. The other detail which caused issues due to extra work in context switch and task migration is the CID (memory context ID) management. It also requires to use a task work to consolidate the CID space, which is executed in the context of an arbitrary task and results in sporadic uncontrolled exit latencies. The rewrite addresses this by: - Removing deprecated and long unsupported functionality - Moving the related data into dedicated data structures which are optimized for fast path processing. - Caching values so actual decisions can be made - Replacing the current implementation with a optimized inlined variant. - Separating fast and slow path for architectures which use the generic entry code, so that only fault and error handling goes into the TIF_NOTIFY_RESUME handler. - Rewriting the CID management so that it becomes mostly invisible in the context switch path. That moves the work of switching modes into the fork/exit path, which is a reasonable tradeoff. That work is only required when a process creates more threads than the cpuset it is allowed to run on or when enough threads exit after that. An artificial thread pool benchmarks which triggers this did not degrade, it actually improved significantly. The main effect in migration heavy scenarios is that runqueue lock held time and therefore contention goes down significantly" * tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/mmcid: Switch over to the new mechanism sched/mmcid: Implement deferred mode change irqwork: Move data struct to a types header sched/mmcid: Provide CID ownership mode fixup functions sched/mmcid: Provide new scheduler CID mechanism sched/mmcid: Introduce per task/CPU ownership infrastructure sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex sched/mmcid: Provide precomputed maximal value sched/mmcid: Move initialization out of line signal: Move MMCID exit out of sighand lock sched/mmcid: Convert mm CID mask to a bitmap cpumask: Cache num_possible_cpus() sched/mmcid: Use cpumask_weighted_or() cpumask: Introduce cpumask_weighted_or() sched/mmcid: Prevent pointless work in mm_update_cpus_allowed() sched/mmcid: Move scheduler code out of global header sched: Fixup whitespace damage sched/mmcid: Cacheline align MM CID storage sched/mmcid: Use proper data structures sched/mmcid: Revert the complex CID management ...
2025-12-01Merge tag 'core-bugs-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull bug handling infrastructure updates from Ingo Molnar: "Core updates: - Improve WARN(), which has vararg printf like arguments, to work with the x86 #UD based WARN-optimizing infrastructure by hiding the format in the bug_table and replacing this first argument with the address of the bug-table entry, while making the actual function that's called a UD1 instruction (Peter Zijlstra) - Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch (Ingo Molnar, s390 support by Heiko Carstens) Fixes and cleanups: - bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens) - <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS (Peter Zijlstra)" * tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits) x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS x86/bug: Fix BUG_FORMAT vs KASLR x86_64/bug: Inline the UD1 x86/bug: Implement WARN_ONCE() x86_64/bug: Implement __WARN_printf() x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED x86/bug: Add BUG_FORMAT basics bug: Allow architectures to provide __WARN_printf() bug: Implement WARN_ON() using __WARN_FLAGS() bug: Add report_bug_entry() bug: Add BUG_FORMAT_ARGS infrastructure bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS bug: Add BUG_FORMAT infrastructure x86: Rework __bug_table helpers bugs/s390: Remove private WARN_ON() implementation bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS() ...
2025-11-24x86_64/bug: Implement __WARN_printf()Peter Zijlstra
The basic idea is to have __WARN_printf() be a vararg function such that the compiler can do the optimal calling convention for us. This function body will be a #UD and then set up a va_list in the exception from pt_regs. But because the trap will be in a called function, the bug_entry must be passed in. Have that be the first argument, with the format tucked away inside the bug_entry. The comments should clarify the real fun details. The big downside is that all WARNs will now show: RIP: 0010:__WARN_trap:+0 One possible solution is to simply discard the top frame when unwinding. A follow up patch takes care of this slightly differently by abusing the x86 static_call implementation. This changes (with the next patches): WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET, "corrupted preempt_count: %s/%d/0x%x\n", from: cmpl $2, %ecx #, _7 jne .L1472 ... .L1472: cmpb $0, __already_done.11(%rip) je .L1513 ... .L1513 movb $1, __already_done.11(%rip) movl 1424(%r14), %edx # _15->pid, _15->pid leaq 1912(%r14), %rsi #, _17 movq $.LC43, %rdi #, call __warn_printk # ud2 .pushsection __bug_table,"aw" 2: .long 1b - . # bug_entry::bug_addr .long .LC1 - . # bug_entry::file .word 5093 # bug_entry::line .word 2313 # bug_entry::flags .org 2b + 12 .popsection .pushsection .discard.annotate_insn,"M", @progbits, 8 .long 1b - . .long 8 # ANNOTYPE_REACHABLE .popsection into: cmpl $2, %ecx #, _7 jne .L1442 #, ... .L1442: lea (2f)(%rip), %rdi 1: .pushsection __bug_table,"aw" 2: .long 1b - . # bug_entry::bug_addr .long .LC43 - . # bug_entry::format .long .LC1 - . # bug_entry::file .word 5093 # bug_entry::line .word 2323 # bug_entry::flags .org 2b + 16 .popsection movl 1424(%r14), %edx # _19->pid, _19->pid leaq 1912(%r14), %rsi #, _13 ud1 (%edx), %rdi Notably, by pushing everything into the exception handler it can take care of the ONCE thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251110115758.213813530@infradead.org
2025-11-12x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possibleSean Christopherson
Extend KVM's export macro framework to provide EXPORT_SYMBOL_FOR_KVM(), and use the helper macro to export symbols for KVM throughout x86 if and only if KVM will build one or more modules, and only for those modules. To avoid unnecessary exports when CONFIG_KVM=m but kvm.ko will not be built (because no vendor modules are selected), let arch code #define EXPORT_SYMBOL_FOR_KVM to suppress/override the exports. Note, the set of symbols to restrict to KVM was generated by manual search and audit; any "misses" are due to human error, not some grand plan. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251112173944.1380633-5-seanjc%40google.com
2025-11-04entry: Remove syscall_enter_from_user_mode_prepare()Thomas Gleixner
Open code the only user in the x86 syscall code and reduce the zoo of functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.652839989@linutronix.de
2025-11-03arch: hookup listns() system callChristian Brauner
Add the listns() system call to all architectures. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-20-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-13x86/fred: Fix 64bit identifier in fred_ssAndrew Cooper
FRED can only be enabled in Long Mode. This is the 64bit mode (as opposed to compatibility mode) identifier, rather than being something hard-wired at 1. No functional change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
2025-10-11Merge tag 'x86_core_for_v6.18_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more x86 updates from Borislav Petkov: - Remove a bunch of asm implementing condition flags testing in KVM's emulator in favor of int3_emulate_jcc() which is written in C - Replace KVM fastops with C-based stubs which avoids problems with the fastop infra related to latter not adhering to the C ABI due to their special calling convention and, more importantly, bypassing compiler control-flow integrity checking because they're written in asm - Remove wrongly used static branches and other ugliness accumulated over time in hyperv's hypercall implementation with a proper static function call to the correct hypervisor call variant - Add some fixes and modifications to allow running FRED-enabled kernels in KVM even on non-FRED hardware - Add kCFI improvements like validating indirect calls and prepare for enabling kCFI with GCC. Add cmdline params documentation and other code cleanups - Use the single-byte 0xd6 insn as the official #UD single-byte undefined opcode instruction as agreed upon by both x86 vendors - Other smaller cleanups and touchups all over the place * tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86,retpoline: Optimize patch_retpoline() x86,ibt: Use UDB instead of 0xEA x86/cfi: Remove __noinitretpoline and __noretpoline x86/cfi: Add "debug" option to "cfi=" bootparam x86/cfi: Standardize on common "CFI:" prefix for CFI reports x86/cfi: Document the "cfi=" bootparam options x86/traps: Clarify KCFI instruction layout compiler_types.h: Move __nocfi out of compiler-specific header objtool: Validate kCFI calls x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardware x86/fred: Install system vector handlers even if FRED isn't fully enabled x86/hyperv: Use direct call to hypercall-page x86/hyperv: Clean up hv_do_hypercall() KVM: x86: Remove fastops KVM: x86: Convert em_salc() to C KVM: x86: Introduce EM_ASM_3WCL KVM: x86: Introduce EM_ASM_1SRC2 KVM: x86: Introduce EM_ASM_2CL KVM: x86: Introduce EM_ASM_2W ...
2025-10-11Merge tag 'x86_cleanups_for_v6.18_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: - Simplify inline asm flag output operands now that the minimum compiler version supports the =@ccCOND syntax - Remove a bunch of AS_* Kconfig symbols which detect assembler support for various instruction mnemonics now that the minimum assembler version supports them all - The usual cleanups all over the place * tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__ x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h> x86/mtrr: Remove license boilerplate text with bad FSF address x86/asm: Use RDPKRU and WRPKRU mnemonics in <asm/special_insns.h> x86/idle: Use MONITORX and MWAITX mnemonics in <asm/mwait.h> x86/entry/fred: Push __KERNEL_CS directly x86/kconfig: Remove CONFIG_AS_AVX512 crypto: x86 - Remove CONFIG_AS_VPCLMULQDQ crypto: X86 - Remove CONFIG_AS_VAES crypto: x86 - Remove CONFIG_AS_GFNI x86/kconfig: Drop unused and needless config X86_64_SMP
2025-10-04Merge tag 'x86_entry_for_6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Dave Hansen: "A pair of x86/entry updates. The FRED one adjusts the kernel to the latest spec. The spec change prevents attackers from abusing kernel entry points. The second one came about because of the LASS work[1]. It moves the vsyscall emulation code away from depending on X86_PF_INSTR which is not available on some CPUs. Those CPUs are pretty obscure these days, but this still seems like the right thing to do. It also makes this code consistent with some things that the LASS code is going to do. - Use RIP instead of X86_PF_INSTR for vsyscall emulation - Remove ENDBR64 from FRED entry points" Link: https://lore.kernel.org/lkml/20250620135325.3300848-1-kirill.shutemov@linux.intel.com/ [1] * tag 'x86_entry_for_6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fred: Remove ENDBR64 from FRED entry points x86/vsyscall: Do not require X86_PF_INSTR to emulate vsyscall
2025-08-22x86/entry/fred: Push __KERNEL_CS directlyXin Li (Intel)
Push __KERNEL_CS directly, rather than moving it into RAX and then pushing RAX. No functional changes. Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250822071644.1405268-1-xin@zytor.com
2025-08-21uprobes/x86: Add uprobe syscall to speed up uprobeJiri Olsa
Adding new uprobe syscall that calls uprobe handlers for given 'breakpoint' address. The idea is that the 'breakpoint' address calls the user space trampoline which executes the uprobe syscall. The syscall handler reads the return address of the initial call to retrieve the original 'breakpoint' address. With this address we find the related uprobe object and call its consumers. Adding the arch_uprobe_trampoline_mapping function that provides uprobe trampoline mapping. This mapping is backed with one global page initialized at __init time and shared by the all the mapping instances. We do not allow to execute uprobe syscall if the caller is not from uprobe trampoline mapping. The uprobe syscall ensures the consumer (bpf program) sees registers values in the state before the trampoline was called. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-10-jolsa@kernel.org
2025-08-18x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardwareJosh Poimboeuf
Modify asm_fred_entry_from_kvm() to allow it to be invoked by KVM even when FRED isn't fully enabled, e.g. when running with CONFIG_X86_FRED=y on non-FRED hardware. This will allow forcing KVM to always use the FRED entry points for 64-bit kernels, which in turn will eliminate a rather gross non-CFI indirect call that KVM uses to trampoline IRQs by doing IDT lookups. The point of asm_fred_entry_from_kvm() is to bridge between C (vmx:handle_external_interrupt_irqoff()) and more C (__fred_entry_from_kvm()) while changing the calling context to appear like an interrupt (pt_regs). Making the whole thing bound by C ABI. All that remains for non-FRED hardware is to restore RSP (to undo the redzone and alignment). However the trivial change would result in code like: push %rbp mov %rsp, %rbp sub $REDZONE, %rsp and $MASK, %rsp PUSH_AND_CLEAR_REGS push %rbp POP_REGS pop %rbp <-- *objtool fail* mov %rbp, %rsp pop %rbp ret And this will confuse objtool something wicked -- it gets confused by the extra pop %rbp, not realizing the push and pop preserve the value. Rather than trying to each objtool about this, recognise that since the code is bound by C ABI on both ends and interrupts are not allowed to change pt_regs (only exceptions are) it is sufficient to PUSH_REGS in order to create pt_regs, but there is no reason to POP_REGS -- provided the callee-saved registers are preserved. So avoid clearing callee-saved regs and skip POP_REGS. [Original patch by Sean; much of this version by Josh; Changelog, comments and final form by Peterz] Originally-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lkml.kernel.org/r/20250714103441.245417052@infradead.org
2025-08-13x86/fred: Remove ENDBR64 from FRED entry pointsXin Li (Intel)
The FRED specification has been changed in v9.0 to state that there is no need for FRED event handlers to begin with ENDBR64, because in the presence of supervisor indirect branch tracking, FRED event delivery does not enter the WAIT_FOR_ENDBRANCH state. As a result, remove ENDBR64 from FRED entry points. Then add ANNOTATE_NOENDBR to indicate that FRED entry points will never be used for indirect calls to suppress an objtool warning. This change implies that any indirect CALL/JMP to FRED entry points causes #CP in the presence of supervisor indirect branch tracking. Credit goes to Jennifer Miller <jmill@asu.edu> and other contributors from Arizona State University whose research shows that placing ENDBR at entry points has negative value thus led to this change. Note: This is obviously an incompatible change to the FRED architecture. But, it's OK because there no FRED systems out in the wild today. All production hardware and late pre-production hardware will follow the FRED v9 spec and be compatible with this approach. [ dhansen: add note to changelog about incompatibility ] Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lore.kernel.org/linux-hardening/Z60NwR4w%2F28Z7XUa@ubun/ Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250716063320.1337818-1-xin%40zytor.com
2025-08-13x86/vsyscall: Do not require X86_PF_INSTR to emulate vsyscallKirill A. Shutemov
emulate_vsyscall() expects to see X86_PF_INSTR in PFEC on a vsyscall page fault, but the CPU does not report X86_PF_INSTR if neither X86_FEATURE_NX nor X86_FEATURE_SMEP are enabled. X86_FEATURE_NX should be enabled on nearly all 64-bit CPUs, except for early P4 processors that did not support this feature. Instead of explicitly checking for X86_PF_INSTR, compare the fault address to RIP. On machines with X86_FEATURE_NX enabled, issue a warning if RIP is equal to fault address but X86_PF_INSTR is absent. [ dhansen: flesh out code comments ] Originally-by: Dave Hansen <dave.hansen@intel.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lore.kernel.org/all/bd81a98b-f8d4-4304-ac55-d4151a1a77ab@intel.com Link: https://lore.kernel.org/all/20250624145918.2720487-1-kirill.shutemov%40linux.intel.com
2025-07-28Merge tag 'hardening-v6.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Introduce and start using TRAILING_OVERLAP() helper for fixing embedded flex array instances (Gustavo A. R. Silva) - mux: Convert mux_control_ops to a flex array member in mux_chip (Thorsten Blum) - string: Group str_has_prefix() and strstarts() (Andy Shevchenko) - Remove KCOV instrumentation from __init and __head (Ritesh Harjani, Kees Cook) - Refactor and rename stackleak feature to support Clang - Add KUnit test for seq_buf API - Fix KUnit fortify test under LTO * tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (22 commits) sched/task_stack: Add missing const qualifier to end_of_stack() kstack_erase: Support Clang stack depth tracking kstack_erase: Add -mgeneral-regs-only to silence Clang warnings init.h: Disable sanitizer coverage for __init and __head kstack_erase: Disable kstack_erase for all of arm compressed boot code x86: Handle KCOV __init vs inline mismatches arm64: Handle KCOV __init vs inline mismatches s390: Handle KCOV __init vs inline mismatches arm: Handle KCOV __init vs inline mismatches mips: Handle KCOV __init vs inline mismatch powerpc/mm/book3s64: Move kfence and debug_pagealloc related calls to __init section configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON configs/hardening: Enable CONFIG_KSTACK_ERASE stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth stackleak: Rename STACKLEAK to KSTACK_ERASE seq_buf: Introduce KUnit tests string: Group str_has_prefix() and strstarts() kunit/fortify: Add back "volatile" for sizeof() constants acpi: nfit: intel: avoid multiple -Wflex-array-member-not-at-end warnings ...
2025-07-28Merge tag 'vfs-6.17-rc1.fileattr' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fileattr updates from Christian Brauner: "This introduces the new file_getattr() and file_setattr() system calls after lengthy discussions. Both system calls serve as successors and extensible companions to the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have started to show their age in addition to being named in a way that makes it easy to conflate them with extended attribute related operations. These syscalls allow userspace to set filesystem inode attributes on special files. One of the usage examples is the XFS quota projects. XFS has project quotas which could be attached to a directory. All new inodes in these directories inherit project ID set on parent directory. The project is created from userspace by opening and calling FS_IOC_FSSETXATTR on each inode. This is not possible for special files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left with empty project ID. Those inodes then are not shown in the quota accounting but still exist in the directory. This is not critical but in the case when special files are created in the directory with already existing project quota, these new inodes inherit extended attributes. This creates a mix of special files with and without attributes. Moreover, special files with attributes don't have a possibility to become clear or change the attributes. This, in turn, prevents userspace from re-creating quota project on these existing files. In addition, these new system calls allow the implementation of additional attributes that we couldn't or didn't want to fit into the legacy ioctls anymore" * tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: tighten a sanity check in file_attr_to_fileattr() tree-wide: s/struct fileattr/struct file_kattr/g fs: introduce file_getattr and file_setattr syscalls fs: prepare for extending file_get/setattr() fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP selinux: implement inode_file_[g|s]etattr hooks lsm: introduce new hooks for setting/getting inode fsxattr fs: split fileattr related helpers into separate file
2025-07-21stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGSKees Cook
In preparation for Clang stack depth tracking for KSTACK_ERASE, split the stackleak-specific cflags out of GCC_PLUGINS_CFLAGS into KSTACK_ERASE_CFLAGS. Link: https://lore.kernel.org/r/20250717232519.2984886-3-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-21stackleak: Rename STACKLEAK to KSTACK_ERASEKees Cook
In preparation for adding Clang sanitizer coverage stack depth tracking that can support stack depth callbacks: - Add the new top-level CONFIG_KSTACK_ERASE option which will be implemented either with the stackleak GCC plugin, or with the Clang stack depth callback support. - Rename CONFIG_GCC_PLUGIN_STACKLEAK as needed to CONFIG_KSTACK_ERASE, but keep it for anything specific to the GCC plugin itself. - Rename all exposed "STACKLEAK" names and files to "KSTACK_ERASE" (named for what it does rather than what it protects against), but leave as many of the internals alone as possible to avoid even more churn. While here, also split "prev_lowest_stack" into CONFIG_KSTACK_ERASE_METRICS, since that's the only place it is referenced from. Suggested-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250717232519.2984886-1-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-02fs: introduce file_getattr and file_setattr syscallsAndrey Albershteyn
Introduce file_getattr() and file_setattr() syscalls to manipulate inode extended attributes. The syscalls takes pair of file descriptor and pathname. Then it operates on inode opened accroding to openat() semantics. The struct file_attr is passed to obtain/change extended attributes. This is an alternative to FS_IOC_FSSETXATTR ioctl with a difference that file don't need to be open as we can reference it with a path instead of fd. By having this we can manipulated inode extended attributes not only on regular files but also on special ones. This is not possible with FS_IOC_FSSETXATTR ioctl as with special files we can not call ioctl() directly on the filesystem inode using fd. This patch adds two new syscalls which allows userspace to get/set extended inode attributes on special files by using parent directory and a path - *at() like syscall. CC: linux-api@vger.kernel.org CC: linux-fsdevel@vger.kernel.org CC: linux-xfs@vger.kernel.org Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://lore.kernel.org/20250630-xattrat-syscall-v6-6-c4e3bc35227b@kernel.org Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16x86/bugs: Rename MDS machinery to something more genericBorislav Petkov (AMD)
It will be used by other x86 mitigations. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
2025-05-26Merge tag 'x86-entry-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vdso updates from Ingo Molnar: "Two changes to simplify the x86 vDSO code a bit" * tag 'x86-entry-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Remove redundant #ifdeffery around in_ia32_syscall() x86/vdso: Remove #ifdeffery around page setup variants
2025-05-17x86/mm/64: Make 5-level paging support unconditionalKirill A. Shutemov
Both Intel and AMD CPUs support 5-level paging, which is expected to become more widely adopted in the future. All major x86 Linux distributions have the feature enabled. Remove CONFIG_X86_5LEVEL and related #ifdeffery for it to make it more readable. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250516123306.3812286-4-kirill.shutemov@linux.intel.com
2025-05-09x86/its: Align RETs in BHB clear sequence to avoid thunkingPawan Gupta
The software mitigation for BHI is to execute BHB clear sequence at syscall entry, and possibly after a cBPF program. ITS mitigation thunks RETs in the lower half of the cacheline. This causes the RETs in the BHB clear sequence to be thunked as well, adding unnecessary branches to the BHB clear sequence. Since the sequence is in hot path, align the RET instructions in the sequence to avoid thunking. This is how disassembly clear_bhb_loop() looks like after this change: 0x44 <+4>: mov $0x5,%ecx 0x49 <+9>: call 0xffffffff81001d9b <clear_bhb_loop+91> 0x4e <+14>: jmp 0xffffffff81001de5 <clear_bhb_loop+165> 0x53 <+19>: int3 ... 0x9b <+91>: call 0xffffffff81001dce <clear_bhb_loop+142> 0xa0 <+96>: ret 0xa1 <+97>: int3 ... 0xce <+142>: mov $0x5,%eax 0xd3 <+147>: jmp 0xffffffff81001dd6 <clear_bhb_loop+150> 0xd5 <+149>: nop 0xd6 <+150>: sub $0x1,%eax 0xd9 <+153>: jne 0xffffffff81001dd3 <clear_bhb_loop+147> 0xdb <+155>: sub $0x1,%ecx 0xde <+158>: jne 0xffffffff81001d9b <clear_bhb_loop+91> 0xe0 <+160>: ret 0xe1 <+161>: int3 0xe2 <+162>: int3 0xe3 <+163>: int3 0xe4 <+164>: int3 0xe5 <+165>: lfence 0xe8 <+168>: pop %rbp 0xe9 <+169>: ret Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-04-22x86/vdso: Remove redundant #ifdeffery around in_ia32_syscall()Thomas Weißschuh
The #ifdefs only guard code that is also guarded by in_ia32_syscall(), which already contains the same #ifdef itself. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/r/20240910-x86-vdso-ifdef-v1-2-877c9df9b081@linutronix.de
2025-04-22x86/vdso: Remove #ifdeffery around page setup variantsThomas Weißschuh
Replace the open-coded ifdefs in C sources files with IS_ENABLED(). This makes the code easier to read and enables the compiler to typecheck also the disabled parts, before optimizing them away. To make this work, also remove the ifdefs from declarations of used variables. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/r/20240910-x86-vdso-ifdef-v1-1-877c9df9b081@linutronix.de
2025-04-09x86/bugs: Use SBPB in write_ibpb() if applicableJosh Poimboeuf
write_ibpb() does IBPB, which (among other things) flushes branch type predictions on AMD. If the CPU has SRSO_NO, or if the SRSO mitigation has been disabled, branch type flushing isn't needed, in which case the lighter-weight SBPB can be used. The 'x86_pred_cmd' variable already keeps track of whether IBPB or SBPB should be used. Use that instead of hardcoding IBPB. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/17c5dcd14b29199b75199d67ff7758de9d9a4928.1744148254.git.jpoimboe@kernel.org