summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2025-11-06arm64: dts: ti: k3-am65: disable "mcu_cpsw" in SoC file and enable in board fileSiddharth Vadapalli
Following the existing convention of disabling nodes in the SoC file and enabling only the required ones in the board file, disable "mcu_cpsw" node in the SoC file "k3-am65-mcu.dtsi" and enable it in the board file "k3-am654-base-board.dts". Also, now that "mcu_cpsw" is disabled in the SoC file, disabling it in "k3-am65-iot2050-common.dtsi" is no longer required. Hence, remove the section corresponding to this change. Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Link: https://patch.msgid.link/20251015111344.3639415-3-s-vadapalli@ti.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am62: disable "cpsw3g" in SoC file and enable in board fileSiddharth Vadapalli
Following the existing convention of disabling nodes in the SoC file and enabling only the required ones in the board file, disable "cpsw3g" node in the SoC file "k3-am62-main.dtsi" and enable it in the board (or board include) files: a) k3-am62-lp-sk.dts b) k3-am62-phycore-som.dtsi c) k3-am625-beagleplay.dts d) k3-am625-sk-common.dtsi Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Link: https://patch.msgid.link/20251015111344.3639415-2-s-vadapalli@ti.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am62p5-sk: Set wakeup-source system-statesMarkus Schneider-Pargmann (TI.com)
The CANUART pins of mcu_mcan0, mcu_mcan1, mcu_uart0 and wkup_uart0 are powered during Partial-IO and I/O Only + DDR and are capable of waking up the system in these states. Specify the states in which these units can do a wakeup on this board. Note that the UARTs are not capable of wakeup in Partial-IO because of of a UART mux on the board not being powered during Partial-IO. Add pincontrol definitions for mcu_mcan0 and mcu_mcan1 for wakeup from Partial-IO. Add these as wakeup pinctrl entries for both devices. Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com> Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-6-b8d9ff5f2742@baylibre.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am62a7-sk: Set wakeup-source system-statesMarkus Schneider-Pargmann (TI.com)
The CANUART pins of mcu_mcan0, mcu_mcan1, mcu_uart0 and wkup_uart0 are powered during Partial-IO and I/O Only + DDR and are capable of waking up the system in these states. Specify the states in which these units can do a wakeup on this board. Note that the UARTs are not capable of wakeup in Partial-IO because of of a UART mux on the board not being powered during Partial-IO. Add pincontrol definitions for mcu_mcan0 and mcu_mcan1 for wakeup from Partial-IO. Add these as wakeup pinctrl entries for both devices. Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com> Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-5-b8d9ff5f2742@baylibre.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am62-lp-sk: Set wakeup-source system-statesMarkus Schneider-Pargmann (TI.com)
The CANUART pins of mcu_mcan0, mcu_mcan1, mcu_uart0 and wkup_uart0 are powered during Partial-IO and I/O Only + DDR and are capable of waking up the system in these states. Specify the states in which these units can do a wakeup on this board. Note that the UARTs are not capable of wakeup in Partial-IO because of of a UART mux on the board not being powered during Partial-IO. As I/O Only + DDR is not supported on AM62x, the UARTs are not added in this patch. Add pincontrol definitions for mcu_mcan0 and mcu_mcan1 for wakeup from Partial-IO. Add these as wakeup pinctrl entries for both devices. Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com> Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-4-b8d9ff5f2742@baylibre.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am62p: Define possible system statesMarkus Schneider-Pargmann (TI.com)
Add the system states that are available on TI AM62P SoCs. Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com> Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-3-b8d9ff5f2742@baylibre.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am62a: Define possible system statesMarkus Schneider-Pargmann (TI.com)
Add the system states that are available on TI AM62A SoCs. Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com> Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-2-b8d9ff5f2742@baylibre.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am62: Define possible system statesMarkus Schneider-Pargmann (TI.com)
Add the system states that are available on TI AM62 SoCs. Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com> Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-1-b8d9ff5f2742@baylibre.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am62p-j722s-common-main: move audio_refclk hereMichael Walle
Since commit 9dee9cb2df08 ("arm64: dts: ti: k3-j722s-main: fix the audio refclk source") the clock nodes of the am62p and j722 are the same. Move them into the commit dtsi. Please note, that for the j722s the nodes are renamed from clock@ to clock-controller@. Suggested-by: Udit Kumar <u-kumar1@ti.com> Signed-off-by: Michael Walle <mwalle@kernel.org> Link: https://patch.msgid.link/20251103152826.1608309-1-mwalle@kernel.org Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-*: Replace rgmii-rxid with rgmii-id for CPSW portsSiddharth Vadapalli
The MAC Ports across all of the CPSW instances (CPSW2G, CPSW3G, CPSW5G and CPSW9G) present in various K3 SoCs only support the 'RGMII-ID' mode. This correction has been implemented/enforced by the updates to: a) Device-Tree binding for CPSW [0] b) Driver for CPSW [1] c) Driver for CPSW MAC Port's GMII [2] To complete the transition from 'RGMII-RXID' to 'RGMII-ID', update the 'phy-mode' property for all CPSW ports by replacing 'rgmii-rxid' with 'rgmii-id'. [0]: commit 9b357ea52523 ("dt-bindings: net: ti: k3-am654-cpsw-nuss: update phy-mode in example") [1]: commit ca13b249f291 ("net: ethernet: ti: am65-cpsw: fixup PHY mode for fixed RGMII TX delay") [2]: commit a22d3b0d49d4 ("phy: ti: gmii-sel: Always write the RGMII ID setting") Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Tested-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com> # k3-am642-tqma64xxl-mbax4xxl Tested-by: Francesco Dolcini <francesco.dolcini@toradex.com> # Toradex Verdin AM62P Link: https://patch.msgid.link/20251025073802.1790437-1-s-vadapalli@ti.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-06arm64: dts: ti: k3-am642-tqma64xxl: add boot phase tagsMatthias Schiffer
Similar to other AM64x-based boards, add boot phase tags to make the Device Trees usable for firmware/bootloaders without modification. Supported boot devices are eMMC/SD card, SPI-NOR and USB (both mass storage and DFU). The I2C EEPROM is included to allow the firmware to select the correct RAM configuration for different TQMa64xxL variants. Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com> Link: https://patch.msgid.link/20251105141726.39579-1-matthias.schiffer@ew.tq-group.com Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
2025-11-05crypto: s390/sha3 - Remove superseded SHA-3 codeEric Biggers
The SHA-3 library now utilizes the same s390 SHA-3 acceleration capabilities as the arch/s390/crypto/ SHA-3 crypto_shash algorithms. Moreover, crypto/sha3.c now uses the SHA-3 library. The result is that all SHA-3 APIs are now s390-accelerated without any need for the old SHA-3 code in arch/s390/crypto/. Remove this superseded code. Also update the s390 defconfig and debug_defconfig files to enable CONFIG_CRYPTO_SHA3 instead of CONFIG_CRYPTO_SHA3_256_S390 and CONFIG_CRYPTO_SHA3_512_S390. This makes it so that the s390-optimized SHA-3 continues to be built when either of these defconfigs is used. Tested-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251026055032.1413733-16-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-11-05lib/crypto: arm64/sha3: Migrate optimized code into libraryEric Biggers
Instead of exposing the arm64-optimized SHA-3 code via arm64-specific crypto_shash algorithms, instead just implement the sha3_absorb_blocks() and sha3_keccakf() library functions. This is much simpler, it makes the SHA-3 library functions be arm64-optimized, and it fixes the longstanding issue where the arm64-optimized SHA-3 code was disabled by default. SHA-3 still remains available through crypto_shash, but individual architectures no longer need to handle it. Note: to see the diff from arch/arm64/crypto/sha3-ce-glue.c to lib/crypto/arm64/sha3.h, view this commit with 'git show -M10'. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251026055032.1413733-10-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-11-05crypto: arm64/sha3 - Update sha3_ce_transform() to prepare for libraryEric Biggers
- Use size_t lengths, to match the library. - Pass the block size instead of digest size, and add support for the block size that SHAKE128 uses. This allows the code to be used with SHAKE128 and SHAKE256, which don't have the concept of a digest size. SHAKE256 has the same block size as SHA3-256, but SHAKE128 has a unique block size. Thus, there are now 5 supported block sizes. Don't bother changing the "glue" code arm64_sha3_update() too much, as it gets deleted when the SHA-3 code is migrated into lib/crypto/ anyway. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251026055032.1413733-9-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-11-05bpf, x86: add support for indirect jumpsAnton Protopopov
Add support for a new instruction BPF_JMP|BPF_X|BPF_JA, SRC=0, DST=Rx, off=0, imm=0 which does an indirect jump to a location stored in Rx. The register Rx should have type PTR_TO_INSN. This new type assures that the Rx register contains a value (or a range of values) loaded from a correct jump table – map of type instruction array. For example, for a C switch LLVM will generate the following code: 0: r3 = r1 # "switch (r3)" 1: if r3 > 0x13 goto +0x666 # check r3 boundaries 2: r3 <<= 0x3 # adjust to an index in array of addresses 3: r1 = 0xbeef ll # r1 is PTR_TO_MAP_VALUE, r1->map_ptr=M 5: r1 += r3 # r1 inherits boundaries from r3 6: r1 = *(u64 *)(r1 + 0x0) # r1 now has type INSN_TO_PTR 7: gotox r1 # jit will generate proper code Here the gotox instruction corresponds to one particular map. This is possible however to have a gotox instruction which can be loaded from different maps, e.g. 0: r1 &= 0x1 1: r2 <<= 0x3 2: r3 = 0x0 ll # load from map M_1 4: r3 += r2 5: if r1 == 0x0 goto +0x4 6: r1 <<= 0x3 7: r3 = 0x0 ll # load from map M_2 9: r3 += r1 A: r1 = *(u64 *)(r3 + 0x0) B: gotox r1 # jump to target loaded from M_1 or M_2 During check_cfg stage the verifier will collect all the maps which point to inside the subprog being verified. When building the config, the high 16 bytes of the insn_state are used, so this patch (theoretically) supports jump tables of up to 2^16 slots. During the later stage, in check_indirect_jump, it is checked that the register Rx was loaded from a particular instruction array. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-9-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-05bpf, x86: allow indirect jumps to r8...r15Anton Protopopov
Currently the emit_indirect_jump() function only accepts one of the RAX, RCX, ..., RBP registers as the destination. Make it to accept R8, R9, ..., R15 as well, and make callers to pass BPF registers, not native registers. This is required to enable indirect jumps support in eBPF. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-8-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-05bpf, x86: add new map type: instructions arrayAnton Protopopov
On bpf(BPF_PROG_LOAD) syscall user-supplied BPF programs are translated by the verifier into "xlated" BPF programs. During this process the original instructions offsets might be adjusted and/or individual instructions might be replaced by new sets of instructions, or deleted. Add a new BPF map type which is aimed to keep track of how, for a given program, the original instructions were relocated during the verification. Also, besides keeping track of the original -> xlated mapping, make x86 JIT to build the xlated -> jitted mapping for every instruction listed in an instruction array. This is required for every future application of instruction arrays: static keys, indirect jumps and indirect calls. A map of the BPF_MAP_TYPE_INSN_ARRAY type must be created with a u32 keys and value of size 8. The values have different semantics for userspace and for BPF space. For userspace a value consists of two u32 values – xlated and jitted offsets. For BPF side the value is a real pointer to a jitted instruction. On map creation/initialization, before loading the program, each element of the map should be initialized to point to an instruction offset within the program. Before the program load such maps should be made frozen. After the program verification xlated and jitted offsets can be read via the bpf(2) syscall. If a tracked instruction is removed by the verifier, then the xlated offset is set to (u32)-1 which is considered to be too big for a valid BPF program offset. One such a map can, obviously, be used to track one and only one BPF program. If the verification process was unsuccessful, then the same map can be re-used to verify the program with a different log level. However, if the program was loaded fine, then such a map, being frozen in any case, can't be reused by other programs even after the program release. Example. Consider the following original and xlated programs: Original prog: Xlated prog: 0: r1 = 0x0 0: r1 = 0 1: *(u32 *)(r10 - 0x4) = r1 1: *(u32 *)(r10 -4) = r1 2: r2 = r10 2: r2 = r10 3: r2 += -0x4 3: r2 += -4 4: r1 = 0x0 ll 4: r1 = map[id:88] 6: call 0x1 6: r1 += 272 7: r0 = *(u32 *)(r2 +0) 8: if r0 >= 0x1 goto pc+3 9: r0 <<= 3 10: r0 += r1 11: goto pc+1 12: r0 = 0 7: r6 = r0 13: r6 = r0 8: if r6 == 0x0 goto +0x2 14: if r6 == 0x0 goto pc+4 9: call 0x76 15: r0 = 0xffffffff8d2079c0 17: r0 = *(u64 *)(r0 +0) 10: *(u64 *)(r6 + 0x0) = r0 18: *(u64 *)(r6 +0) = r0 11: r0 = 0x0 19: r0 = 0x0 12: exit 20: exit An instruction array map, containing, e.g., instructions [0,4,7,12] will be translated by the verifier to [0,4,13,20]. A map with index 5 (the middle of 16-byte instruction) or indexes greater than 12 (outside the program boundaries) would be rejected. The functionality provided by this patch will be extended in consequent patches to implement BPF Static Keys, indirect jumps, and indirect calls. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-05x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probeJiri Olsa
Currently we don't get stack trace via ORC unwinder on top of fgraph exit handler. We can see that when generating stacktrace from kretprobe_multi bpf program which is based on fprobe/fgraph. The reason is that the ORC unwind code won't get pass the return_to_handler callback installed by fgraph return probe machinery. Solving this by creating stack frame in return_to_handler expected by ftrace_graph_ret_addr function to recover original return address and continue with the unwind. Also updating the pt_regs data with cs/flags/rsp which are needed for successful stack retrieval from ebpf bpf_get_stackid helper. - in get_perf_callchain we check user_mode(regs) so CS has to be set - in perf_callchain_kernel we call perf_hw_regs(regs), so EFLAGS/FIXED has to be unset Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20251104215405.168643-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-05Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"Jiri Olsa
This reverts commit 83f44ae0f8afcc9da659799db8693f74847e66b3. Currently we store initial stacktrace entry twice for non-HW ot_regs, which means callers that fail perf_hw_regs(regs) condition in perf_callchain_kernel. It's easy to reproduce this bpftrace: # bpftrace -e 'tracepoint:sched:sched_process_exec { print(kstack()); }' Attaching 1 probe... bprm_execve+1767 bprm_execve+1767 do_execveat_common.isra.0+425 __x64_sys_execve+56 do_syscall_64+133 entry_SYSCALL_64_after_hwframe+118 When perf_callchain_kernel calls unwind_start with first_frame, AFAICS we do not skip regs->ip, but it's added as part of the unwind process. Hence reverting the extra perf_callchain_store for non-hw regs leg. I was not able to bisect this, so I'm not really sure why this was needed in v5.2 and why it's not working anymore, but I could see double entries as far as v5.10. I did the test for both ORC and framepointer unwind with and without the this fix and except for the initial entry the stacktraces are the same. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20251104215405.168643-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-05ARM: dts: omap: am335x-mba335x: Fix stray '/*' in commentNathan Chancellor
When preprocessing arch/arm/boot/dts/ti/omap/am335x-mba335x.dts with clang, there are a couple of warnings about '/*' within a block comment. arch/arm/boot/dts/ti/omap/am335x-mba335x.dts:260:7: warning: '/*' within block comment [-Wcomment] 260 | /* /* gpmc_csn3.gpio2_0 - interrupt */ | ^ arch/arm/boot/dts/ti/omap/am335x-mba335x.dts:267:7: warning: '/*' within block comment [-Wcomment] 267 | /* /* gpmc_ben1.gpio1_28 - interrupt */ | ^ Remove the duplicate '/*' to clear up the warning. Fixes: 5267fcd180b1 ("ARM: dts: omap: Add support for TQMa335x/MBa335x") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20251105-omap-mba335x-fix-clang-comment-warning-v2-1-f8a0003e1003@kernel.org Signed-off-by: Kevin Hilman <khilman@baylibre.com>
2025-11-05ARM: dts: omap: am335x-tqma335x/mba335x: Fix MicIn routingAlexander Stein
'Mic Jack' is connected to IN3_L and 'Mic Bias' is connected to 'Mic Jack' Adjust routing accordingly. Fixes: 5267fcd180b1 ("ARM: dts: omap: Add support for TQMa335x/MBa335x") Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com> Link: https://lore.kernel.org/r/20251105083422.1010825-1-alexander.stein@ew.tq-group.com Signed-off-by: Kevin Hilman <khilman@baylibre.com>
2025-11-06arm64: dts: rockchip: add missing clocks for cpu cores on rk356xHeiko Stuebner
All cpu cores are supplied by the same clock, but all except the first core are missing that clocks reference - add the missing ones. Reviewed-by: Diederik de Haas <diederik@cknow-tech.com> Signed-off-by: Heiko Stuebner <heiko@sntech.de> Link: https://patch.msgid.link/20251103234926.416137-4-heiko@sntech.de
2025-11-06arm64: dts: rockchip: use SCMI clock id for cpu clock on rk356xHeiko Stuebner
Instead of hard-coding 0, use the more descriptive ID from the binding to reference the SCMI clock for the cpu on rk356x. Reviewed-by: Diederik de Haas <diederik@cknow-tech.com> Signed-off-by: Heiko Stuebner <heiko@sntech.de> Link: https://patch.msgid.link/20251103234926.416137-3-heiko@sntech.de
2025-11-05x86/mce/amd: Define threshold restart function for banksYazen Ghannam
Prepare for CMCI storm support by moving the common bank/block iterator code to a helper function. Include a parameter to switch the interrupt enable. This will be used by the CMCI storm handling function. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05x86/mce/amd: Remove redundant reset_block()Yazen Ghannam
Many of the checks in reset_block() are done again in the block reset function. So drop the redundant checks. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05KVM: nSVM: Avoid incorrect injection of SVM_EXIT_CR0_SEL_WRITEYosry Ahmed
When emulating L2 instructions, svm_check_intercept() checks whether a write to CR0 should trigger a synthesized #VMEXIT with SVM_EXIT_CR0_SEL_WRITE. However, it does not check whether L1 enabled the intercept for SVM_EXIT_WRITE_CR0, which has higher priority according to the APM (24593—Rev. 3.42—March 2024, Table 15-7): When both selective and non-selective CR0-write intercepts are active at the same time, the non-selective intercept takes priority. With respect to exceptions, the priority of this intercept is the same as the generic CR0-write intercept. Make sure L1 does NOT intercept SVM_EXIT_WRITE_CR0 before checking if SVM_EXIT_CR0_SEL_WRITE needs to be injected. Opportunistically tweak the "not CR0" logic to explicitly bail early so that it's more obvious that only CR0 has a selective intercept, and that modifying icpt_info.exit_code is functionally necessary so that the call to nested_svm_exit_handled() checks the correct exit code. Fixes: cfec82cb7d31 ("KVM: SVM: Add intercept check for emulated cr accesses") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251024192918.3191141-4-yosry.ahmed@linux.dev [sean: isolate non-CR0 write logic, tweak comments accordingly] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: nSVM: Propagate SVM_EXIT_CR0_SEL_WRITE correctly for LMSW emulationYosry Ahmed
When emulating L2 instructions, svm_check_intercept() checks whether a write to CR0 should trigger a synthesized #VMEXIT with SVM_EXIT_CR0_SEL_WRITE. For MOV-to-CR0, SVM_EXIT_CR0_SEL_WRITE is only triggered if any bit other than CR0.MP and CR0.TS is updated. However, according to the APM (24593—Rev. 3.42—March 2024, Table 15-7): The LMSW instruction treats the selective CR0-write intercept as a non-selective intercept (i.e., it intercepts regardless of the value being written). Skip checking the changed bits for x86_intercept_lmsw and always inject SVM_EXIT_CR0_SEL_WRITE. Fixes: cfec82cb7d31 ("KVM: SVM: Add intercept check for emulated cr accesses") Cc: stable@vger.kernel.org Reported-by: Matteo Rizzo <matteorizzo@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251024192918.3191141-3-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: nSVM: Remove redundant cases in nested_svm_intercept()Yosry Ahmed
Both the CRx and DRx cases are doing exactly what the default case is doing, remove them. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251024192918.3191141-2-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05x86/mce/amd: Support SMCA Corrected Error InterruptYazen Ghannam
AMD systems optionally support MCA thresholding which provides the ability for hardware to send an interrupt when a set error threshold is reached. This feature counts errors of all severities, but it is commonly used to report correctable errors with an interrupt rather than polling. Scalable MCA systems allow the platform to take control of this feature. In this case, the OS will not see the feature configuration and control bits in the MCA_MISC* registers. The OS will not receive the MCA thresholding interrupt, and it will need to poll for correctable errors. A "corrected error interrupt" will be available on Scalable MCA systems. This will be used in the same configuration where the platform controls MCA thresholding. However, the platform will now be able to send the MCA thresholding interrupt to the OS. Check for, and enable, this feature during per-CPU SMCA init. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05KVM: TDX: Fix list_add corruption during vcpu_load()Yan Zhao
During vCPU creation, a vCPU may be destroyed immediately after kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))" before invoking tdx_vcpu_free(). Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)") will cause list corruption of the per-pCPU list associated_tdvcpus. Then, a later list_add() during vcpu_load() would detect list corruption and print calltrace as shown below. Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs destroyed immediately after creation which must be in VCPU_TD_STATE_UNINITIALIZED state. kernel BUG at lib/list_debug.c:29! Oops: invalid opcode: 0000 [#2] SMP NOPTI RIP: 0010:__list_add_valid_or_report+0x82/0xd0 Call Trace: <TASK> tdx_vcpu_load+0xa8/0x120 vt_vcpu_load+0x25/0x30 kvm_arch_vcpu_load+0x81/0x300 vcpu_load+0x55/0x90 kvm_arch_vcpu_create+0x24f/0x330 kvm_vm_ioctl_create_vcpu+0x1b1/0x53 kvm_vm_ioctl+0xc2/0xa60 __x64_sys_ioctl+0x9a/0xf0 x64_sys_call+0x10ee/0x20d0 do_syscall_64+0xc3/0x470 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: d789fa6efac9 ("KVM: TDX: Handle vCPU dissociation") Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Bug the VM if extending the initial measurement failsSean Christopherson
WARN and terminate the VM if TDH_MR_EXTEND fails, as extending the measurement should fail if and only if there is a KVM bug, or if the S-EPT mapping is invalid. Now that KVM makes all state transitions mutually exclusive via tdx_vm_state_guard, it should be impossible for S-EPT mappings to be removed between kvm_tdp_mmu_map_private_pfn() and tdh_mr_extend(). Holding slots_lock prevents zaps due to memslot updates, filemap_invalidate_lock() prevents zaps due to guest_memfd PUNCH_HOLE, vcpu->mutex locks prevents updates from other vCPUs, kvm->lock prevents VM-scoped ioctls from creating havoc (e.g. by creating new vCPUs), and all usage of kvm_zap_gfn_range() is mutually exclusive with S-EPT entries that can be used for the initial image. For kvm_zap_gfn_range(), the call from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT so the same goes for kvm_noncoherent_dma_assignment_start_or_stop(), and __kvm_set_or_clear_apicv_inhibit() is blocked by virtue of holding all VM and vCPU mutexes (and the APIC page has its own KVM-internal memslot that is never created for TDX VMs, and so can't possibly be used for the initial image, which means that too is mutually exclusive irrespective of locking). Opportunistically return early if the region doesn't need to be measured in order to reduce line lengths and avoid wraps. Similarly, immediately and explicitly return if TDH_MR_EXTEND fails to make it clear that KVM needs to bail entirely if extending the measurement fails. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Guard VM state transitions with "all" the locksSean Christopherson
Acquire kvm->lock, kvm->slots_lock, and all vcpu->mutex locks when servicing ioctls that (a) transition the TD to a new state, i.e. when doing INIT or FINALIZE or (b) are only valid if the TD is in a specific state, i.e. when initializing a vCPU or memory region. Acquiring "all" the locks fixes several KVM_BUG_ON() situations where a SEAMCALL can fail due to racing actions, e.g. if tdh_vp_create() contends with either tdh_mr_extend() or tdh_mr_finalize(). For all intents and purposes, the paths in question are fully serialized, i.e. there's no reason to try and allow anything remotely interesting to happen. Smack 'em with a big hammer instead of trying to be "nice". Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering. Use the recently-renamed kvm_arch_vcpu_unlocked_ioctl() to service the vCPU-scoped ioctls to avoid a lock inversion problem, e.g. due to taking vcpu->mutex outside kvm->lock. See also commit ecf371f8b02d ("KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight"), which fixed a similar bug with SEV intra-host migration where an in-flight vCPU creation could race with a VM-wide state transition. Define a fancy new CLASS to handle the lock+check => unlock logic with guard()-like syntax: CLASS(tdx_vm_state_guard, guard)(kvm); if (IS_ERR(guard)) return PTR_ERR(guard); to simplify juggling the many locks. Note! Take kvm->slots_lock *after* all vcpu->mutex locks, as per KVM's soon-to-be-documented lock ordering rules[1]. Link: https://lore.kernel.org/all/20251016235538.171962-1-seanjc@google.com [1] Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/aLFiPq1smdzN3Ary@yzhao56-desk.sh.intel.com Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Don't copy "cmd" back to userspace for KVM_TDX_CAPABILITIESSean Christopherson
Don't copy the kvm_tdx_cmd structure back to userspace when handling KVM_TDX_CAPABILITIES, as tdx_get_capabilities() doesn't modify hw_error or any other fields. Opportunistically hoist the call to tdx_get_capabilities() outside of the kvm->lock critical section, as getting the capabilities doesn't touch the VM in any way, e.g. doesn't even take @kvm. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Use guard() to acquire kvm->lock in tdx_vm_ioctl()Sean Christopherson
Use guard() in tdx_vm_ioctl() to tidy up the code a small amount, but more importantly to minimize the diff of a future change, which will use guard-like semantics to acquire and release multiple locks. No functional change intended. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Convert INIT_MEM_REGION and INIT_VCPU to "unlocked" vCPU ioctlSean Christopherson
Handle the KVM_TDX_INIT_MEM_REGION and KVM_TDX_INIT_VCPU vCPU sub-ioctls in the unlocked variant, i.e. outside of vcpu->mutex, in anticipation of taking kvm->lock along with all other vCPU mutexes, at which point the sub-ioctls _must_ start without vcpu->mutex held. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Add tdx_get_cmd() helper to get and validate sub-ioctl commandSean Christopherson
Add a helper to copy a kvm_tdx_cmd structure from userspace and verify that must-be-zero fields are indeed zero. No functional change intended. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guestSean Christopherson
Add a macro to handle kicking vCPUs out of the guest and retrying SEAMCALLs on TDX_OPERAND_BUSY instead of providing small helpers to be used by each SEAMCALL. Wrapping the SEAMCALLs in a macro makes it a little harder to tease out which SEAMCALL is being made, but significantly reduces the amount of copy+paste code, and makes it all but impossible to leave an elevated wait_for_sept_zap. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05Merge tag 'rust-fixes-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull rust fixes from Miguel Ojeda: - Fix/workaround a couple Rust 1.91.0 build issues when sanitizers are enabled due to extra checking performed by the compiler and an upstream issue already fixed for Rust 1.93.0 - Fix future Rust 1.93.0 builds by supporting the stabilized name for the 'no-jump-tables' flag - Fix a couple private/broken intra-doc links uncovered by the future move of pin-init to 'syn' * tag 'rust-fixes-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: rust: kbuild: support `-Cjump-tables=n` for Rust 1.93.0 rust: kbuild: workaround `rustdoc` doctests modifier bug rust: kbuild: treat `build_error` and `rustdoc` as kernel objects rust: condvar: fix broken intra-doc link rust: devres: fix private intra-doc link
2025-11-05KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT entriesSean Christopherson
Unconditionally assert that mmu_lock is held for write when removing S-EPT entries, not just when removing S-EPT entries triggers certain conditions, e.g. needs to do TDH_MEM_TRACK or kick vCPUs out of the guest. Conditionally asserting implies that it's safe to hold mmu_lock for read when those paths aren't hit, which is simply not true, as KVM doesn't support removing S-EPT entries under read-lock. Only two paths lead to remove_external_spte(), and both paths asserts that mmu_lock is held for write (tdp_mmu_set_spte() via lockdep, and handle_removed_pt() via KVM_BUG_ON()). Deliberately leave lockdep assertions in the "no vCPUs" helpers to document that wait_for_sept_zap is guarded by holding mmu_lock for write, and keep the conditional assert in tdx_track() as well, but with a comment to help explain why holding mmu_lock for write matters (above and beyond why tdx_sept_remove_private_spte()'s requirements). Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Derive error argument names from the local variable namesSean Christopherson
When printing SEAMCALL errors, use the name of the variable holding an error parameter instead of the register from whence it came, so that flows which use descriptive variable names will similarly print descriptive error messages. Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON()Sean Christopherson
Add TDX_BUG_ON() macros (with varying numbers of arguments) to deduplicate the myriad flows that do KVM_BUG_ON()/WARN_ON_ONCE() followed by a call to pr_tdx_error(). In addition to reducing boilerplate copy+paste code, this also helps ensure that KVM provides consistent handling of SEAMCALL errors. Opportunistically convert a handful of bare WARN_ON_ONCE() paths to the equivalent of KVM_BUG_ON(), i.e. have them terminate the VM. If a SEAMCALL error is fatal enough to WARN on, it's fatal enough to terminate the TD. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte()Sean Christopherson
Do TDH_MEM_RANGE_BLOCK directly in tdx_sept_remove_private_spte() instead of using a one-off helper now that the nr_premapped tracking is gone. Opportunistically drop the WARN on hugepages, which was dead code (see the KVM_BUG_ON() in tdx_sept_remove_private_spte()). No functional change intended. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: ADD pages to the TD image while populating mirror EPT entriesSean Christopherson
When populating the initial memory image for a TDX guest, ADD pages to the TD as part of establishing the mappings in the mirror EPT, as opposed to creating the mappings and then doing ADD after the fact. Doing ADD in the S-EPT callbacks eliminates the need to track "premapped" pages, as the mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails, KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT). Eliminating the hole where the M-EPT can have a mapping that doesn't exist in the S-EPT in turn obviates the need to handle errors that are unique to encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()). Keeping the M-EPT and S-EPT synchronized also eliminates the need to check for unconsumed "premap" entries during tdx_td_finalize(), as there simply can't be any such entries. Dropping that check in particular reduces the overall cognitive load, as the management of nr_premapped with respect to removal of S-EPT is _very_ subtle. E.g. successful removal of an S-EPT entry after it completed ADD doesn't adjust nr_premapped, but it's not clear why that's "ok" but having half-baked entries is not (it's not truly "ok" in that removing pages from the image will likely prevent the guest from booting, but from KVM's perspective it's "ok"). Doing ADD in the S-EPT path requires passing an argument via a scratch field, but the current approach of tracking the number of "premapped" pages effectively does the same. And the "premapped" counter is much more dangerous, as it doesn't have a singular lock to protect its usage, since nr_premapped can be modified as soon as mmu_lock is dropped, at least in theory. I.e. nr_premapped is guarded by slots_lock, but only for "happy" paths. Note, this approach was used/tried at various points in TDX development, but was ultimately discarded due to a desire to avoid stashing temporary state in kvm_tdx. But as above, KVM ended up with such state anyways, and fully committing to using temporary state provides better access rules (100% guarded by slots_lock), and makes several edge cases flat out impossible. Note #2, continue to extend the measurement outside of mmu_lock, as it's a slow operation (typically 16 SEAMCALLs per page whose data is included in the measurement), and doesn't *need* to be done under mmu_lock, e.g. for consistency purposes. However, MR.EXTEND isn't _that_ slow, e.g. ~1ms latency to measure a full page, so if it needs to be done under mmu_lock in the future, e.g. because KVM gains a flow that can remove S-EPT entries during KVM_TDX_INIT_MEM_REGION, then extending the measurement can also be moved into the S-EPT mapping path (again, only if absolutely necessary). P.S. _If_ MR.EXTEND is moved into the S-EPT path, take care not to return an error up the stack if TDH_MR_EXTEND fails, as removing the M-EPT entry but not the S-EPT entry would result in inconsistent state! Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole callerSean Christopherson
Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as providing a one-off helper for effectively three lines of code is at best a wash, and splitting the code makes the comment for smp_rmb() _extremely_ confusing as the comment talks about reading kvm->arch.pre_fault_allowed before kvm_tdx->state, but the immediately visible code does the exact opposite. Opportunistically rewrite the comments to more explicitly explain who is checking what, as well as _why_ the ordering matters. No functional change intended. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Use atomic64_dec_return() instead of a poor equivalentSean Christopherson
Use atomic64_dec_return() when decrementing the number of "pre-mapped" S-EPT pages to ensure that the count can't go negative without KVM noticing. In theory, checking for '0' and then decrementing in a separate operation could miss a 0=>-1 transition. In practice, such a condition is impossible because nr_premapped is protected by slots_lock, i.e. doesn't actually need to be an atomic (that wart will be addressed shortly). Don't bother trying to keep the count non-negative, as the KVM_BUG_ON() ensures the VM is dead, i.e. there's no point in trying to limp along. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()Sean Christopherson
Return -EIO immediately from tdx_sept_zap_private_spte() if the number of to-be-added pages underflows, so that the following "KVM_BUG_ON(err, kvm)" isn't also triggered. Isolating the check from the "is premap error" if-statement will also allow adding a lockdep assertion that premap errors are encountered if and only if slots_lock is held. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: WARN if mirror SPTE doesn't have full RWX when creating S-EPT mappingSean Christopherson
Pass in the mirror_spte to kvm_x86_ops.set_external_spte() to provide symmetry with .remove_external_spte(), and assert in TDX that the mirror SPTE is shadow-present with full RWX permissions (the TDX-Module doesn't allow the hypervisor to control protections). Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte()Sean Christopherson
Drop the return code from kvm_x86_ops.remove_external_spte(), a.k.a. tdx_sept_remove_private_spte(), as KVM simply does a KVM_BUG_ON() failure, and that KVM_BUG_ON() is redundant since all error paths in TDX also do a KVM_BUG_ON(). Opportunistically pass the spte instead of the pfn, as the API is clearly about removing an spte. Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte()Sean Christopherson
Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() as a step towards having "remove" be the one and only function that deals with removing/zapping/dropping a SPTE, e.g. to avoid having to differentiate between "zap", "drop", and "remove". Eliminating the "drop" helper also gets rid of what is effectively dead code due to redundant checks, e.g. on an HKID being assigned. No functional change intended. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() conditionSean Christopherson
Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO when a VM has been killed due to a KVM bug, not -EINVAL. Note, many (all?) of the affected paths never propagate the error code to userspace, i.e. this is about internal consistency more than anything else. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>