summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
12 daysxfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()Tristan Madani
iptfs_destroy_state() calls hrtimer_cancel() while holding a spinlock that the timer callback also acquires, leading to an ABBA deadlock on SMP systems. For the output timer (iptfs_timer): - iptfs_destroy_state() holds x->lock, calls hrtimer_cancel() - iptfs_delay_timer() callback takes x->lock For the drop timer (drop_timer): - iptfs_destroy_state() holds drop_lock, calls hrtimer_cancel() - iptfs_drop_timer() callback takes drop_lock Both timers use HRTIMER_MODE_REL_SOFT, so their callbacks run in softirq context. When hrtimer_cancel() is called for a soft timer that is currently executing on another CPU, hrtimer_cancel_wait_running() spins on softirq_expiry_lock -- the same lock held by the softirq running the callback. If the callback is blocked waiting for the spinlock held by the caller of hrtimer_cancel(), a circular dependency forms: CPU 0: holds lock_A -> waits for softirq_expiry_lock CPU 1: holds softirq_expiry_lock -> waits for lock_A Fix by calling hrtimer_cancel() before acquiring the respective locks. hrtimer_cancel() is safe to call without holding any lock and will wait for any in-progress callback to complete. For the output timer, the lock is still acquired afterwards to drain the packet queue. For the drop timer, the lock/unlock pair is removed entirely since it only existed to serialize with the timer callback, which hrtimer_cancel() already guarantees. Found by source code audit. Fixes: 4b3faf610cc6 ("xfrm: iptfs: add new iptfs xfrm mode impl") Cc: Christian Hopps <chopps@labn.net> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: stable@vger.kernel.org Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
12 daysarm64: arch_timer: reuse arch_timer_read_cnt{p,v}ct_el0() helpersBreno Leitao
__arch_counter_get_cntpct() and __arch_counter_get_cntvct() open-code the same ECV-aware ALTERNATIVE block that arch_timer_read_cntpct_el0() and arch_timer_read_cntvct_el0() already provide in the same header. The two pairs are byte-for-byte identical except for the trailing arch_counter_enforce_ordering() the __arch_counter_get_* variants add. Replace the duplicated inline assembly in __arch_counter_get_cntpct() and __arch_counter_get_cntvct() with calls to the corresponding helpers. This mirrors commit 00b39d150986 ("arm64: vdso: Use __arch_counter_get_cntvct()"), which removed similar duplication from the vDSO, and keeps the system-counter read sequence in a single place, reducing assembly code in the kernell No functional change: the resulting inline assembly, alternatives, and clobbers are unchanged; only the source-level expression of the read moves into the existing helper. Verified by rebuilding the consumers of these helpers before and after the change and comparing the resulting disassembly: - arch/arm64/kernel/vdso/vdso.so (final linked vDSO): bit-identical (same sha256 across rebuilds) - arch/arm64/kernel/vdso/vgettimeofday.o: identical disassembly - arch/arm64/lib/delay.o: identical disassembly - drivers/clocksource/arm_arch_timer.o: same 50 functions with byte-identical instruction streams; only difference is function ordering inside .text and NOP padding, with no opcodes added or removed. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Will Deacon <will@kernel.org>
12 daysKVM: arm64: Reassign nested_mmus array behind mmu_lockHyunwoo Kim
kvm->arch.nested_mmus[] is walked under kvm->mmu_lock, including from the MMU notifier path (kvm_unmap_gfn_range() -> kvm_nested_s2_unmap()), which can run at any time. kvm_vcpu_init_nested() reallocates the array and frees the old buffer while holding only kvm->arch.config_lock, so such a walker can reference the freed array. Allocate the new array outside of mmu_lock, as the allocation can sleep. Under the lock, copy the existing entries, fix up the back pointers and reassign the array. Free the old buffer after dropping the lock, as kvfree() can sleep as well. Fixes: 4f128f8e1aaac ("KVM: arm64: nv: Support multiple nested Stage-2 mmu structures") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/aiKIVVeIr1aAB1yp@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger,kernel.org
12 daysKVM: arm64: Restore POR_EL0 access to host EL0Joey Gouly
CPTR_EL2.E0POE was being cleared in __deactivate_cptr_traps_vhe(), which meant that any accesses to POR_EL0 from host EL0 would trap and be reported to userspace as an Illegal instruction. This would happen after running any VM, regardless if it used POE or not. Signed-off-by: Joey Gouly <joey.gouly@arm.com> Link: https://sashiko.dev/#/patchset/20260602155430.2088142-1-maz@kernel.org?part=1 Link: https://patch.msgid.link/20260604105434.2297268-1-joey.gouly@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger,kernel.org
12 daysarm64/mm: Rename ptdesc_tAnshuman Khandual
ptdesc_t sounds very similar to the core MM struct ptdesc which is actually the memory descriptor for page table allocations. Hence rename this typedef element as ptval_t instead for better clarity and separation. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: linux-efi@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
12 daysarm64: mm: Defer remap of linear alias of data/bssArd Biesheuvel
Marking the linear alias of data/bss invalid involves calling set_memory_valid(), which calls split_kernel_leaf_mapping() under the hood. On BBML2_NOABORT capable systems, this may result in the need to allocate page tables at a time when the generic memory allocation APIs are not yet available, resulting in a splat like WARNING: arch/arm64/mm/mmu.c:821 at split_kernel_leaf_mapping+0x15c/0x170, CPU#0: swapper/0 Modules linked in: CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 7.1.0-rc6 #1 PREEMPT(undef) pstate: a04000c9 (NzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : split_kernel_leaf_mapping+0x15c/0x170 lr : update_range_prot+0x40/0x128 sp : ffffc99ad3863c80 ... Call trace: split_kernel_leaf_mapping+0x15c/0x170 (P) update_range_prot+0x40/0x128 set_memory_valid+0x94/0xe0 mark_linear_data_alias_valid+0x54/0x68 map_mem+0x1fc/0x240 paging_init+0x48/0x210 setup_arch+0x274/0x338 start_kernel+0x98/0x538 __primary_switched+0x88/0x98 as reported by CKI automated testing. So defer the boot-time call to mark_linear_data_alias_valid() to a later time when page allocations can be made normally. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
12 daysKVM: arm64: Omit tag sync on stage-2 mappings of the zero pageArd Biesheuvel
Commit f620d66af316 ("arm64: mte: Do not flag the zero page as PG_mte_tagged") removed the PG_mte_tagged flag from the zero page, but missed a KVM code path that may set this flag on the zero page when it is used in a stage-2 CoW mapping of anonymous memory. So disregard the zero page explicitly in sanitise_mte_tags(). Fixes: f620d66af316 ("arm64: mte: Do not flag the zero page as PG_mte_tagged") Cc: stable@vger.kernel.org # 5.10.x Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
12 daysarm64: Avoid double evaluation of __ptep_get()Ard Biesheuvel
Sashiko warns that the new pte_valid_noncont() macro is used in a manner where the argument (which performs a READ_ONCE() of the descriptor) is evaluated twice. Drop the macro that we just added, and move the check into the newly added users. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
12 dayskasan: Move generic KASAN page tables out of BSS tooArd Biesheuvel
Make sure that all KASAN page tables are emitted into the .pgtbl section (provided that the arch has one - otherwise, fall back to page aligned BSS) This is needed because BSS itself is no longer accessible via the linear map on arm64. Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: kasan-dev@googlegroups.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
12 daysarm64: Rename page table BSS section to .bss..pgtblArd Biesheuvel
Rename the .pgdir.bss section to .bss..pgtbl so that the compiler will notice the leading ".bss" and mark it as NOBITS by default (rather than PROGBITS, which would take up space in Image binary, forcing all of the preceding BSS to be emitted into the image as well). This supersedes the NOLOAD linker directive, which achieves the same thing, and can be therefore be dropped. Also, rename .pgdir to .pgtbl to be more generic, as page tables of various levels will reside here. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
12 daysRevert "drm/i915/backlight: Remove try_vesa_interface"Suraj Kandpal
This reverts commit 40d2f5820951dee818d05c14677277048bd85f9f. Removing the try_vesa_interface gate caused a backlight regression on panels whose VBT correctly reports INTEL_BACKLIGHT_DISPLAY_DDI and whose PWM path is the actual backlight control, but whose DPCD optimistically advertises DP_EDP_BACKLIGHT_AUX_ENABLE_CAP / _BRIGHTNESS_AUX_SET_CAP. After the commit such panels silently bind to the VESA AUX backlight funcs; AUX writes complete but the panel ignores them, leaving brightness stuck (no-op backlight). Observed on at least KBL and TGL eDP setups. Signed-off-by: Suraj Kandpal <suraj.kandpal@intel.com> Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com> Link: https://patch.msgid.link/20260517024709.1016121-1-suraj.kandpal@intel.com (cherry picked from commit f30fddb4402313aa5301a74d721638d343395269) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
12 daysx86/process: Convert rdmsr() to rdmsrq() in arch_post_acpi_subsys_init() to ↵HyeongJun An
address W=1 warning arch_post_acpi_subsys_init() reads MSR_K8_INT_PENDING_MSG with rdmsr() into a lo/hi pair but only uses the low 32 bits: K8_INTP_C1E_ACTIVE_MASK (0x18000000) lies entirely within them. The 'hi' half is never consumed, which triggers a -Wunused-but-set-variable warning under W=1: arch/x86/kernel/process.c: In function 'arch_post_acpi_subsys_init': arch/x86/kernel/process.c:972:17: warning: variable 'hi' set but not used Read the full MSR into a single u64 with rdmsrq() and test the mask against it, dropping the now-unnecessary lo/hi variables. No functional change intended. Signed-off-by: HyeongJun An <sammiee5311@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jürgen Groß <jgross@suse.com> Link: https://patch.msgid.link/20260604150052.3337246-1-sammiee5311@gmail.com
12 daysKVM: arm64: Take the SRCU lock for page table walks in fault injection and ↵Hyunwoo Kim
AT emulation walk_s1() and kvm_walk_nested_s2() expect to be called while holding kvm->srcu to guard against memslot changes. While this is generally the case, __kvm_at_s12() and __kvm_find_s1_desc_level() call into the respective walkers without taking kvm->srcu. Fix by acquiring kvm->srcu prior to the table walk in both instances. Cc: stable@vger.kernel.org Fixes: 50f77dc87f13 ("KVM: arm64: Populate level on S1PTW SEA injection") Fixes: be04cebf3e78 ("KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}") Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/aiAZfdeyanIvP8SD@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org>
12 daysKVM: arm64: vgic-its: Drop the translation cache reference only for the ↵Hyunwoo Kim
erased entry vgic_its_invalidate_cache() walks the per-ITS translation cache with xa_for_each() and drops the cache's reference on each entry with vgic_put_irq(). It puts the iterated pointer, though, rather than the value returned by xa_erase(). The function is called from contexts that do not exclude one another: the ITS command handlers hold its_lock, the GITS_CTLR write path holds cmd_lock, and the path that clears EnableLPIs in a redistributor's GICR_CTLR holds neither. Two or more of them can drain the same cache concurrently, and if each one observes the same entry, erases it and then puts it, the single reference the cache holds on that entry is dropped more than once. The entry can then be freed while an ITE still maps it. xa_erase() is atomic and returns the previous entry, so put only the entry that this context actually removed. The cache reference is then dropped exactly once per entry even when the invalidations run concurrently, and the behavior is unchanged when only one context runs. Fixes: 8201d1028caa ("KVM: arm64: vgic-its: Maintain a translation cache per ITS") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/ah2c5lu4JbUg7dj-@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
12 daysirqchip/irq-realtek-rtl: Add multicore supportMarkus Stockhausen
The Realtek interrupt driver currently supports only single core systems. So the higher end devices like RTL839x and RTL930x with dual VPEs must be driven with NR_CPU=1. Enhance the driver to support multicore (dual VPE) systems. For this: - Extend the register map for multiple cores - Search for multiple CPU cores in the devicetree - Improve the register helpers to support multiple cores - Add an affinity setter - Enhance the IRQ handler for multiple cores Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260604182506.1113440-3-markus.stockhausen@gmx.de
12 daysirqchip/irq-realtek-rtl: Add/simplify register helpersMarkus Stockhausen
The Realtek interrupt controller has two important registers that are used by the driver in several places - GIMR: global interrupt mask register - IRR: Interrupt routing registers The usage of these registers is very inconsistent. GIMR is addressed directly while IRR has a helper that needs a macro as an input. Harmonize this by providing consistent helpers that improve code readability. The callers of these helpers use classic lock/unlock functions and sometimes use the wrong locking helper. E.g. irqsave variants are used in mask/unmask although not needed. Adapt and fix the surrounding call locations. Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260604182506.1113440-2-markus.stockhausen@gmx.de
12 daysx86/resctrl: Only check Intel systems for SNCTony Luck
topology_num_nodes_per_package() reports values greater than one on certain AMD systems resulting in resctrl's Intel model specific SNC detection printing the confusing message: "CoD enabled system? Resctrl not supported" Add a check for Intel systems before looking at the topology. [ reinette: Add Closes tag, fix tag typos, rework changelog ] Fixes: 59674fc9d0bf ("x86/resctrl: Fix SNC detection") Reported-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://patch.msgid.link/9849330f45ac86344cc5ac54df2d313906d70bc4.1780634584.git.reinette.chatre@intel.com Closes: https://lore.kernel.org/lkml/37ac0376-43a3-4283-a3d5-4d57b3bec578@amd.com/
12 daysMerge tag 'ib-gpio-add-gpiod-is-single-ended-for-v7.2' into i2c/i2c-hostAndi Shyti
Immutable branch between the GPIO and I2C trees for v7.2-rc1 - add the gpiod_is_single_ended() helper function
12 daysiomap: introduce IOMAP_F_ZERO_TAIL flagNamjae Jeon
In filesystems that maintain a separate Valid Data Length, such as exFAT and NTFS, a partial write may start at or beyond the current valid_size and extend it. In this case, the region after the previous valid_size but within the same filesystem block is considered unwritten. This patch introduces IOMAP_F_ZERO_TAIL. When this flag is set in iomap, __iomap_write_begin() will zero only the tail portion while preserving any valid data before it in the same block. Without this tail zeroing, stale data in the unwritten portion of the block can remain in the page cache. Subsequent reads can then return incorrect contents from that region. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Link: https://patch.msgid.link/20260518114705.9601-2-linkinjeon@kernel.org Acked-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
12 daysrust: ptr: remove implicit index projection syntaxGary Guo
All users have been converted to use keyworded index projection syntax to explicitly state their intention when doing index projection. Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Gary Guo <gary@garyguo.net> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-6-6989470f5440@garyguo.net Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
12 daysgpu: nova-core: convert to keyworded projection syntaxGary Guo
Use "build" to denote that the index bounds checking here is performed at build time. Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Gary Guo <gary@garyguo.net> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-5-6989470f5440@garyguo.net Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
12 daysrust: dma: update to keyworded index projection syntaxGary Guo
Demonstrate the preferred syntax of index projection in DMA documentation and examples. A few `[i]?` cases are converted to demonstrate the new variant. Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Gary Guo <gary@garyguo.net> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-4-6989470f5440@garyguo.net Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
12 daysrust: ptr: add panicking index projection variantGary Guo
There have been a few cases where the programmer knows that the indices are in bounds but the compiler cannot deduce that. This is also compiler-version-dependent, so using build indexing here can be problematic. On the other hand, it is also not ideal to use the fallible variant, as it adds an error handling path that is never hit. Add a new panicking index projection for this scenario. Like all panicking operations, this should be used carefully only in cases where the user knows the index is going to be in bounds, and panicking would indicate something is catastrophically wrong. To signify this, require users to explicitly denote the type of index being used. The existing two types of index projections also gain the keyworded version, which will be the recommended way going forward. The keyworded syntax also paves the way of perhaps adding more flavors in the future, e.g. `unsafe` index projection. However, unless the code is extremely performance sensitive and bounds checking cannot be tolerated, the panicking variant is safer and should be preferred, so it will be left to the future when demand arises. Signed-off-by: Gary Guo <gary@garyguo.net> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-3-6989470f5440@garyguo.net [ Fixed broken intra-doc link. Added a few extra intra-doc links. Reworded some docs slightly. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
12 daysrust: ptr: use `match` instead of `unwrap_or_else` for `build_index`Gary Guo
Use `match` to avoid potential inlining issues of the `unwrap_or_else` function. Suggested-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/rust-for-linux/aeCKlut-88SbNsyW@google.com/ Signed-off-by: Gary Guo <gary@garyguo.net> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-2-6989470f5440@garyguo.net Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
12 daysrust: ptr: rename `ProjectIndex::index` to `build_index`Gary Guo
The corresponding `SliceIndex` trait in Rust uses `index` to mean the panicking variant, which is also being added to `ProjectIndex`. Hence rename our custom `build_error!` index variant to `build_index`. Suggested-by: Alexandre Courbot <acourbot@nvidia.com> Link: https://lore.kernel.org/rust-for-linux/DI5LLN2V3XCS.34H4CG99N4MPA@nvidia.com Signed-off-by: Gary Guo <gary@garyguo.net> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-1-6989470f5440@garyguo.net [ Reworded docs slightly. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
12 daysALSA: seq: dummy: fix UMP event stack overreadKyle Zeng
The dummy sequencer port forwards events by copying an incoming struct snd_seq_event into a stack temporary, rewriting source and destination, and dispatching the temporary to subscribers. That legacy event storage is smaller than struct snd_seq_ump_event. When a UMP event reaches the dummy client, the copy leaves the UMP flag set but only provides legacy-sized stack storage. The subscriber delivery path then uses snd_seq_event_packet_size() and copies a UMP-sized packet from that stack object, reading past the end of the temporary. Use the existing union __snd_seq_event storage and copy the packet size reported for the incoming event before rewriting the common routing fields. This preserves the full UMP packet for UMP events while keeping legacy event handling unchanged. Fixes: 32cb23a0f911 ("ALSA: seq: dummy: Allow UMP conversion") Signed-off-by: Kyle Zeng <kylebot@openai.com> Link: https://patch.msgid.link/20260605080204.32045-1-kylebot@openai.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
12 daysMerge patch series "proc: protect ptrace_may_access() with exec_update_lock"Christian Brauner
Jann Horn <jannh@google.com> says: My understanding is that procfs is effectively maintained by the VFS maintainers (though scripts/get_maintainer.pl claims that there are no maintainers for procfs because the VFS entry only claims files directly in fs/, and the procfs entry has no maintainers listed on it). In procfs, most uses of ptrace_may_access() should use exec_update_lock to avoid TOCTOU issues with concurrent privileged execve() (like setuid binary execution). This series doesn't fix all the remaining issues in procfs, but it fixes the easy cases for now; I will probably follow up with fixes for the gnarlier cases later unless someone else wants to do that. I have checked that procfs files still work with these changes and that CONFIG_PROVE_LOCKING=y doesn't generate any warnings. (checkpatch complains about missing argument names in proc_op::proc_get_link, but that was already the case before my patch.) * patches from https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-0-5c3d20e0ac33@google.com: proc: protect ptrace_may_access() with exec_update_lock (FD links) proc: protect ptrace_may_access() with exec_update_lock (part 1) Link: https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-0-5c3d20e0ac33@google.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
12 daysproc: protect ptrace_may_access() with exec_update_lock (FD links)Jann Horn
proc_pid_get_link() and proc_pid_readlink() currently look up the task from the pid once, then do the ptrace access check on that task, then look up the task from the pid a second time to do the actual access. That's racy in several ways. To fix it, pass the task to the ->proc_get_link() handler, and instead of proc_fd_access_allowed(), introduce a new helper call_proc_get_link() that looks up and locks the task, does the access check, and calls ->proc_get_link(). Fixes: 778c1144771f ("[PATCH] proc: Use sane permission checks on the /proc/<pid>/fd/ symlinks") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-2-5c3d20e0ac33@google.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
12 daysproc: protect ptrace_may_access() with exec_update_lock (part 1)Jann Horn
Fix the easy cases where procfs currently calls ptrace_may_access() without exec_update_lock protection, where the fix is to simply add the extra lock or use mm_access(): - do_task_stat(): grab exec_update_lock - proc_pid_wchan(): grab exec_update_lock - proc_map_files_lookup(): use mm_access() instead of get_task_mm() - proc_map_files_readdir(): use mm_access() instead of get_task_mm() - proc_ns_get_link(): grab exec_update_lock - proc_ns_readlink(): grab exec_update_lock Fixes: f83ce3e6b02d ("proc: avoid information leaks to non-privileged processes") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-1-5c3d20e0ac33@google.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
12 dayssimple_lookup(): use d_splice_alias() for ->lookup() return valueAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysecryptfs: use d_splice_alias() for ->lookup() return valueAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysconfigfs_lookup(): switch to d_splice_alias()Al Viro
more idiomatic Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daystracefs: use d_splice_alias() in ->lookup() instancesAl Viro
d_add() is not wrong there (inodes are freshly allocated), but d_splice_alias() is more idiomatic. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysmake cursors NORCUAl Viro
All it requires is making sure that d_walk() will skip *all* CURSOR dentries, even if somebody passes it one as an argument. Cursors are negative and unhashed all along, never get added to LRU or to shrink lists and no RCU references via ->d_sib are possible for those - dentry_unlist() makes sure that no killed dentry has ->d_sib.next left pointing to a cursor. Seeing that a cursor is allocated every time we open a directory on autofs, debugfs, devpts, etc., avoiding an RCU delay when such opened files get closed is attractive... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysnfs: get rid of fake root dentriesAl Viro
... just grab the reference to the (real) root we are about to return for the first mount of this superblock and be done with that. Once upon a time dentry tree eviction at fs shutdown used to break if ->s_root had been spliced on top of something; that hadn't been the case for years now, and these fake root dentries violate a bunch of invariants. Let's get rid of them... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 dayswind ->s_roots via ->d_sib instead of ->d_hashAl Viro
shrink_dcache_for_umount() is supposed to handle the possibility of some of the dentries to be evicted being in other threads shrink lists; it either kills them, leaving an empty husk to be freed by the owner of shrink list whenever it gets around to that, or it waits for the eviction in progress to get completed. That relies upon dentry remaining attached to the tree until the eviction reaches dentry_unlist() and its ->d_sib gets removed from the list. Unfortunately, the secondary roots are linked via ->d_hash, rather than ->d_sib and they become removed from that list before their inode references are dropped. If shrink_dentry_list() from another thread ends up evicting one of the secondary roots and gets to that point in dentry_kill() when shrink_dcache_for_umount() is looking for secondary roots, the latter will *not* notice anything, possibly leading to warnings about busy inodes at umount time and all kinds of breakage after that. Moreover, shrink_dcache_for_umount() walks the list of secondary roots with no protection whatsoever, so it might end up calling dget() on a dentry that already passed through lockref_mark_dead(&dentry->d_lockref); ending up with corrupted refcount and possible UAF. AFAICS, the most straightforward way to deal with that would be to have secondary roots linked via ->d_sib rather than ->d_hash; then they would remain on the list until killed, and we could use d_add_waiter() machinery to wait for eviction in progress. Changes: * secondary roots look the same as ->s_root from d_unhashed() and d_unlinked() POV now. * secondary roots are represented as "no parent, but on ->d_sib" instead of "no parent, but on ->d_hash". * since ->d_sib is a plain hlist, we protect it with per-superblock spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for non-root dentries it would be protected by ->d_lock of parent). * __d_obtain_alias() uses ->d_sib for linkage when allocating a secondary root. * d_splice_alias_ops() detects splicing of a secondary root and removes it from the list before calling __d_move(). * dentry_unlist() detects eviction of a secondary root and removes it from the list; no need to play the games for d_walk() sake, since the latter is not going to look for the next sibling of those anyway. * ___d_drop() doesn't care about ->s_roots anymore. * shrink_dcache_for_umount() uses proper locking for access to the list of secondary roots and if it runs into one that is in the middle of eviction waits for that to finish. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysshrink_dentry_tree(): unify the calls of shrink_dentry_list()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysshrinking rcu_read_lock() scope in d_alloc_parallel()Al Viro
The current use of rcu_read_lock() uses in d_alloc_parallel() is fairly opaque - the single large scope serves two purposes. We start with lookup in normal hash, and there rcu_read_lock() scope puts __d_lookup_rcu() and subsequent lockref_get_not_dead() into the same RCU read-side critical area. If no match is found, we proceed to lock the hash chain of in-lookup hash and scan that for a match. If we find a match, we want to grab it and wait for lookup in progress to finish. Since the bitlock we use for these hash chains has to nest inside ->d_lock, we need to unlock the chain first and use lockref_get_not_dead() on the match. That has to be done without breaking the RCU read-side critical area, and we use the same rcu_read_lock() scope to bridge over. The thing is, after having grabbed the reference (and it is very unlikely to fail) we proceed to grab ->d_lock - d_wait_lookup() and __d_lookup_unhash()/__d_wake_in_lookup_waiters() are using that for serialization. That makes lockref_get_not_dead() pointless - trying to avoid grabbing ->d_lock for refcount increment, only to grab it anyway immediately after that. If we grab ->d_lock first and replace lockref_get_not_dead() with direct check for sign and increment if non-negative we can move rcu_read_unlock() to immediately after grabbing ->d_lock. Moreover, we don't need the RCU read-side critical area to be contiguous since before earlier __d_lookup_rcu() - we can just as well terminate the earlier one ASAP and call rcu_read_lock() again only after having found a match (if any) in the in-lookup hash chain. That makes the entire thing easier to follow and the purpose of those rcu_read_lock() calls easier to describe - the first scope is for __d_lookup_rcu() + lockref_get_not_dead(), the second one bridges over from the bitlock scope to the ->d_lock scope on the match found in in-lookup hash. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysd_walk(): shrink rcu_read_lock() scopeAl Viro
we only need it to bridge over from ->d_lock scope of child to ->d_lock scope of parent; dropping ->d_lock at rename_retry doesn't need to be in rcu_read_lock() scope. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysdocument dentry_kill()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysadjust calling conventions of lock_for_kill(), fold __dentry_kill() into ↵Al Viro
dentry_kill() Pull dropping ->d_lock on lock_for_kill() failure into lock_for_kill() itself. That reduces dentry_kill() to if (!lock_for_kill(dentry)) return NULL; return __dentry_kill(dentry); at which point it's easier to move that if (...) into the beginning of __dentry_kill() itself and rename it into dentry_kill(). Document the new calling conventions of lock_for_kill(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysDocument rcu_read_lock() use in select_collect2()Al Viro
If select_collect2() finds something that is neither busy nor can be moved to shrink list, it needs to return that to caller's caller (shrink_dcache_tree()) ASAP and do so without grabbing references (among other things, it might be already dying, in which case refcount can't be incremented). We are called inside a ->d_lock scope, but that scope is going to be terminated as soon as we return to caller (d_walk()); ->d_lock will be retaken by shrink_dcache_tree(), but we need to bridge between these scopes, turning them into contiguous RCU read-side critical area. We do that with rcu_read_lock() scope - it spans from unbalanced rcu_read_lock() in select_collect2() to unbalanced rcu_read_unlock() in shrink_dcache_tree(). That works, but it really needs to be documented; it's rather unidiomatic and it had caused quite a bit of confusion - some of it in form of patches "fixing" the damn thing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysShift rcu_read_{,un}lock() inside fast_dput()Al Viro
Shrink rcu_read_lock() scopes surrounding fast_dput() calls. Both callers are immediately preceded and followed by rcu_read_lock()/rcu_read_unlock() resp. Shrink that down into fast_dput() itself; in case when fast_dput() ends up grabbing ->d_lock, we can pull rcu_read_unlock() up to right after spin_lock(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 dayssimplify safety for lock_for_kill() slowpathAl Viro
rcu_read_lock() scopes in dentry eviction machinery are too wide and badly structured; we end up with too many of those, quite a few essentially identical. Worse, quite a few of the function involved are not neutral wrt that, making them harder to reason about. rcu_read_lock() scope is not the only thing establishing an RCU read-side critical area - spin_lock scope does the same and they can be mixed - the sequence rcu_read_lock() ... spin_lock() ... rcu_read_unlock() ... rcu_read_lock() ... spun_unlock() ... rcu_read_unlock() is an unbroken RCU read-side critical area. Use of that observation allows to simplify things. First of all, lock_for_kill() relies upon being in an unbroken RCU read-side critical area. It's always called with ->d_lock held, and normally returns without having ever dropped that spinlock. We would not need rcu_read_lock() at all, if not for the slow path - if trylock of inode->i_lock fails, we need to drop and retake ->d_lock. Having all calls of lock_for_kill() inside an rcu_read_lock() scope takes care of that, but to show that lock_for_kill() slow path is safe, we need to demonstrate such rcu_read_lock() scope for any call chain leading to lock_for_kill(). Which is not fun, seeing that there are 10 such scopes, with 5 distinct beginnings between them. Case 1: opens in dput() proceeds through fast_dput() grabbing ->d_lock, returning false into dput() and there a call of finish_dput() which calls dentry_kill(), which calls lock_for_kill(); ends in dentry_kill(), either right after lock_for_kill() success or right after dropping ->d_lock on lock_for_kill() failure. ->d_lock is held continuously all the way into lock_for_kill(). Case 2: opens in dentry_kill(), where we proceed to the same call of dentry_kill() as in case 1. ->d_lock is held since before the beginning of the scope and all the way into lock_for_kill(). Case 3: opens in select_collect2(), proceeds through the return to d_walk() and to shrink_dcache_tree() where we grab ->d_lock and proceed to call shrink_kill(), which calls dentry_kill(), then as in the previous scopes. Case 4: opens in shrink_dentry_list(), followed by call of shrink_kill(), then same as in case 3. ->d_lock is held since before the beginning of the scope and all the way into lock_for_kill(). Case 5: opens in shrink_kill(), where it's immediately followed by call of dentry_kill(), then same as in the previous scopes. ->d_lock is held since before the beginning of the scope all the way into lock_for_kill(). Note that in cases 2, 4 and 5 the slow path of lock_for_kill() is the only part of rcu_read_lock() scope that is not covered by spinlock scopes. In case 1 we have the area in fast_dput() as well and in case 3 - the return path from select_collect2() and chunk in shrink_dcache_tree() up to grabbing ->d_lock. Seeing that the reasons we need rcu_read_lock() in these additional areas are completely unrelated to lock_for_kill() slow path, the things get much more straightforward with * explicit rcu_read_lock() scope surrounding the area in slow path of lock_for_kill() where ->d_lock is not held * shrink_dentry_list() dropping rcu_read_lock() as soon as it has grabbed ->d_lock. * dput() dropping rcu_read_lock() just before calling finish_dput(). * rcu_read_lock() calls in finish_dput(), shrink_kill() and shrink_dentry_list() are removed, along with rcu_read_unlock() calls in dentry_kill(). RCU read-side critical areas are unchanged by that, safety of lock_for_kill() slow path is trivial to verify and a bunch of rcu_read_lock() scopes either gone or become easier to describe. Update the comments on locking conventions and memory safety considerations, including the NORCU case. Incidentally, all calls of fast_dput() are immediately preceded by rcu_read_lock() and followed by rcu_read_unlock() now, which will allow to simplify those on the next step... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysfold lock_for_kill() and __dentry_kill() into common helperAl Viro
There are two callers of lock_for_kill() and both are followed by the same sequence of actions: * in case of failure, drop ->d_lock, do rcu_read_unlock() and go away * in case of success, do rcu_read_unlock() followed by passing dentry to __dentry_kill(); if the latter returns NULL, go away. All calls of __dentry_kill() are paired with lock_for_kill() now; let's turn that sequence into a new helper (dentry_kill()) and switch to using it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysfold lock_for_kill() into shrink_kill()Al Viro
Both callers have exact same shape. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysshrink_dentry_list(): start with removing from shrink listAl Viro
Currently we leave dentry on the list until we are done with lock_for_kill(). That guarantees that it won't have been even scheduled for removal until we remove it from the list and drop ->d_lock. We grab ->d_lock and rcu_read_lock() and call lock_for_kill(). There are four possible cases: 1) lock_for_kill() has succeeded; dentry and its inode (if any) are locked, dentry refcount is zero and we can remove it from shrink list and feed it to shrink_kill(). 2) lock_for_kill() fails since dentry has become busy. Nothing to do, rcu_read_unlock(), remove from shrink list, drop ->d_lock and move on. 3) lock_for_kill() fails since dentry is currently being killed - already entered __dentry_kill(), but hasn't reached dentry_unlist() yet. Nothing to do, we should just do rcu_read_unlock(), remove from shrink list so that whoever's executing __dentry_kill() would free it once they are done, drop ->d_lock and move on - same actions as in case (2). 4) lock_for_kill() fails since dentry has been killed (reached dentry_unlist(), DCACHE_DENTRY_KILLED set in ->d_flags). In that case whoever had been killing it had already seen it on our shrink list and skipped freeing it. At that point it's just a passive chunk of memory; rcu_read_unlock(), remove from the list, drop ->d_lock and use dentry_free() to schedule freeing. While that works, there's a simpler way to do it: * grab ->d_lock * remove dentry from our shrink list * if DCACHE_DENTRY_KILLED is already set, drop ->d_lock, call dentry_free() and move on. * otherwise grab rcu_read_lock() and call lock_for_free() * if lock_for_kill() succeeds, feed dentry to shrink_kill(), otherwise drop the locks and move on. The end result is equivalent to the old variant. The only difference arises if at the time we grab ->d_lock dentry had refcount 0 and lock_for_kill() had failed spin_trylock() and had to drop and regain ->d_lock. Otherwise nobody can observe at which point within the unbroken ->d_lock scope dentry had been removed from the shrink list - all accesses to ->d_lru are under ->d_lock. If ->d_lock had been dropped and regained, it is possible for another thread to feed that dentry to __dentry_kill(); if it doesn't get to dentry_unlist() before we regain ->d_lock, behaviour is still identical - it's case (3) and by the time __dentry_kill() would've gotten around to checking if the victim is on shrink list, it would've been already removed from ours. If __dentry_kill() from another thread *does* get to dentry_unlist(), in the old variant we would have __dentry_kill() leave calling dentry_free() to us and in the new one __dentry_kill() would've called dentry_free() itself. Since we are under rcu_read_lock(), we are guaranteed that actual freeing won't happen until we get around to rcu_read_unlock(). IOW, the new variant is still safe wrt UAF, if not for the same reason as the old one, and overall result is the same; the only difference is which threads ends up scheduling the actual freeing of dentry. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysd_prune_aliases(): make sure to skip NORCU aliasesAl Viro
Either they are busy (in which case they won't be moved to shrink list anyway) or they have a zero refcount, in which case we really shouldn't mess with them - whoever had dropped the refcount to zero is on the way to evicting and freeing them. That way we are guaranteed that only the thread that has dropped refcount of NORCU dentry to zero might call lock_for_kill() and __dentry_kill() for those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 dayskill d_dispose_if_unused()Al Viro
Rename to_shrink_list() into __move_to_shrink_list(), document and export it. Switch d_dispose_if_unused() users to that and kill d_dispose_if_unused() itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
12 daysmake to_shrink_list() return whether it has moved dentry to listAl Viro
... and make it check the refcount for being zero in addition to dentry not being on a shrink list already. Simplifies the callers... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>