summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-11-05srcu: Make SRCU-fast readers enforce use of SRCU-fast definition/initPaul E. McKenney
This commit makes CONFIG_PROVE_RCU=y kernels enforce the new rule that srcu_struct structures that are passed to srcu_read_lock_fast() and other SRCU-fast read-side markers be either initialized with init_srcu_struct_fast() on the one hand or defined with DEFINE_SRCU_FAST() or DEFINE_STATIC_SRCU_FAST() on the other. This eliminates the read-side test that was formerly included in srcu_read_lock_fast() and friends, speeding these primitives up by about 25% (admittedly only about half of a nanosecond, but when tracing on fastpaths...) Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Require special srcu_struct define/init for SRCU-fast readersPaul E. McKenney
This commit adds CONFIG_PROVE_RCU=y checking to enforce the new rule that srcu_struct structures passed to srcu_read_lock_fast() and other SRCU-fast read-side markers be either initialized with init_srcu_struct_fast() on the one hand or defined using either DEFINE_SRCU_FAST() or DEFINE_STATIC_SRCU_FAST(). This will enable removal of the non-debug read-side checks from srcu_read_lock_fast() and friends, which on my laptop provides a 25% speedup (which admittedly amounts to about half a nanosecond, but when tracing fastpaths...) Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Create a DEFINE_SRCU_FAST()Paul E. McKenney
This commit creates DEFINE_SRCU_FAST() and DEFINE_STATIC_SRCU_FAST() macros that are similar to DEFINE_SRCU() and DEFINE_STATIC_SRCU(), but which create srcu_struct structures that are usable only by readers initiated by srcu_read_lock_fast() and friends. This commit does make DEFINE_SRCU_FAST() available to modules, in which case the per-CPU srcu_data structures are not created at compile time, but rather at module-load time. This means that the >srcu_reader_flavor field of the srcu_data structure is not available. Therefore, this commit instead creates an ->srcu_reader_flavor field in the srcu_struct structure, adds arguments to the DEFINE_SRCU()-related macros to initialize this new field, and extends the checks in the __srcu_check_read_flavor() function to include this new field. This commit also allows dynamically allocated srcu_struct structure to be marked for SRCU-fast readers. It does so by defining a new init_srcu_struct_fast() function that marks the specified srcu_struct structure for use by srcu_read_lock_fast() and friends. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Create an srcu_expedite_current() functionPaul E. McKenney
This commit creates an srcu_expedite_current() function that expedites the current (and possibly the next) SRCU grace period for the specified srcu_struct structure. This functionality will be inherited by RCU Tasks Trace courtesy of its mapping to SRCU fast. If the current SRCU grace period is already waiting, that wait will complete before the expediting takes effect. If there is no SRCU grace period in flight, this function might well create one. [ paulmck: Apply Zqiang feedback for PREEMPT_RT use. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05coredump: mark struct mm_struct as constChristian Brauner
We don't actually modify it. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-7-b447b82f2c9b@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05crypto: hisilicon - qm updates BAR configurationLongfang Liu
On new platforms greater than QM_HW_V3, the configuration region for the live migration function of the accelerator device is no longer placed in the VF, but is instead placed in the PF. Therefore, the configuration region of the live migration function needs to be opened when the QM driver is loaded. When the QM driver is uninstalled, the driver needs to clear this configuration. Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/20251030015744.131771-2-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-05fs: add super_write_guardChristian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-1-5108ac78a171@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fs: inline current_umask() and move it to fs_struct.hMateusz Guzik
There is no good reason to have this as a func call, other than avoiding the churn of adding fs_struct.h as needed. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251104170448.630414-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fs: add fs/super.h headerChristian Brauner
Split out super block associated functions into a separate header. Link: https://patch.msgid.link/20251104-work-fs-header-v1-3-fb39a2efe39e@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fs: add fs/super_types.h headerChristian Brauner
Split out super block associated structures into a separate header. Link: https://patch.msgid.link/20251104-work-fs-header-v1-2-fb39a2efe39e@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05Merge tag 'platform-drivers-x86-v6.18-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Ilpo Järvinen: "Fixes and New Hotkey Support: - input + dell-wmi-base: Electronic privacy screen on/off hotkey support - int3472: Fix unregister double free - wireless-hotkey: Fix Kconfig typo" * tag 'platform-drivers-x86-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: platform: x86: Kconfig: fix minor typo in help for WIRELESS_HOTKEY platform/x86: dell-wmi-base: Handle electronic privacy screen on/off events Input: Add keycodes for electronic privacy screen on/off hotkeys MAINTAINERS: Update int3472 maintainers platform/x86: int3472: Fix double free of GPIO device during unregister
2025-11-05KVM: Rename kvm_arch_vcpu_async_ioctl() to kvm_arch_vcpu_unlocked_ioctl()Sean Christopherson
Rename the "async" ioctl API to "unlocked" so that upcoming usage in x86's TDX code doesn't result in a massive misnomer. To avoid having to retry SEAMCALLs, TDX needs to acquire kvm->lock *and* all vcpu->mutex locks, and acquiring all of those locks after/inside the current vCPU's mutex is a non-starter. However, TDX also needs to acquire the vCPU's mutex and load the vCPU, i.e. the handling is very much not async to the vCPU. No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05KVM: Make support for kvm_arch_vcpu_async_ioctl() mandatorySean Christopherson
Implement kvm_arch_vcpu_async_ioctl() "natively" in x86 and arm64 instead of relying on an #ifdef'd stub, and drop HAVE_KVM_VCPU_ASYNC_IOCTL in anticipation of using the API on x86. Once x86 uses the API, providing a stub for one architecture and having all other architectures opt-in requires more code than simply implementing the API in the lone holdout. Eliminating the Kconfig will also reduce churn if the API is renamed in the future (spoiler alert). No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05block: introduce blkdev_report_zones_cached()Damien Le Moal
Introduce the function blkdev_report_zones_cached() to provide a fast report zone built using the blkdev_get_zone_info() function, which gets zone information from a disk zones_cond array or zone write plugs. For a large capacity SMR drive, such fast report zone can be completed in a few milliseconds compared to several seconds completion times when the report zone is obtained from the device. The zone report is built in the same manner as with the regular blkdev_report_zones() function, that is, the first zone reported is the one containing the specified start sector and the report is limited to the specified number of zones (nr_zones argument). The information for each zone in the report is obtained using blkdev_get_zone_info(). For zoned devices that do not use zone write plug resources, using blkdev_get_zone_info() is inefficient as the zone report would be very slow, generated one zone at a time. To avoid this, blkdev_report_zones_cached() falls back to calling blkdev_do_report_zones() to execute a regular zone report. In this case, the .report_active field of struct blk_report_zones_args is set to true to report zone conditions using the BLK_ZONE_COND_ACTIVE condition in place of the implicit open, explicit open and closed conditions. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05block: introduce blkdev_get_zone_info()Damien Le Moal
Introduce the function blkdev_get_zone_info() to obtain a single zone information from cached zone data, that is, either from the zone write plug for the target zone if it exists and from the disk zones_cond array otherwise. Since sequential zones that do not have a zone write plug are either full, empty or in a bad state (read-only or offline), the zone write pointer can be inferred from the zone condition cached in the disk zones_cond array. For sequential zones that have a zone write plug, the zone condition and zone write pointer are obtained from the condition and write pointer offset managed with the zone write plug. This allows obtaining the information for a zone much more quickly than having to execute a report zones command on the device. blkdev_get_zone_info() falls back to using a regular zone report if the target zone is flagged as needing an update with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag, or if the target device does not use zone write plugs (i.e. a device mapper device). In this case, the new function blkdev_report_zone_fallback() is used and the zone condition is reported consistantly with the cahced report, that is, the BLK_ZONE_COND_ACTIVE condition is used in place of the implicit open, explicit open and closed conditions. This is achieved by adding the .report_active field to struct blk_report_zones_args and by having disk_report_zone() sets the correct zone condition if .report_active is true. In preparation for using blkdev_get_zone_info() in upcoming file systems changes, also export this function as a GPL symbol. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05block: use zone condition to determine conventional zonesDamien Le Moal
The conv_zones_bitmap field of struct gendisk is used to define a bitmap to identify the conventional zones of a zoned block device. The bit for a zone is set in this bitmap if the zone is a conventional one, that is, if the zone type is BLK_ZONE_TYPE_CONVENTIONAL. For such zone, this always corresponds to the zone condition BLK_ZONE_COND_NOT_WP. In other words, conv_zones_bitmap tracks a single condition of the zones of a zoned block device. In preparation for tracking more zone conditions, change conv_zones_bitmap into an array of zone conditions, using 1 byte per zone. This increases the memory usage from 1 bit per zone to 1 byte per zone, that is, from 16 KiB to about 100 KiB for a 30 TB SMR HDD with 256 MiB zones. This is a trade-off to allow fast cached report zones later on top of this change. Rename the conv_zones_bitmap field of struct gendisk to zones_cond. Add a blk_revalidate_zone_cond() function to initialize the zones_cond array of a disk during device scan and to update it on device revalidation. Move the allocation of the zones_cond array to disk_revalidate_zone_resources(), making sure that this array is always allocated, even for devices that do not need zone write plugs (zone resources), to ensure that bdev_zone_is_seq() can be re-implemented to use the zone condition array in place of the conv zones bitmap. Finally, the function bdev_zone_is_seq() is rewritten to use a test on the condition of the target zone. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05block: introduce disk_report_zone()Damien Le Moal
Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer alignment") introduced an indirect call for the callback function of a report zones executed with blkdev_report_zones(). This is necessary so that the function disk_zone_wplug_sync_wp_offset() can be called to refresh a zone write plug zone write pointer offset after a write error. However, this solution makes following the path of a zone information harder to understand. Clean this up by introducing the new blk_report_zones_args structure to define a zone report callback and its private data and introduce the helper function disk_report_zone() which calls both disk_zone_wplug_sync_wp_offset() and the zone report user callback function for all zones of a zone report. This helper function must be called by all block device drivers that implement the report zones block operation in order to correctly report a zone information. All block device drivers supporting the report_zones block operation are updated to use this new scheme. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05regulator: Add support for MediaTek MT6363 SPMI PMIC RegulatorsAngeloGioacchino Del Regno
Add a driver for the regulators found on the MT6363 PMIC, fully controlled by SPMI interface. This PMIC regulates voltage with an input range of 2.6-5.0V, and features 10 buck converters and 26 LDOs. Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://patch.msgid.link/20251027110527.21002-5-angelogioacchino.delregno@collabora.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-11-05module: Add compile-time check for embedded NUL charactersKees Cook
Long ago, the kernel module license checks were bypassed by embedding a NUL character in the MODULE_LICENSE() string[1]. By using a string like "GPL\0proprietary text", the kernel would only read "GPL" due to C string termination at the NUL byte, allowing proprietary modules to avoid kernel tainting and access GPL-only symbols. The MODULE_INFO() macro stores these strings in the .modinfo ELF section, and get_next_modinfo() uses strcmp()-family functions which stop at the first NUL. This split the embedded string into two separate .modinfo entries, with only the first part being processed by license_is_gpl_compatible(). Add a compile-time check using static_assert that compares the full string length (sizeof - 1) against __builtin_strlen(), which stops at the first NUL. If they differ, compilation fails with a clear error message. While this check can still be circumvented by modifying the ELF binary post-compilation, it prevents accidental embedded NULs and forces intentional abuse to require deliberate binary manipulation rather than simple source-level tricks. Build tested with test modules containing both valid and invalid license strings. The check correctly rejects: MODULE_LICENSE("GPL\0proprietary") while accepting normal declarations: MODULE_LICENSE("GPL") Link: https://lwn.net/Articles/82305/ [1] Suggested-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
2025-11-05block: add __must_check attribute to sb_min_blocksize()Yongpeng Yang
When sb_min_blocksize() returns 0 and the return value is not checked, it may lead to a situation where sb->s_blocksize is 0 when accessing the filesystem super block. After commit a64e5a596067bd ("bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()"), this becomes more likely to happen when the block device’s logical_block_size is larger than PAGE_SIZE and the filesystem is unformatted. Add the __must_check attribute to ensure callers always check the return value. Cc: stable@vger.kernel.org # v6.15 Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Link: https://patch.msgid.link/20251104125009.2111925-6-yangyongpeng.storage@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05entry,unwind/deferred: Fix unwind_reset_info() placementPeter Zijlstra
Stephen reported that on KASAN builds he's seeing: vmlinux.o: warning: objtool: user_exc_vmm_communication+0x15a: call to __kasan_check_read() leaves .noinstr.text section vmlinux.o: warning: objtool: exc_debug_user+0x182: call to __kasan_check_read() leaves .noinstr.text section vmlinux.o: warning: objtool: exc_int3+0x123: call to __kasan_check_read() leaves .noinstr.text section vmlinux.o: warning: objtool: noist_exc_machine_check+0x17a: call to __kasan_check_read() leaves .noinstr.text section vmlinux.o: warning: objtool: fred_exc_machine_check+0x17e: call to __kasan_check_read() leaves .noinstr.text section This turns out to be atomic ops from unwind_reset_info() that have explicit instrumentation. Place unwind_reset_info() in the preceding instrumentation_begin() section. Fixes: c6439bfaabf2 ("Merge tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251105100014.GY4068168@noisy.programming.kicks-ass.net
2025-11-05dma-mapping: Allow use of DMA_BIT_MASK(64) in global scopeJames Clark
Clang doesn't like that (1ULL<<(64)) overflows when initializing a global scope variable, even if that part of the ternary isn't used when n = 64. The same initialization can be done without warnings in function scopes, and GCC doesn't mind either way. The build failure that highlighted this was already fixed in a different way [1], which also has detailed links to the Clang issues. However it's not going to be long before the same thing happens again, so it's better to fix the root cause. Fix it by using GENMASK_ULL() which does exactly the same thing, is much more readable anyway, and doesn't have a shift that overflows. [1]: https://lore.kernel.org/all/20250918-mmp-pdma-simplify-dma-addressing-v1-1-5c2be2b85696@riscstar.com/ Signed-off-by: James Clark <james.clark@linaro.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251030-james-fix-dma_bit_mask-v1-1-ad1ce7cfab6e@linaro.org
2025-11-05iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flagQu Wenruo
Btrfs requires all of its bios to be fs block aligned, normally it's totally fine but with the incoming block size larger than page size (bs > ps) support, the requirement is no longer met for direct IOs. Because iomap_dio_bio_iter() calls bio_iov_iter_get_pages(), only requiring alignment to be bdev_logical_block_size(). In the real world that value is either 512 or 4K, on 4K page sized systems it means bio_iov_iter_get_pages() can break the bio at any page boundary, breaking btrfs' requirement for bs > ps cases. To address this problem, introduce a new public iomap dio flag, IOMAP_DIO_FSBLOCK_ALIGNED. When calling __iomap_dio_rw() with that new flag, iomap_dio::flags will inherit that new flag, and iomap_dio_bio_iter() will take fs block size into the calculation of the alignment, and pass the alignment to bio_iov_iter_get_pages(), respecting the fs block size requirement. The initial user of this flag will be btrfs, which needs to calculate the checksum for direct read and thus requires the biovec to be fs block aligned for the incoming bs > ps support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> [hch: also align pos/len, incorporate the trace flags from Darrick] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251031131045.1613229-2-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: optional zero range dirty folio processingBrian Foster
The only way zero range can currently process unwritten mappings with dirty pagecache is to check whether the range is dirty before mapping lookup and then flush when at least one underlying mapping is unwritten. This ordering is required to prevent iomap lookup from racing with folio writeback and reclaim. Since zero range can skip ranges of unwritten mappings that are clean in cache, this operation can be improved by allowing the filesystem to provide a set of dirty folios that require zeroing. In turn, rather than flush or iterate file offsets, zero range can iterate on folios in the batch and advance over clean or uncached ranges in between. Add a folio_batch in struct iomap and provide a helper for filesystems to populate the batch at lookup time. Update the folio lookup path to return the next folio in the batch, if provided, and advance the iter if the folio starts beyond the current offset. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05filemap: add helper to look up dirty folios in a rangeBrian Foster
Add a new filemap_get_folios_dirty() helper to look up existing dirty folios in a range and add them to a folio_batch. This is to support optimization of certain iomap operations that only care about dirty folios in a target range. For example, zero range only zeroes the subset of dirty pages over unwritten mappings, seek hole/data may use similar logic in the future, etc. Note that the helper is intended for use under internal fs locks. Therefore it trylocks folios in order to filter out clean folios. This loosely follows the logic from filemap_range_has_writeback(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: make iomap_read_folio() a void returnJoanne Koong
No errors are propagated in iomap_read_folio(). Change iomap_read_folio() to a void return to make this clearer to callers. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: add caller-provided callbacks for read and readaheadJoanne Koong
Add caller-provided callbacks for read and readahead so that it can be used generically, especially by filesystems that are not block-based. In particular, this: * Modifies the read and readahead interface to take in a struct iomap_read_folio_ctx that is publicly defined as: struct iomap_read_folio_ctx { const struct iomap_read_ops *ops; struct folio *cur_folio; struct readahead_control *rac; void *read_ctx; }; where struct iomap_read_ops is defined as: struct iomap_read_ops { int (*read_folio_range)(const struct iomap_iter *iter, struct iomap_read_folio_ctx *ctx, size_t len); void (*read_submit)(struct iomap_read_folio_ctx *ctx); }; read_folio_range() reads in the folio range and is required by the caller to provide. read_submit() is optional and is used for submitting any pending read requests. * Modifies existing filesystems that use iomap for read and readahead to use the new API, through the new statically inlined helpers iomap_bio_read_folio() and iomap_bio_readahead(). There is no change in functionality for those filesystems. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05spi: tegra210-quad: Improve timeout handling underMark Brown
Merge series from Vishwaroop A <va@nvidia.com>: This patch series addresses timeout handling issues in the Tegra QSPI driver that occur under high system load conditions. We've observed that when CPUs are saturated (due to error injection, RAS firmware activity, or general CPU contention), QSPI interrupt handlers can be delayed, causing spurious transfer failures even though the hardware completed the operation successfully. These changes have been tested in production environments under various high load scenarios including RAS testing and CPU saturation workloads.
2025-11-05fs: rename fs_types.h to fs_dirent.hChristian Brauner
We will split out a bunch of types into a separate header. So free up the appropriate name for it. Link: https://patch.msgid.link/20251104-work-fs-header-v1-1-fb39a2efe39e@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iommupt: Add the Intel VT-d second stage page table formatJason Gunthorpe
The VT-d second stage format is almost the same as the x86 PAE format, except the bit encodings in the PTE are different and a few new PTE features, like force coherency are present. Among all the formats it is unique in not having a designated present bit. Comparing the performance of several operations to the existing version: iommu_map() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 53,66 , 50,64 , 21.21 2^21, 59,70 , 56,67 , 16.16 2^30, 54,66 , 52,63 , 17.17 256*2^12, 384,524 , 337,516 , 34.34 256*2^21, 387,632 , 336,626 , 46.46 256*2^30, 376,629 , 323,623 , 48.48 iommu_unmap() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 67,86 , 63,84 , 25.25 2^21, 64,84 , 59,80 , 26.26 2^30, 59,78 , 56,74 , 24.24 256*2^12, 216,335 , 198,317 , 37.37 256*2^21, 245,350 , 232,344 , 32.32 256*2^30, 248,345 , 226,339 , 33.33 Cc: Tina Zhang <tina.zhang@intel.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Use the incoherent start/stop functions for PT_FEAT_DMA_INCOHERENTJason Gunthorpe
This is the first step to supporting an incoherent walker, start and stop the incoherence around the allocation and frees of the page table memory. The iommu_pages API maps this to dma_map/unmap_single(), or arch cache flushing calls. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add a kunit test for the IOMMU implementationJason Gunthorpe
This intends to have high coverage of the page table format functions and the IOMMU implementation itself, exercising the various corner cases. The kunit tests can be run in the kunit framework, using commands like: tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y There are several interesting corner cases on the 32 bit platforms that need checking. Like the generic tests, these are run on the format's configuration list using kunit "params". This also checks the core iommu parts of the page table code as it enters the logic through a mock iommu_domain. The following are checked: - PT_FEAT_DYNAMIC_TOP properly adds levels one by one - Every page size can be iommu_map()'d, and mapping creates that size - iommu_iova_to_phys() works with every page size - Test converting OA -> non present -> OA when the two OAs overlap and free table levels - Test that unmap stops at holes, unmap doesn't split, and unmap returns the right values for partial unmap requests - Randomly map/unmap. Checks map with random sizes, that map fails when hitting collisions doing nothing, unmap/map with random intersections and full unmap of random sizes. Also checks iommu_iova_to_phys() with random sizes - Check for memory leaks by monitoring NR_SECONDARY_PAGETABLE Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommu/amd: Remove AMD io_pgtable supportJason Gunthorpe
None of this is used anymore, delete it. Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add the x86 64 bit page table formatJason Gunthorpe
This is used by x86 CPUs and can be used in AMD/VT-d x86 IOMMUs. When a x86 IOMMU is running SVA the MM will be using this format. This implementation follows the AMD v2 io-pgtable version. There is nothing remarkable here, the format can have 4 or 5 levels and limited support for different page sizes. No contiguous pages support. x86 uses a sign extension mechanism where the top bits of the VA must match the sign bit. The core code supports this through PT_FEAT_SIGN_EXTEND which creates and upper and lower VA range. All the new operations will work correctly in both spaces, however currently there is no way to report the upper space to other layers. Future patches can improve that. In principle this can support 3 page tables levels matching the 32 bit PAE table format, but no iommu driver needs this. The focus is on the modern 64 bit 4 and 5 level formats. Comparing the performance of several operations to the existing version: iommu_map() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 71,61 , 66,58 , -13.13 2^21, 66,60 , 61,55 , -10.10 2^30, 59,56 , 56,54 , -3.03 256*2^12, 392,1360 , 345,1289 , 73.73 256*2^21, 383,1159 , 335,1145 , 70.70 256*2^30, 378,965 , 331,892 , 62.62 iommu_unmap() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 77,71 , 73,68 , -7.07 2^21, 76,70 , 70,66 , -6.06 2^30, 69,66 , 66,63 , -4.04 256*2^12, 225,899 , 210,870 , 75.75 256*2^21, 262,722 , 248,710 , 65.65 256*2^30, 251,643 , 244,634 , 61.61 The small -ve values in the iommu_unmap() are due to the core code calling iommu_pgsize() before invoking the domain op. This is unncessary with this implementation. Future work optimizes this and gets to 2%, 4%, 3%. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommufd: Change the selftest to use iommupt instead of xarrayJason Gunthorpe
The iommufd self test uses an xarray to store the pfns and their orders to emulate a page table. Make it act more like a real iommu driver by replacing the xarray with an iommupt based page table. The new AMDv1 mock format behaves similarly to the xarray. Add set_dirty() as a iommu_pt operation to allow the test suite to simulate HW dirty. Userspace can select between several formats including the normal AMDv1 format and a special MOCK_IOMMUPT_HUGE variation for testing huge page dirty tracking. To make the dirty tracking test work the page table must only store exactly 2M huge pages otherwise the logic the test uses fails. They cannot be broken up or combined. Aside from aligning the selftest with a real page table implementation, this helps test the iommupt code itself. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add a mock pagetable format for iommufd selftest to useJason Gunthorpe
The iommufd self test uses an xarray to store the pfns and their orders to emulate a page table. Slightly modify the amdv1 page table to create a real page table that has similar properties: - 2k base granule to simulate something like a 4k page table on a 64K PAGE_SIZE ARM system - Contiguous page support for every PFN order - Dirty tracking AMDv1 is the closest format, as it is the only one that already supports every page size. Tweak it to have only 5 levels and an 11 bit base granule and compile it separately as a format variant. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add read_and_clear_dirty opJason Gunthorpe
IOMMU HW now supports updating a dirty bit in an entry when a DMA writes to the entry's VA range. iommufd has a uAPI to read and clear the dirty bits from the tables. This is a trivial recursive descent algorithm to read and optionally clear the dirty bits. The format needs a function to tell if a contiguous entry is dirty, and a function to clear a contiguous entry back to clean. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add map_pages opJason Gunthorpe
map is slightly complicated because it has to handle a number of special edge cases: - Overmapping a previously shared, but now empty, table level with an OA. Requries validating and freeing the possibly empty tables - Doing the above across an entire to-be-created contiguous entry - Installing a new shared table level concurrently with another thread - Expanding the table by adding more top levels Table expansion is a unique feature of AMDv1, this version is quite similar except we handle racing concurrent lockless map. The table top pointer and starting level are encoded in a single uintptr_t which ensures we can READ_ONCE() without tearing. Any op will do the READ_ONCE() and use that fixed point as its starting point. Concurrent expansion is handled with a table global spinlock. When inserting a new table entry map checks that the entire portion of the table is empty. This includes freeing any empty lower tables that will be overwritten by an OA. A separate free list is used while checking and collecting all the empty lower tables so that writing the new entry is uninterrupted, either the new entry fully writes or nothing changes. A special fast path for PAGE_SIZE is implemented that does a direct walk to the leaf level and installs a single entry. This gives ~15% improvement for iommu_map() when mapping lists of single pages. This version sits under the iommu_domain_ops as map_pages() but does not require the external page size calculation. The implementation is actually map_range() and can do arbitrary ranges, internally handling all the validation and supporting any arrangment of page sizes. A future series can optimize iommu_map() to take advantage of this. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add unmap_pages opJason Gunthorpe
unmap_pages removes mappings and any fully contained interior tables from the given range. This follows the now-standard iommu_domain API definition where it does not split up larger page sizes into smaller. The caller must perform unmap only on ranges created by map or it must have somehow otherwise determined safe cut points (eg iommufd/vfio use iova_to_phys to scan for them) A future work will provide 'cut' which explicitly does the page size split if the HW can support it. unmap is implemented with a recursive descent of the tree. If the caller provides a VA range that spans an entire table item then the table memory can be freed as well. If an entire table item can be freed then this version will also check the leaf-only level of the tree to ensure that all entries are present to generate -EINVAL. Many of the existing drivers don't do this extra check. This version sits under the iommu_domain_ops as unmap_pages() but does not require the external page size calculation. The implementation is actually unmap_range() and can do arbitrary ranges, internally handling all the validation and supporting any arrangment of page sizes. A future series can optimize __iommu_unmap() to take advantage of this. Freed page table memory is batched up in the gather and will be freed in the driver's iotlb_sync() callback after the IOTLB flush completes. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add iova_to_phys opJason Gunthorpe
iova_to_phys is a performance path for the DMA API and iommufd, implement it using an unrolled get_user_pages() like function waterfall scheme. The implementation itself is fairly trivial. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add the AMD IOMMU v1 page table formatJason Gunthorpe
AMD IOMMU v1 is unique in supporting contiguous pages with a variable size and it can decode the full 64 bit VA space. Unlike other x86 page tables this explicitly does not do sign extension as part of allowing the entire 64 bit VA space to be supported. The general design is quite similar to the x86 PAE format, except with a 6th level and quite different PTE encoding. This format is the only one that uses the PT_FEAT_DYNAMIC_TOP feature in the existing code as the existing AMDv1 code starts out with a 3 level table and adds levels on the fly if more IOVA is needed. Comparing the performance of several operations to the existing version: iommu_map() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 65,64 , 62,61 , -1.01 2^13, 70,66 , 67,62 , -8.08 2^14, 73,69 , 71,65 , -9.09 2^15, 78,75 , 75,71 , -5.05 2^16, 89,89 , 86,84 , -2.02 2^17, 128,121 , 124,112 , -10.10 2^18, 175,175 , 170,163 , -4.04 2^19, 264,306 , 261,279 , 6.06 2^20, 444,525 , 438,489 , 10.10 2^21, 60,62 , 58,59 , 1.01 256*2^12, 381,1833 , 367,1795 , 79.79 256*2^21, 375,1623 , 356,1555 , 77.77 256*2^30, 356,1338 , 349,1277 , 72.72 iommu_unmap() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 76,89 , 71,86 , 17.17 2^13, 79,89 , 75,86 , 12.12 2^14, 78,90 , 74,86 , 13.13 2^15, 82,89 , 74,86 , 13.13 2^16, 79,89 , 74,86 , 13.13 2^17, 81,89 , 77,87 , 11.11 2^18, 90,92 , 87,89 , 2.02 2^19, 91,93 , 88,90 , 2.02 2^20, 96,95 , 91,92 , 1.01 2^21, 72,88 , 68,85 , 20.20 256*2^12, 372,6583 , 364,6251 , 94.94 256*2^21, 398,6032 , 392,5758 , 93.93 256*2^30, 396,5665 , 389,5258 , 92.92 The ~5-17x speedup when working with mutli-PTE map/unmaps is because the AMD implementation rewalks the entire table on every new PTE while this version retains its position. The same speedup will be seen with dirtys as well. The old implementation triggers a compiler optimization that ends up generating a "rep stos" memset for contiguous PTEs. Since AMD can have contiguous PTEs that span 2Kbytes of table this is a huge win compared to a normal movq loop. It is why the unmap side has a fairly flat runtime as the contiguous PTE sides increases. This version makes it explicit with a memset64() call. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add the basic structure of the iommu implementationJason Gunthorpe
The existing IOMMU page table implementations duplicate all of the working algorithms for each format. By using the generic page table API a single C version of the IOMMU algorithms can be created and re-used for all of the different formats used in the drivers. The implementation will provide a single C version of the iommu domain operations: iova_to_phys, map, unmap, and read_and_clear_dirty. Further, adding new algorithms and techniques becomes easy to do across the entire fleet of drivers and formats. The C functions are drop in compatible with the existing iommu_domain_ops using the IOMMU_PT_DOMAIN_OPS() macro. Each per-format implementation compilation unit will produce exported symbols following the pattern pt_iommu_FMT_map_pages() which the macro directly maps to the iommu_domain_ops members. This avoids the additional function pointer indirection like io-pgtable has. The top level struct used by the drivers is pt_iommu_table_FMT. It contains the other structs to allow container_of() to move between the driver, iommu page table, generic page table, and generic format layers. struct pt_iommu_table_amdv1 { struct pt_iommu { struct iommu_domain domain; } iommu; struct pt_amdv1 { struct pt_common common; } amdpt; }; The driver is expected to union the pt_iommu_table_FMT with its own existing domain struct: struct driver_domain { union { struct iommu_domain domain; struct pt_iommu_table_amdv1 amdv1; }; }; PT_IOMMU_CHECK_DOMAIN(struct driver_domain, amdv1, domain); To create an alias to avoid renaming 'domain' in a lot of driver code. This allows all the layers to access all the necessary functions to implement their different roles with no change to any of the existing iommu core code. Implement the basic starting point: pt_iommu_init(), get_info() and deinit(). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05genpt: Generic Page Table base APIJason Gunthorpe
The generic API is intended to be separated from the implementation of page table algorithms. It contains only accessors for walking and manipulating the table and helpers that are useful for building an implementation. Memory management is not in the generic API, but part of the implementation. Using a multi-compilation approach the implementation module would include headers in this order: common.h defs_FMT.h pt_defs.h FMT.h pt_common.h IMPLEMENTATION.h Where each compilation unit would have a combination of FMT and IMPLEMENTATION to produce a per-format per-implementation module. The API is designed so that the format headers have minimal logic, and default implementations are provided if the format doesn't include one. Generally formats provide their code via an inline function using the pattern: static inline FMTpt_XX(..) {} #define pt_XX FMTpt_XX The common code then enforces a function signature so that there is no drift in function arguments, or accidental polymorphic functions (as has been slightly troublesome in mm). Use of function-like #defines are avoided in the format even though many of the functions are small enough. Provide kdocs for the API surface. This is enough to implement the 8 initial format variations with all of their features: * Entries comprised of contiguous blocks of IO PTEs for larger page sizes (AMDv1, ARMv8) * Multi-level tables, up to 6 levels. Runtime selected top level * The size of the top table level can be selected at runtime (ARM's concatenated tables) * The number of levels in the table can optionally increase dynamically during map (AMDv1) * Optional leaf entries at any level * 32 bit/64 bit virtual and output addresses, using every bit * Sign extended addressing (x86) * Dirty tracking A basic simple format takes about 200 lines to declare the require inline functions. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-04net: Convert struct sockaddr to fixed-size "sa_data[14]"Kees Cook
Revert struct sockaddr from flexible array to fixed 14-byte "sa_data", to solve over 36,000 -Wflex-array-member-not-at-end warnings, since struct sockaddr is embedded within many network structs. With socket/proto sockaddr-based internal APIs switched to use struct sockaddr_unsized, there should be no more uses of struct sockaddr that depend on reading beyond the end of struct sockaddr::sa_data that might trigger bounds checking. Comparing an x86_64 "allyesconfig" vmlinux build before and after this patch showed no new "ud1" instructions from CONFIG_UBSAN_BOUNDS nor any new "field-spanning" memcpy CONFIG_FORTIFY_SOURCE instrumentations. Cc: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-8-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04bpf: Convert bpf_sock_addr_kern "uaddr" to sockaddr_unsizedKees Cook
Change struct bpf_sock_addr_kern to use sockaddr_unsized for the "uaddr" field instead of sockaddr. This improves type safety in the BPF cgroup socket address filtering code. The casting in __cgroup_bpf_run_filter_sock_addr() is updated to match the new type, removing an unnecessary cast in the initialization and updating the conditional assignment to use the appropriate sockaddr_unsized cast. Additionally rename the "unspec" variable to "storage" to better align with its usage. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-7-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04bpf: Convert cgroup sockaddr filters to use sockaddr_unsized consistentlyKees Cook
Update BPF cgroup sockaddr filtering infrastructure to use sockaddr_unsized consistently throughout the call chain, removing redundant explicit casts from callers. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-6-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Remove struct sockaddr from net.hKees Cook
Now that struct sockaddr is no longer used by net.h, remove it. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-4-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops connect() callbacks to use sockaddr_unsizedKees Cook
Update all struct proto_ops connect() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops bind() callbacks to use sockaddr_unsizedKees Cook
Update all struct proto_ops bind() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Add struct sockaddr_unsized for sockaddr of unknown lengthKees Cook
Add flexible sockaddr structure to support addresses longer than the traditional 14-byte struct sockaddr::sa_data limitation without requiring the full 128-byte sa_data of struct sockaddr_storage. This allows the network APIs to pass around a pointer to an object that isn't lying to the compiler about how big it is, but must be accompanied by its actual size as an additional parameter. It's possible we may way to migrate to including the size with the struct in the future, e.g.: struct sockaddr_unsized { u16 sa_data_len; u16 sa_family; u8 sa_data[] __counted_by(sa_data_len); }; Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-1-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>