| Age | Commit message (Collapse) | Author |
|
Implement .db_vector_count() and .db_vector_mask() so NTB core/clients can
map doorbell events to per-vector work and avoid the thundering-herd
behavior.
pci-epf-vntb reserves two slots in db_count: slot 0 for link events and
slot 1 which is historically unused. Therefore the number of doorbell
vectors is (db_count - 2).
Report vectors as 0..N-1 and return BIT_ULL(db_vector) for the
corresponding doorbell bit. Build db_valid_mask from a validated vector
count so out-of-range db_count values cannot create invalid shifts.
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260513024923.451765-8-den@valinux.co.jp
|
|
In pci-epf-vntb, db_count represents the total number of doorbell slots
exposed to the peer, including:
- slot #0 reserved for link events, and
- slot #1 historically unused (kept for compatibility).
Only the remaining slots correspond to actual doorbell bits. The current
db_valid_mask() exposes all slots as valid doorbells.
Limit db_valid_mask() to the real doorbell bits by returning
BIT_ULL(db_count - 2) - 1, and guard against db_count < 2.
Fixes: e35f56bb0330 ("PCI: endpoint: Support NTB transfer between RC and EP")
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260513024923.451765-7-den@valinux.co.jp
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver updates from Ilpo Järvinen:
- amd/hfi: Add support for dynamic ranking tables (version 3)
- amd/pmc:
- Add PMC driver support for AMD 1Ah M80H SoC
- Delay suspend for some Lenovo Laptops to avoid keyboard and lid
switch problems after s2idle
- arm64: qcom-hamoa-ec: Add Hamoa/Purwa/Glymur EC driver
- asus-armoury: add support for G614PR, GA402NJ, GA403UM, and FX608JPR
- asus-wmi: add keystone dongle support
- dell-dw5826e: Add reset driver for DW5826e
- dell-laptop: Fix rollback path
- hp-wmi:
- Add support for Omen 16-ap0xxx (board ID 8D26) and board ID 8B2F
- intel-hid:
- Add HP ProBook x360 440 G1 5 button array support
- Prevent racing ACPI notify handlers
- intel/pmc:
- Add Nova Lake support
- Rate-limit LTR scale-factor warning
- intel-uncore-freq:
- Expose instance ID in the sysfs
- Fix current_freq_khz after CPU hotplug
- intel/vsec: Restore BAR fallback for header walk
- ISST: Restore SST-PP control to all domains
- lenovo-wmi-*:
- Add more CPU tunable attributes
- Add GPU tunable attributes
- Add WMI battery charge limiting
- oxpec: add support for OneXPlayer Super X
- sel3350-platform: Retain LED state on load and unload
- surface: SAM: Add support for Surface Pro 12in
- uniwill-laptop: Add support for battery charge modes
- tools/power/x86/intel-speed-select: Harden daemon pidfile open
- Major refactoring efforts:
- ACPI driver to platform driver conversion
- Converting drivers to use the improved WMI API
- Miscellaneous cleanups / refactoring / improvements
* tag 'platform-drivers-x86-v7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (115 commits)
platform/x86/intel/pmc: Add NVL PCI IDs for SSRAM telemetry discovery
platform/x86/intel/pmc/ssram: Make PMT registration optional
platform/x86/intel/pmc/ssram: Add ACPI discovery scaffolding
platform/x86/intel/pmc/ssram: Switch to static array with per-index probe state
platform/x86/intel/pmc/ssram: Refactor DEVID/PWRMBASE extraction into helper
platform/x86/intel/pmc/ssram: Add PCI platform data
platform/x86/intel/pmc/ssram: Rename probe and PCI ID table for consistency
platform/x86/intel/pmc: Add ACPI PWRM telemetry driver for Nova Lake S
platform/x86/intel/pmc: Add PMC SSRAM Kconfig description
platform/x86/intel/pmt: Unify header fetch and add ACPI source
platform/x86/intel/pmt: Cache the telemetry discovery header
platform/x86/intel/pmt: Pass discovery index instead of resource
platform/x86/intel/pmt/telemetry: Move overlap check to post-decode hook
platform/x86/intel/pmt/crashlog: Split init into pre-decode
platform/x86/intel/pmt: Add pre/post decode hooks around header parsing
modpost: Handle malformed WMI GUID strings
platform/wmi: Make sysfs attributes const
platform/wmi: Make wmi_bus_class const
hwmon: (dell-smm) Use new buffer-based WMI API
platform/x86: dell-ddv: Use new buffer-based WMI API
...
|
|
Pull NVMe fixes from Keith:
"- Apple A11 quirk for sharing tags across admin and IO queues (Nick)
- Target fix for short AUTH_RECEIVE buffers (Michael)
- Target fix for SQ refcount leak (Wentao)
- Target RDMA handling inline data with nonzero offset (Bryam)
- Target TCP fix handling the TCP_CLOSING state (Maurizio)
- FC abort fixes in early initialization (Mohamed)
- Controller device teardown fixes (Maurizio, John)
- Allocate the target ana_state with the port (Rosen)
- Quieten sparse and sysfs symbol warnings (John)"
* tag 'nvme-7.2-2026-06-23' of git://git.infradead.org/nvme:
nvmet-tcp: handle TCP_CLOSING state in nvmet_tcp_state_change
nvmet-auth: reject short AUTH_RECEIVE buffers
nvme-fc: Do not cancel requests in io target before it is initialized
nvme: make nvme_add_ns{_head}_cdev return void
nvme: make some sysfs diagnostic structures static
nvmet-rdma: handle inline data with a nonzero offset
nvme: target: allocate ana_state with port
nvme: fix crash and memory leak during invalid cdev teardown
nvmet: fix refcount leak in nvmet_sq_create()
nvme: quieten sparse warning in valid LBA size check
nvme-apple: Prevent shared tags across queues on Apple A11
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox
Pull mailbox updates from Jassi Brar:
"Core:
- add debugfs support for used channels
- fix resource leak on startup failure
- propagate tx error codes
- clarify blocking mode thread support
Drivers:
- exynos: remove unused register definitions
- imx: refactor IRQ handlers, migrate to devm helpers, and other
minor improvements
- mpfs: fix syscon presence check in inbox ISR
- mtk-adsp: fix use-after-free during device teardown
- qcom: add dt-bindings for QCOM Maili, Hawi, Shikra APCS, and Nord
CPUCP platform support"
* tag 'mailbox-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox: (23 commits)
mailbox: imx: Don't force-thread the primary handler
mailbox: imx: Move the RXDB part of the mailbox into the threaded handler
mailbox: imx: Move the RX part of the mailbox into the threaded handler
mailbox: imx: Start splitting the IRQ handler in primary and threaded handler
mailbox: imx: Use channel index instead of zero in imx_mu_specific_rx()
mailbox: imx: use devm_of_platform_populate()
mailbox: imx: Use devm_pm_runtime_enable()
mailbox: imx: Add a channel shutdown field
mailbox: imx: Forward the timeout/ error in imx_mu_generic_tx()
dt-bindings: mailbox: qcom: Add IPCC support for Maili Platform
mailbox: add list of used channels to debugfs
mailbox: don't free the channel if the startup callback failed
mailbox: Make mbox_send_message() return error code when tx fails
mailbox: Clarify multi-thread is not supported in blocking mode
mailbox: mtk-adsp: fix UAF during device teardown
mailbox: qcom: Unify user-visible "Qualcomm" name
mailbox: exynos: Drop unused register definitions
dt-bindings: mailbox: qcom: Add IPCC support for Hawi Platform
dt-bindings: mailbox: qcom,cpucp-mbox: Add Hawi compatible
dt-bindings: mailbox: qcom: Add Shikra APCS compatible
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm updates from Jarkko Sakkinen:
"Only bug fixes"
* tag 'for-next-tpm-7.2-rc1-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm: fix event_size output in tpm1_binary_bios_measurements_show
tpm: tpm_crb_ffa: revert defered_probed when tpm_crb_ffa is built-in
tpm: tpm2-sessions: wait for async KPP completion in tpm_buf_append_salt
tpm: tpm_tis: Add settle time for some TPMs
tpm: tpm_tis: store entire did_vid
tpm_crb: Check ACPI_COMPANION() against NULL during probe
tpm: tpm_tis_spi: Use wait_woken() in wait_for_tmp_stat()
tpm: Initialize name_size_alg for non-NULL name in tpm_buf_append_name()
tpm: restore timeout for key creation commands
tpm: svsm: constify tpm_chip_ops
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull more kselftest updates from Shuah Khan:
"Docs:
-remove obsolete wiki link from kselftest.rst
ftrace:
- drop invalid top-level local in test_ownership
- Fix trace_marker_raw test on 64K page kernels"
* tag 'linux_kselftest-next-7.2-rc1-second' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
docs: kselftest: remove link to obsolete wiki
selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
selftests/ftrace: Drop invalid top-level local in test_ownership
|
|
registration
On helper registration, the maximum number of expectations cannot go over
NF_CT_EXPECT_MAX_CNT (255), but zero can be specified then
nf_conntrack_expect_max applies. Turn zero into NF_CT_EXPECT_MAX_CNT
otherwise, expectation LRU eviction on insertion is disabled.
Moreover, expand this sanity check all expectation classes.
This max_expecy policy is only tunable since userspace helpers are
available, set Fixes: tag to the commit that adds such infrastructure.
Remove the check for p->max_expected given this field must always
be non-zero after this patch.
Fixes: 12f7a505331e ("netfilter: add user-space connection tracking helper infrastructure")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Userspace passes '5000' in case user asks for 5 seconds.
Allowing for sub-second expectation lifetimes makes sense to me. so
fix up the kernel side instead of munging nft to send a value rounded
up to next second.
Also note that this violates nft convention of passing integers in
network byte order, but we can't change this anymore.
Fixes: 857b46027d6f ("netfilter: nft_ct: add ct expectations support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Run expectation eviction if no helper is specified to deal with the
nft_ct expectation support.
Cap the maximum expectation limit per master conntrack to
NF_CT_EXPECT_MAX_CNT (255).
Fixes: 857b46027d6f ("netfilter: nft_ct: add ct expectations support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Store master conntrack tuple in the expectation since exp->master might
refer to a different conntrack when accessed from rcu read side lock
area due to typesafe rcu rules.
Fixes: 02a3231b6d82 ("netfilter: nf_conntrack_expect: store netns and zone in expectation")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
IRC Direct client-to-client requires plaintext. IRC over TLS should be
preferred, making this helper ineffective. Add a deprecation warning and
update the help text to better reflect that this is needed for the DCC
extension, not IRC itself.
PPTP is esoteric these days and it is the only helper that requires the
destroy callback in the conntrack helper API.
Removal would simplify the conntrack core.
Both helpers are IPv4 only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In davinci_gpio_irq_setup(), after successfully creating an IRQ domain
with irq_domain_create_legacy(), a subsequent devm_kzalloc() failure
in the bank loop causes the function to return -ENOMEM without
removing the IRQ domain.
Unlike devm-managed resources, irq_domain_create_legacy() does not
auto-clean up on probe failure, so the domain is leaked.
Fix by calling irq_domain_remove() before returning on allocation
failure.
Fixes: b5cf3fd827d2 ("gpio: davinci: Redesign driver to accommodate ngpios in one gpio chip")
Signed-off-by: Qingshuang Fu <fuqingshuang@kylinos.cn>
Link: https://patch.msgid.link/20260623023106.117229-1-fffsqian@163.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
|
|
tegra_gpio_direction_input() and tegra_gpio_direction_output() already
program the GPIO controller direction registers directly. The additional
pinctrl_gpio_direction_input/output() calls do not add a Tegra pinctrl
operation, because the Tegra pinmux ops provide GPIO request/free
handling but no gpio_set_direction hook.
The extra call still enters the pinctrl core and takes pctldev->mutex.
Shared GPIO users can call the direction path while holding their
per-line spinlock, so this otherwise redundant pinctrl direction call can
sleep in an atomic context.
This was found by our static analysis tool and then confirmed by manual
review of tegra_gpio_probe(), the Tegra GPIO direction callbacks and the
Tegra pinctrl ops. The reviewed path has a default non-sleeping
struct gpio_chip while the direction callback still enters the pinctrl
mutex path.
A directed runtime validation kept the same non-sleeping chip registration
and drove:
gpio_shared_proxy_direction_output()
gpiod_direction_output_raw_commit()
tegra_gpio_direction_output()
pinctrl_gpio_direction_output()
Lockdep reported a sleep-in-atomic warning with the shared GPIO spinlock
held and pinctrl_get_device_gpio_range() plus tegra_gpio_direction_output()
on the stack.
Do not mark the whole chip as can_sleep to paper over this: can_sleep
describes whether get()/set() may sleep, and Tegra value access is MMIO.
Remove the redundant pinctrl direction calls and keep pinctrl involvement
in the existing request/free path.
Fixes: 11da90541283 ("gpio: tegra: Fix offset of pinctrl calls")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Link: https://patch.msgid.link/20260619152439.1239561-1-runyu.xiao@seu.edu.cn
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
|
|
sendmsg()/sendto() with MSG_FASTOPEN is a combination of connect(2) and
write(2): it opens the connection in the SYN. apparmor_socket_sendmsg()
only checks AA_MAY_SEND, so a profile that grants send but denies connect
lets a confined task open an outbound TCP/MPTCP connection that connect(2)
would have refused, bypassing connect mediation.
Mediate the implicit connect when MSG_FASTOPEN is set and a destination
is supplied. Add it to apparmor_socket_sendmsg() (not the shared
aa_sock_msg_perm() helper, which recvmsg also uses) and call aa_sk_perm()
directly, mirroring the selinux and tomoyo fixes. sk_is_tcp() does not
cover MPTCP fast open, so the SOCK_STREAM/IPPROTO_MPTCP arm is explicit.
Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: John Johansen <john.johansen@canonical.com>
|
|
This feature allows to reset a helper for an existing conntrack, but it
is not safe. This requires a synchronized_rcu() call after resetting the
helper, which is going to be expensive for a large batch of conntrack
entries. This also needs to call to the .destroy callback to release the
GRE/PPTP mappings to fix it.
This feature antedates the creation of the conntrack-tools and I cannot
find a good use-case for this. Given that I cannot find any user in the
netfilter.org userspace tree, I prefer to remove this feature.
Fixes: c1d10adb4a52 ("[NETFILTER]: Add ctnetlink port for nf_conntrack")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add a test queueing from bridge family.
This was lacking: we queued from inet for ipv4 and ipv6 but
we had no bridge queue test so far.
Given kernel MUST validate that in/out port are still part of
a bridge device on reinject add a test case for this before
adding this check.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
xtables targets return netfilter verdicts: NF_ACCEPT, NF_DROP, and so
on. ebtables targets return incompatible verdicts: EBT_ACCEPT,
EBT_DROP, ... We cannot allow fallback to NFPROTO_UNSPEC.
ebtables doesn't permit this since
11ff7288beb2 ("netfilter: ebtables: reject non-bridge targets")
but that commit missed the nft_compat layer.
Reported-by: Ren Wei <n05ec@lzu.edu.cn>
Reported-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
collision test
The existing test covered a scenario where a delayed INIT_ACK chunk
updates the vtag in conntrack after the association has already been
established.
A similar issue can occur with a delayed SCTP INIT chunk.
Add a new simultaneous-open test case where the client's INIT is
delayed, allowing conntrack to establish the association based on
the server-initiated handshake.
When the stale INIT arrives later, it may get recorded and cause a
following INIT_ACK from the peer to be accepted instead of dropped.
This INIT_ACK overwrites the vtag in conntrack, causing subsequent
SCTP DATA chunks to be considered as invalid and then dropped by
nft rules matching on ct state invalid.
This test verifies such stale INIT chunks do not cause problems.
Signed-off-by: Yi Chen <yiche.cy@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nft_synproxy_eval_v4() and nft_synproxy_eval_v6() already take a
whole-object READ_ONCE() snapshot of the shared priv->info state before
building the SYNACK reply, but nft_synproxy_tcp_options() still masks
opts->options with priv->info.options from the live shared object.
When a named synproxy object is updated concurrently with SYN traffic,
the eval path can then mix mss and timestamp handling from the local
snapshot with an options mask taken from a newer configuration, so one
SYNACK no longer reflects a coherent synproxy configuration.
Use info->options so nft_synproxy_tcp_options() stays on the same local
snapshot that the eval path already copied from priv->info.
Fixes: ee394f96ad75 ("netfilter: nft_synproxy: add synproxy stateful object support")
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
- use correct names in kernel-doc comments
- add missing struct members to kernel-doc comments
Warning: include/linux/netfilter/x_tables.h:41 struct member 'targinfo' not described in 'xt_action_param'
Warning: include/linux/netfilter/x_tables.h:41 Excess struct member 'targetinfo' description in 'xt_action_param'
Warning: include/linux/netfilter/x_tables.h:90 struct member 'family' not described in 'xt_mtchk_param'
Warning: include/linux/netfilter/x_tables.h:90 struct member 'nft_compat' not described in 'xt_mtchk_param'
Warning: include/linux/netfilter/x_tables.h:101 expecting prototype for struct xt_mdtor_param. Prototype was for struct xt_mtdtor_param instead
Warning: include/linux/netfilter/x_tables.h:121 struct member 'net' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'table' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'target' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'targinfo' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'hook_mask' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'family' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'nft_compat' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:345 expecting prototype for xt_recseq(). Prototype was for DECLARE_PER_CPU() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add sanity check for iph->ihl field in nf_flow_ip4_tunnel_proto() before
using it to compute the header size, avoiding out-of-bounds access with
malformed IP headers.
While at it, use iph->protocol instead of the hardcoded IPPROTO_IPIP
constant when setting ctx->tun.proto and reference ctx->tun.hdr_size
when updating ctx->offset.
Fixes: ab427db178858 ("netfilter: flowtable: Add IPIP rx sw acceleration")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Commit 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add
was skipped") introduced a regression where packets for valid
connections are dropped when using connlimit for soft-limiting
scenarios.
The issue occurs when a new connection reuses a socket currently in
the TIME_WAIT state. In this scenario, the connection tracking entry
is evaluated as already confirmed. Previously, __nf_conncount_add()
assumed that if a connection was confirmed and did not originate from
the loopback interface, it should skip the addition and return -EEXIST.
Skipping the addition triggers a garbage collection run that cleans up
the TIME_WAIT connection. Consequently, the active connection count
drops to 0, which xt_connlimit mishandles, leading to the false rejection
of the perfectly valid new connection.
Fix this by replacing the interface check with protocol-agnostic state
checks. We now skip the tree insertion and preserve the lockless garbage
collection optimization only if the connection is IPS_ASSURED. This
allows early-confirmed setup packets (such as reused TIME_WAIT sockets
or locally generated SYN-ACKs) to be properly evaluated and counted
without falsely dropping. The goto check_connections path is maintained
to ensure these setup packets are deduplicated correctly.
This has been tested with slowhttptest and HTTP server configured
locally to ensure we are not breaking soft-limiting scenarios for local
or external connections. In addition, it was tested with a OVS zone
limit too.
Fixes: 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add was skipped")
Reported-by: Alejandro Olivan Alvarez <alejandro.olivan.alvarez@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/177349610461.3071718.4083978280323144323@eldamar.lan/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
We ran into below KASAN splat, which is mostly uninteresting, beside
for having nf_nat_register_fn() in the call chain as a cause for the
offending access:
==================================================================
BUG: KASAN: slab-out-of-bounds in nf_nat_register_fn+0x5f9/0x640
Read of size 8 at addr ffff890031e54c20 by task iptables/9510
CPU: 0 UID: 0 PID: 9510 Comm: iptables Not tainted 6.18.18-grsec-full-20260320181326 #1 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
[…] dump_stack_lvl+0xee/0x160 ffff88004117eeb8
[…] print_report+0x6e/0x640 ffff88004117eee0
[…] ? __phys_addr+0x8e/0x140 ffff88004117eef0
[…] ? kasan_addr_to_slab+0x51/0xe0 ffff88004117ef08
[…] ? complete_report_info+0xec/0x1c0 ffff88004117ef20
[…] ? nf_nat_register_fn+0x5f9/0x640 ffff88004117ef48
[…] kasan_report+0xbc/0x140 ffff88004117ef50
[…] ? nf_nat_register_fn+0x5f9/0x640 ffff88004117ef90
[…] nf_nat_register_fn+0x5f9/0x640 ffff88004117eff8
[…] ? nf_nat_icmp_reply_translation+0x6e0/0x6e0 ffff88004117f070
[…] nf_tables_register_hook.part.0+0xa0/0x220 ffff88004117f080
[…] nf_tables_addchain.constprop.0+0x1054/0x1fc0 ffff88004117f0b8
[…] ? nft_chain_lookup.part.0+0x4ce/0xac0 ffff88004117f130
[…] ? nf_tables_abort+0x3d80/0x3d80 ffff88004117f190
[…] ? nf_tables_dumpreset_obj+0x100/0x100 ffff88004117f1c8
[…] ? nft_table_lookup.part.0+0x255/0x300 ffff88004117f310
[…] ? nf_tables_newchain+0x21a4/0x2fa0 ffff88004117f358
[…] nf_tables_newchain+0x21a4/0x2fa0 ffff88004117f360
[…] ? nf_tables_addchain.constprop.0+0x1fc0/0x1fc0 ffff88004117f458
[…] ? nla_get_range_signed+0x4a0/0x4a0 ffff88004117f488
[…] ? lock_acquire+0x16f/0x320 ffff88004117f490
[…] ? find_held_lock+0x3b/0xe0 ffff88004117f4b0
[…] ? __nla_parse+0x45/0x80 ffff88004117f500
[…] nfnetlink_rcv_batch+0xbca/0x19a0 ffff88004117f550
[…] ? nfnetlink_net_exit_batch+0x120/0x120 ffff88004117f618
[…] ? __sanitizer_cov_trace_switch+0x63/0xe0 ffff88004117f720
[…] ? gr_acl_handle_mmap+0x1c4/0x320 ffff88004117f7c0
[…] ? nla_get_range_signed+0x4a0/0x4a0 ffff88004117f7e8
[…] ? gr_is_capable+0x6f/0xe0 ffff88004117f830
[…] ? __nla_parse+0x45/0x80 ffff88004117f860
[…] ? skb_pull+0x103/0x1a0 ffff88004117f880
[…] nfnetlink_rcv+0x3db/0x4a0 ffff88004117f8b0
[…] ? nfnetlink_rcv_batch+0x19a0/0x19a0 ffff88004117f8d8
[…] ? netlink_lookup+0xe2/0x240 ffff88004117f900
[…] netlink_unicast+0x74b/0xb00 ffff88004117f930
[…] ? netlink_attachskb+0xb20/0xb20 ffff88004117f980
[…] ? __check_object_size+0x3e/0xaa0 ffff88004117f998
[…] ? security_netlink_send+0x51/0x160 ffff88004117f9c8
[…] netlink_sendmsg+0xa03/0x1200 ffff88004117f9f8
[…] ? netlink_unicast+0xb00/0xb00 ffff88004117fa70
[…] ? netlink_unicast+0xb00/0xb00 ffff88004117fac8
[…] ? ____sys_sendmsg+0xe2a/0x1040 ffff88004117faf8
[…] ____sys_sendmsg+0xe2a/0x1040 ffff88004117fb00
[…] ? kernel_recvmsg+0x300/0x300 ffff88004117fb60
[…] ? reacquire_held_locks+0xe9/0x260 ffff88004117fbc8
[…] ___sys_sendmsg+0x138/0x200 ffff88004117fbf8
[…] ? do_recvmmsg+0x7e0/0x7e0 ffff88004117fc30
[…] ? lockdep_hardirqs_on_prepare+0x101/0x1e0 ffff88004117fc50
[…] ? lock_acquire+0x16f/0x320 ffff88004117fd20
[…] ? lock_acquire+0x16f/0x320 ffff88004117fd58
[…] ? find_held_lock+0x3b/0xe0 ffff88004117fd70
[…] __sys_sendmsg+0x17a/0x260 ffff88004117fdc8
[…] ? __sys_sendmsg_sock+0x80/0x80 ffff88004117fdf0
[…] ? syscall_trace_enter+0x15e/0x2c0 ffff88004117fe98
[…] do_syscall_64+0x7d/0x400 ffff88004117fec8
[…] entry_SYSCALL_64_safe_stack+0x4a/0x60 ffff88004117fef8
</TASK>
==================================================================
The out-of-bounds report, though, is a red herring as it is for an
access that shouldn't have happened in the first place.
When nf_nat_init() fails to register its BPF kfuncs, it'll unwind and,
among others, call unregister_pernet_subsys() to deregister its per-net
ops. This makes the previously allocated net id available for reuse by
the next caller of register_pernet_subsys(), in our case, synproxy.
However, 'nat_net_id' will still hold the previously allocated value.
If nf_nat.o gets build as a module, all this doesn't matter. A failed
initialization routine makes the module fail to load and any dependent
module won't be able to load either. However, if nf_nat.o is built-in,
a failing init won't /completely/ make its functionality unavailable to
dependent modules, namely the code and static data is still there, free
to be called by modules like nft_chain_nat.ko.
Case in point, nft_chain_nat registers hooks that'll call into nf_nat
which, in our case, failed to initialize and therefore won't have a
valid net id nor related net_nat object any more.
Code in nf_nat, namely nf_nat_register_fn() and nf_nat_unregister_fn(),
still making use of the reallocated net id, lead to a type confusion as
the call to net_generic() will no longer return memory belonging to an
object suited to fit 'struct nat_net' but 'struct synproxy_net' instead.
The latter is only 24 bytes on 64-bit systems, much smaller than struct
nat_net which is 176 bytes, perfectly explaining the OOB KASAN report.
Detect and handle a failed nf_nat_init() by testing the 'nf_nat_hook'
pointer which will be reset to NULL on initialization errors to prevent
the usage of an invalid nat_net pointer.
As this check is only needed when nf_nat.o is built-in, guard it by
'#ifndef MODULE...'.
Fixes: cbc1dd5b659f ("netfilter: nf_nat: Fix possible memory leak in nf_nat_init()")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Prepare input updates for 7.2 merge window.
|
|
The MMS134S and MMS136 touch controllers have an event size of 6 bytes
rather than 8 bytes. When __mms114_read_reg() reads the touch data
packet from the device into the touch buffer, the events are packed
tightly at 6-byte intervals. However, the driver iterates through the
events using standard C array indexing (touch[index]), where each
element is sizeof(struct mms114_touch) (8 bytes) apart. As a result, any
touch events beyond the first one are read from incorrect offsets and
parsed improperly.
Fix this by explicitly calculating the byte offset for each touch event
based on the device's specific event size.
Fixes: 53fefdd1d3a3 ("Input: mms114 - support MMS136")
Fixes: ab108678195f ("Input: mms114 - support MMS134S")
Reported-by: sashiko-bot@kernel.org
Assisted-by: Antigravity:gemini-3.5-flash
Reviewed-by: Bryam Vargas <hexlabsecurity@proton.me>
Link: https://patch.msgid.link/20260616050912.1531241-1-dmitry.torokhov@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
The Elan I2C touchpad driver queries the device for its physical
dimensions and trace counts to calculate the device resolution and width.
However, if the device firmware or device tree provides invalid zero
values for x_traces or y_traces, it results in a fatal division-by-zero
exception leading to a kernel panic during device probe.
Add checks to ensure these parameters are non-zero before performing
the division. If invalid trace values are detected, fall back to a safe
default of 1.
Additionally, prevent an arithmetic underflow in the touch reporting
logic. Previously, if the calculated or fallback width was smaller than
ETP_FWIDTH_REDUCE (90), the subtraction would underflow, resulting in a
massive unsigned integer being reported to userspace. Clamp the adjusted
width to a minimum of 0 to safely handle small physical dimensions and
fallback scenarios.
Completing the probe with safe fallback values ensures the sysfs nodes
are created, keeping the firmware update path intact so a recovery
firmware can be flashed to the device.
Fixes: 6696777c6506 ("Input: add driver for Elan I2C/SMbus touchpad")
Fixes: e3a9a1290688 ("Input: elan_i2c - do not query the info if they are provided")
Signed-off-by: Ranjan Kumar <kumarranja@chromium.org>
Link: https://patch.msgid.link/20260612060339.3829666-1-kumarranja@chromium.org
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
Memoryless force-feedback devices use a timer to manage playback of
effects. When a driver for such a device is unbound (or the device is
unregistered for other reasons), the driver typically frees its private
data synchronously. However, the input_dev structure (and its associated
force-feedback structures, including the timer) is only freed when the
last user closes the corresponding device node.
If userspace keeps the device node open while the device is unregistered
(e.g., during driver unbind), the force-feedback timer can still fire
after the driver's private data has been freed.
Introduce a new 'stop' callback to struct ff_device, and call it from
input_unregister_device() before the device is deleted. Implement this
callback for memoryless devices and synchronously shut down the timer to
ensure it is stopped and cannot be rearmed once unregistration happens.
Assisted-by: Gemini:gemini-3.1-pro
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
iforce_process_packet() handles a status report (packet id 0x02) by
taking a force-feedback effect index straight from the device wire and
using it to address the per-effect state array:
i = data[1] & 0x7f;
if (data[1] & 0x80) {
if (!test_and_set_bit(FF_CORE_IS_PLAYED,
iforce->core_effects[i].flags))
...
} else if (test_and_clear_bit(FF_CORE_IS_PLAYED,
iforce->core_effects[i].flags)) {
...
}
The index is masked only with 0x7f, so it ranges 0..127, but
core_effects[] holds only IFORCE_EFFECTS_MAX (32) entries. For an index
of 32..127 the test_and_set_bit()/test_and_clear_bit() is an
out-of-bounds single-bit read-modify-write past the array. core_effects[]
is the second-to-last member of struct iforce, so the write lands in the
trailing members and beyond the embedding kzalloc()'d iforce_serio /
iforce_usb object.
data[1] is unvalidated device payload on both transports (the USB
interrupt endpoint and serio), and the status path is not gated on force
feedback being present, so a malicious or counterfeit device can set or
clear a bit at an attacker-chosen offset past the object.
Reject an out-of-range index instead of indexing with it. Bound against
the array dimension IFORCE_EFFECTS_MAX rather than dev->ff->max_effects so
the check guarantees memory safety regardless of how many effects the
device registered. A legitimate "effect started/stopped" status always
carries an index below IFORCE_EFFECTS_MAX, so well-formed devices are
unaffected; the neighbouring mark_core_as_ready() loop is already bounded
and is left untouched.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260613-b4-disp-4828d263-v1-1-02320e1a89dd@proton.me
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
goodix_ts_read_input_report() copies the number of touch points reported
by the device into an on-stack buffer
u8 point_data[2 + GOODIX_MAX_CONTACT_SIZE * GOODIX_MAX_CONTACTS];
which is sized for at most GOODIX_MAX_CONTACTS (10) contacts. The only
runtime check bounds the per-interrupt count against ts->max_touch_num,
but that value is taken verbatim from a 4-bit field of the device
configuration block and is never clamped:
ts->max_touch_num = ts->config[MAX_CONTACTS_LOC] & 0x0f;
The nibble can be 0..15, so a malfunctioning, malicious or counterfeit
controller (or an attacker tampering with the I2C bus) can advertise up
to 15 contacts. goodix_ts_read_input_report() then accepts a touch_num
of up to 15 and the second goodix_i2c_read() writes
ts->contact_size * (touch_num - 1) bytes past the one-contact header into
point_data - up to 30 bytes (45 with the 9-byte report format) beyond the
92-byte buffer: a stack out-of-bounds write.
Clamp max_touch_num to GOODIX_MAX_CONTACTS, the number of contacts
point_data[] is sized for, when reading it from the configuration.
Fixes: a7ac7c95d468 ("Input: goodix - use max touch number from device config")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Hans de Goede <johannes.goede@oss.qualcomm.com>
Link: https://patch.msgid.link/20260612-b4-disp-6844625d-v1-1-df0aed080c9d@proton.me
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
BPF LSM programs can currently attach to xfrm_decode_session(). That
hook may return an error, but security_skb_classify_flow() calls it
from a void path and triggers BUG_ON() if an error is returned.
Disable BPF attachment to the hook to prevent a BPF LSM program from
turning packet classification into a full panic.
Fixes: 9e4e01dfd325 ("bpf: lsm: Implement attach, detach and execution")
Signed-off-by: Bradley Morgan <include@grrlz.net>
Link: https://lore.kernel.org/r/20260619130305.27779-1-include@grrlz.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"The most notable change is the removal of the fscache backend: it has
been deprecated for almost two years, mainly because EROFS file-backed
mounts and fanotify pre-content hooks (together with erofs-utils) now
provide better functionality and simpler codebase. In addition,
fscache has depended on netfslib for years, which is undesirable for
EROFS since it is a local filesystem. More details in [1].
In addition, sparse support has been added to the pcluster layout,
which is helpful for large sparse AI datasets, and map requests for
chunk-based inodes have been optimized to be more efficient as well.
There are also the usual fixes and cleanups.
Summary:
- Report more consecutive chunks of the same type for
each iomap request
- Add sparse support for the pcluster layout
- Update the EROFS documentation overview
- Remove the deprecated fscache backend
- Various fixes and cleanups"
Link: https://lore.kernel.org/r/20260622013622.934174-1-hsiangkao@linux.alibaba.com [1]
* tag 'erofs-for-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: handle 48-bit blocks_hi for compressed inodes
erofs: remove fscache backend entirely
erofs: simplify RCU read critical sections
erofs: add sparse support to pcluster layout
erofs: add folio order to trace_erofs_read_folio
erofs: introduce erofs_map_chunks()
erofs: call erofs_exit_ishare() before rcu_barrier()
erofs: update the overview of the documentation
erofs: clean up erofs_ishare_fill_inode()
|
|
KCSAN report a race in break_stripe_batch_list() vs. raid5_make_request()
on sh->dev[i].flags (plain word write vs. atomic bit op)..
and .. one possible scenario is:
CPU1 CPU2
break_stripe_batch_list(sh1)
-> handle sh2
-> lock(sh2)
-> sh2->batch_head = NULL
-> unlock(sh2)
-> test_and_clear_bit(R5_Overlap, sh2->dev[i].flags)
-> wake_up_bit(sh2->dev[i].flags)
raid5_make_request()
-> add_all_stripe_bios(sh2)
-> lock(sh2)
-> stripe_bio_overlaps(sh2) returns true
batch_head is NULL, so new bio overlap
exist bio on sh2 -> true
-> set_bit(R5_Overlap, sh2->dev[i].flags)
-> unlock(sh2)
-> wait_on_bit(sh2->dev[i].flags)
-> sh2->dev[i].flags = sh1->dev[i].flags & ~R5_Overlap
No wait_up_bit(), CPU2 could be wait_on_bit() forever...
Fix by :
- Expand the protect zone.
- Use batch_head's device flag's snaphot when no held head_sh->stripe_lock.
- Move sh/head_sh->batch_head = NULL to the end of protected zone , and ,
any concurrent add_all_stripe_bios() grabs sh->stripe_lock now either:
- see batch_head != null, and , is rejected by stripe_bio_overlaps()
under the lock (no R5_Overlap wait ) , or ,
- sees batch_head == NULL, only after dev[i].flags has already been
set and the prior R5_Overlap waiters worken.
KCSAN report:
================================================
BUG: KCSAN: data-race in break_stripe_batch_list / raid5_make_request
write (marked) to 0xffff8e89c8117548 of 8 bytes by task 4042 on cpu 0:
raid5_make_request+0xea0/0x2930
md_handle_request+0x4a2/0xa40
md_submit_bio+0x109/0x1a0
__submit_bio+0x2ec/0x390
submit_bio_noacct_nocheck+0x457/0x710
submit_bio_noacct+0x2a7/0xc20
submit_bio+0x56/0x250
blkdev_direct_IO+0x54c/0xda0
blkdev_write_iter+0x38f/0x570
aio_write+0x22b/0x490
io_submit_one+0xa51/0xf70
__x64_sys_io_submit+0xf7/0x220
x64_sys_call+0x1907/0x1c60
do_syscall_64+0x130/0x570
entry_SYSCALL_64_after_hwframe+0x76/0x7e
read to 0xffff8e89c8117548 of 8 bytes by task 4010 on cpu 5:
break_stripe_batch_list+0x249/0x480
handle_stripe_clean_event+0x720/0x9b0
handle_stripe+0x32fb/0x4500
handle_active_stripes.isra.0+0x6e0/0xa50
raid5d+0x7e0/0xba0
md_thread+0x15a/0x2d0
kthread+0x1e3/0x220
ret_from_fork+0x37a/0x410
ret_from_fork_asm+0x1a/0x30
value changed: 0x0000000000000019 -> 0x0000000000000099 --> R5_Overlap
Fixes: fb642b92c267 ("md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list")
Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Link: https://patch.msgid.link/20260619041013.1207148-1-chencheng@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
|
|
The patch just suppress KCSAN noise. No functional change.
RAID-5 can group multi full-stripe-write aka stripe_head into a
batch aka batch_list, with one head_sh leading them. Call
break_stripe_batch_list() when the batch is finished, or,
a stripe has to be dropped out of the batch.
break_stripe_batch_list() reads stripe state several times while
request paths can update thost state words concurrently with
lockless bitops, which reported by KCSAN.
Use a snapshot to guarantees that the value used for
warning, copying, and handle checks is internally consistent
at current read moment.
KCSAN report:
==============================================
BUG: KCSAN: data-race in __add_stripe_bio / break_stripe_batch_list
write (marked) to 0xffff8e89d4f0b988 of 8 bytes by task 4323 on cpu 3:
__add_stripe_bio+0x35e/0x400
raid5_make_request+0x6ac/0x2930
md_handle_request+0x4a2/0xa40
md_submit_bio+0x109/0x1a0
__submit_bio+0x2ec/0x390
submit_bio_noacct_nocheck+0x457/0x710
submit_bio_noacct+0x2a7/0xc20
submit_bio+0x56/0x250
blkdev_direct_IO+0x54c/0xda0
blkdev_write_iter+0x38f/0x570
aio_write+0x22b/0x490
io_submit_one+0xa51/0xf70
read to 0xffff8e89d4f0b988 of 8 bytes by task 4290 on cpu 4:
break_stripe_batch_list+0x3ce/0x480
handle_stripe_clean_event+0x720/0x9b0
handle_stripe+0x32fb/0x4500
handle_active_stripes.isra.0+0x6e0/0xa50
raid5d+0x7e0/0xba0
Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Link: https://patch.msgid.link/20260618134748.1168360-1-chencheng@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
|
|
The net-next-hw spinners on netdev.bots.linux.dev observe failing
so-txtime-py tests. A review of stdout shows most failures to be
due to exceeding the 4ms grace period. All I saw were within 8ms.
So increase to that.
Double the bounds from 4 to 8ms. This is still is small enough to
differentiate the delays programmed by the test, 10 and 20ms.
Fixes: 5c6baef3885c ("selftests: drv-net: convert so_txtime to drv-net")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20260610170651.1b644001@kernel.org/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260621200137.1564776-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In airoha_qdma_set_chan_tx_sched(), the loop clearing queue mask was
using AIROHA_NUM_TX_RING (32) instead of AIROHA_NUM_QOS_QUEUES (8).
Each channel has 8 queues, and TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i)
computes BIT(i + (channel * 8)). With i ranging 0..31, this causes:
- channel 0: clears bit 0..31 (all 4 channels) instead of 0..7
- channel 1: clears bit 8..31 (channels 1-3) instead of 8..15
- channel 2: clears bit 16..31 (channels 2-3) instead of 16..23
- channel 3: clears bit 24..31 (channel 3 only) - correct by accident
While BIT(32+) on arm64 produces 64-bit values truncated to 0 in u32
mask parameter, the loop still incorrectly clears queues within the
same channel beyond queue 7.
Even though this is functionally harmless (the register resets to 0
and is only ever cleared, never set — so clearing extra bits is a
no-op), the loop bound is semantically wrong and should be fixed for
correctness and clarity.
Fix by using AIROHA_NUM_QOS_QUEUES (8) as the loop upper bound.
Fixes: ef1ca9271313 ("net: airoha: Add sched HTB offload support")
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Wayen Yan <win847@gmail.com>
Link: https://patch.msgid.link/178187479434.2400840.1312143943526335838@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If the allocation of fp[i].tpa_info fails, the error path will not free
the struct bnx2x_fastpath allocated earlier, as it is not linked to the
bp structure yet. Fix that by linking it immediately after allocation.
Cc: stable@vger.kernel.org
Fixes: 15192a8cf8a8 ("bnx2x: Split the FP structure")
Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260620062402.89549-1-nihaal@cse.iitm.ac.in
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When CONFIG_IP_MULTIPLE_TABLES is enabled but no rule is added,
fib_lookup() performs route lookup directly on two tables.
Since the first lookup does not properly bail out, the result
of an error route in the merged local/main table could be
overwritten by another route in the default table:
# unshare -n
# ip link set lo up
# ip route add 192.168.0.0/24 dev lo table 253
# ip route add unreachable 192.168.0.0/24
# ip route get 192.168.0.1
192.168.0.1 dev lo table default uid 0
cache <local>
Once a random rule is added, the error route is respected:
# ip rule add table 0
# ip rule del table 0
# ip route get 192.168.0.1
RTNETLINK answers: No route to host
Let's fix the inconsistent behaviour.
Fixes: f4530fa574df ("ipv4: Avoid overhead when no custom FIB rules are installed.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260619212753.3367244-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Kernel selftests wait 1.25x of the promised stats refresh time
(as read from ethtool -c). bnxt reports 1sec by default, but
the stats update process has two steps. First device DMAs the
new values, then the service task performs update in full-width
SW counters. So the worst case delay is actually 2x.
Note that the behavior is different for ring stats and port stats.
Port stats are fetched synchronously by the service worker, so
there's no risk of doubling up the delay there.
The problem of stale stats impacts not only tests but real workloads
which monitor egress bandwidth of a NIC. The inaccuracy causes double
counting in the next cycle and spurious overload alarms.
Try to read from the DMA buffer more aggressively, to mitigate
timing issues between DMA and service task. The SW update should
be cheap.
Fixes: 51f307856b60 ("bnxt_en: Allow statistics DMA to be configurable using ethtool -C.")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260619191538.104165-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fib6_nh_mtu_change() re-fetches idev via __in6_dev_get(arg->dev) and
dereferences idev->cnf.mtu6 without a NULL check. addrconf_ifdown()
clears dev->ip6_ptr with RCU_INIT_POINTER() after rt6_disable_ip() has
released tb6_lock, so the RA-driven MTU walk can observe a NULL idev and
oops. The caller rt6_mtu_change_route() guards its own __in6_dev_get(),
but this re-fetch is unguarded; nexthop-backed routes survive
addrconf_ifdown()'s flush, so the walk still reaches it after ip6_ptr is
nulled.
Return 0 when idev is NULL, matching rt6_mtu_change_route() and the
fib6_mtu() fix in commit 5ad509c1fdad ("ipv6: Fix null-ptr-deref in
fib6_mtu().").
Oops: general protection fault, ... KASAN: null-ptr-deref in range
[0x00000000000002a8-0x00000000000002af]
RIP: 0010:fib6_nh_mtu_change+0x203/0x990
rt6_mtu_change_route+0x141/0x1d0
__fib6_clean_all+0xd0/0x160
rt6_mtu_change+0xb4/0x100
ndisc_router_discovery+0x24b5/0x2cb0
icmpv6_rcv+0x12e9/0x1710
ipv6_rcv+0x39b/0x410
Fixes: c0b220cf7d80 ("ipv6: Refactor exception functions")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260619045334.2427073-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Each PPP control protocol (LCP/IPCP/IPV6CP) embedded in struct ppp
registers a timer via timer_setup(). That struct ppp is the
hdlc->state allocation, which detach_hdlc_protocol() frees with kfree()
in both teardown paths: unregister_hdlc_device() and the re-attach inside
attach_hdlc_protocol().
The ppp proto never registered a .detach callback, so
detach_hdlc_protocol() performs no timer synchronization before the
kfree(). The only cancel, timer_delete(&proto->timer) in ppp_cp_event(),
is partial (it does not wait for a running callback) and only runs on the
->CLOSED transition; ppp_stop()/ppp_close() do not sync either. A
ppp_timer callback already executing (blocked on ppp->lock) survives the
kfree and then dereferences proto->state / ppp->lock in freed memory,
leading to a use-after-free.
Fix this by adding a .detach helper that calls timer_shutdown_sync() on
every per-proto timer. detach_hdlc_protocol() invokes proto->detach(dev)
before kfree(hdlc->state), so timer_shutdown_sync()
now runs on both free paths.
timer_shutdown_sync() is used instead of timer_delete_sync() because the
keepalive path re-arms the timer through add_timer()/mod_timer() and
shutdown blocks any re-activation during teardown.
Initialize the per-protocol timers in ppp_ioctl() when the protocol is
attached, and remove the now-redundant timer_setup() from ppp_start(), so
that the timers are initialized exactly once at attach time and
ppp_timer_release() never operates on uninitialized timer_list
structures. attach_hdlc_protocol() uses kmalloc() (not kzalloc), so
struct ppp's protos[i].timer is uninitialized garbage until the first
timer_setup(); without this init-at-attach, attaching the PPP protocol
without ever bringing the device up would leave timer_shutdown_sync()
operating on uninitialized memory in .detach. Moving the init out of
ppp_start() (which only runs on NETDEV_UP) into the attach path makes the
initialization unconditional and avoids initializing the same timer_list
twice.
This bug was found by static analysis.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Fan Wu <fanwu01@zju.edu.cn>
Link: https://patch.msgid.link/20260617020518.116319-1-fanwu01@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
icssg_ndo_get_stats64() unconditionally calls emac_get_stat_by_name()
with FW PA stat names regardless of whether the PA stats block is
present on the hardware. emac_get_stat_by_name() already guards the
PA stats lookup with `if (emac->prueth->pa_stats)`; when that pointer
is NULL the lookup falls through to netdev_err() and returns -EINVAL.
Because ndo_get_stats64 is polled regularly by the networking stack
this produces thousands of log entries of the form:
icssg-prueth icssg1-eth end0: Invalid stats FW_RX_ERROR
A secondary consequence is that the int(-EINVAL) return value is
implicitly widened to a near-ULLONG_MAX unsigned value when accumulated
into the __u64 fields of rtnl_link_stats64, silently corrupting the
rx_errors, rx_dropped and tx_dropped counters reported by `ip -s link`.
Every other PA-aware code path in the driver is already guarded with
the same `if (emac->prueth->pa_stats)` check. Apply the same guard
here.
Fixes: 0d15a26b247d ("net: ti: icssg-prueth: Add ICSSG FW Stats")
Signed-off-by: Philippe Schenker <philippe.schenker@impulsing.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Cc: danishanwar@ti.com
Cc: rogerq@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260618093037.3448858-1-dev@pschenker.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Tristan Madani says:
====================
Fix stale register bounds on LSM retval context load
From: Tristan Madani <tristan@talencesecurity.com>
check_mem_access() calls __mark_reg_s32_range() to narrow a register to
the LSM hook retval range, but the intersection preserves stale bounds
from prior instructions. Add mark_reg_unknown() before narrowing (same
pattern as the else branch) and a selftest that catches the mismatch.
Changes in v3:
- Add selftest demonstrating the issue (Eduard Zingerman)
- No code change in patch 1 from v2
====================
Link: https://patch.msgid.link/20260622230123.3695446-1-tristmd@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a verifier test that catches the stale-bounds issue fixed in the
previous patch. The test sets r6 = 0 to create known bounds, then loads
the LSM hook return value into r6 from the context. Without the fix,
the verifier intersects the retval range with the stale bounds and
incorrectly narrows r6 to a single value, pruning the fall-through
branch as dead code and missing the div-by-zero.
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260622230123.3695446-3-tristmd@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When the BPF verifier processes a context load of an LSM hook return
value, it calls __mark_reg_s32_range() to narrow the register to the
hook's valid range. However, __mark_reg_s32_range() intersects the new
range with the register's existing bounds using max_t()/min_t() rather
than replacing them.
If the destination register carries stale bounds from a prior instruction
(e.g. BPF_MOV64_IMM), the intersection can produce a range narrower than
reality. The verifier then believes it knows the register's exact value,
while at runtime the actual hook return value is loaded, creating a
verifier/runtime mismatch that can be used to bypass BPF memory safety
checks.
The else branch already calls mark_reg_unknown() to reset register state
before any narrowing. Apply the same reset in the is_retval path so
stale bounds are cleared before __mark_reg_s32_range() intersects.
Fixes: 5d99e198be27 ("bpf, lsm: Add check for BPF LSM return value")
Cc: stable@vger.kernel.org
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260622230123.3695446-2-tristmd@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In ofdpa_port_fdb(), the hash_del() only unlinks the node from
hash table, but does not free it.
Fix this by adding kfree(found) after the !found == removing check,
where the pointer value is no longer needed.
Found by Coccinelle kfree script.
Cc: <stable+noautosel@kernel.org> # rocker is a test harness, it's never loaded on production systems
Signed-off-by: Ziran Zhang <zhangcoder@yeah.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260616013245.7098-1-zhangcoder@yeah.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Yiyang Chen says:
====================
bpf: Guard conntrack opts error writes
The conntrack lookup/allocation kfuncs expose an opts/opts__sz pair.
The verifier checks the caller-provided opts__sz range, but the wrappers
currently write opts->error after internal errors even when opts__sz is too
small to include that field.
Patch 1 writes opts->error only when opts__sz includes it, and uses a
single helper to fold ERR_PTR returns into the kfunc ABI result while keeping
the local nfct result variable in each wrapper.
Patch 2 adds a bpf_nf regression check that keeps a guard in opts->error
while passing opts__sz covering only netns_id.
The regression check follows the existing bpf_nf test shape. Before the
fix, the guard is overwritten with -EINVAL even though opts__sz covers only
the first four bytes of the options object. After the fix, the kfunc still
returns NULL for the invalid size, but the guard remains intact.
Validation, rebased and tested on bpf-next master e771677c937d
("Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd"):
git diff --check origin/master..HEAD: OK
scripts/checkpatch.pl --strict on 1/2 and 2/2: OK
make O=/root/ebpf-verifier-bug-detection/kernel-build/bpf-next \
net/netfilter/nf_conntrack_bpf.o: OK
Focused QEMU direct-runner against XDP and TC lookup/alloc paths:
unpatched bpf-next e771677c937d: guard overwritten with -EINVAL
patched v2 007dfd0341cd: guard preserved as 0x12345678
QEMU upstream bpf_nf selftest with CONFIG_NF_CONNTRACK_MARK,
CONFIG_NF_CONNTRACK_ZONES, and legacy iptables enabled:
./test_progs -t bpf_nf -vv: OK
git am of exported 1/2 and 2/2 on a fresh worktree at base: OK
range-diff between branch commits and git-am result: equivalent
Changes in v2:
- Rebased onto current bpf-next master.
- Reworked patch 1 to use bpf_ct_opts_result() for the ERR_PTR-to-NULL
conversion and guarded opts->error write, as suggested by Alexei.
- Kept the local nfct result variable in each wrapper before returning
through bpf_ct_opts_result().
- Added matching Fixes tags to the selftest patch so the regression test
can be backported with the fix.
v1: https://lore.kernel.org/bpf/cover.1781586477.git.chenyy23@mails.tsinghua.edu.cn/
====================
Link: https://patch.msgid.link/cover.1781765747.git.chenyy23@mails.tsinghua.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a conntrack kfunc regression check for opts__sz values that do not
cover opts->error. The BPF program initializes opts->error with a guard
value, calls the lookup and allocation kfuncs with opts__sz set to
sizeof(opts->netns_id), and verifies that the guard is still intact
after the kfunc returns NULL.
Without the conntrack wrapper guard, the kfunc error path overwrites
that guard with -EINVAL even though the verifier checked only the first
four bytes of the options object.
Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF")
Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT")
Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/r/007dfd0341cd84560e4795a2a951cc56d4adff1d.1781765747.git.chenyy23@mails.tsinghua.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The conntrack lookup and allocation kfuncs take an opts pointer
together with an opts__sz argument. The verifier checks only the memory
range described by opts__sz, but the wrappers unconditionally write
opts->error whenever the internal lookup or allocation helper returns an
error.
For an invalid size smaller than the end of opts->error, that write can
land outside the verifier-checked range. Keep returning NULL for invalid
arguments, but only report the error through opts->error when the
supplied size includes the field.
This preserves error reporting for the supported 12-byte and 16-byte
layouts, and for other invalid sizes that still include opts->error.
Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF")
Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT")
Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/r/9535e781fe14449b1d4e9bbc3baa7566a93bf512.1781765747.git.chenyy23@mails.tsinghua.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
[BUG]
Our fuzz testing triggered a blkcg use-after-free issue:
BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
Call Trace:
...
blkcg_deactivate_policy+0x244/0x4d0
ioc_rqos_exit+0x44/0xe0
rq_qos_exit+0xba/0x120
__del_gendisk+0x50b/0x800
del_gendisk+0xff/0x190
...
[CAUSE]
process1 process2
cgroup_rmdir
...
css_killed_work_fn
offline_css
...
blkcg_destroy_blkgs
...
__blkg_release
css_put(&blkg->blkcg->css)
blkg_free
INIT_WORK(xxx, blkg_free_workfn)
schedule_work
css_put
...
blkcg_css_free
kfree(blkcg)--------blkcg has been freed!!!
====================================schedule_work
blkg_free_workfn
__del_gendisk
rq_qos_exit
ioc_rqos_exit
blkcg_deactivate_policy
mutex_lock(&q->blkcg_mutex)
spin_lock_irq(&q->queue_lock)
list_for_each_entry(blkg, xxx)
blkcg = blkg->blkcg
spin_lock(&blkcg->lock)-------UAF!!!
mutex_lock(&q->blkcg_mutex)
spin_lock_irq(&q->queue_lock)
/* Only then is the blkg removed from the list */
list_del_init(&blkg->q_node)
As a result, a blkg can still be reachable through q->blkg_list while
its ->blkcg has already been freed.
[Fix]
Fix this by deferring the blkcg css_put() until after the blkg has been
unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
blkcg outlives every blkg still reachable through q->blkg_list, so any
iterator holding q->queue_lock is guaranteed to observe a valid
blkg->blkcg.
While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
so that the css reference is owned by the alloc/free pair rather than
straddling layers:
blkg_alloc() <-> blkg_free()
blkg_create() <-> blkg_destroy()
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Link: https://patch.msgid.link/20260616011746.2451461-1-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|