| Age | Commit message (Collapse) | Author |
|
Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.
The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:
#1 the functionality is enabled
#2 the exit is a return from interrupt to user mode
#3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
which means the task was already scheduled out.
The slow path, which implements the actual user space ABI, is invoked
when:
A) #1 is true, #2 is true and #3 is false
It checks whether user space requested a slice extension by setting
the request bit in the rseq slice_ctrl field. If so, it grants the
extension and stores the slice expiry time, so that the actual exit
code can double check whether the slice is already exhausted before
going back.
B) #1 - #3 are true _and_ a slice extension was granted in a previous
loop iteration
In this case the grant is revoked.
In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.195303303@linutronix.de
|
|
When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.
RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.
Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251215155709.131081527@linutronix.de
|
|
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de
|
|
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.
If the code detects inconsistent user space that result in a SIGSEGV for
the application.
If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.005777059@linutronix.de
|
|
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
|
|
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.858717691@linutronix.de
|
|
Extend the quick statistics with time slice specific fields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.795202254@linutronix.de
|
|
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.733429292@linutronix.de
|
|
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
Subject: netfilter: updates for net-next
1) Speed up nftables transactions after earlier transaction failed.
Due to a (harmeless) bug we remained in slow paranoia mode until
a successful transaction completes.
2) Allow generic tracker to resolve clashes, this avoids very rare
packet drops. From Yuto Hamaguchi.
3) Increase the cleanup budget to 64 entries in nf_conncount to reap
more entries in one go, from Fernando Fernandez Mancera.
4) Allow icmp trackers to resolve clashes, this avoids very rare
initial packet drop with test cases that have high-frequency pings.
After this all trackers except tcp and sctp allow clash resolution.
5) Disentangle netfilter headers, don't include nftables/xtables headers
in subsystems that are unrelated.
6) Don't rely on implicit includes coming from nf_conntrack_proto_gre.h.
7) Allow nfnetlink_queue nfq instance struct to get accounted via memcg,
from Scott Mitchell.
8) Reject bogus xt target/match data upfront via netlink policiy in
nft_compat interface rather than relying on x_tables API to do it.
9) Fix nf_conncount breakage when trying to limit loopback flows via
prerouting rule, from Fernando Fernandez Mancera.
This is a recent breakage but not seen as urgent enough to rush this
via net tree at this late stage in development cycle.
10) Fix a possible off-by-one when parsing tcp option in xtables tcpmss
match. Also handled via -next due to late stage in development
cycle.
* tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: xt_tcpmss: check remaining length before reading optlen
netfilter: nf_conncount: fix tracking of connections from localhost
netfilter: nft_compat: add more restrictions on netlink attributes
netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT allocation
netfilter: nf_conntrack: don't rely on implicit includes
netfilter: don't include xt and nftables.h in unrelated subsystems
netfilter: nf_conntrack: enable icmp clash support
netfilter: nf_conncount: increase the connection clean up limit to 64
netfilter: nf_conntrack: Add allow_clash to generic protocol handler
netfilter: nf_tables: reset table validation state on abort
====================
Link: https://patch.msgid.link/20260120191803.22208-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some drivers of MAC + tightly integrated PCS (example: SJA1105 + XPCS
covered by same reset domain) need to perform resets at runtime.
The reset is triggered by the MAC driver, and it needs to restore its
and the PCS' registers, all invisible to phylink.
However, there is a desire to simplify the API through which the MAC and
the PCS interact, so this becomes challenging.
Phylink holds all the necessary state to help with this operation, and
can offer two helpers which walk the MAC and PCS drivers again through
the callbacks required during a destructive reset operation. The
procedure is as follows:
Before reset, MAC driver calls phylink_replay_link_begin():
- Triggers phylink mac_link_down() and pcs_link_down() methods
After reset, MAC driver calls phylink_replay_link_end():
- Triggers phylink mac_config() -> pcs_config() -> mac_link_up() ->
pcs_link_up() methods.
MAC and PCS registers are restored with no other custom driver code.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260119121954.1624535-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The Mediatek LynxI PCS is used from the MT7530 DSA driver (where it does
not have an OF presence) and from mtk_eth_soc, where it does
(Documentation/devicetree/bindings/net/pcs/mediatek,sgmiisys.yaml
informs of a combined clock provider + SGMII PCS "SGMIISYS" syscon
block).
Currently, mtk_eth_soc parses the SGMIISYS OF node for the
"mediatek,pnswap" property and sets a bit in the "flags" argument of
mtk_pcs_lynxi_create() if set.
I'd like to deprecate "mediatek,pnswap" in favour of a property which
takes the current phy-mode into consideration. But this is only known at
mtk_pcs_lynxi_config() time, and not known at mtk_pcs_lynxi_create(),
when the SGMIISYS OF node is parsed.
To achieve that, we must pass the OF node of the PCS, if it exists, to
mtk_pcs_lynxi_create(), and let the PCS take a reference on it and
handle property parsing whenever it wants.
Use the fwnode API which is more general than OF (in case we ever need
to describe the PCS using some other format). This API should be NULL
tolerant, so add no particular tests for the mt7530 case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260119091220.1493761-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
clang does not inline this helper in GRO fast path.
We can save space and cpu cycles.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 0/2 grow/shrink: 2/0 up/down: 156/-218 (-62)
Function old new delta
tcp6_gro_complete 227 311 +84
tcp4_gro_complete 325 397 +72
__pfx___skb_incr_checksum_unnecessary 32 - -32
__skb_incr_checksum_unnecessary 186 - -186
Total: Before=22592724, After=22592662, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We can change tcp_rsk() and inet_rsk() to propagate their argument
const qualifier thanks to container_of_const().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260120125353.1470456-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
'add-devm_clk_bulk_get_optional_enable-helper-and-use-in-axi-ethernet-driver'
Suraj Gupta says:
====================
Add devm_clk_bulk_get_optional_enable() helper and use in AXI Ethernet driver
This patch series introduces a new managed clock framework helper function
and demonstrates its usage in AXI ethernet driver.
Device drivers frequently need to get optional bulk clocks, prepare them,
and enable them during probe, while ensuring automatic cleanup on device
unbind. Currently, this requires three separate operations with manual
cleanup handling.
The new devm_clk_bulk_get_optional_enable() helper combines these
operations into a single managed call, eliminating boilerplate code and
following the established pattern of devm_clk_bulk_get_all_enabled().
====================
Link: https://patch.msgid.link/20260116192725.972966-1-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new managed clock framework helper function that combines getting
optional bulk clocks and enabling them in a single operation.
The devm_clk_bulk_get_optional_enable() function simplifies the common
pattern where drivers need to get optional bulk clocks, prepare and enable
them, and have them automatically disabled/unprepared and freed when the
device is unbound.
This new API follows the established pattern of
devm_clk_bulk_get_all_enabled() and reduces boilerplate code in drivers
that manage multiple optional clocks.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Brian Masney <bmasney@redhat.com>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Link: https://patch.msgid.link/20260116192725.972966-2-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A previous commit got rid of any use of this member, but forgot to
remove it. Kill it.
Fixes: f4bb2f65bb81 ("io_uring/eventfd: move ctx->evfd_last_cq_tail into io_ev_fd")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/drivers
Qualcomm driver updates for v6.20
Support multiple wait queues in the SCM firmware interface and provide
discovery of the wait queue interrupt to deal with the cases where
bootloader didn't patch the DeviceTree with the IRQ information.
Refactor the MDT loader and the SCM driver's peripheral authentication
service interface and introduce support for passing a remoteproc
resource table to the firmware. The remoteproc patches that uses this
and uses this to configure the IOMMU are included here due to
bidirectional dependencies. The end result is remoteproc support on the
Glymur platform.
Enable QSEECOM and thereby UEFI variable access, on the Surface Pro 11.
Make the QMI interface endianness aware, to support ath1Xk on big endian
machines.
Add the Glymur support in LLCC driver.
* tag 'qcom-drivers-for-6.20' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (33 commits)
soc: qcom: preserve CPU endianness for QMI_DATA_LEN
soc: qcom: fix QMI encoding/decoding for basic elements
soc: qcom: check QMI basic element error codes
soc: qcom: ubwc: add missing include
remoteproc: qcom: pas: Enable Secure PAS support with IOMMU managed by Linux
remoteproc: pas: Extend parse_fw callback to fetch resources via SMC call
firmware: qcom_scm: Add qcom_scm_pas_get_rsc_table() to get resource table
firmware: qcom_scm: Add SHM bridge handling for PAS when running without QHEE
firmware: qcom_scm: Refactor qcom_scm_pas_init_image()
firmware: qcom_scm: Add a prep version of auth_and_reset function
soc: qcom: mdtloader: Remove qcom_mdt_pas_init() from exported symbols
soc: qcom: mdtloader: Add PAS context aware qcom_mdt_pas_load() function
remoteproc: pas: Replace metadata context with PAS context structure
firmware: qcom_scm: Introduce PAS context allocator helper function
firmware: qcom_scm: Rename peripheral as pas_id
firmware: qcom_scm: Remove redundant piece of code
dt-bindings: remoteproc: qcom,pas: Add iommus property
soc: qcom: cmd-db: Use devm_memremap() to fix memory leak in cmd_db_dev_probe
soc: qcom: pmic_glink_altmode: Consume TBT3/USB4 mode notifications
dt-bindings: qcom,pdc: document the Milos Power Domain Controller
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into soc/drivers
i.MX drivers changes for 6.20:
- A few changes from Peng Fan adding dump syslog support for i.MX
System Manager firmware driver, cleaning up soc-imx9 driver, fixing
error handling for soc-imx8m driver
* tag 'imx-drivers-6.20' of https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
soc: imx8m: Fix error handling for clk_prepare_enable()
soc: imx: Spport i.MX9[4,52]
soc: imx: Use dev_err_probe() for i.MX9
soc: imx: Use device-managed APIs for i.MX9
firmware: imx: sm-misc: Dump syslog info
firmware: arm_scmi: imx: Support getting syslog of MISC protocol
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into soc/drivers
Samsung SoC drivers for v6.20
1. Several improvements in Exynos ChipID Socinfo driver and finally
adding Google GS101 SoC support.
2. Few cleanups from old code.
3. Documenting Axis Artpec-9 SoC PMU (Power Management Unit).
* tag 'samsung-drivers-6.20' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
ARM: s3c: remove a leftover hwmon-s3c.h header file
dt-bindings: soc: samsung: exynos-pmu: Drop unnecessary select schema
soc: samsung: exynos-chipid: add google,gs101-otp support
soc: samsung: exynos-chipid: downgrade dev_info to dev_dbg for soc info
soc: samsung: exynos-chipid: rename method
dt-bindings: nvmem: add google,gs101-otp
soc: samsung: exynos-chipid: use dev_err_probe where appropiate
soc: samsung: exynos-chipid: use devm action to unregister soc device
dt-bindings: samsung: exynos-pmu: Add compatible for ARTPEC-9 SoC
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jenswi/linux-tee into soc/drivers
TEE sysfs for 6.20
- Add an optional generic sysfs attribute for TEE revision
- Implement revision reporting for OP-TEE using both SMC and FF-A ABIs
* tag 'tee-sysfs-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/jenswi/linux-tee:
tee: optee: store OS revision for TEE core
tee: add revision sysfs attribute
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jenswi/linux-tee into soc/drivers
TEE bus callback for 6.20
- Move from generic device_driver to TEE bus-specific callbacks
- Add module_tee_client_driver() and registration helpers to reduce
boilerplate
- Convert several client drivers (TPM, KEYS, firmware, EFI, hwrng,
and RTC)
- Update documentation and fix kernel-doc warnings
* tag 'tee-bus-callback-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/jenswi/linux-tee:
tpm/tpm_ftpm_tee: Fix kdoc after function renames
tpm/tpm_ftpm_tee: Make use of tee bus methods
tpm/tpm_ftpm_tee: Make use of tee specific driver registration
KEYS: trusted: Make use of tee bus methods
KEYS: trusted: Migrate to use tee specific driver registration function
firmware: tee_bnxt: Make use of tee bus methods
firmware: tee_bnxt: Make use of module_tee_client_driver()
firmware: arm_scmi: Make use of tee bus methods
firmware: arm_scmi: optee: Make use of module_tee_client_driver()
efi: stmm: Make use of tee bus methods
efi: stmm: Make use of module_tee_client_driver()
hwrng: optee - Make use of tee bus methods
hwrng: optee - Make use of module_tee_client_driver()
rtc: optee: Make use of tee bus methods
rtc: optee: Migrate to use tee specific driver registration function
tee: Adapt documentation to cover recent additions
tee: Add probe, remove and shutdown bus callbacks to tee_client_driver
tee: Add some helpers to reduce boilerplate for tee client drivers
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel into soc/drivers
Renesas driver updates for v6.20 (take two)
- Add and use for_each_of_imap_item() iterator,
- Add support for the RZ/N1 GPIO Interrupt Multiplexer.
* tag 'renesas-drivers-for-v6.20-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel:
soc: renesas: Add support for RZ/N1 GPIO Interrupt Multiplexer
irqchip/renesas-rza1: Use for_each_of_imap_item iterator
irqchip/ls-extirq: Use for_each_of_imap_item iterator
of: unittest: Add a test case for for_each_of_imap_item iterator
of/irq: Introduce for_each_of_imap_item
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
This reverts commit 5f0bf80cc5e04d31eeb201683e0b477c24bd18e7.
This was asked to be reverted as it is not the correct way to do this.
Fixes: 5f0bf80cc5e0 ("mmc: rtsx_pci: add quirk to disable MMC_CAP_AGGRESSIVE_PM for RTS525A")
Cc: Matthew Schwartz <matthew.schwartz@linux.dev>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit eac85fbd0867c25ac517f58fae401d65c627edff.
This is not the correct change, so revert it for now.
Fixes: eac85fbd0867 ("mmc: rtsx: reset power state on suspend")
Cc: Matthew Schwartz <matthew.schwartz@linux.dev>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add a helper to allow an existing bio to be resubmitted without
having to re-add the payload.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The IOMMU code operates on physical addresses which can be outside
of system RAM.
Add a new function page_ext_get_from_phys() to abstract the logic of
checking the address and returning the page_ext.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The PMF driver retrieves NPU metrics data from the PMFW. Introduce a new
interface to make NPU metrics accessible to other drivers like AMDXDNA
driver, which can access and utilize this information as needed.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Co-developed-by: Patil Rajesh Reddy <Patil.Reddy@amd.com>
Signed-off-by: Patil Rajesh Reddy <Patil.Reddy@amd.com>
Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
[lizhi: save return value of is_npu_metrics_supported() and return it]
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260115173448.403826-1-lizhi.hou@amd.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
The hibernate resume sequence involves loading a resume kernel that is just
used for loading the hibernate image before shifting back to the existing
kernel.
During that hibernate resume sequence the resume kernel may have loaded
the ccp driver. If this happens the resume kernel will also have called
PSP_CMD_TEE_RING_INIT but it will never have called
PSP_CMD_TEE_RING_DESTROY.
This is problematic because the existing kernel needs to re-initialize the
ring. One could argue that the existing kernel should call destroy
as part of restore() but there is no guarantee that the resume kernel did
or didn't load the ccp driver. There is also no callback opportunity for
the resume kernel to destroy before handing back control to the existing
kernel.
Similar problems could potentially exist with the use of kdump and
crash handling. I actually reproduced this issue like this:
1) rmmod ccp
2) hibernate the system
3) resume the system
4) modprobe ccp
The resume kernel will have loaded ccp but never destroyed and then when
I try to modprobe it fails.
Because of these possible cases add a flow that checks the error code from
the PSP_CMD_TEE_RING_INIT call and tries to call PSP_CMD_TEE_RING_DESTROY
if it failed. If this succeeds then call PSP_CMD_TEE_RING_INIT again.
Fixes: f892a21f51162 ("crypto: ccp - use generic power management")
Reported-by: Lars Francke <lars.francke@gmail.com>
Closes: https://lore.kernel.org/platform-driver-x86/CAD-Ua_gfJnQSo8ucS_7ZwzuhoBRJ14zXP7s8b-zX3ZcxcyWePw@mail.gmail.com/
Tested-by: Yijun Shen <Yijun.Shen@Dell.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://patch.msgid.link/20260116041132.153674-6-superm1@kernel.org
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
The HDMI 2.1 specification introduced the Fixed Rate Link (FRL) mode,
aiming to replace the older Transition-Minimized Differential Signaling
(TMDS) mode used in previous HDMI versions to support much higher
bandwidths (up to 48 Gbps) for modern video and audio formats.
FRL has been designed to support ultra high resolution formats at high
refresh rates like 8K@60Hz or 4K@120Hz, and eliminates the need for
dynamic bandwidth adjustments, which reduces latency. It operates with
3 or 4 lanes at different link rates: 3Gbps, 6Gbps, 8Gbps, 10Gbps or
12Gbps.
Add support for configuring the FRL mode for HDMI PHYs.
Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
Link: https://patch.msgid.link/20260113-phy-hdptx-frl-v6-1-8d5f97419c0b@collabora.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
__sprint_symbol() might access an invalid pointer when
kallsyms_lookup_buildid() returns a symbol found by
ftrace_mod_address_lookup().
The ftrace lookup function must set both @modname and @modbuildid the same
way as module_address_lookup().
Link: https://lkml.kernel.org/r/20251128135920.217303-7-pmladek@suse.com
Fixes: 9294523e3768 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
bpf_address_lookup() has been used only in kallsyms_lookup_buildid(). It
was supposed to set @modname and @modbuildid when the symbol was in a
module.
But it always just cleared @modname because BPF symbols were never in a
module. And it did not clear @modbuildid because the pointer was not
passed.
The wrapper is no longer needed. Both @modname and @modbuildid are now
always initialized to NULL in kallsyms_lookup_buildid().
Remove the wrapper and rename __bpf_address_lookup() to
bpf_address_lookup() because this variant is used everywhere.
[akpm@linux-foundation.org: fix loongarch]
Link: https://lkml.kernel.org/r/20251128135920.217303-6-pmladek@suse.com
Fixes: 9294523e3768 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a helper function for reading the optional "build_id" member of struct
module. It is going to be used also in ftrace_mod_address_lookup().
Use "#ifdef" instead of "#if IS_ENABLED()" to match the declaration of the
optional field in struct module.
Link: https://lkml.kernel.org/r/20251128135920.217303-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Add ARRAY_END(), and use it to fix off-by-one bugs", v6.
Add ARRAY_END(), and use it to fix off-by-one bugs
ARRAY_END() is a macro to calculate a pointer to one past the last element
of an array argument. This is a very common pointer, which is used to
iterate over all elements of an array:
for (T *p = a; p < ARRAY_END(a); p++)
...
Of course, this pointer should never be dereferenced. A pointer one past
the last element of an array should not be dereferenced; it's perfectly
fine to hold such a pointer --and a good thing to do--, but the only thing
it should be used for is comparing it with other pointers derived from the
same array.
Due to how special these pointers are, it would be good to use consistent
naming. It's common to name such a pointer 'end' --in fact, we have many
such cases in the kernel--. C++ even standardized this name with
std::end(). Let's try naming such pointers 'end', and try also avoid
using 'end' for pointers that are not the result of ARRAY_END().
It has been incorrectly suggested that these pointers are dangerous, and
that they should never be used, suggesting to use something like
#define ARRAY_LAST(a) ((a) + ARRAY_SIZE(a) - 1)
for (T *p = a; p <= ARRAY_LAST(a); p++)
...
This is bogus, as it doesn't scale down to arrays of 0 elements. In the
case of an array of 0 elements, ARRAY_LAST() would underflow the pointer,
which not only it can't be dereferenced, it can't even be held (it
produces Undefined Behavior). That would be a footgun. Such arrays don't
exist per the ISO C standard; however, GCC supports them as an extension
(with partial support, though; GCC has a few bugs which need to be fixed).
This patch set fixes a few places where it was intended to use the array
end (that is, one past the last element), but accidentally a pointer to
the last element was used instead, thus wasting one byte.
It also replaces other places where the array end was correctly calculated
with ARRAY_SIZE(), by using the simpler ARRAY_END().
Also, there was one drivers/ file that already defined this macro. We
remove that definition, to not conflict with this one.
This patch (of 4):
ARRAY_END() returns a pointer one past the end of the last element in the
array argument. This pointer is useful for iterating over the elements of
an array:
for (T *p = a, p < ARRAY_END(a); p++)
...
Link: https://lkml.kernel.org/r/cover.1765449750.git.alx@kernel.org
Link: https://lkml.kernel.org/r/5973cfb674192bc8e533485dbfb54e3062896be1.1765449750.git.alx@kernel.org
Signed-off-by: Alejandro Colomar <alx@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Christopher Bazley <chris.bazley.wg14@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of
hex.h interfaces to directly #include <linux/hex.h> as part of the process
of putting kernel.h on a diet.
Removing hex.h from kernel.h means that 36K C source files don't have to
pay the price of parsing hex.h for the roughly 120 C source files that
need it.
This change has been build-tested with allmodconfig on most ARCHes. Also,
all users/callers of <linux/hex.h> in the entire source tree have been
updated if needed (if not already #included).
Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Several helper functions for managing node and zone states have become
obsolete and no longer have any callers within the kernel.
inc_node_state()
inc_zone_state()
dec_zone_state()
This commit removes the dead code.
Link: https://lkml.kernel.org/r/20251225210213.2553-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This reintroduces a concept removed by: commit d6cb41cc44c6 ("mm, hugetlb:
remove hugepages_treat_as_movable sysctl")
This sysctl provides flexibility between ZONE_MOVABLE use cases:
1) onlining memory in ZONE_MOVABLE to maintain hotplug compatibility
2) onlining memory in ZONE_MOVABLE to make hugepage allocate reliable
When ZONE_MOVABLE is used to make huge page allocation more reliable,
disallowing gigantic pages memory in this region is pointless. If hotplug
is not a requirement, we can loosen the restrictions to allow 1GB gigantic
pages in ZONE_MOVABLE.
Since 1GB can be difficult to migrate / has impacts on compaction /
defragmentation, we don't enable this by default. Notably, 1GB pages can
only be migrated if another 1GB page is available - so hot-unplug will
fail if such a page cannot be found.
However, since there are scenarios where gigantic pages are migratable, we
should allow use of these on movable regions.
When not valid 1GB is available for migration, hot-unplug will retry
indefinitely (or until interrupted). For example:
echo 0 > node0/hugepages/..-1GB/nr_hugepages # clear node0 1GB pages
echo 1 > node1/hugepages/..-1GB/nr_hugepages # reserve node1 1GB page
./alloc_huge_node1 & # Allocate a 1GB page on node1
./node1_offline & # attempt to offline all node1 memory
echo 1 > node0/hugepages/..-1GB/nr_hugepages # reserve node0 1GB page
In this example, node1_offline will block indefinitely until the final
step, when a node0 1GB page is made available.
Note: Boot-time CMA is not possible for driver-managed hotplug memory, as
CMA requires the memory to be registered as SystemRAM at boot time.
Additionally, 1GB huge pages are not supported by THP.
Link: https://lkml.kernel.org/r/20251221125603.2364174-1-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: David Rientjes <rientjes@google.com>
Link: https://lore.kernel.org/all/20180201193132.Hk7vI_xaU%25akpm@linux-foundation.org/
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Gregory Price <gourry@gourry.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
dup_mmap()"), removed the only user and mas_expected_entries has been
removed, since commit e3852a1213ffc ("maple_tree: Drop bulk insert
support"). Also cleanup the mas_expected_entries in maple_tree.h.
No functional change.
Link: https://lkml.kernel.org/r/20251106110929.3522073-1-guanwentao@uniontech.com
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cheng Nie <niecheng1@uniontech.com>
Cc: Guan Wentao <guanwentao@uniontech.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The current description of contexts where it's invalid to make GFP_ATOMIC
and GFP_NOWAIT calls is rather vague.
Replace this with a direct description of the actual contexts of concern
and refer to the RT docs where this is explained more discursively.
While rejigging this prose, also move the documentation of GFP_NOWAIT to
the GFP_NOWAIT section.
Link: https://lore.kernel.org/all/d912480a-5229-4efe-9336-b31acded30f5@suse.cz/
Link: https://lkml.kernel.org/r/20251219-b4-gfp_atomic-comment-v2-1-4c4ce274c2b6@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use bool for 'hashdist' and replace simple_strtoul() with kstrtobool() for
parsing the 'hashdist=' boot parameter. Unlike simple_strtoul(), which
returns an unsigned long, kstrtobool() converts the string directly to
bool and avoids implicit casting.
Check the return value of kstrtobool() and reject invalid values. This
adds error handling while preserving behavior for existing values, and
removes use of the deprecated simple_strtoul() helper. The current code
silently sets 'hashdist = 0' if parsing fails, instead of leaving the
default value (HASHDIST_DEFAULT) unchanged.
Additionally, kstrtobool() accepts common boolean strings such as "on" and
"off".
Link: https://lkml.kernel.org/r/20251217110214.50807-1-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
struct maple_alloc is deprecated after the maple tree conversion to
sheaves, remove the references from the header file.
Link: https://lkml.kernel.org/r/20251203224511.469978-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Laptop mode was introduced to save battery, by delaying and consolidating
writes and thereby maximize the time rotating hard drives wouldn't have to
spin.
Luckily, rotating hard drives, with their high spin-up times and power
draw, are a thing of the past for battery-powered devices. Reclaim has
also since changed to not write single filesystem pages anymore, and
regular filesystem writeback is lumpy by design.
The juice doesn't appear worth the squeeze anymore. The footprint of the
feature is small, but nevertheless it's a complicating factor in mm,
block, filesystems. Developers don't think about it, and it likely hasn't
been tested with new reclaim and writeback changes in years.
Let's sunset it. Keep the sysctl with a deprecation warning around for a
few more cycles, but remove all functionality behind it.
[akpm@linux-foundation.org: fix Documentation/admin-guide/laptops/index.rst]
Link: https://lkml.kernel.org/r/20251216185201.GH905277@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There are DAMOS use cases that require user-space centric control of its
activation and deactivation. Having the control plane on the user-space,
or using DAMOS as a way for monitoring results collection are such
examples.
DAMON parameters online commit, DAMOS quotas and watermarks can be useful
for this purpose. However, those features work only at the
sub-DAMON-snapshot level. In some use cases, the DAMON-snapshot level
control is required. For example, in DAMOS-based monitoring results
collection use case, the user online-installs a DAMOS scheme with
DAMOS_STAT action, wait it be applied to whole regions of a single
DAMON-snapshot, retrieves the stats and tried regions information, and
online-uninstall the scheme. It is efficient to ensure the lifetime of
the scheme as no more no less one snapshot consumption.
To support such use cases, introduce a new DAMOS core API per-scheme
parameter, namely max_nr_snapshots. As the name implies, it is the upper
limit of nr_snapshots, which is a DAMOS stat that represents the number of
DAMON-snapshots that the scheme has fully applied. If the limit is set
with a non-zero value and nr_snapshots reaches or exceeds the limit, the
scheme is deactivated.
Link: https://lkml.kernel.org/r/20251216080128.42991-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 0e92c2ee9f45 ("mm/damon/schemes: account scheme actions that
successfully applied") has replaced ->stat_count and ->stat_sz of 'struct
damos' with ->stat. The commit mistakenly did not update the related
kernel doc comment, though. Update the comment.
Link: https://lkml.kernel.org/r/20251216080128.42991-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for
damos stats".
Introduce three changes for improving DAMOS stat's provided information,
deterministic control, and reading usability.
DAMOS provides stats that are important for understanding its behavior.
It lacks information about how many DAMON-generated monitoring output
snapshots it has worked on. Add a new stat, nr_snapshots, to show the
information.
Users can control DAMOS schemes in multiple ways. Using the online
parameters commit feature, they can install and uninstall DAMOS schemes
whenever they want while keeping DAMON runs. DAMOS quotas and watermarks
can be used for manually or automatically turning on/off or adjusting the
aggressiveness of the scheme. DAMOS filters can be used for applying the
scheme to specific memory entities based on their types and locations.
Some users want their DAMOS scheme to be applied to only specific number
of DAMON snapshots, for more deterministic control. One example use case
is tracepoint based snapshot reading. Add a new knob, max_nr_snapshots,
to support this. If the nr_snapshots parameter becomes same to or greater
than the value of this parameter, the scheme is deactivated.
Users can read DAMOS stats via DAMON's sysfs interface. For deep level
investigations on environments having advanced tools like perf and
bpftrace, exposing the stats via a tracepoint can be useful. Implement a
new tracepoint, namely damon:damos_stat_after_apply_interval.
First five patches (patches 1-5) of this series implement the new stat,
nr_snapshots, on the core layer (patch 1), expose on DAMON sysfs user
interface (patch 2), and update documents (patches 3-5).
Following six patches (patches 6-11) are for the new stat based DAMOS
deactivation (max_nr_snapshots). The first one (patch 6) of this group
updates a kernel-doc comment before making further changes. Then an
implementation of it on the core layer (patch 7), an introduction of a new
DAMON sysfs interface file for users of the feature (patch 8), and three
updates of the documents (patches 9-11) follow.
The final one (patch 12) introduces the new tracepoint that exposes the
DAMOS stat values for each scheme apply interval.
This patch (of 12):
DAMON generates monitoring results snapshots for every sampling interval.
DAMOS applies given schemes on the regions of the snapshots, for every
apply interval of the scheme.
DAMOS stat informs a given scheme has tried to how many memory entities
and applied, in the region and byte level. In some use cases including
user-space oriented tuning and investigations, it is useful to know that
in the DAMON-snapshot level. Introduce a new stat, namely nr_snapshots
for DAMON core API callers.
[sj@kernel.org: fix wrong list_is_last() call in damons_is_last_region()]
Link: https://lkml.kernel.org/r/20260114152049.99727-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20251216080128.42991-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20251216080128.42991-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In addition to slab objects, this function is used for resolving non-slab
kernel pointers. This has caused confusion in recent refactoring work.
Rename it to mem_cgroup_from_virt(), sticking with terminology established
by the virt_to_<foo>() converters.
Link: https://lore.kernel.org/linux-mm/20251113161424.GB3465062@cmpxchg.org/
Link: https://lkml.kernel.org/r/20251210154301.720133-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The mem_cgroup_size helper is used only in apply_proportional_protection
to read the current memory usage. Its semantics are unclear and
inconsistent with other sites, which directly call page_counter_read for
the same purpose.
Remove this helper and get its usage via mem_cgroup_protection for
clarity. Additionally, rename the local variable 'cgroup_size' to 'usage'
to better reflect its meaning.
No functional changes intended.
Link: https://lkml.kernel.org/r/20251211013019.2080004-3-chenridong@huaweicloud.com
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Lu Jialin <lujialin4@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use batch clearing in clear_contig_highpages() instead of clearing a
single page at a time. Exposing larger ranges enables the processor to
optimize based on extent.
To do this we just switch to using clear_user_highpages() which would in
turn use clear_user_pages() or clear_pages().
Batched clearing, when running under non-preemptible models, however, has
latency considerations. In particular, we need periodic invocations of
cond_resched() to keep to reasonable preemption latencies. This is a
problem because the clearing primitives do not, or might not be able to,
call cond_resched() to check if preemption is needed.
So, limit the worst case preemption latency by doing the clearing in units
of no more than PROCESS_PAGES_NON_PREEMPT_BATCH pages. (Preemptible
models already define away most of cond_resched(), so the batch size is
ignored when running under those.)
PROCESS_PAGES_NON_PREEMPT_BATCH: for architectures with "fast" clear-pages
(ones that define clear_pages()), we define it as 32MB worth of pages.
This is meant to be large enough to allow the processor to optimize the
operation and yet small enough that we see reasonable preemption latency
for when this optimization is not possible (ex. slow microarchitectures,
memory bandwidth saturation.)
This specific value also allows for a cacheline allocation elision
optimization (which might help unrelated applications by not evicting
potentially useful cache lines) that kicks in recent generations of AMD
Zen processors at around LLC-size (32MB is a typical size).
At the same time 32MB is small enough that even with poor clearing
bandwidth (say ~10GBps), time to clear 32MB should be well below the
scheduler's default warning threshold
(sysctl_resched_latency_warn_ms=100).
"Slow" architectures (don't have clear_pages()) will continue to use the
base value (single page).
Performance
==
Testing a demand fault workload shows a decent improvement in bandwidth
with pg-sz=1GB. Bandwidth with pg-sz=2MB stays flat.
$ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
contiguous-pages batched-pages
(GBps +- %stdev) (GBps +- %stdev)
pg-sz=2MB 23.58 +- 1.95% 25.34 +- 1.18% + 7.50% preempt=*
pg-sz=1GB 25.09 +- 0.79% 39.22 +- 2.32% + 56.31% preempt=none|voluntary
pg-sz=1GB 25.71 +- 0.03% 52.73 +- 0.20% [#] +110.16% preempt=full|lazy
[#] We perform much better with preempt=full|lazy because, not
needing explicit invocations of cond_resched() we can clear the
full extent (pg-sz=1GB) as a single unit which the processor
can optimize for.
(Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13);
region-size=64GB, local node; 2.56 GHz, boost=0.)
Analysis
==
pg-sz=1GB: the improvement we see falls in two buckets depending on the
batch size in use.
For batch-size=32MB the number of cachelines allocated (L1-dcache-loads)
-- which stay relatively flat for smaller batches, start to drop off
because cacheline allocation elision kicks in. And as can be seen below,
at batch-size=1GB, we stop allocating cachelines almost entirely. (Not
visible here but from testing with intermediate sizes, the allocation
change kicks in only at batch-size=32MB and ramps up from there.)
contigous-pages 6,949,417,798 L1-dcache-loads # 883.599 M/sec ( +- 0.01% ) (35.75%)
3,226,709,573 L1-dcache-load-misses # 46.43% of all L1-dcache accesses ( +- 0.05% ) (35.75%)
batched,32MB 2,290,365,772 L1-dcache-loads # 471.171 M/sec ( +- 0.36% ) (35.72%)
1,144,426,272 L1-dcache-load-misses # 49.97% of all L1-dcache accesses ( +- 0.58% ) (35.70%)
batched,1GB 63,914,157 L1-dcache-loads # 17.464 M/sec ( +- 8.08% ) (35.73%)
22,074,367 L1-dcache-load-misses # 34.54% of all L1-dcache accesses ( +- 16.70% ) (35.70%)
The dropoff is also visible in L2 prefetch hits (miss numbers are
on similar lines):
contiguous-pages 3,464,861,312 l2_pf_hit_l2.all # 437.722 M/sec ( +- 0.74% ) (15.69%)
batched,32MB 883,750,087 l2_pf_hit_l2.all # 181.223 M/sec ( +- 1.18% ) (15.71%)
batched,1GB 8,967,943 l2_pf_hit_l2.all # 2.450 M/sec ( +- 17.92% ) (15.77%)
This largely decouples the frontend from the backend since the clearing
operation does not need to wait on loads from memory (we still need
cacheline ownership but that's a shorter path). This is most visible if
we rerun the test above with (boost=1, 3.66 GHz).
$ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
contiguous-pages batched-pages
(GBps +- %stdev) (GBps +- %stdev)
pg-sz=2MB 26.08 +- 1.72% 26.13 +- 0.92% - preempt=*
pg-sz=1GB 26.99 +- 0.62% 48.85 +- 2.19% + 80.99% preempt=none|voluntary
pg-sz=1GB 27.69 +- 0.18% 75.18 +- 0.25% +171.50% preempt=full|lazy
Comparing the batched-pages numbers from the boost=0 ones and these: for a
clock-speed gain of 42% we gain 24.5% for batch-size=32MB and 42.5% for
batch-size=1GB. In comparison the baseline contiguous-pages case and both
the pg-sz=2MB ones are largely backend bound so gain no more than ~10%.
Other platforms tested, Intel Icelakex (Oracle X9) and ARM64 Neoverse-N1
(Ampere Altra) both show an improvement of ~35% for pg-sz=2MB|1GB. The
first goes from around 8GBps to 11GBps and the second from 32GBps to 44
GBPs.
[ankur.a.arora@oracle.com: move the unit computation and make it a const
Link: https://lkml.kernel.org/r/20260108060406.1693853-1-ankur.a.arora@oracle.com
Link: https://lkml.kernel.org/r/20260107072009.1615991-8-ankur.a.arora@oracle.com
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Define clear_user_highpages() which uses the range clearing primitive,
clear_user_pages(). We can safely use this when CONFIG_HIGHMEM is
disabled and if the architecture does not have clear_user_highpage.
The first is needed to ensure that contiguous page ranges stay contiguous
which precludes intermediate maps via HIGMEM. The second, because if the
architecture has clear_user_highpage(), it likely needs flushing magic
when clearing the page, magic that we aren't privy to.
For both of those cases, just fallback to a loop around
clear_user_highpage().
Link: https://lkml.kernel.org/r/20260107072009.1615991-4-ankur.a.arora@oracle.com
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce clear_pages(), to be overridden by architectures that support
more efficient clearing of consecutive pages.
Also introduce clear_user_pages(), however, we will not expect this
function to be overridden anytime soon.
As we do for clear_user_page(), define clear_user_pages() only if the
architecture does not define clear_user_highpage().
That is because if the architecture does define clear_user_highpage(),
then it likely needs some flushing magic when clearing user pages or
highpages. This means we can get away without defining
clear_user_pages(), since, much like its single page sibling, its only
potential user is the generic clear_user_highpages() which should instead
be using clear_user_highpage().
Link: https://lkml.kernel.org/r/20260107072009.1615991-3-ankur.a.arora@oracle.com
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|