From 3fa805c37dd4d3e72ae5c58800f3f46ab3ca1f70 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Fri, 10 Oct 2025 03:36:50 -0700 Subject: vmcoreinfo: track and log recoverable hardware errors MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a generic infrastructure for tracking recoverable hardware errors (HW errors that are visible to the OS but does not cause a panic) and record them for vmcore consumption. This aids post-mortem crash analysis tools by preserving a count and timestamp for the last occurrence of such errors. On the other side, correctable errors, which the OS typically remains unaware of because the underlying hardware handles them transparently, are less relevant for crash dump and therefore are NOT tracked in this infrastructure. Add centralized logging for sources of recoverable hardware errors based on the subsystem it has been notified. hwerror_data is write-only at kernel runtime, and it is meant to be read from vmcore using tools like crash/drgn. For example, this is how it looks like when opening the crashdump from drgn. >>> prog['hwerror_data'] (struct hwerror_info[1]){ { .count = (int)844, .timestamp = (time64_t)1752852018, }, ... This helps fleet operators quickly triage whether a crash may be influenced by hardware recoverable errors (which executes a uncommon code path in the kernel), especially when recoverable errors occurred shortly before a panic, such as the bug fixed by commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool") This is not intended to replace full hardware diagnostics but provides a fast way to correlate hardware events with kernel panics quickly. Rare machine check exceptions—like those indicated by mce_flags.p5 or mce_flags.winchip—are not accounted for in this method, as they fall outside the intended usage scope for this feature's user base. [leitao@debian.org: add hw-recoverable-errors to toctree] Link: https://lkml.kernel.org/r/20251127-vmcoreinfo_fix-v1-1-26f5b1c43da9@debian.org Link: https://lkml.kernel.org/r/20251010-vmcore_hw_error-v5-1-636ede3efe44@debian.org Signed-off-by: Breno Leitao Suggested-by: Tony Luck Suggested-by: Shuai Xue Reviewed-by: Shuai Xue Reviewed-by: Hanjun Guo [APEI] Cc: Bjorn Helgaas Cc: Bob Moore Cc: Borislav Betkov Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Morse Cc: Konrad Rzessutek Wilk Cc: Len Brown Cc: Mahesh Salgaonkar Cc: Mauro Carvalho Chehab Cc: "Oliver O'Halloran" Cc: Omar Sandoval Cc: Thomas Gleinxer Signed-off-by: Andrew Morton --- include/uapi/linux/vmcore.h | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'include/uapi/linux/vmcore.h') diff --git a/include/uapi/linux/vmcore.h b/include/uapi/linux/vmcore.h index 3e9da91866ff..2ba89fafa518 100644 --- a/include/uapi/linux/vmcore.h +++ b/include/uapi/linux/vmcore.h @@ -15,4 +15,13 @@ struct vmcoredd_header { __u8 dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Device dump's name */ }; +enum hwerr_error_type { + HWERR_RECOV_CPU, + HWERR_RECOV_MEMORY, + HWERR_RECOV_PCI, + HWERR_RECOV_CXL, + HWERR_RECOV_OTHERS, + HWERR_RECOV_MAX, +}; + #endif /* _UAPI_VMCORE_H */ -- cgit v1.2.3