diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2026-02-12 16:33:05 -0800 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2026-02-12 16:33:05 -0800 |
| commit | e812928be2ee1c2744adf20ed04e0ce1e2fc5c13 (patch) | |
| tree | d2685be8adaca1d097adf407b333d913d74c2582 /Documentation/driver-api | |
| parent | cebcffe666cc82e68842e27852a019ca54072cb7 (diff) | |
| parent | 63fbf275fa9f18f7020fb8acf54fa107e51d0f23 (diff) | |
Merge tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL updates from Dave Jiang:
- Introduce cxl_memdev_attach and pave way for soft reserved handling,
type2 accelerator enabling, and LSA 2.0 enabling. All these series
require the endpoint driver to settle before continuing the memdev
driver probe.
- Address CXL port error protocol handling and reporting.
The large patch series was split into three parts. The first two
parts are included here with the final part coming later.
The first part consists of a series of code refactoring to PCI AER
sub-system that addresses CXL and also CXL RAS code to prepare for
port error handling.
The second part refactors the CXL code to move management of
component registers to cxl_port objects to allow all CXL AER errors
to be handled through the cxl_port hierarchy.
- Provide AMD Zen5 platform address translation for CXL using ACPI
PRMT. This includes a conventions document to explain why this is
needed and how it's implemented.
- Misc CXL patches of fixes, cleanups, and updates. Including CXL
address translation for unaligned MOD3 regions.
[ TLA service: CXL is "Compute Express Link" ]
* tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (59 commits)
cxl: Disable HPA/SPA translation handlers for Normalized Addressing
cxl/region: Factor out code into cxl_region_setup_poison()
cxl/atl: Lock decoders that need address translation
cxl: Enable AMD Zen5 address translation using ACPI PRMT
cxl/acpi: Prepare use of EFI runtime services
cxl: Introduce callback for HPA address ranges translation
cxl/region: Use region data to get the root decoder
cxl/region: Add @hpa_range argument to function cxl_calc_interleave_pos()
cxl/region: Separate region parameter setup and region construction
cxl: Simplify cxl_root_ops allocation and handling
cxl/region: Store HPA range in struct cxl_region
cxl/region: Store root decoder in struct cxl_region
cxl/region: Rename misleading variable name @hpa to @hpa_range
Documentation/driver-api/cxl: ACPI PRM Address Translation Support and AMD Zen5 enablement
cxl, doc: Moving conventions in separate files
cxl, doc: Remove isonum.txt inclusion
cxl/port: Unify endpoint and switch port lookup
cxl/port: Move endpoint component register management to cxl_port
cxl/port: Map Port RAS registers
cxl/port: Move dport RAS setup to dport add time
...
Diffstat (limited to 'Documentation/driver-api')
| -rw-r--r-- | Documentation/driver-api/cxl/conventions.rst | 178 | ||||
| -rw-r--r-- | Documentation/driver-api/cxl/conventions/cxl-atl.rst | 304 | ||||
| -rw-r--r-- | Documentation/driver-api/cxl/conventions/cxl-lmh.rst | 135 | ||||
| -rw-r--r-- | Documentation/driver-api/cxl/conventions/template.rst | 37 | ||||
| -rw-r--r-- | Documentation/driver-api/cxl/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/driver-api/cxl/platform/bios-and-efi.rst | 23 | ||||
| -rw-r--r-- | Documentation/driver-api/cxl/platform/device-hotplug.rst | 130 |
7 files changed, 637 insertions, 171 deletions
diff --git a/Documentation/driver-api/cxl/conventions.rst b/Documentation/driver-api/cxl/conventions.rst index e37336d7b116..0d2e07279ad9 100644 --- a/Documentation/driver-api/cxl/conventions.rst +++ b/Documentation/driver-api/cxl/conventions.rst @@ -1,9 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 -.. include:: <isonum.txt> -======================================= Compute Express Link: Linux Conventions -======================================= +####################################### There exists shipping platforms that bend or break CXL specification expectations. Record the details and the rationale for those deviations. @@ -11,172 +9,10 @@ Borrow the ACPI Code First template format to capture the assumptions and tradeoffs such that multiple platform implementations can follow the same convention. -<(template) Title> -================== +.. toctree:: + :maxdepth: 1 + :caption: Contents -Document --------- -CXL Revision <rev>, Version <ver> - -License -------- -SPDX-License Identifier: CC-BY-4.0 - -Creator/Contributors --------------------- - -Summary of the Change ---------------------- - -<Detail the conflict with the specification and where available the -assumptions and tradeoffs taken by the hardware platform.> - - -Benefits of the Change ----------------------- - -<Detail what happens if platforms and Linux do not adopt this -convention.> - -References ----------- - -Detailed Description of the Change ----------------------------------- - -<Propose spec language that corrects the conflict.> - - -Resolve conflict between CFMWS, Platform Memory Holes, and Endpoint Decoders -============================================================================ - -Document --------- - -CXL Revision 3.2, Version 1.0 - -License -------- - -SPDX-License Identifier: CC-BY-4.0 - -Creator/Contributors --------------------- - -- Fabio M. De Francesco, Intel -- Dan J. Williams, Intel -- Mahesh Natu, Intel - -Summary of the Change ---------------------- - -According to the current Compute Express Link (CXL) Specifications (Revision -3.2, Version 1.0), the CXL Fixed Memory Window Structure (CFMWS) describes zero -or more Host Physical Address (HPA) windows associated with each CXL Host -Bridge. Each window represents a contiguous HPA range that may be interleaved -across one or more targets, including CXL Host Bridges. Each window has a set -of restrictions that govern its usage. It is the Operating System-directed -configuration and Power Management (OSPM) responsibility to utilize each window -for the specified use. - -Table 9-22 of the current CXL Specifications states that the Window Size field -contains the total number of consecutive bytes of HPA this window describes. -This value must be a multiple of the Number of Interleave Ways (NIW) * 256 MB. - -Platform Firmware (BIOS) might reserve physical addresses below 4 GB where a -memory gap such as the Low Memory Hole for PCIe MMIO may exist. In such cases, -the CFMWS Range Size may not adhere to the NIW * 256 MB rule. - -The HPA represents the actual physical memory address space that the CXL devices -can decode and respond to, while the System Physical Address (SPA), a related -but distinct concept, represents the system-visible address space that users can -direct transaction to and so it excludes reserved regions. - -BIOS publishes CFMWS to communicate the active SPA ranges that, on platforms -with LMH's, map to a strict subset of the HPA. The SPA range trims out the hole, -resulting in lost capacity in the Endpoints with no SPA to map to that part of -the HPA range that intersects the hole. - -E.g, an x86 platform with two CFMWS and an LMH starting at 2 GB: - - +--------+------------+-------------------+------------------+-------------------+------+ - | Window | CFMWS Base | CFMWS Size | HDM Decoder Base | HDM Decoder Size | Ways | - +========+============+===================+==================+===================+======+ - | 0 | 0 GB | 2 GB | 0 GB | 3 GB | 12 | - +--------+------------+-------------------+------------------+-------------------+------+ - | 1 | 4 GB | NIW*256MB Aligned | 4 GB | NIW*256MB Aligned | 12 | - +--------+------------+-------------------+------------------+-------------------+------+ - -HDM decoder base and HDM decoder size represent all the 12 Endpoint Decoders of -a 12 ways region and all the intermediate Switch Decoders. They are configured -by the BIOS according to the NIW * 256MB rule, resulting in a HPA range size of -3GB. Instead, the CFMWS Base and CFMWS Size are used to configure the Root -Decoder HPA range that results smaller (2GB) than that of the Switch and -Endpoint Decoders in the hierarchy (3GB). - -This creates 2 issues which lead to a failure to construct a region: - -1) A mismatch in region size between root and any HDM decoder. The root decoders - will always be smaller due to the trim. - -2) The trim causes the root decoder to violate the (NIW * 256MB) rule. - -This change allows a region with a base address of 0GB to bypass these checks to -allow for region creation with the trimmed root decoder address range. - -This change does not allow for any other arbitrary region to violate these -checks - it is intended exclusively to enable x86 platforms which map CXL memory -under 4GB. - -Despite the HDM decoders covering the PCIE hole HPA region, it is expected that -the platform will never route address accesses to the CXL complex because the -root decoder only covers the trimmed region (which excludes this). This is -outside the ability of Linux to enforce. - -On the example platform, only the first 2GB will be potentially usable, but -Linux, aiming to adhere to the current specifications, fails to construct -Regions and attach Endpoint and intermediate Switch Decoders to them. - -There are several points of failure that due to the expectation that the Root -Decoder HPA size, that is equal to the CFMWS from which it is configured, has -to be greater or equal to the matching Switch and Endpoint HDM Decoders. - -In order to succeed with construction and attachment, Linux must construct a -Region with Root Decoder HPA range size, and then attach to that all the -intermediate Switch Decoders and Endpoint Decoders that belong to the hierarchy -regardless of their range sizes. - -Benefits of the Change ----------------------- - -Without the change, the OSPM wouldn't match intermediate Switch and Endpoint -Decoders with Root Decoders configured with CFMWS HPA sizes that don't align -with the NIW * 256MB constraint, and so it leads to lost memdev capacity. - -This change allows the OSPM to construct Regions and attach intermediate Switch -and Endpoint Decoders to them, so that the addressable part of the memory -devices total capacity is made available to the users. - -References ----------- - -Compute Express Link Specification Revision 3.2, Version 1.0 -<https://www.computeexpresslink.org/> - -Detailed Description of the Change ----------------------------------- - -The description of the Window Size field in table 9-22 needs to account for -platforms with Low Memory Holes, where SPA ranges might be subsets of the -endpoints HPA. Therefore, it has to be changed to the following: - -"The total number of consecutive bytes of HPA this window represents. This value -shall be a multiple of NIW * 256 MB. - -On platforms that reserve physical addresses below 4 GB, such as the Low Memory -Hole for PCIe MMIO on x86, an instance of CFMWS whose Base HPA range is 0 might -have a size that doesn't align with the NIW * 256 MB constraint. - -Note that the matching intermediate Switch Decoders and the Endpoint Decoders -HPA range sizes must still align to the above-mentioned rule, but the memory -capacity that exceeds the CFMWS window size won't be accessible.". + conventions/cxl-lmh.rst + conventions/cxl-atl.rst + conventions/template.rst diff --git a/Documentation/driver-api/cxl/conventions/cxl-atl.rst b/Documentation/driver-api/cxl/conventions/cxl-atl.rst new file mode 100644 index 000000000000..3a36a84743d0 --- /dev/null +++ b/Documentation/driver-api/cxl/conventions/cxl-atl.rst @@ -0,0 +1,304 @@ +.. SPDX-License-Identifier: GPL-2.0 + +ACPI PRM CXL Address Translation +================================ + +Document +-------- + +CXL Revision 3.2, Version 1.0 + +License +------- + +SPDX-License Identifier: CC-BY-4.0 + +Creator/Contributors +-------------------- + +- Robert Richter, AMD et al. + +Summary of the Change +--------------------- + +The CXL Fixed Memory Window Structures (CFMWS) describe zero or more Host +Physical Address (HPA) windows associated with one or more CXL Host Bridges. +Each HPA range of a CXL Host Bridge is represented by a CFMWS entry. An HPA +range may include addresses currently assigned to CXL.mem devices, or an OS may +assign ranges from an address window to a device. + +Host-managed Device Memory is Device-attached memory that is mapped to system +coherent address space and accessible to the Host using standard write-back +semantics. The managed address range is configured in the CXL HDM Decoder +registers of the device. An HDM Decoder in a device is responsible for +converting HPA into DPA by stripping off specific address bits. + +CXL devices and CXL bridges use the same HPA space. It is common across all +components that belong to the same host domain. The view of the address region +must be consistent on the CXL.mem path between the Host and the Device. + +This is described in the *CXL 3.2 specification* (Table 1-1, 3.3.1, +8.2.4.20, 9.13.1, 9.18.1.3). [#cxl-spec-3.2]_ + +Depending on the interconnect architecture of the platform, components attached +to a host may not share the same host physical address space. Those platforms +need address translation to convert an HPA between the host and the attached +component, such as a CXL device. The translation mechanism is host-specific and +implementation dependent. + +For example, x86 AMD platforms use a Data Fabric that manages access to physical +memory. Devices have their own memory space and can be configured to use +'Normalized addresses' different from System Physical Addresses (SPA). Address +translation is then needed. For details, see +:doc:`x86 AMD Address Translation </admin-guide/RAS/address-translation>`. + +Those AMD platforms provide PRM [#prm-spec]_ handlers in firmware to perform +various types of address translation, including for CXL endpoints. AMD Zen5 +systems implement the ACPI PRM CXL Address Translation firmware call. The ACPI +PRM handler has a specific GUID to uniquely identify platforms with support for +Normalized addressing. This is documented in the *ACPI v6.5 Porting Guide* +(Address Translation - CXL DPA to System Physical Address). [#amd-ppr-58088]_ + +When in Normalized address mode, HDM decoder address ranges must be configured +and handled differently. Hardware addresses used in the HDM decoder +configurations of an endpoint are not SPA and need to be translated from the +address range of the endpoint to that of the CXL host bridge. This is especially +important for finding an endpoint's associated CXL Host Bridge and HPA window +described in the CFMWS. Additionally, the interleave decoding is done by the +Data Fabric and the endpoint does not perform decoding when converting HPA to +DPA. Instead, interleaving is switched off for the endpoint (1-way). Finally, +address translation might also be needed to inspect the endpoint's hardware +addresses, such as during profiling, tracing, or error handling. + +For example, with Normalized addressing the HDM decoders could look as follows:: + + ------------------------------- + | Root Decoder (CFMWS) | + | SPA Range: 0x850000000 | + | Size: 0x8000000000 (512 GB) | + | Interleave Ways: 1 | + ------------------------------- + | + v + ------------------------------- + | Host Bridge Decoder (HDM) | + | SPA Range: 0x850000000 | + | Size: 0x8000000000 (512 GB) | + | Interleave Ways: 4 | + | Targets: endpoint5,8,11,13 | + | Granularity: 256 | + ------------------------------- + | + -----------------------------+------------------------------ + | | | | + v v v v + ------------------- ------------------- ------------------- ------------------- + | endpoint5 | | endpoint8 | | endpoint11 | | endpoint13 | + | decoder5.0 | | decoder8.0 | | decoder11.0 | | decoder13.0 | + | PCIe: | | PCIe: | | PCIe: | | PCIe: | + | 0000:e2:00.0 | | 0000:e3:00.0 | | 0000:e4:00.0 | | 0000:e1:00.0 | + | DPA: | | DPA: | | DPA: | | DPA: | + | Start: 0x0 | | Start: 0x0 | | Start: 0x0 | | Start: 0x0 | + | Size: | | Size: | | Size: | | Size: | + | 0x2000000000 | | 0x2000000000 | | 0x2000000000 | | 0x2000000000 | + | (128 GB) | | (128 GB) | | (128 GB) | | (128 GB) | + | Interleaving: | | Interleaving: | | Interleaving: | | Interleaving: | + | Ways: 1 | | Ways: 1 | | Ways: 1 | | Ways: 1 | + | Gran: 256 | | Gran: 256 | | Gran: 256 | | Gran: 256 | + ------------------- ------------------- ------------------- ------------------- + | | | | + v v v v + DPA DPA DPA DPA + +This shows the representation in sysfs: + +.. code-block:: none + + /sys/bus/cxl/devices/endpoint5/decoder5.0/interleave_granularity:256 + /sys/bus/cxl/devices/endpoint5/decoder5.0/interleave_ways:1 + /sys/bus/cxl/devices/endpoint5/decoder5.0/size:0x2000000000 + /sys/bus/cxl/devices/endpoint5/decoder5.0/start:0x0 + /sys/bus/cxl/devices/endpoint8/decoder8.0/interleave_granularity:256 + /sys/bus/cxl/devices/endpoint8/decoder8.0/interleave_ways:1 + /sys/bus/cxl/devices/endpoint8/decoder8.0/size:0x2000000000 + /sys/bus/cxl/devices/endpoint8/decoder8.0/start:0x0 + /sys/bus/cxl/devices/endpoint11/decoder11.0/interleave_granularity:256 + /sys/bus/cxl/devices/endpoint11/decoder11.0/interleave_ways:1 + /sys/bus/cxl/devices/endpoint11/decoder11.0/size:0x2000000000 + /sys/bus/cxl/devices/endpoint11/decoder11.0/start:0x0 + /sys/bus/cxl/devices/endpoint13/decoder13.0/interleave_granularity:256 + /sys/bus/cxl/devices/endpoint13/decoder13.0/interleave_ways:1 + /sys/bus/cxl/devices/endpoint13/decoder13.0/size:0x2000000000 + /sys/bus/cxl/devices/endpoint13/decoder13.0/start:0x0 + +Note that the endpoint interleaving configurations use direct mapping (1-way). + +With PRM calls, the kernel can determine the following mappings: + +.. code-block:: none + + cxl decoder5.0: address mapping found for 0000:e2:00.0 (hpa -> spa): + 0x0+0x2000000000 -> 0x850000000+0x8000000000 ways:4 granularity:256 + cxl decoder8.0: address mapping found for 0000:e3:00.0 (hpa -> spa): + 0x0+0x2000000000 -> 0x850000000+0x8000000000 ways:4 granularity:256 + cxl decoder11.0: address mapping found for 0000:e4:00.0 (hpa -> spa): + 0x0+0x2000000000 -> 0x850000000+0x8000000000 ways:4 granularity:256 + cxl decoder13.0: address mapping found for 0000:e1:00.0 (hpa -> spa): + 0x0+0x2000000000 -> 0x850000000+0x8000000000 ways:4 granularity:256 + +The corresponding CXL host bridge (HDM) decoders and root decoder (CFMWS) match +the calculated endpoint mappings shown: + +.. code-block:: none + + /sys/bus/cxl/devices/port1/decoder1.0/interleave_granularity:256 + /sys/bus/cxl/devices/port1/decoder1.0/interleave_ways:4 + /sys/bus/cxl/devices/port1/decoder1.0/size:0x8000000000 + /sys/bus/cxl/devices/port1/decoder1.0/start:0x850000000 + /sys/bus/cxl/devices/port1/decoder1.0/target_list:0,1,2,3 + /sys/bus/cxl/devices/port1/decoder1.0/target_type:expander + /sys/bus/cxl/devices/root0/decoder0.0/interleave_granularity:256 + /sys/bus/cxl/devices/root0/decoder0.0/interleave_ways:1 + /sys/bus/cxl/devices/root0/decoder0.0/size:0x8000000000 + /sys/bus/cxl/devices/root0/decoder0.0/start:0x850000000 + /sys/bus/cxl/devices/root0/decoder0.0/target_list:7 + +The following changes to the specification are needed: + +* Allow a CXL device to be in an HPA space other than the host's address space. + +* Allow the platform to use implementation-specific address translation when + crossing memory domains on the CXL.mem path between the host and the device. + +* Define a PRM handler method for converting device addresses to SPAs. + +* Specify that the platform shall provide the PRM handler method to the + Operating System to detect Normalized addressing and for determining Endpoint + SPA ranges and interleaving configurations. + +* Add reference to: + + | Platform Runtime Mechanism Specification, Version 1.1 – November 2020 + | https://uefi.org/sites/default/files/resources/PRM_Platform_Runtime_Mechanism_1_1_release_candidate.pdf + +Benefits of the Change +---------------------- + +Without the change, the Operating System may be unable to determine the memory +region and Root Decoder for an Endpoint and its corresponding HDM decoder. +Region creation would fail. Platforms with a different interconnect architecture +would fail to set up and use CXL. + +References +---------- + +.. [#cxl-spec-3.2] Compute Express Link Specification, Revision 3.2, Version 1.0, + https://www.computeexpresslink.org/ + +.. [#amd-ppr-58088] AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh, + ACPI v6.5 Porting Guide, Publication # 58088, + https://www.amd.com/en/search/documentation/hub.html + +.. [#prm-spec] Platform Runtime Mechanism, Version: 1.1, + https://uefi.org/sites/default/files/resources/PRM_Platform_Runtime_Mechanism_1_1_release_candidate.pdf + +Detailed Description of the Change +---------------------------------- + +The following describes the necessary changes to the *CXL 3.2 specification* +[#cxl-spec-3.2]_: + +Add the following reference to the table: + +Table 1-2. Reference Documents + ++----------------------------+-------------------+---------------------------+ +| Document | Chapter Reference | Document No./Location | ++============================+===================+===========================+ +| Platform Runtime Mechanism | Chapter 8, 9 | https://www.uefi.org/acpi | +| Version: 1.1 | | | ++----------------------------+-------------------+---------------------------+ + +Add the following paragraphs to the end of the section: + +**8.2.4.20 CXL HDM Decoder Capability Structure** + +"A device may use an HPA space that is not common to other components of the +host domain. The platform is responsible for address translation when crossing +HPA spaces. The Operating System must determine the interleaving configuration +and perform address translation to the HPA ranges of the HDM decoders as needed. +The translation mechanism is host-specific and implementation dependent. + +The platform indicates support of independent HPA spaces and the need for +address translation by providing a Platform Runtime Mechanism (PRM) handler. The +OS shall use that handler to perform the necessary translations from the DPA +space to the HPA space. The handler is defined in Section 9.18.4 *PRM Handler +for CXL DPA to System Physical Address Translation*." + +Add the following section and sub-section including tables: + +**9.18.4 PRM Handler for CXL DPA to System Physical Address Translation** + +"A platform may be configured to use 'Normalized addresses'. Host physical +address (HPA) spaces are component-specific and differ from system physical +addresses (SPAs). The endpoint has its own physical address space. All requests +presented to the device already use Device Physical Addresses (DPAs). The CXL +endpoint decoders have interleaving disabled (1-way interleaving) and the device +does not perform HPA decoding to determine a DPA. + +The platform provides a PRM handler for CXL DPA to System Physical Address +Translation. The PRM handler translates a Device Physical Address (DPA) to a +System Physical Address (SPA) for a specified CXL endpoint. In the address space +of the host, SPA and HPA are equivalent, and the OS shall use this handler to +determine the HPA that corresponds to a device address, for example when +configuring HDM decoders on platforms with Normalized addressing. The GUID and +the parameter buffer format of the handler are specified in section 9.18.4.1. If +the OS identifies the PRM handler, the platform supports Normalized addressing +and the OS must perform DPA address translation as needed." + +**9.18.4.1 PRM Handler Invocation** + +"The OS calls the PRM handler for CXL DPA to System Physical Address Translation +using the direct invocation mechanism. Details of calling a PRM handler are +described in the Platform Runtime Mechanism (PRM) specification. + +The PRM handler is identified by the following GUID: + + EE41B397-25D4-452C-AD54-48C6E3480B94 + +The caller allocates and prepares a Parameter Buffer, then passes the PRM +handler GUID and a pointer to the Parameter Buffer to invoke the handler. The +Parameter Buffer is described in Table 9-32." + +**Table 9-32. PRM Parameter Buffer used for CXL DPA to System Physical Address Translation** + ++-------------+-----------+------------------------------------------------------------------------+ +| Byte Offset | Length in | Description | +| | Bytes | | ++=============+===========+========================================================================+ +| 00h | 8 | **CXL Device Physical Address (DPA)**: CXL DPA (e.g., from | +| | | CXL Component Event Log) | ++-------------+-----------+------------------------------------------------------------------------+ +| 08h | 4 | **CXL Endpoint SBDF**: | +| | | | +| | | - Byte 3 - PCIe Segment | +| | | - Byte 2 - Bus Number | +| | | - Byte 1: | +| | | - Device Number Bits[7:3] | +| | | - Function Number Bits[2:0] | +| | | - Byte 0 - RESERVED (MBZ) | +| | | | ++-------------+-----------+------------------------------------------------------------------------+ +| 0Ch | 8 | **Output Buffer**: Virtual Address Pointer to the buffer, | +| | | as defined in Table 9-33. | ++-------------+-----------+------------------------------------------------------------------------+ + +**Table 9-33. PRM Output Buffer used for CXL DPA to System Physical Address Translation** + ++-------------+-----------+------------------------------------------------------------------------+ +| Byte Offset | Length in | Description | +| | Bytes | | ++=============+===========+========================================================================+ +| 00h | 8 | **System Physical Address (SPA)**: The SPA converted | +| | | from the CXL DPA. | ++-------------+-----------+------------------------------------------------------------------------+ diff --git a/Documentation/driver-api/cxl/conventions/cxl-lmh.rst b/Documentation/driver-api/cxl/conventions/cxl-lmh.rst new file mode 100644 index 000000000000..baece5c35345 --- /dev/null +++ b/Documentation/driver-api/cxl/conventions/cxl-lmh.rst @@ -0,0 +1,135 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Resolve conflict between CFMWS, Platform Memory Holes, and Endpoint Decoders +============================================================================ + +Document +-------- + +CXL Revision 3.2, Version 1.0 + +License +------- + +SPDX-License Identifier: CC-BY-4.0 + +Creator/Contributors +-------------------- + +- Fabio M. De Francesco, Intel +- Dan J. Williams, Intel +- Mahesh Natu, Intel + +Summary of the Change +--------------------- + +According to the current Compute Express Link (CXL) Specifications (Revision +3.2, Version 1.0), the CXL Fixed Memory Window Structure (CFMWS) describes zero +or more Host Physical Address (HPA) windows associated with each CXL Host +Bridge. Each window represents a contiguous HPA range that may be interleaved +across one or more targets, including CXL Host Bridges. Each window has a set +of restrictions that govern its usage. It is the Operating System-directed +configuration and Power Management (OSPM) responsibility to utilize each window +for the specified use. + +Table 9-22 of the current CXL Specifications states that the Window Size field +contains the total number of consecutive bytes of HPA this window describes. +This value must be a multiple of the Number of Interleave Ways (NIW) * 256 MB. + +Platform Firmware (BIOS) might reserve physical addresses below 4 GB where a +memory gap such as the Low Memory Hole for PCIe MMIO may exist. In such cases, +the CFMWS Range Size may not adhere to the NIW * 256 MB rule. + +The HPA represents the actual physical memory address space that the CXL devices +can decode and respond to, while the System Physical Address (SPA), a related +but distinct concept, represents the system-visible address space that users can +direct transaction to and so it excludes reserved regions. + +BIOS publishes CFMWS to communicate the active SPA ranges that, on platforms +with LMH's, map to a strict subset of the HPA. The SPA range trims out the hole, +resulting in lost capacity in the Endpoints with no SPA to map to that part of +the HPA range that intersects the hole. + +E.g, an x86 platform with two CFMWS and an LMH starting at 2 GB: + + +--------+------------+-------------------+------------------+-------------------+------+ + | Window | CFMWS Base | CFMWS Size | HDM Decoder Base | HDM Decoder Size | Ways | + +========+============+===================+==================+===================+======+ + | 0 | 0 GB | 2 GB | 0 GB | 3 GB | 12 | + +--------+------------+-------------------+------------------+-------------------+------+ + | 1 | 4 GB | NIW*256MB Aligned | 4 GB | NIW*256MB Aligned | 12 | + +--------+------------+-------------------+------------------+-------------------+------+ + +HDM decoder base and HDM decoder size represent all the 12 Endpoint Decoders of +a 12 ways region and all the intermediate Switch Decoders. They are configured +by the BIOS according to the NIW * 256MB rule, resulting in a HPA range size of +3GB. Instead, the CFMWS Base and CFMWS Size are used to configure the Root +Decoder HPA range that results smaller (2GB) than that of the Switch and +Endpoint Decoders in the hierarchy (3GB). + +This creates 2 issues which lead to a failure to construct a region: + +1) A mismatch in region size between root and any HDM decoder. The root decoders + will always be smaller due to the trim. + +2) The trim causes the root decoder to violate the (NIW * 256MB) rule. + +This change allows a region with a base address of 0GB to bypass these checks to +allow for region creation with the trimmed root decoder address range. + +This change does not allow for any other arbitrary region to violate these +checks - it is intended exclusively to enable x86 platforms which map CXL memory +under 4GB. + +Despite the HDM decoders covering the PCIE hole HPA region, it is expected that +the platform will never route address accesses to the CXL complex because the +root decoder only covers the trimmed region (which excludes this). This is +outside the ability of Linux to enforce. + +On the example platform, only the first 2GB will be potentially usable, but +Linux, aiming to adhere to the current specifications, fails to construct +Regions and attach Endpoint and intermediate Switch Decoders to them. + +There are several points of failure that due to the expectation that the Root +Decoder HPA size, that is equal to the CFMWS from which it is configured, has +to be greater or equal to the matching Switch and Endpoint HDM Decoders. + +In order to succeed with construction and attachment, Linux must construct a +Region with Root Decoder HPA range size, and then attach to that all the +intermediate Switch Decoders and Endpoint Decoders that belong to the hierarchy +regardless of their range sizes. + +Benefits of the Change +---------------------- + +Without the change, the OSPM wouldn't match intermediate Switch and Endpoint +Decoders with Root Decoders configured with CFMWS HPA sizes that don't align +with the NIW * 256MB constraint, and so it leads to lost memdev capacity. + +This change allows the OSPM to construct Regions and attach intermediate Switch +and Endpoint Decoders to them, so that the addressable part of the memory +devices total capacity is made available to the users. + +References +---------- + +Compute Express Link Specification Revision 3.2, Version 1.0 +<https://www.computeexpresslink.org/> + +Detailed Description of the Change +---------------------------------- + +The description of the Window Size field in table 9-22 needs to account for +platforms with Low Memory Holes, where SPA ranges might be subsets of the +endpoints HPA. Therefore, it has to be changed to the following: + +"The total number of consecutive bytes of HPA this window represents. This value +shall be a multiple of NIW * 256 MB. + +On platforms that reserve physical addresses below 4 GB, such as the Low Memory +Hole for PCIe MMIO on x86, an instance of CFMWS whose Base HPA range is 0 might +have a size that doesn't align with the NIW * 256 MB constraint. + +Note that the matching intermediate Switch Decoders and the Endpoint Decoders +HPA range sizes must still align to the above-mentioned rule, but the memory +capacity that exceeds the CFMWS window size won't be accessible.". diff --git a/Documentation/driver-api/cxl/conventions/template.rst b/Documentation/driver-api/cxl/conventions/template.rst new file mode 100644 index 000000000000..ff2fcf1b5e24 --- /dev/null +++ b/Documentation/driver-api/cxl/conventions/template.rst @@ -0,0 +1,37 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. :: Template Title here: + +Template File +============= + +Document +-------- +CXL Revision <rev>, Version <ver> + +License +------- +SPDX-License Identifier: CC-BY-4.0 + +Creator/Contributors +-------------------- + +Summary of the Change +--------------------- + +<Detail the conflict with the specification and where available the +assumptions and tradeoffs taken by the hardware platform.> + +Benefits of the Change +---------------------- + +<Detail what happens if platforms and Linux do not adopt this +convention.> + +References +---------- + +Detailed Description of the Change +---------------------------------- + +<Propose spec language that corrects the conflict.> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst index ec8aae9ec0d4..3dfae1d310ca 100644 --- a/Documentation/driver-api/cxl/index.rst +++ b/Documentation/driver-api/cxl/index.rst @@ -30,6 +30,7 @@ that have impacts on each other. The docs here break up configurations steps. platform/acpi platform/cdat platform/example-configs + platform/device-hotplug .. toctree:: :maxdepth: 2 diff --git a/Documentation/driver-api/cxl/platform/bios-and-efi.rst b/Documentation/driver-api/cxl/platform/bios-and-efi.rst index a9aa0ccd92af..a4b44c018f09 100644 --- a/Documentation/driver-api/cxl/platform/bios-and-efi.rst +++ b/Documentation/driver-api/cxl/platform/bios-and-efi.rst @@ -29,6 +29,29 @@ at :doc:`ACPI Tables <acpi>`. on physical memory region size and alignment, memory holes, HDM interleave, and what linux expects of HDM decoders trying to work with these features. + +Linux Expectations of BIOS/EFI Software +======================================= +Linux expects BIOS/EFI software to construct sufficient ACPI tables (such as +CEDT, SRAT, HMAT, etc) and platform-specific configurations (such as HPA spaces +and host-bridge interleave configurations) to allow the Linux driver to +subsequently configure the devices in the CXL fabric at runtime. + +Programming of HDM decoders and switch ports is not required, and may be +deferred to the CXL driver based on admin policy (e.g. udev rules). + +Some platforms may require pre-programming HDM decoders and locking them +due to quirks (see: Zen5 address translation), but this is not the normal, +"expected" configuration path. This should be avoided if possible. + +Some platforms may wish to pre-configure these resources to bring memory +up without requiring CXL driver support. These platform vendors should +test their configurations with the existing CXL driver and provide driver +support for their auto-configurations if features like RAS are required. + +Platforms requiring boot-time programming and/or locking of CXL fabric +components may prevent features, such as device hot-plug, from working. + UEFI Settings ============= If your platform supports it, the :code:`uefisettings` command can be used to diff --git a/Documentation/driver-api/cxl/platform/device-hotplug.rst b/Documentation/driver-api/cxl/platform/device-hotplug.rst new file mode 100644 index 000000000000..e4a065fdd3ec --- /dev/null +++ b/Documentation/driver-api/cxl/platform/device-hotplug.rst @@ -0,0 +1,130 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +CXL Device Hotplug +================== + +Device hotplug refers to *physical* hotplug of a device (addition or removal +of a physical device from the machine). + +BIOS/EFI software is expected to configure sufficient resources **at boot +time** to allow hotplugged devices to be configured by software (such as +proximity domains, HPA regions, and host-bridge configurations). + +BIOS/EFI is not expected (**nor suggested**) to configure hotplugged +devices at hotplug time (i.e. HDM decoders should be left unprogrammed). + +This document covers some examples of those resources, but should not +be considered exhaustive. + +Hot-Remove +========== +Hot removal of a device typically requires careful removal of software +constructs (memory regions, associated drivers) which manage these devices. + +Hard-removing a CXL.mem device without carefully tearing down driver stacks +is likely to cause the system to machine-check (or at least SIGBUS if memory +access is limited to user space). + +Memory Device Hot-Add +===================== +A device present at boot may be associated with a CXL Fixed Memory Window +reported in :doc:`CEDT<acpi/cedt>`. That CFMWS may match the size of the +device, but the construction of the CEDT CFMWS is platform-defined. + +Hot-adding a memory device requires this pre-defined, **static** CFMWS to +have sufficient HPA space to describe that device. + +There are a few common scenarios to consider. + +Single-Endpoint Memory Device Present at Boot +--------------------------------------------- +A device present at boot likely had its capacity reported in the +:doc:`CEDT<acpi/cedt>`. If a device is removed and a new device hotplugged, +the capacity of the new device will be limited to the original CFMWS capacity. + +Adding capacity larger than the original device will cause memory region +creation to fail if the region size is greater than the CFMWS size. + +The CFMWS is **static** and cannot be adjusted. Platforms which may expect +different sized devices to be hotplugged must allocate sufficient CFMWS space +**at boot time** to cover all future expected devices. + +Multi-Endpoint Memory Device Present at Boot +-------------------------------------------- +Non-switch-based Multi-Endpoint devices are outside the scope of what the +CXL specification describes, but they are technically possible. We describe +them here for instructive reasons only - this does not imply Linux support. + +A hot-plug capable CXL memory device, such as one which presents multiple +expanders as a single large-capacity device, should report the **maximum +possible capacity** for the device at boot. :: + + HB0 + RP0 + | + [Multi-Endpoint Memory Device] + _____|_____ + | | + [Endpoint0] [Empty] + + +Limiting the size to the capacity preset at boot will limit hot-add support +to replacing capacity that was present at boot. + +No CXL Device Present at Boot +----------------------------- +When no CXL memory device is present on boot, some platforms omit the CFMWS +in the :doc:`CEDT<acpi/cedt>`. When this occurs, hot-add is not possible. + +This describes the base case for any given device not being present at boot. +If a future possible device is not described in the CEDT at boot, hot-add +of that device is either limited or not possible. + +For a platform to support hot-add of a full memory device, it must allocate +a CEDT CFMWS region with sufficient memory capacity to cover all future +potentially added capacity (along with any relevant CEDT CHBS entry). + +To support memory hotplug directly on the host bridge/root port, or on a switch +downstream of the host bridge, a platform must construct a CEDT CFMWS at boot +with sufficient resources to support the max possible (or expected) hotplug +memory capacity. :: + + HB0 HB1 + RP0 RP1 RP2 + | | | + Empty Empty USP + ________|________ + | | | | + DSP DSP DSP DSP + | | | | + All Empty + +For example, a BIOS/EFI may expose an option to configure a CEDT CFMWS with +a pre-configured amount of memory capacity (per host bridge, or host bridge +interleave set), even if no device is attached to Root Ports or Downstream +Ports at boot (as depicted in the figure above). + + +Interleave Sets +=============== + +Host Bridge Interleave +---------------------- +Host-bridge interleaved memory regions are defined **statically** in the +:doc:`CEDT<acpi/cedt>`. To apply cross-host-bridge interleave, a CFMWS entry +describing that interleave must have been provided **at boot**. Hotplugged +devices cannot add host-bridge interleave capabilities at hotplug time. + +See the :doc:`Flexible CEDT Configuration<example-configurations/flexible>` +example to see how a platform can provide this kind of flexibility regarding +hotplugged memory devices. BIOS/EFI software should consider options to +present flexible CEDT configurations with hotplug support. + +HDM Interleave +-------------- +Decoder-applied interleave can flexibly handle hotplugged devices, as decoders +can be re-programmed after hotplug. + +To add or remove a device to/from an existing HDM-applied interleaved region, +that region must be torn down an re-created. |
