summaryrefslogtreecommitdiff
path: root/Documentation/x86/x86_64
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2019-05-10 13:24:53 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2019-05-10 13:24:53 -0400
commit1fb3b526df3bd7647e7854915ae6b22299408baf (patch)
treeb1fd6c9eaad0b1aff2b1df28ef1d5efcd6d7532a /Documentation/x86/x86_64
parente290e6af1d22c3f5225c9d46faabdde80e27aef2 (diff)
parentafbd4d42470e91470bc59040094b89cd717530bd (diff)
Merge tag 'docs-5.2a' of git://git.lwn.net/linux
Pull more documentation updates from Jonathan Corbet: "Some late arriving documentation changes. In particular, this contains the conversion of the x86 docs to RST, which has been in the works for some time but needed a couple of final tweaks" * tag 'docs-5.2a' of git://git.lwn.net/linux: (29 commits) Documentation: x86: convert x86_64/machinecheck to reST Documentation: x86: convert x86_64/cpu-hotplug-spec to reST Documentation: x86: convert x86_64/fake-numa-for-cpusets to reST Documentation: x86: convert x86_64/5level-paging.txt to reST Documentation: x86: convert x86_64/mm.txt to reST Documentation: x86: convert x86_64/uefi.txt to reST Documentation: x86: convert x86_64/boot-options.txt to reST Documentation: x86: convert i386/IO-APIC.txt to reST Documentation: x86: convert usb-legacy-support.txt to reST Documentation: x86: convert orc-unwinder.txt to reST Documentation: x86: convert resctrl_ui.txt to reST Documentation: x86: convert microcode.txt to reST Documentation: x86: convert pti.txt to reST Documentation: x86: convert amd-memory-encryption.txt to reST Documentation: x86: convert intel_mpx.txt to reST Documentation: x86: convert protection-keys.txt to reST Documentation: x86: convert pat.txt to reST Documentation: x86: convert mtrr.txt to reST Documentation: x86: convert tlb.txt to reST Documentation: x86: convert zero-page.txt to reST ...
Diffstat (limited to 'Documentation/x86/x86_64')
-rw-r--r--Documentation/x86/x86_64/5level-paging.rst (renamed from Documentation/x86/x86_64/5level-paging.txt)16
-rw-r--r--Documentation/x86/x86_64/boot-options.rst335
-rw-r--r--Documentation/x86/x86_64/boot-options.txt278
-rw-r--r--Documentation/x86/x86_64/cpu-hotplug-spec.rst (renamed from Documentation/x86/x86_64/cpu-hotplug-spec)5
-rw-r--r--Documentation/x86/x86_64/fake-numa-for-cpusets.rst (renamed from Documentation/x86/x86_64/fake-numa-for-cpusets)25
-rw-r--r--Documentation/x86/x86_64/index.rst16
-rw-r--r--Documentation/x86/x86_64/machinecheck.rst (renamed from Documentation/x86/x86_64/machinecheck)12
-rw-r--r--Documentation/x86/x86_64/mm.rst161
-rw-r--r--Documentation/x86/x86_64/mm.txt153
-rw-r--r--Documentation/x86/x86_64/uefi.rst (renamed from Documentation/x86/x86_64/uefi.txt)30
10 files changed, 575 insertions, 456 deletions
diff --git a/Documentation/x86/x86_64/5level-paging.txt b/Documentation/x86/x86_64/5level-paging.rst
index 2432a5ef86d9..ab88a4514163 100644
--- a/Documentation/x86/x86_64/5level-paging.txt
+++ b/Documentation/x86/x86_64/5level-paging.rst
@@ -1,5 +1,11 @@
-== Overview ==
+.. SPDX-License-Identifier: GPL-2.0
+==============
+5-level paging
+==============
+
+Overview
+========
Original x86-64 was limited by 4-level paing to 256 TiB of virtual address
space and 64 TiB of physical address space. We are already bumping into
this limit: some vendors offers servers with 64 TiB of memory today.
@@ -16,16 +22,17 @@ QEMU 2.9 and later support 5-level paging.
Virtual memory layout for 5-level paging is described in
Documentation/x86/x86_64/mm.txt
-== Enabling 5-level paging ==
+Enabling 5-level paging
+=======================
CONFIG_X86_5LEVEL=y enables the feature.
Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware.
In this case additional page table level -- p4d -- will be folded at
runtime.
-== User-space and large virtual address space ==
-
+User-space and large virtual address space
+==========================================
On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
@@ -58,4 +65,3 @@ One important case we need to handle here is interaction with MPX.
MPX (without MAWA extension) cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.
-
diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst
new file mode 100644
index 000000000000..2f69836b8445
--- /dev/null
+++ b/Documentation/x86/x86_64/boot-options.rst
@@ -0,0 +1,335 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+AMD64 Specific Boot Options
+===========================
+
+There are many others (usually documented in driver documentation), but
+only the AMD64 specific ones are listed here.
+
+Machine check
+=============
+Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
+
+ mce=off
+ Disable machine check
+ mce=no_cmci
+ Disable CMCI(Corrected Machine Check Interrupt) that
+ Intel processor supports. Usually this disablement is
+ not recommended, but it might be handy if your hardware
+ is misbehaving.
+ Note that you'll get more problems without CMCI than with
+ due to the shared banks, i.e. you might get duplicated
+ error logs.
+ mce=dont_log_ce
+ Don't make logs for corrected errors. All events reported
+ as corrected are silently cleared by OS.
+ This option will be useful if you have no interest in any
+ of corrected errors.
+ mce=ignore_ce
+ Disable features for corrected errors, e.g. polling timer
+ and CMCI. All events reported as corrected are not cleared
+ by OS and remained in its error banks.
+ Usually this disablement is not recommended, however if
+ there is an agent checking/clearing corrected errors
+ (e.g. BIOS or hardware monitoring applications), conflicting
+ with OS's error handling, and you cannot deactivate the agent,
+ then this option will be a help.
+ mce=no_lmce
+ Do not opt-in to Local MCE delivery. Use legacy method
+ to broadcast MCEs.
+ mce=bootlog
+ Enable logging of machine checks left over from booting.
+ Disabled by default on AMD Fam10h and older because some BIOS
+ leave bogus ones.
+ If your BIOS doesn't do that it's a good idea to enable though
+ to make sure you log even machine check events that result
+ in a reboot. On Intel systems it is enabled by default.
+ mce=nobootlog
+ Disable boot machine check logging.
+ mce=tolerancelevel[,monarchtimeout] (number,number)
+ tolerance levels:
+ 0: always panic on uncorrected errors, log corrected errors
+ 1: panic or SIGBUS on uncorrected errors, log corrected errors
+ 2: SIGBUS or log uncorrected errors, log corrected errors
+ 3: never panic or SIGBUS, log all errors (for testing only)
+ Default is 1
+ Can be also set using sysfs which is preferable.
+ monarchtimeout:
+ Sets the time in us to wait for other CPUs on machine checks. 0
+ to disable.
+ mce=bios_cmci_threshold
+ Don't overwrite the bios-set CMCI threshold. This boot option
+ prevents Linux from overwriting the CMCI threshold set by the
+ bios. Without this option, Linux always sets the CMCI
+ threshold to 1. Enabling this may make memory predictive failure
+ analysis less effective if the bios sets thresholds for memory
+ errors since we will not see details for all errors.
+ mce=recovery
+ Force-enable recoverable machine check code paths
+
+ nomce (for compatibility with i386)
+ same as mce=off
+
+ Everything else is in sysfs now.
+
+APICs
+=====
+
+ apic
+ Use IO-APIC. Default
+
+ noapic
+ Don't use the IO-APIC.
+
+ disableapic
+ Don't use the local APIC
+
+ nolapic
+ Don't use the local APIC (alias for i386 compatibility)
+
+ pirq=...
+ See Documentation/x86/i386/IO-APIC.txt
+
+ noapictimer
+ Don't set up the APIC timer
+
+ no_timer_check
+ Don't check the IO-APIC timer. This can work around
+ problems with incorrect timer initialization on some boards.
+
+ apicpmtimer
+ Do APIC timer calibration using the pmtimer. Implies
+ apicmaintimer. Useful when your PIT timer is totally broken.
+
+Timing
+======
+
+ notsc
+ Deprecated, use tsc=unstable instead.
+
+ nohpet
+ Don't use the HPET timer.
+
+Idle loop
+=========
+
+ idle=poll
+ Don't do power saving in the idle loop using HLT, but poll for rescheduling
+ event. This will make the CPUs eat a lot more power, but may be useful
+ to get slightly better performance in multiprocessor benchmarks. It also
+ makes some profiling using performance counters more accurate.
+ Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
+ CPUs) this option has no performance advantage over the normal idle loop.
+ It may also interact badly with hyperthreading.
+
+Rebooting
+=========
+
+ reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old]
+ bios
+ Use the CPU reboot vector for warm reset
+ warm
+ Don't set the cold reboot flag
+ cold
+ Set the cold reboot flag
+ triple
+ Force a triple fault (init)
+ kbd
+ Use the keyboard controller. cold reset (default)
+ acpi
+ Use the ACPI RESET_REG in the FADT. If ACPI is not configured or
+ the ACPI reset does not work, the reboot path attempts the reset
+ using the keyboard controller.
+ efi
+ Use efi reset_system runtime service. If EFI is not configured or
+ the EFI reset does not work, the reboot path attempts the reset using
+ the keyboard controller.
+
+ Using warm reset will be much faster especially on big memory
+ systems because the BIOS will not go through the memory check.
+ Disadvantage is that not all hardware will be completely reinitialized
+ on reboot so there may be boot problems on some systems.
+
+ reboot=force
+ Don't stop other CPUs on reboot. This can make reboot more reliable
+ in some cases.
+
+Non Executable Mappings
+=======================
+
+ noexec=on|off
+ on
+ Enable(default)
+ off
+ Disable
+
+NUMA
+====
+
+ numa=off
+ Only set up a single NUMA node spanning all memory.
+
+ numa=noacpi
+ Don't parse the SRAT table for NUMA setup
+
+ numa=fake=<size>[MG]
+ If given as a memory unit, fills all system RAM with nodes of
+ size interleaved over physical nodes.
+
+ numa=fake=<N>
+ If given as an integer, fills all system RAM with N fake nodes
+ interleaved over physical nodes.
+
+ numa=fake=<N>U
+ If given as an integer followed by 'U', it will divide each
+ physical node into N emulated nodes.
+
+ACPI
+====
+
+ acpi=off
+ Don't enable ACPI
+ acpi=ht
+ Use ACPI boot table parsing, but don't enable ACPI interpreter
+ acpi=force
+ Force ACPI on (currently not needed)
+ acpi=strict
+ Disable out of spec ACPI workarounds.
+ acpi_sci={edge,level,high,low}
+ Set up ACPI SCI interrupt.
+ acpi=noirq
+ Don't route interrupts
+ acpi=nocmcff
+ Disable firmware first mode for corrected errors. This
+ disables parsing the HEST CMC error source to check if
+ firmware has set the FF flag. This may result in
+ duplicate corrected error reports.
+
+PCI
+===
+
+ pci=off
+ Don't use PCI
+ pci=conf1
+ Use conf1 access.
+ pci=conf2
+ Use conf2 access.
+ pci=rom
+ Assign ROMs.
+ pci=assign-busses
+ Assign busses
+ pci=irqmask=MASK
+ Set PCI interrupt mask to MASK
+ pci=lastbus=NUMBER
+ Scan up to NUMBER busses, no matter what the mptable says.
+ pci=noacpi
+ Don't use ACPI to set up PCI interrupt routing.
+
+IOMMU (input/output memory management unit)
+===========================================
+Multiple x86-64 PCI-DMA mapping implementations exist, for example:
+
+ 1. <lib/dma-direct.c>: use no hardware/software IOMMU at all
+ (e.g. because you have < 3 GB memory).
+ Kernel boot message: "PCI-DMA: Disabling IOMMU"
+
+ 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
+ Kernel boot message: "PCI-DMA: using GART IOMMU"
+
+ 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
+ e.g. if there is no hardware IOMMU in the system and it is need because
+ you have >3GB memory or told the kernel to us it (iommu=soft))
+ Kernel boot message: "PCI-DMA: Using software bounce buffering
+ for IO (SWIOTLB)"
+
+ 4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
+ pSeries and xSeries servers. This hardware IOMMU supports DMA address
+ mapping with memory protection, etc.
+ Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
+
+::
+
+ iommu=[<size>][,noagp][,off][,force][,noforce]
+ [,memaper[=<order>]][,merge][,fullflush][,nomerge]
+ [,noaperture][,calgary]
+
+General iommu options:
+
+ off
+ Don't initialize and use any kind of IOMMU.
+ noforce
+ Don't force hardware IOMMU usage when it is not needed. (default).
+ force
+ Force the use of the hardware IOMMU even when it is
+ not actually needed (e.g. because < 3 GB memory).
+ soft
+ Use software bounce buffering (SWIOTLB) (default for
+ Intel machines). This can be used to prevent the usage
+ of an available hardware IOMMU.
+
+iommu options only relevant to the AMD GART hardware IOMMU:
+
+ <size>
+ Set the size of the remapping area in bytes.
+ allowed
+ Overwrite iommu off workarounds for specific chipsets.
+ fullflush
+ Flush IOMMU on each allocation (default).
+ nofullflush
+ Don't use IOMMU fullflush.
+ memaper[=<order>]
+ Allocate an own aperture over RAM with size 32MB<<order.
+ (default: order=1, i.e. 64MB)
+ merge
+ Do scatter-gather (SG) merging. Implies "force" (experimental).
+ nomerge
+ Don't do scatter-gather (SG) merging.
+ noaperture
+ Ask the IOMMU not to touch the aperture for AGP.
+ noagp
+ Don't initialize the AGP driver and use full aperture.
+ panic
+ Always panic when IOMMU overflows.
+ calgary
+ Use the Calgary IOMMU if it is available
+
+iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
+implementation:
+
+ swiotlb=<pages>[,force]
+ <pages>
+ Prereserve that many 128K pages for the software IO bounce buffering.
+ force
+ Force all IO through the software TLB.
+
+Settings for the IBM Calgary hardware IOMMU currently found in IBM
+pSeries and xSeries machines
+
+ calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
+ Set the size of each PCI slot's translation table when using the
+ Calgary IOMMU. This is the size of the translation table itself
+ in main memory. The smallest table, 64k, covers an IO space of
+ 32MB; the largest, 8MB table, can cover an IO space of 4GB.
+ Normally the kernel will make the right choice by itself.
+ calgary=[translate_empty_slots]
+ Enable translation even on slots that have no devices attached to
+ them, in case a device will be hotplugged in the future.
+ calgary=[disable=<PCI bus number>]
+ Disable translation on a given PHB. For
+ example, the built-in graphics adapter resides on the first bridge
+ (PCI bus number 0); if translation (isolation) is enabled on this
+ bridge, X servers that access the hardware directly from user
+ space might stop working. Use this option if you have devices that
+ are accessed from userspace directly on some PCI host bridge.
+ panic
+ Always panic when IOMMU overflows
+
+
+Miscellaneous
+=============
+
+ nogbpages
+ Do not use GB pages for kernel direct mappings.
+ gbpages
+ Use GB pages for kernel direct mappings.
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
deleted file mode 100644
index abc53886655e..000000000000
--- a/Documentation/x86/x86_64/boot-options.txt
+++ /dev/null
@@ -1,278 +0,0 @@
-AMD64 specific boot options
-
-There are many others (usually documented in driver documentation), but
-only the AMD64 specific ones are listed here.
-
-Machine check
-
- Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
-
- mce=off
- Disable machine check
- mce=no_cmci
- Disable CMCI(Corrected Machine Check Interrupt) that
- Intel processor supports. Usually this disablement is
- not recommended, but it might be handy if your hardware
- is misbehaving.
- Note that you'll get more problems without CMCI than with
- due to the shared banks, i.e. you might get duplicated
- error logs.
- mce=dont_log_ce
- Don't make logs for corrected errors. All events reported
- as corrected are silently cleared by OS.
- This option will be useful if you have no interest in any
- of corrected errors.
- mce=ignore_ce
- Disable features for corrected errors, e.g. polling timer
- and CMCI. All events reported as corrected are not cleared
- by OS and remained in its error banks.
- Usually this disablement is not recommended, however if
- there is an agent checking/clearing corrected errors
- (e.g. BIOS or hardware monitoring applications), conflicting
- with OS's error handling, and you cannot deactivate the agent,
- then this option will be a help.
- mce=no_lmce
- Do not opt-in to Local MCE delivery. Use legacy method
- to broadcast MCEs.
- mce=bootlog
- Enable logging of machine checks left over from booting.
- Disabled by default on AMD Fam10h and older because some BIOS
- leave bogus ones.
- If your BIOS doesn't do that it's a good idea to enable though
- to make sure you log even machine check events that result
- in a reboot. On Intel systems it is enabled by default.
- mce=nobootlog
- Disable boot machine check logging.
- mce=tolerancelevel[,monarchtimeout] (number,number)
- tolerance levels:
- 0: always panic on uncorrected errors, log corrected errors
- 1: panic or SIGBUS on uncorrected errors, log corrected errors
- 2: SIGBUS or log uncorrected errors, log corrected errors
- 3: never panic or SIGBUS, log all errors (for testing only)
- Default is 1
- Can be also set using sysfs which is preferable.
- monarchtimeout:
- Sets the time in us to wait for other CPUs on machine checks. 0
- to disable.
- mce=bios_cmci_threshold
- Don't overwrite the bios-set CMCI threshold. This boot option
- prevents Linux from overwriting the CMCI threshold set by the
- bios. Without this option, Linux always sets the CMCI
- threshold to 1. Enabling this may make memory predictive failure
- analysis less effective if the bios sets thresholds for memory
- errors since we will not see details for all errors.
- mce=recovery
- Force-enable recoverable machine check code paths
-
- nomce (for compatibility with i386): same as mce=off
-
- Everything else is in sysfs now.
-
-APICs
-
- apic Use IO-APIC. Default
-
- noapic Don't use the IO-APIC.
-
- disableapic Don't use the local APIC
-
- nolapic Don't use the local APIC (alias for i386 compatibility)
-
- pirq=... See Documentation/x86/i386/IO-APIC.txt
-
- noapictimer Don't set up the APIC timer
-
- no_timer_check Don't check the IO-APIC timer. This can work around
- problems with incorrect timer initialization on some boards.
- apicpmtimer
- Do APIC timer calibration using the pmtimer. Implies
- apicmaintimer. Useful when your PIT timer is totally
- broken.
-
-Timing
-
- notsc
- Deprecated, use tsc=unstable instead.
-
- nohpet
- Don't use the HPET timer.
-
-Idle loop
-
- idle=poll
- Don't do power saving in the idle loop using HLT, but poll for rescheduling
- event. This will make the CPUs eat a lot more power, but may be useful
- to get slightly better performance in multiprocessor benchmarks. It also
- makes some profiling using performance counters more accurate.
- Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
- CPUs) this option has no performance advantage over the normal idle loop.
- It may also interact badly with hyperthreading.
-
-Rebooting
-
- reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old]
- bios Use the CPU reboot vector for warm reset
- warm Don't set the cold reboot flag
- cold Set the cold reboot flag
- triple Force a triple fault (init)
- kbd Use the keyboard controller. cold reset (default)
- acpi Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the
- ACPI reset does not work, the reboot path attempts the reset using
- the keyboard controller.
- efi Use efi reset_system runtime service. If EFI is not configured or the
- EFI reset does not work, the reboot path attempts the reset using
- the keyboard controller.
-
- Using warm reset will be much faster especially on big memory
- systems because the BIOS will not go through the memory check.
- Disadvantage is that not all hardware will be completely reinitialized
- on reboot so there may be boot problems on some systems.
-
- reboot=force
-
- Don't stop other CPUs on reboot. This can make reboot more reliable
- in some cases.
-
-Non Executable Mappings
-
- noexec=on|off
-
- on Enable(default)
- off Disable
-
-NUMA
-
- numa=off Only set up a single NUMA node spanning all memory.
-
- numa=noacpi Don't parse the SRAT table for NUMA setup
-
- numa=fake=<size>[MG]
- If given as a memory unit, fills all system RAM with nodes of
- size interleaved over physical nodes.
-
- numa=fake=<N>
- If given as an integer, fills all system RAM with N fake nodes
- interleaved over physical nodes.
-
- numa=fake=<N>U
- If given as an integer followed by 'U', it will divide each
- physical node into N emulated nodes.
-
-ACPI
-
- acpi=off Don't enable ACPI
- acpi=ht Use ACPI boot table parsing, but don't enable ACPI
- interpreter
- acpi=force Force ACPI on (currently not needed)
-
- acpi=strict Disable out of spec ACPI workarounds.
-
- acpi_sci={edge,level,high,low} Set up ACPI SCI interrupt.
-
- acpi=noirq Don't route interrupts
-
- acpi=nocmcff Disable firmware first mode for corrected errors. This
- disables parsing the HEST CMC error source to check if
- firmware has set the FF flag. This may result in
- duplicate corrected error reports.
-
-PCI
-
- pci=off Don't use PCI
- pci=conf1 Use conf1 access.
- pci=conf2 Use conf2 access.
- pci=rom Assign ROMs.
- pci=assign-busses Assign busses
- pci=irqmask=MASK Set PCI interrupt mask to MASK
- pci=lastbus=NUMBER Scan up to NUMBER busses, no matter what the mptable says.
- pci=noacpi Don't use ACPI to set up PCI interrupt routing.
-
-IOMMU (input/output memory management unit)
-
- Multiple x86-64 PCI-DMA mapping implementations exist, for example:
-
- 1. <lib/dma-direct.c>: use no hardware/software IOMMU at all
- (e.g. because you have < 3 GB memory).
- Kernel boot message: "PCI-DMA: Disabling IOMMU"
-
- 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
- Kernel boot message: "PCI-DMA: using GART IOMMU"
-
- 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
- e.g. if there is no hardware IOMMU in the system and it is need because
- you have >3GB memory or told the kernel to us it (iommu=soft))
- Kernel boot message: "PCI-DMA: Using software bounce buffering
- for IO (SWIOTLB)"
-
- 4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
- pSeries and xSeries servers. This hardware IOMMU supports DMA address
- mapping with memory protection, etc.
- Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
-
- iommu=[<size>][,noagp][,off][,force][,noforce]
- [,memaper[=<order>]][,merge][,fullflush][,nomerge]
- [,noaperture][,calgary]
-
- General iommu options:
- off Don't initialize and use any kind of IOMMU.
- noforce Don't force hardware IOMMU usage when it is not needed.
- (default).
- force Force the use of the hardware IOMMU even when it is
- not actually needed (e.g. because < 3 GB memory).
- soft Use software bounce buffering (SWIOTLB) (default for
- Intel machines). This can be used to prevent the usage
- of an available hardware IOMMU.
-
- iommu options only relevant to the AMD GART hardware IOMMU:
- <size> Set the size of the remapping area in bytes.
- allowed Overwrite iommu off workarounds for specific chipsets.
- fullflush Flush IOMMU on each allocation (default).
- nofullflush Don't use IOMMU fullflush.
- memaper[=<order>] Allocate an own aperture over RAM with size 32MB<<order.
- (default: order=1, i.e. 64MB)
- merge Do scatter-gather (SG) merging. Implies "force"
- (experimental).
- nomerge Don't do scatter-gather (SG) merging.
- noaperture Ask the IOMMU not to touch the aperture for AGP.
- noagp Don't initialize the AGP driver and use full aperture.
- panic Always panic when IOMMU overflows.
- calgary Use the Calgary IOMMU if it is available
-
- iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
- implementation:
- swiotlb=<pages>[,force]
- <pages> Prereserve that many 128K pages for the software IO
- bounce buffering.
- force Force all IO through the software TLB.
-
- Settings for the IBM Calgary hardware IOMMU currently found in IBM
- pSeries and xSeries machines:
-
- calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
- calgary=[translate_empty_slots]
- calgary=[disable=<PCI bus number>]
- panic Always panic when IOMMU overflows
-
- 64k,...,8M - Set the size of each PCI slot's translation table
- when using the Calgary IOMMU. This is the size of the translation
- table itself in main memory. The smallest table, 64k, covers an IO
- space of 32MB; the largest, 8MB table, can cover an IO space of
- 4GB. Normally the kernel will make the right choice by itself.
-
- translate_empty_slots - Enable translation even on slots that have
- no devices attached to them, in case a device will be hotplugged
- in the future.
-
- disable=<PCI bus number> - Disable translation on a given PHB. For
- example, the built-in graphics adapter resides on the first bridge
- (PCI bus number 0); if translation (isolation) is enabled on this
- bridge, X servers that access the hardware directly from user
- space might stop working. Use this option if you have devices that
- are accessed from userspace directly on some PCI host bridge.
-
-Miscellaneous
-
- nogbpages
- Do not use GB pages for kernel direct mappings.
- gbpages
- Use GB pages for kernel direct mappings.
diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec b/Documentation/x86/x86_64/cpu-hotplug-spec.rst
index 3c23e0587db3..8d1c91f0c880 100644
--- a/Documentation/x86/x86_64/cpu-hotplug-spec
+++ b/Documentation/x86/x86_64/cpu-hotplug-spec.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================================
Firmware support for CPU hotplug under Linux/x86-64
----------------------------------------------------
+===================================================
Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
know in advance of boot time the maximum number of CPUs that could be plugged
diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst
index 4b09f18831f8..74fbb78b3c67 100644
--- a/Documentation/x86/x86_64/fake-numa-for-cpusets
+++ b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst
@@ -1,5 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Fake NUMA For CPUSets
+=====================
+
+:Author: David Rientjes <rientjes@cs.washington.edu>
+
Using numa=fake and CPUSets for Resource Management
-Written by David Rientjes <rientjes@cs.washington.edu>
This document describes how the numa=fake x86_64 command-line option can be used
in conjunction with cpusets for coarse memory management. Using this feature,
@@ -20,7 +27,7 @@ you become more familiar with using this combination for resource control,
you'll determine a better setup to minimize the number of nodes you have to deal
with.
-A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
+A machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
Faking node 0 at 0000000000000000-0000000020000000 (512MB)
Faking node 1 at 0000000020000000-0000000040000000 (512MB)
@@ -34,7 +41,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
Now following the instructions for mounting the cpusets filesystem from
Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
-address spaces) to individual cpusets:
+address spaces) to individual cpusets::
[root@xroads /]# mkdir exampleset
[root@xroads /]# mount -t cpuset none exampleset
@@ -47,7 +54,7 @@ Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
memory allocations (1G).
You can now assign tasks to these cpusets to limit the memory resources
-available to them according to the fake nodes assigned as mems:
+available to them according to the fake nodes assigned as mems::
[root@xroads /exampleset/ddset]# echo $$ > tasks
[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
@@ -57,9 +64,13 @@ Notice the difference between the system memory usage as reported by
/proc/meminfo between the restricted cpuset case above and the unrestricted
case (i.e. running the same 'dd' command without assigning it to a fake NUMA
cpuset):
- Unrestricted Restricted
- MemTotal: 3091900 kB 3091900 kB
- MemFree: 42113 kB 1513236 kB
+
+ ======== ============ ==========
+ Name Unrestricted Restricted
+ ======== ============ ==========
+ MemTotal 3091900 kB 3091900 kB
+ MemFree 42113 kB 1513236 kB
+ ======== ============ ==========
This allows for coarse memory management for the tasks you assign to particular
cpusets. Since cpusets can form a hierarchy, you can create some pretty
diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst
new file mode 100644
index 000000000000..d6eaaa5a35fc
--- /dev/null
+++ b/Documentation/x86/x86_64/index.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+x86_64 Support
+==============
+
+.. toctree::
+ :maxdepth: 2
+
+ boot-options
+ uefi
+ mm
+ 5level-paging
+ fake-numa-for-cpusets
+ cpu-hotplug-spec
+ machinecheck
diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck.rst
index d0648a74fceb..e189168406fa 100644
--- a/Documentation/x86/x86_64/machinecheck
+++ b/Documentation/x86/x86_64/machinecheck.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
-Configurable sysfs parameters for the x86-64 machine check code.
+===============================================================
+Configurable sysfs parameters for the x86-64 machine check code
+===============================================================
Machine checks report internal hardware error conditions detected
by the CPU. Uncorrected errors typically cause a machine check
@@ -16,14 +19,13 @@ log then mcelog should run to collect and decode machine check entries
from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
-(N = CPU number)
+(N = CPU number).
The directory contains some configurable entries:
-Entries:
-
bankNctl
-(N bank number)
+ (N bank number)
+
64bit Hex bitmask enabling/disabling specific subevents for bank N
When a bit in the bitmask is zero then the respective
subevent will not be reported.
diff --git a/Documentation/x86/x86_64/mm.rst b/Documentation/x86/x86_64/mm.rst
new file mode 100644
index 000000000000..267fc4808945
--- /dev/null
+++ b/Documentation/x86/x86_64/mm.rst
@@ -0,0 +1,161 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+Memory Managment
+================
+
+Complete virtual memory map with 4-level page tables
+====================================================
+
+.. note::
+
+ - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
+ from the top of the 64-bit address space. It's easier to understand the layout
+ when seen both in absolute addresses and in distance-from-top notation.
+
+ For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
+ 64-bit address space (ffffffffffffffff).
+
+ Note that as we get closer to the top of the address space, the notation changes
+ from TB to GB and then MB/KB.
+
+ - "16M TB" might look weird at first sight, but it's an easier to visualize size
+ notation than "16 EB", which few will recognize at first sight as 16 exabytes.
+ It also shows it nicely how incredibly large 64-bit address space is.
+
+::
+
+ ========================================================================================================================
+ Start addr | Offset | End addr | Size | VM area description
+ ========================================================================================================================
+ | | | |
+ 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
+ __________________|____________|__________________|_________|___________________________________________________________
+ | | | |
+ 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -128 TB
+ | | | | starting offset of kernel mappings.
+ __________________|____________|__________________|_________|___________________________________________________________
+ |
+ | Kernel-space virtual memory, shared between all processes:
+ ____________________________________________________________|___________________________________________________________
+ | | | |
+ ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
+ ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
+ ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
+ ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
+ ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
+ ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
+ ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
+ ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
+ ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
+ __________________|____________|__________________|_________|____________________________________________________________
+ |
+ | Identical layout to the 56-bit one from here on:
+ ____________________________________________________________|____________________________________________________________
+ | | | |
+ fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
+ | | | | vaddr_end for KASLR
+ fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
+ fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
+ ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
+ ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
+ ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
+ ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
+ ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
+ ffffffff80000000 |-2048 MB | | |
+ ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
+ ffffffffff000000 | -16 MB | | |
+ FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
+ ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
+ ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
+ __________________|____________|__________________|_________|___________________________________________________________
+
+
+Complete virtual memory map with 5-level page tables
+====================================================
+
+.. note::
+
+ - With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
+ from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
+ offset and many of the regions expand to support the much larger physical
+ memory supported.
+
+::
+
+ ========================================================================================================================
+ Start addr | Offset | End addr | Size | VM area description
+ ========================================================================================================================
+ | | | |
+ 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
+ __________________|____________|__________________|_________|___________________________________________________________
+ | | | |
+ 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -64 PB
+ | | | | starting offset of kernel mappings.
+ __________________|____________|__________________|_________|___________________________________________________________
+ |
+ | Kernel-space virtual memory, shared between all processes:
+ ____________________________________________________________|___________________________________________________________
+ | | | |
+ ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
+ ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
+ ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
+ ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
+ ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
+ ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
+ ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
+ ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
+ ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory
+ __________________|____________|__________________|_________|____________________________________________________________
+ |
+ | Identical layout to the 47-bit one from here on:
+ ____________________________________________________________|____________________________________________________________
+ | | | |
+ fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
+ | | | | vaddr_end for KASLR
+ fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
+ fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
+ ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
+ ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
+ ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
+ ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
+ ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
+ ffffffff80000000 |-2048 MB | | |
+ ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
+ ffffffffff000000 | -16 MB | | |
+ FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
+ ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
+ ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
+ __________________|____________|__________________|_________|___________________________________________________________
+
+Architecture defines a 64-bit virtual address. Implementations can support
+less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
+through to the most-significant implemented bit are sign extended.
+This causes hole between user space and kernel addresses if you interpret them
+as unsigned.
+
+The direct mapping covers all memory in the system up to the highest
+memory address (this means in some cases it can also include PCI memory
+holes).
+
+vmalloc space is lazily synchronized into the different PML4/PML5 pages of
+the processes using the page fault handler, with init_top_pgt as
+reference.
+
+We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
+memory window (this size is arbitrary, it can be raised later if needed).
+The mappings are not part of any other kernel PGD and are only available
+during EFI runtime calls.
+
+Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
+physical memory, vmalloc/ioremap space and virtual memory map are randomized.
+Their order is preserved but their base will be offset early at boot time.
+
+Be very careful vs. KASLR when changing anything here. The KASLR address
+range must not overlap with anything except the KASAN shadow area, which is
+correct as KASAN disables KASLR.
+
+For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
+hole: ffffffffffff4111
diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
deleted file mode 100644
index 6cbe652d7a49..000000000000
--- a/Documentation/x86/x86_64/mm.txt
+++ /dev/null
@@ -1,153 +0,0 @@
-====================================================
-Complete virtual memory map with 4-level page tables
-====================================================
-
-Notes:
-
- - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
- from the top of the 64-bit address space. It's easier to understand the layout
- when seen both in absolute addresses and in distance-from-top notation.
-
- For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
- 64-bit address space (ffffffffffffffff).
-
- Note that as we get closer to the top of the address space, the notation changes
- from TB to GB and then MB/KB.
-
- - "16M TB" might look weird at first sight, but it's an easier to visualize size
- notation than "16 EB", which few will recognize at first sight as 16 exabytes.
- It also shows it nicely how incredibly large 64-bit address space is.
-
-========================================================================================================================
- Start addr | Offset | End addr | Size | VM area description
-========================================================================================================================
- | | | |
- 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
-__________________|____________|__________________|_________|___________________________________________________________
- | | | |
- 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
- | | | | virtual memory addresses up to the -128 TB
- | | | | starting offset of kernel mappings.
-__________________|____________|__________________|_________|___________________________________________________________
- |
- | Kernel-space virtual memory, shared between all processes:
-____________________________________________________________|___________________________________________________________
- | | | |
- ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
- ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
- ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
- ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
- ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
- ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
- ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
- ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
- ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
-__________________|____________|__________________|_________|____________________________________________________________
- |
- | Identical layout to the 56-bit one from here on:
-____________________________________________________________|____________________________________________________________
- | | | |
- fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
- | | | | vaddr_end for KASLR
- fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
- fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
- ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
- ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
- ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
- ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
- ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
- ffffffff80000000 |-2048 MB | | |
- ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
- ffffffffff000000 | -16 MB | | |
- FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
- ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
- ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
-__________________|____________|__________________|_________|___________________________________________________________
-
-
-====================================================
-Complete virtual memory map with 5-level page tables
-====================================================
-
-Notes:
-
- - With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
- from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
- offset and many of the regions expand to support the much larger physical
- memory supported.
-
-========================================================================================================================
- Start addr | Offset | End addr | Size | VM area description
-========================================================================================================================
- | | | |
- 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
-__________________|____________|__________________|_________|___________________________________________________________
- | | | |
- 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
- | | | | virtual memory addresses up to the -64 PB
- | | | | starting offset of kernel mappings.
-__________________|____________|__________________|_________|___________________________________________________________
- |
- | Kernel-space virtual memory, shared between all processes:
-____________________________________________________________|___________________________________________________________
- | | | |
- ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
- ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
- ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
- ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
- ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
- ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
- ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
- ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
- ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory
-__________________|____________|__________________|_________|____________________________________________________________
- |
- | Identical layout to the 47-bit one from here on:
-____________________________________________________________|____________________________________________________________
- | | | |
- fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
- | | | | vaddr_end for KASLR
- fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
- fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
- ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
- ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
- ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
- ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
- ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
- ffffffff80000000 |-2048 MB | | |
- ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
- ffffffffff000000 | -16 MB | | |
- FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
- ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
- ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
-__________________|____________|__________________|_________|___________________________________________________________
-
-Architecture defines a 64-bit virtual address. Implementations can support
-less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
-through to the most-significant implemented bit are sign extended.
-This causes hole between user space and kernel addresses if you interpret them
-as unsigned.
-
-The direct mapping covers all memory in the system up to the highest
-memory address (this means in some cases it can also include PCI memory
-holes).
-
-vmalloc space is lazily synchronized into the different PML4/PML5 pages of
-the processes using the page fault handler, with init_top_pgt as
-reference.
-
-We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
-memory window (this size is arbitrary, it can be raised later if needed).
-The mappings are not part of any other kernel PGD and are only available
-during EFI runtime calls.
-
-Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
-physical memory, vmalloc/ioremap space and virtual memory map are randomized.
-Their order is preserved but their base will be offset early at boot time.
-
-Be very careful vs. KASLR when changing anything here. The KASLR address
-range must not overlap with anything except the KASAN shadow area, which is
-correct as KASAN disables KASLR.
-
-For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
-hole: ffffffffffff4111
diff --git a/Documentation/x86/x86_64/uefi.txt b/Documentation/x86/x86_64/uefi.rst
index a5e2b4fdb170..88c3ba32546f 100644
--- a/Documentation/x86/x86_64/uefi.txt
+++ b/Documentation/x86/x86_64/uefi.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
General note on [U]EFI x86_64 support
--------------------------------------
+=====================================
The nomenclature EFI and UEFI are used interchangeably in this document.
@@ -14,29 +17,42 @@ with EFI firmware and specifications are listed below.
3. x86_64 platform with EFI/UEFI firmware.
-Mechanics:
+Mechanics
---------
-- Build the kernel with the following configuration.
+
+- Build the kernel with the following configuration::
+
CONFIG_FB_EFI=y
CONFIG_FRAMEBUFFER_CONSOLE=y
+
If EFI runtime services are expected, the following configuration should
- be selected.
+ be selected::
+
CONFIG_EFI=y
CONFIG_EFI_VARS=y or m # optional
+
- Create a VFAT partition on the disk
- Copy the following to the VFAT partition:
+
elilo bootloader with x86_64 support, elilo configuration file,
kernel image built in first step and corresponding
initrd. Instructions on building elilo and its dependencies
can be found in the elilo sourceforge project.
+
- Boot to EFI shell and invoke elilo choosing the kernel image built
in first step.
- If some or all EFI runtime services don't work, you can try following
kernel command line parameters to turn off some or all EFI runtime
services.
- noefi turn off all EFI runtime services
- reboot_type=k turn off EFI reboot runtime service
+
+ noefi
+ turn off all EFI runtime services
+ reboot_type=k
+ turn off EFI reboot runtime service
+
- If the EFI memory map has additional entries not in the E820 map,
you can include those entries in the kernels memory map of available
physical RAM by using the following kernel command line parameter.
- add_efi_memmap include EFI memory map of available physical RAM
+
+ add_efi_memmap
+ include EFI memory map of available physical RAM