diff options
| author | Dmitry Torokhov <dmitry.torokhov@gmail.com> | 2026-04-19 18:28:57 -0700 |
|---|---|---|
| committer | Dmitry Torokhov <dmitry.torokhov@gmail.com> | 2026-04-19 18:28:57 -0700 |
| commit | f4b369c6fe0ceaba2da2daff8c9eb415f85926dd (patch) | |
| tree | 30465d0a429b2c224685b5d8e804bf053c4d129a /Documentation/admin-guide | |
| parent | ff14dafde15c11403fac61367a34fea08926e9ee (diff) | |
| parent | 2ca45e57ea027fffe3350ae5e21ad9cecb0dce74 (diff) | |
Merge branch 'next' into for-linus
Prepare input updates for 7.1 merge window.
Diffstat (limited to 'Documentation/admin-guide')
73 files changed, 1857 insertions, 1464 deletions
diff --git a/Documentation/admin-guide/LSM/Smack.rst b/Documentation/admin-guide/LSM/Smack.rst index 6d44f4fdbf59..c5ed775f2d10 100644 --- a/Documentation/admin-guide/LSM/Smack.rst +++ b/Documentation/admin-guide/LSM/Smack.rst @@ -601,10 +601,15 @@ specification. Task Attribute ~~~~~~~~~~~~~~ -The Smack label of a process can be read from /proc/<pid>/attr/current. A -process can read its own Smack label from /proc/self/attr/current. A +The Smack label of a process can be read from ``/proc/<pid>/attr/current``. A +process can read its own Smack label from ``/proc/self/attr/current``. A privileged process can change its own Smack label by writing to -/proc/self/attr/current but not the label of another process. +``/proc/self/attr/current`` but not the label of another process. + +Format of writing is : only the label or the label followed by one of the +3 trailers: ``\n`` (by common agreement for ``/proc/...`` interfaces), +``\0`` (because some applications incorrectly include it), +``\n\0`` (because we think some applications may incorrectly include it). File Attribute ~~~~~~~~~~~~~~ @@ -696,6 +701,11 @@ sockets. A privileged program may set this to match the label of another task with which it hopes to communicate. +UNIX domain socket (UDS) with a BSD address functions both as a file in a +filesystem and as a socket. As a file, it carries the SMACK64 attribute. This +attribute is not involved in Smack security enforcement and is immutably +assigned the label "*". + Smack Netlabel Exceptions ~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst index dc7088451f9d..a756d8158531 100644 --- a/Documentation/admin-guide/LSM/ipe.rst +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -95,7 +95,20 @@ languages when these scripts are invoked by passing these program files to the interpreter. This is because the way interpreters execute these files; the scripts themselves are not evaluated as executable code through one of IPE's hooks, but they are merely text files that are read -(as opposed to compiled executables) [#interpreters]_. +(as opposed to compiled executables). However, with the introduction of the +``AT_EXECVE_CHECK`` flag (:doc:`AT_EXECVE_CHECK </userspace-api/check_exec>`), +interpreters can use it to signal the kernel that a script file will be executed, +and request the kernel to perform LSM security checks on it. + +IPE's EXECUTE operation enforcement differs between compiled executables and +interpreted scripts: For compiled executables, enforcement is triggered +automatically by the kernel during ``execve()``, ``execveat()``, ``mmap()`` +and ``mprotect()`` syscalls when loading executable content. For interpreted +scripts, enforcement requires explicit interpreter integration using +``execveat()`` with ``AT_EXECVE_CHECK`` flag. Unlike exec syscalls that IPE +intercepts during the execution process, this mechanism needs the interpreter +to take the initiative, and existing interpreters won't be automatically +supported unless the signal call is added. Threat Model ------------ @@ -806,8 +819,6 @@ A: .. [#digest_cache_lsm] https://lore.kernel.org/lkml/20240415142436.2545003-1-roberto.sassu@huaweicloud.com/ -.. [#interpreters] There is `some interest in solving this issue <https://lore.kernel.org/lkml/20220321161557.495388-1-mic@digikod.net/>`_. - .. [#devdoc] Please see :doc:`the design docs </security/ipe>` for more on this topic. diff --git a/Documentation/admin-guide/LSM/landlock.rst b/Documentation/admin-guide/LSM/landlock.rst index 9e61607def08..9923874e2156 100644 --- a/Documentation/admin-guide/LSM/landlock.rst +++ b/Documentation/admin-guide/LSM/landlock.rst @@ -6,7 +6,7 @@ Landlock: system-wide management ================================ :Author: Mickaël Salaün -:Date: March 2025 +:Date: January 2026 Landlock can leverage the audit framework to log events. @@ -38,6 +38,37 @@ AUDIT_LANDLOCK_ACCESS domain=195ba459b blockers=fs.refer path="/usr/bin" dev="vda2" ino=351 domain=195ba459b blockers=fs.make_reg,fs.refer path="/usr/local" dev="vda2" ino=365 + + The ``blockers`` field uses dot-separated prefixes to indicate the type of + restriction that caused the denial: + + **fs.*** - Filesystem access rights (ABI 1+): + - fs.execute, fs.write_file, fs.read_file, fs.read_dir + - fs.remove_dir, fs.remove_file + - fs.make_char, fs.make_dir, fs.make_reg, fs.make_sock + - fs.make_fifo, fs.make_block, fs.make_sym + - fs.refer (ABI 2+) + - fs.truncate (ABI 3+) + - fs.ioctl_dev (ABI 5+) + + **net.*** - Network access rights (ABI 4+): + - net.bind_tcp - TCP port binding was denied + - net.connect_tcp - TCP connection was denied + + **scope.*** - IPC scoping restrictions (ABI 6+): + - scope.abstract_unix_socket - Abstract UNIX socket connection denied + - scope.signal - Signal sending denied + + Multiple blockers can appear in a single event (comma-separated) when + multiple access rights are missing. For example, creating a regular file + in a directory that lacks both ``make_reg`` and ``refer`` rights would show + ``blockers=fs.make_reg,fs.refer``. + + The object identification fields (path, dev, ino for filesystem; opid, + ocomm for signals) depend on the type of access being blocked and provide + context about what resource was involved in the denial. + + AUDIT_LANDLOCK_DOMAIN This record type describes the status of a Landlock domain. The ``status`` field can be either ``allocated`` or ``deallocated``. @@ -86,7 +117,7 @@ This command generates two events, each identified with a unique serial number following a timestamp (``msg=audit(1729738800.268:30)``). The first event (serial ``30``) contains 4 records. The first record (``type=LANDLOCK_ACCESS``) shows an access denied by the domain `1a6fdc66f`. -The cause of this denial is signal scopping restriction +The cause of this denial is signal scoping restriction (``blockers=scope.signal``). The process that would have receive this signal is the init process (``opid=1 ocomm="systemd"``). diff --git a/Documentation/admin-guide/RAS/main.rst b/Documentation/admin-guide/RAS/main.rst index 447bfde509fb..5a45db32c49b 100644 --- a/Documentation/admin-guide/RAS/main.rst +++ b/Documentation/admin-guide/RAS/main.rst @@ -406,24 +406,8 @@ index of the MC:: |->mc2 .... -Under each ``mcX`` directory each ``csrowX`` is again represented by a -``csrowX``, where ``X`` is the csrow index:: - - .../mc/mc0/ - | - |->csrow0 - |->csrow2 - |->csrow3 - .... - -Notice that there is no csrow1, which indicates that csrow0 is composed -of a single ranked DIMMs. This should also apply in both Channels, in -order to have dual-channel mode be operational. Since both csrow2 and -csrow3 are populated, this indicates a dual ranked set of DIMMs for -channels 0 and 1. - -Within each of the ``mcX`` and ``csrowX`` directories are several EDAC -control and attribute files. +Within each of the ``mcX`` directory are several EDAC control and +attribute files. ``mcX`` directories ------------------- @@ -569,7 +553,7 @@ this ``X`` memory module: - Unbuffered-DDR .. [#f5] On some systems, the memory controller doesn't have any logic - to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. + to identify the memory module. On such systems, the directory is called ``rankX``. On modern Intel memory controllers, the memory controller identifies the memory modules directly. On such systems, the directory is called ``dimmX``. @@ -577,126 +561,6 @@ this ``X`` memory module: symlinks inside the sysfs mapping that are automatically created by the sysfs subsystem. Currently, they serve no purpose. -``csrowX`` directories ----------------------- - -When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` -directories. As this API doesn't work properly for Rambus, FB-DIMMs and -modern Intel Memory Controllers, this is being deprecated in favor of -``dimmX`` directories. - -In the ``csrowX`` directories are EDAC control and attribute files for -this ``X`` instance of csrow: - - -- ``ue_count`` - Total Uncorrectable Errors count attribute file - - This attribute file displays the total count of uncorrectable - errors that have occurred on this csrow. If panic_on_ue is set - this counter will not have a chance to increment, since EDAC - will panic the system. - - -- ``ce_count`` - Total Correctable Errors count attribute file - - This attribute file displays the total count of correctable - errors that have occurred on this csrow. This count is very - important to examine. CEs provide early indications that a - DIMM is beginning to fail. This count field should be - monitored for non-zero values and report such information - to the system administrator. - - -- ``size_mb`` - Total memory managed by this csrow attribute file - - This attribute file displays, in count of megabytes, the memory - that this csrow contains. - - -- ``mem_type`` - Memory Type attribute file - - This attribute file will display what type of memory is currently - on this csrow. Normally, either buffered or unbuffered memory. - Examples: - - - Registered-DDR - - Unbuffered-DDR - - -- ``edac_mode`` - EDAC Mode of operation attribute file - - This attribute file will display what type of Error detection - and correction is being utilized. - - -- ``dev_type`` - Device type attribute file - - This attribute file will display what type of DRAM device is - being utilized on this DIMM. - Examples: - - - x1 - - x2 - - x4 - - x8 - - -- ``ch0_ce_count`` - Channel 0 CE Count attribute file - - This attribute file will display the count of CEs on this - DIMM located in channel 0. - - -- ``ch0_ue_count`` - Channel 0 UE Count attribute file - - This attribute file will display the count of UEs on this - DIMM located in channel 0. - - -- ``ch0_dimm_label`` - Channel 0 DIMM Label control file - - - This control file allows this DIMM to have a label assigned - to it. With this label in the module, when errors occur - the output can provide the DIMM label in the system log. - This becomes vital for panic events to isolate the - cause of the UE event. - - DIMM Labels must be assigned after booting, with information - that correctly identifies the physical slot with its - silk screen label. This information is currently very - motherboard specific and determination of this information - must occur in userland at this time. - - -- ``ch1_ce_count`` - Channel 1 CE Count attribute file - - - This attribute file will display the count of CEs on this - DIMM located in channel 1. - - -- ``ch1_ue_count`` - Channel 1 UE Count attribute file - - - This attribute file will display the count of UEs on this - DIMM located in channel 0. - - -- ``ch1_dimm_label`` - Channel 1 DIMM Label control file - - This control file allows this DIMM to have a label assigned - to it. With this label in the module, when errors occur - the output can provide the DIMM label in the system log. - This becomes vital for panic events to isolate the - cause of the UE event. - - DIMM Labels must be assigned after booting, with information - that correctly identifies the physical slot with its - silk screen label. This information is currently very - motherboard specific and determination of this information - must occur in userland at this time. - System Logging -------------- diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index 05301f03b717..77fec1de6dc8 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -53,7 +53,7 @@ Documentation these typically contain kernel-specific installation notes for some drivers for example. Please read the :ref:`Documentation/process/changes.rst <changes>` file, as it - contains information about the problems, which may result by upgrading + contains information about the problems which may result from upgrading your kernel. Installing the kernel source diff --git a/Documentation/admin-guide/aoe/index.rst b/Documentation/admin-guide/aoe/index.rst index d71c5df15922..564354bbce57 100644 --- a/Documentation/admin-guide/aoe/index.rst +++ b/Documentation/admin-guide/aoe/index.rst @@ -8,10 +8,3 @@ ATA over Ethernet (AoE) aoe todo examples - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/auxdisplay/index.rst b/Documentation/admin-guide/auxdisplay/index.rst index e466f0595248..31eae08255fd 100644 --- a/Documentation/admin-guide/auxdisplay/index.rst +++ b/Documentation/admin-guide/auxdisplay/index.rst @@ -7,10 +7,3 @@ Auxiliary Display Support ks0108.rst cfag12864b.rst - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst index 6fdb495ac466..f71f349553e4 100644 --- a/Documentation/admin-guide/bcache.rst +++ b/Documentation/admin-guide/bcache.rst @@ -17,8 +17,7 @@ The latest bcache kernel code can be found from mainline Linux kernel: It's designed around the performance characteristics of SSDs - it only allocates in erase block sized buckets, and it uses a hybrid btree/log to track cached extents (which can be anywhere from a single sector to the bucket size). It's -designed to avoid random writes at all costs; it fills up an erase block -sequentially, then issues a discard before reusing it. +designed to avoid random writes at all costs. Both writethrough and writeback caching are supported. Writeback defaults to off, but can be switched on and off arbitrarily at runtime. Bcache goes to @@ -618,19 +617,11 @@ bucket_size cache_replacement_policy One of either lru, fifo or random. -discard - Boolean; if on a discard/TRIM will be issued to each bucket before it is - reused. Defaults to off, since SATA TRIM is an unqueued command (and thus - slow). - freelist_percent Size of the freelist as a percentage of nbuckets. Can be written to to increase the number of buckets kept on the freelist, which lets you artificially reduce the size of the cache at runtime. Mostly for testing - purposes (i.e. testing how different size caches affect your hit rate), but - since buckets are discarded when they move on to the freelist will also make - the SSD's garbage collection easier by effectively giving it more reserved - space. + purposes (i.e. testing how different size caches affect your hit rate). io_errors Number of errors that have occurred, decayed by io_error_halflife. diff --git a/Documentation/admin-guide/blockdev/zoned_loop.rst b/Documentation/admin-guide/blockdev/zoned_loop.rst index 64dcfde7450a..6aa865424ac3 100644 --- a/Documentation/admin-guide/blockdev/zoned_loop.rst +++ b/Documentation/admin-guide/blockdev/zoned_loop.rst @@ -68,30 +68,43 @@ The options available for the add command can be listed by reading the In more details, the options that can be used with the "add" command are as follows. -================ =========================================================== -id Device number (the X in /dev/zloopX). - Default: automatically assigned. -capacity_mb Device total capacity in MiB. This is always rounded up to - the nearest higher multiple of the zone size. - Default: 16384 MiB (16 GiB). -zone_size_mb Device zone size in MiB. Default: 256 MiB. -zone_capacity_mb Device zone capacity (must always be equal to or lower than - the zone size. Default: zone size. -conv_zones Total number of conventioanl zones starting from sector 0. - Default: 8. -base_dir Path to the base directory where to create the directory - containing the zone files of the device. - Default=/var/local/zloop. - The device directory containing the zone files is always - named with the device ID. E.g. the default zone file - directory for /dev/zloop0 is /var/local/zloop/0. -nr_queues Number of I/O queues of the zoned block device. This value is - always capped by the number of online CPUs - Default: 1 -queue_depth Maximum I/O queue depth per I/O queue. - Default: 64 -buffered_io Do buffered IOs instead of direct IOs (default: false) -================ =========================================================== +=================== ========================================================= +id Device number (the X in /dev/zloopX). + Default: automatically assigned. +capacity_mb Device total capacity in MiB. This is always rounded up + to the nearest higher multiple of the zone size. + Default: 16384 MiB (16 GiB). +zone_size_mb Device zone size in MiB. Default: 256 MiB. +zone_capacity_mb Device zone capacity (must always be equal to or lower + than the zone size. Default: zone size. +conv_zones Total number of conventioanl zones starting from + sector 0 + Default: 8 +base_dir Path to the base directory where to create the directory + containing the zone files of the device. + Default=/var/local/zloop. + The device directory containing the zone files is always + named with the device ID. E.g. the default zone file + directory for /dev/zloop0 is /var/local/zloop/0. +nr_queues Number of I/O queues of the zoned block device. This + value is always capped by the number of online CPUs + Default: 1 +queue_depth Maximum I/O queue depth per I/O queue. + Default: 64 +buffered_io Do buffered IOs instead of direct IOs (default: false) +zone_append Enable or disable a zloop device native zone append + support. + Default: 1 (enabled). + If native zone append support is disabled, the block layer + will emulate this operation using regular write + operations. +ordered_zone_append Enable zloop mitigation of zone append reordering. + Default: disabled. + This is useful for testing file systems file data mapping + (extents), as when enabled, this can significantly reduce + the number of data extents needed to for a file data + mapping. +=================== ========================================================= 3) Deleting a Zoned Device -------------------------- @@ -121,7 +134,7 @@ MB and a zone capacity of 63 MB:: $ modprobe zloop $ mkdir -p /var/local/zloop/0 - $ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity=63MB" > /dev/zloop-control + $ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity_mb=63" > /dev/zloop-control For the device created (/dev/zloop0), the zone backing files are all created under the default base directory (/var/local/zloop):: diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 3e273c1bb749..451fa00d3004 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -214,6 +214,9 @@ mem_limit WO specifies the maximum amount of memory ZRAM can writeback_limit WO specifies the maximum amount of write IO zram can write out to backing device as 4KB unit writeback_limit_enable RW show and set writeback_limit feature +writeback_batch_size RW show and set maximum number of in-flight + writeback operations +compressed_writeback RW show and set compressed writeback feature comp_algorithm RW show and change the compression algorithm algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction @@ -222,7 +225,6 @@ backing_dev RW set up backend storage for zram to write out idle WO mark allocated slot as idle ====================== ====== =============================================== - User space is advised to use the following files to read the device statistics. File /sys/block/zram<id>/stat @@ -434,6 +436,26 @@ system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback happened until you reset the zram to allocate extra writeback budget in next setting is user's job. +By default zram stores written back pages in decompressed (raw) form, which +means that writeback operation involves decompression of the page before +writing it to the backing device. This behavior can be changed by enabling +`compressed_writeback` feature, which causes zram to write compressed pages +to the backing device, thus avoiding decompression overhead. To enable +this feature, execute:: + + $ echo yes > /sys/block/zramX/compressed_writeback + +Note that this feature should be configured before the `zramX` device is +initialized. + +Depending on backing device storage type, writeback operation may benefit +from a higher number of in-flight write requests (batched writes). The +number of maximum in-flight writeback operations can be configured via +`writeback_batch_size` attribute. To change the default value (which is 32), +execute:: + + $ echo 64 > /sys/block/zramX/writeback_batch_size + If admin wants to measure writeback count in a certain period, they could know it via /sys/block/zram0/bd_stat's 3rd column. diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst index 7a86042c9b6d..f712758472d5 100644 --- a/Documentation/admin-guide/bootconfig.rst +++ b/Documentation/admin-guide/bootconfig.rst @@ -20,18 +20,26 @@ Config File Syntax The boot config syntax is a simple structured key-value. Each key consists of dot-connected-words, and key and value are connected by ``=``. The value -has to be terminated by semi-colon (``;``) or newline (``\n``). -For array value, array entries are separated by comma (``,``). :: - - KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] - -Unlike the kernel command line syntax, spaces are OK around the comma and ``=``. +string has to be terminated by the following delimiters described below. Each key word must contain only alphabets, numbers, dash (``-``) or underscore (``_``). And each value only contains printable characters or spaces except for delimiters such as semi-colon (``;``), new-line (``\n``), comma (``,``), hash (``#``) and closing brace (``}``). +If the ``=`` is followed by whitespace up to one of these delimiters, the +key is assigned an empty value. + +For arrays, the array values are comma (``,``) separated, and comments and +line breaks with newline (``\n``) are allowed between array values for +readability. Thus the first entry of the array must be on the same line as +the key.:: + + KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] + +Unlike the kernel command line syntax, white spaces (including tabs) are +ignored around the comma and ``=``. + If you want to use those delimiters in a value, you can use either double- quotes (``"VALUE"``) or single-quotes (``'VALUE'``) to quote it. Note that you can not escape these quotes. @@ -138,8 +146,8 @@ This is parsed as below:: foo = value bar = 1, 2, 3 -Note that you can not put a comment between value and delimiter(``,`` or -``;``). This means following config has a syntax error :: +Note that you can NOT put a comment or a newline between value and delimiter +(``,`` or ``;``). This means following config has a syntax error :: key = 1 # comment ,2 diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 7da0504388ec..3901b43c96df 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -52,14 +52,14 @@ line is usually required to identify and handle the bug. Along this chapter, we'll refer to "Oops" for all kinds of stack traces that need to be analyzed. If the kernel is compiled with ``CONFIG_DEBUG_INFO``, you can enhance the -quality of the stack trace by using file:`scripts/decode_stacktrace.sh`. +quality of the stack trace by using ``scripts/decode_stacktrace.sh``. Modules linked in ----------------- Modules that are tainted or are being loaded or unloaded are marked with "(...)", where the taint flags are described in -file:`Documentation/admin-guide/tainted-kernels.rst`, "being loaded" is +Documentation/admin-guide/tainted-kernels.rst, "being loaded" is annotated with "+", and "being unloaded" is annotated with "-". @@ -235,7 +235,7 @@ Dave Miller):: mov 0x8(%ebp), %ebx ! %ebx = skb->sk mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt -file:`scripts/decodecode` can be used to automate most of this, depending +``scripts/decodecode`` can be used to automate most of this, depending on what CPU architecture is being debugged. Reporting the bug diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst index 493a8e386700..b5f3873b7d3a 100644 --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -77,7 +77,7 @@ control group and enforces the limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to fault in HugeTLB pages beyond its limit. Therefore the application needs to know exactly how many -HugeTLB pages it uses before hand, and the sysadmin needs to make sure that +HugeTLB pages it uses beforehand, and the sysadmin needs to make sure that there are enough available on the machine for all the users to avoid processes getting SIGBUS. @@ -91,23 +91,23 @@ getting SIGBUS. hugetlb.<hugepagesize>.rsvd.usage_in_bytes hugetlb.<hugepagesize>.rsvd.failcnt -The HugeTLB controller allows to limit the HugeTLB reservations per control +The HugeTLB controller allows limiting the HugeTLB reservations per control group and enforces the controller limit at reservation time and at the fault of HugeTLB memory for which no reservation exists. Since reservation limits are -enforced at reservation time (on mmap or shget), reservation limits never causes -the application to get SIGBUS signal if the memory was reserved before hand. For +enforced at reservation time (on mmap or shget), reservation limits never cause +the application to get SIGBUS signal if the memory was reserved beforehand. For MAP_NORESERVE allocations, the reservation limit behaves the same as the fault limit, enforcing memory usage at fault time and causing the application to receive a SIGBUS if it's crossing its limit. Reservation limits are superior to page fault limits described above, since reservation limits are enforced at reservation time (on mmap or shget), and -never causes the application to get SIGBUS signal if the memory was reserved -before hand. This allows for easier fallback to alternatives such as +never cause the application to get SIGBUS signal if the memory was reserved +beforehand. This allows for easier fallback to alternatives such as non-HugeTLB memory for example. In the case of page fault accounting, it's very -hard to avoid processes getting SIGBUS since the sysadmin needs precisely know -the HugeTLB usage of all the tasks in the system and make sure there is enough -pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited +hard to avoid processes getting SIGBUS since the sysadmin needs to precisely know +the HugeTLB usage of all the tasks in the system and make sure there are enough +pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommitted systems is practically impossible with page fault accounting. diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst index 99fbc8a64ba9..14897a8d32b3 100644 --- a/Documentation/admin-guide/cgroup-v1/index.rst +++ b/Documentation/admin-guide/cgroup-v1/index.rst @@ -22,10 +22,3 @@ Control Groups version 1 net_prio pids rdma - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index d6b1db8cc7eb..7db63c002922 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -311,9 +311,8 @@ Lock order is as follows:: folio_lock mm->page_table_lock or split pte_lock - folio_memcg_lock (memcg->move_lock) - mapping->i_pages lock - lruvec->lru_lock. + mapping->i_pages lock + lruvec->lru_lock. Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by lruvec->lru_lock; the folio LRU flag is cleared before diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 0e6c67ac585a..91beaa6798ce 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -53,7 +53,8 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 5-2. Memory 5-2-1. Memory Interface Files 5-2-2. Usage Guidelines - 5-2-3. Memory Ownership + 5-2-3. Reclaim Protection + 5-2-4. Memory Ownership 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback @@ -736,9 +737,6 @@ combinations are invalid and should be rejected. Also, if the resource is mandatory for execution of processes, process migrations may be rejected. -"cpu.rt.max" hard-allocates realtime slices and is an example of this -type. - Interface Files =============== @@ -1317,7 +1315,7 @@ PAGE_SIZE multiple when read back. smaller overages. Effective min boundary is limited by memory.min values of - all ancestor cgroups. If there is memory.min overcommitment + ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent's protection proportional to its @@ -1326,9 +1324,6 @@ PAGE_SIZE multiple when read back. Putting more memory than generally available under this protection is discouraged and may lead to constant OOMs. - If a memory cgroup is not populated with processes, - its memory.min is ignored. - memory.low A read-write single value file which exists on non-root cgroups. The default is "0". @@ -1343,7 +1338,7 @@ PAGE_SIZE multiple when read back. smaller overages. Effective low boundary is limited by memory.low values of - all ancestor cgroups. If there is memory.low overcommitment + ancestor cgroups. If there is memory.low overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent's protection proportional to its @@ -1515,6 +1510,10 @@ The following nested keys are defined. oom_group_kill The number of times a group OOM has occurred. + sock_throttled + The number of times network sockets associated with + this cgroup are throttled. + memory.events.local Similar to memory.events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event @@ -1934,6 +1933,27 @@ memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn't implemented yet. +Reclaim Protection +~~~~~~~~~~~~~~~~~~ + +The protection configured with "memory.low" or "memory.min" applies relatively +to the target of the reclaim (i.e. any of memory cgroup limits, proactive +memory.reclaim or global reclaim apparently located in the root cgroup). +The protection value configured for B applies unchanged to the reclaim +targeting A (i.e. caused by competition with the sibling E):: + + root - ... - A - B - C + \ ` D + ` E + +When the reclaim targets ancestors of A, the effective protection of B is +capped by the protection value configured for A (and any other intermediate +ancestors between A and the target). + +To express indifference about relative sibling protection, it is suggested to +use memory_recursiveprot. Configuring all descendants of a parent with finite +protection to "max" works but it may unnecessarily skew memory.events:low +field. Memory Ownership ~~~~~~~~~~~~~~~~ @@ -2538,10 +2558,10 @@ Cpuset Interface Files Users can manually set it to a value that is different from "cpuset.cpus". One constraint in setting it is that the list of CPUs must be exclusive with respect to "cpuset.cpus.exclusive" - of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup - isn't set, its "cpuset.cpus" value, if set, cannot be a subset - of it to leave at least one CPU available when the exclusive - CPUs are taken away. + and "cpuset.cpus.exclusive.effective" of its siblings. Another + constraint is that it cannot be a superset of "cpuset.cpus" + of its sibling in order to leave at least one CPU available to + that sibling when the exclusive CPUs are taken away. For a parent cgroup, any one of its exclusive CPUs can only be distributed to at most one of its child cgroups. Having an @@ -2561,9 +2581,9 @@ Cpuset Interface Files of this file will always be a subset of its parent's "cpuset.cpus.exclusive.effective" if its parent is not the root cgroup. It will also be a subset of "cpuset.cpus.exclusive" - if it is set. If "cpuset.cpus.exclusive" is not set, it is - treated to have an implicit value of "cpuset.cpus" in the - formation of local partition. + if it is set. This file should only be non-empty if either + "cpuset.cpus.exclusive" is set or when the current cpuset is + a valid partition root. cpuset.cpus.isolated A read-only and root cgroup only multiple values file. @@ -2595,13 +2615,22 @@ Cpuset Interface Files There are two types of partitions - local and remote. A local partition is one whose parent cgroup is also a valid partition root. A remote partition is one whose parent cgroup is not a - valid partition root itself. Writing to "cpuset.cpus.exclusive" - is optional for the creation of a local partition as its - "cpuset.cpus.exclusive" file will assume an implicit value that - is the same as "cpuset.cpus" if it is not set. Writing the - proper "cpuset.cpus.exclusive" values down the cgroup hierarchy - before the target partition root is mandatory for the creation - of a remote partition. + valid partition root itself. + + Writing to "cpuset.cpus.exclusive" is optional for the creation + of a local partition as its "cpuset.cpus.exclusive" file will + assume an implicit value that is the same as "cpuset.cpus" if it + is not set. Writing the proper "cpuset.cpus.exclusive" values + down the cgroup hierarchy before the target partition root is + mandatory for the creation of a remote partition. + + Not all the CPUs requested in "cpuset.cpus.exclusive" can be + used to form a new partition. Only those that were present + in its parent's "cpuset.cpus.exclusive.effective" control + file can be used. For partitions created without setting + "cpuset.cpus.exclusive", exclusive CPUs specified in sibling's + "cpuset.cpus.exclusive" or "cpuset.cpus.exclusive.effective" + also cannot be used. Currently, a remote partition cannot be created under a local partition. All the ancestors of a remote partition root except @@ -2609,6 +2638,10 @@ Cpuset Interface Files The root cgroup is always a partition root and its state cannot be changed. All other non-root cgroups start out as "member". + Even though the "cpuset.cpus.exclusive*" and "cpuset.cpus" + control files are not present in the root cgroup, they are + implicitly the same as the "/sys/devices/system/cpu/possible" + sysfs file. When set to "root", the current cgroup is the root of a new partition or scheduling domain. The set of exclusive CPUs is @@ -2793,7 +2826,7 @@ DMEM Interface Files HugeTLB ------- -The HugeTLB controller allows to limit the HugeTLB usage per control group and +The HugeTLB controller allows limiting the HugeTLB usage per control group and enforces the controller limit during page fault. HugeTLB Interface Files diff --git a/Documentation/admin-guide/cifs/index.rst b/Documentation/admin-guide/cifs/index.rst index fad5268635f5..58ab58a71a82 100644 --- a/Documentation/admin-guide/cifs/index.rst +++ b/Documentation/admin-guide/cifs/index.rst @@ -12,10 +12,3 @@ CIFS todo changes authors - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst index bb17e26e3c1b..3780f6e6b6bb 100644 --- a/Documentation/admin-guide/device-mapper/dm-raid.rst +++ b/Documentation/admin-guide/device-mapper/dm-raid.rst @@ -20,10 +20,10 @@ The target is named "raid" and it accepts the following parameters:: raid0 RAID0 striping (no resilience) raid1 RAID1 mirroring raid4 RAID4 with dedicated last parity disk - raid5_n RAID5 with dedicated last parity disk supporting takeover + raid5_n RAID5 with dedicated last parity disk supporting takeover from/to raid1 Same as raid4 - - Transitory layout + - Transitory layout for takeover from/to raid1 raid5_la RAID5 left asymmetric - rotating parity 0 with data continuation @@ -48,8 +48,8 @@ The target is named "raid" and it accepts the following parameters:: raid6_n_6 RAID6 with dedicate parity disks - parity and Q-syndrome on the last 2 disks; - layout for takeover from/to raid4/raid5_n - raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk + layout for takeover from/to raid0/raid4/raid5_n + raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk supporting takeover from/to raid5 - layout for takeover from raid5_la from/to raid6 raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk @@ -173,9 +173,9 @@ The target is named "raid" and it accepts the following parameters:: The delta_disks option value (-251 < N < +251) triggers device removal (negative value) or device addition (positive value) to any reshape supporting raid levels 4/5/6 and 10. - RAID levels 4/5/6 allow for addition of devices (metadata - and data device tuple), raid10_near and raid10_offset only - allow for device addition. raid10_far does not support any + RAID levels 4/5/6 allow for addition and removal of devices + (metadata and data device tuple), raid10_near and raid10_offset + only allow for device addition. raid10_far does not support any reshaping at all. A minimum of devices have to be kept to enforce resilience, which is 3 devices for raid4/5 and 4 devices for raid6. @@ -372,6 +372,72 @@ to safely enable discard support for RAID 4/5/6: 'devices_handle_discards_safely' +Takeover/Reshape Support +------------------------ +The target natively supports these two types of MDRAID conversions: + +o Takeover: Converts an array from one RAID level to another + +o Reshape: Changes the internal layout while maintaining the current RAID level + +Each operation is only valid under specific constraints imposed by the existing array's layout and configuration. + + +Takeover: +linear -> raid1 with N >= 2 mirrors +raid0 -> raid4 (add dedicated parity device) +raid0 -> raid5 (add dedicated parity device) +raid0 -> raid10 with near layout and N >= 2 mirror groups (raid0 stripes have to become first member within mirror groups) +raid1 -> linear +raid1 -> raid5 with 2 mirrors +raid4 -> raid5 w/ rotating parity +raid5 with dedicated parity device -> raid4 +raid5 -> raid6 (with dedicated Q-syndrome) +raid6 (with dedicated Q-syndrome) -> raid5 +raid10 with near layout and even number of disks -> raid0 (select any in-sync device from each mirror group) + +Reshape: +linear: not possible +raid0: not possible +raid1: change number of mirrors +raid4: add and remove stripes (minimum 3), change stripesize +raid5: add and remove stripes (minimum 3, special case 2 for raid1 takeover), change rotating parity algorithms, change stripesize +raid6: add and remove stripes (minimum 4), change rotating syndrome algorithms, change stripesize +raid10 near: add stripes (minimum 4), change stripesize, no stripe removal possible, change to offset layout +raid10 offset: add stripes, change stripesize, no stripe removal possible, change to near layout +raid10 far: not possible + +Table line examples: + +### raid1 -> raid5 +# +# 2 devices limitation in raid1. +# raid5 personality is able to just map 2 like raid1. +# Reshape after takeover to change to full raid5 layout + + 0 1960886272 raid raid1 3 0 region_size 2048 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + +# dm-0 and dm-2 are e.g. 4MiB large metadata devices, dm-1 and dm-3 have to be at least 1960886272 big. +# +# Table line to takeover to raid5 + + 0 1960886272 raid raid5 3 0 region_size 2048 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + +# Add required out-of-place reshape space to the beginniong of the given 2 data devices, +# allocate another metadata/data device tuple with the same sizes for the parity space +# and zero the first 4K of the metadata device. +# +# Example table of the out-of-place reshape space addition for one data device, e.g. dm-1 + + 0 8192 linear 8:0 0 1960903888 # <- must be free space segment + 8192 1960886272 linear 8:0 0 2048 # previous data segment + +# Mapping table for e.g. raid5_rs reshape causing the size of the raid device to double-fold once the reshape finishes. +# Check the status output (e.g. "dmsetup status $RaidDev") for progress. + + 0 $((2 * 1960886272)) raid raid5 7 0 region_size 2048 data_offset 8192 delta_disk 1 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + + Version History --------------- diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst index f1c1f4b824ba..030d854628ac 100644 --- a/Documentation/admin-guide/device-mapper/index.rst +++ b/Documentation/admin-guide/device-mapper/index.rst @@ -40,10 +40,3 @@ Device Mapper verity writecache zero - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst index 8c3f1f967a3c..3ecab1cff9c6 100644 --- a/Documentation/admin-guide/device-mapper/verity.rst +++ b/Documentation/admin-guide/device-mapper/verity.rst @@ -236,8 +236,10 @@ is available at the cryptsetup project's wiki page Status ====== -V (for Valid) is returned if every check performed so far was valid. -If any check failed, C (for Corruption) is returned. +1. V (for Valid) is returned if every check performed so far was valid. + If any check failed, C (for Corruption) is returned. +2. Number of corrected blocks by Forward Error Correction. + '-' if Forward Error Correction is not enabled. Example ======= diff --git a/Documentation/admin-guide/devices.rst b/Documentation/admin-guide/devices.rst index e3776d77374b..b103ba52776a 100644 --- a/Documentation/admin-guide/devices.rst +++ b/Documentation/admin-guide/devices.rst @@ -97,9 +97,12 @@ It is recommended that these links exist on all systems: /dev/bttv0 video0 symbolic Backward compatibility /dev/radio radio0 symbolic Backward compatibility /dev/i2o* /dev/i2o/* symbolic Backward compatibility -/dev/scd? sr? hard Alternate SCSI CD-ROM name =============== =============== =============== =============================== +Suggested earlier ``/dev/scd?`` alternative names for ``/dev/sr?`` +CD-ROM and other optical drives (using SCSI commands) were removed +in ``udev`` version 174 that was released in 2011. + Locally defined links +++++++++++++++++++++ @@ -112,7 +115,6 @@ exist, they should have the following uses. /dev/mouse mouse port symbolic Current mouse device /dev/tape tape device symbolic Current tape device /dev/cdrom CD-ROM device symbolic Current CD-ROM device -/dev/cdwriter CD-writer symbolic Current CD-writer device /dev/scanner scanner symbolic Current scanner device /dev/modem modem port symbolic Current dialout device /dev/root root device symbolic Current root filesystem @@ -126,8 +128,8 @@ exists, ``/dev/modem`` should point to the appropriate primary TTY device For SCSI devices, ``/dev/tape`` and ``/dev/cdrom`` should point to the *cooked* devices (``/dev/st*`` and ``/dev/sr*``, respectively), whereas -``/dev/cdwriter`` and /dev/scanner should point to the appropriate generic -SCSI devices (/dev/sg*). +``/dev/scanner`` should point to the appropriate generic +SCSI device (``/dev/sg*``). ``/dev/mouse`` may point to a primary serial TTY device, a hardware mouse device, or a socket for a mouse driver program (e.g. ``/dev/gpmdata``). diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt index 94c98be1329a..440633642fea 100644 --- a/Documentation/admin-guide/devices.txt +++ b/Documentation/admin-guide/devices.txt @@ -352,7 +352,7 @@ 216 = /dev/fujitsu/apanel Fujitsu/Siemens application panel 217 = /dev/ni/natmotn National Instruments Motion 218 = /dev/kchuid Inter-process chuid control - 219 = /dev/modems/mwave MWave modem firmware upload + 219 = 220 = /dev/mptctl Message passing technology (MPT) control 221 = /dev/mvista/hssdsi Montavista PICMG hot swap system driver 222 = /dev/mvista/hasi Montavista PICMG high availability @@ -389,11 +389,11 @@ ... 11 block SCSI CD-ROM devices - 0 = /dev/scd0 First SCSI CD-ROM - 1 = /dev/scd1 Second SCSI CD-ROM + 0 = /dev/sr0 First SCSI CD-ROM + 1 = /dev/sr1 Second SCSI CD-ROM ... - The prefix /dev/sr (instead of /dev/scd) has been deprecated. + In the past the prefix /dev/scd (instead of /dev/sr) was used and even recommended. 12 char QIC-02 tape 2 = /dev/ntpqic11 QIC-11, no rewind-on-close diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index 7c036590cd07..095a63892257 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -223,12 +223,13 @@ The flags are:: f Include the function name s Include the source file name l Include line number + d Include call trace For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only the ``p`` flag has meaning, other flags are ignored. -Note the regexp ``^[-+=][fslmpt_]+$`` matches a flags specification. -To clear all flags at once, use ``=_`` or ``-fslmpt``. +Note the regexp ``^[-+=][fslmptd_]+$`` matches a flags specification. +To clear all flags at once, use ``=_`` or ``-fslmptd``. Debug messages during Boot Process diff --git a/Documentation/admin-guide/efi-stub.rst b/Documentation/admin-guide/efi-stub.rst index 090f3a185e18..f8e7407698bd 100644 --- a/Documentation/admin-guide/efi-stub.rst +++ b/Documentation/admin-guide/efi-stub.rst @@ -79,6 +79,9 @@ because the image we're executing is interpreted by the EFI shell, which understands relative paths, whereas the rest of the command line is passed to bzImage.efi. +.. hint:: + It is also possible to provide an initrd using a Linux-specific UEFI + protocol at boot time. See :ref:`pe-coff-entry-point` for details. The "dtb=" option ----------------- diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst index 712f379731cb..082646851029 100644 --- a/Documentation/admin-guide/gpio/index.rst +++ b/Documentation/admin-guide/gpio/index.rst @@ -12,10 +12,3 @@ GPIO gpio-sim gpio-virtuser Obsolete APIs <obsolete> - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst index 210020bc3f56..35dc25159b28 100644 --- a/Documentation/admin-guide/hw-vuln/l1d_flush.rst +++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst @@ -31,7 +31,7 @@ specifically opt into the feature to enable it. Mitigation ---------- -When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is +When PR_SPEC_L1D_FLUSH is enabled for a task a flush of the L1D cache is performed when the task is scheduled out and the incoming task belongs to a different process and therefore to a different address space. diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst index 991f12adef8d..4bb8549bee82 100644 --- a/Documentation/admin-guide/hw-vuln/spectre.rst +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -406,7 +406,7 @@ The possible values in this file are: - Single threaded indirect branch prediction (STIBP) status for protection between different hyper threads. This feature can be controlled through - prctl per process, or through kernel command line options. This is x86 + prctl per process, or through kernel command line options. This is an x86 only feature. For more details see below. ==================== ======================================================== diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 259d79fbeb94..b734f8a2a2c4 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -189,10 +189,3 @@ A few hard-to-categorize and generally obsolete documents. ldm unicode - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/initrd.rst b/Documentation/admin-guide/initrd.rst index 67bbad8806e8..6c1660a4c5cc 100644 --- a/Documentation/admin-guide/initrd.rst +++ b/Documentation/admin-guide/initrd.rst @@ -297,7 +297,7 @@ as follows: 8) now the system is bootable and additional installation tasks can be performed -The key role of initrd here is to re-use the configuration data during +The key role of initrd here is to reuse the configuration data during normal system operation without requiring the use of a bloated "generic" kernel or re-compiling or re-linking the kernel. diff --git a/Documentation/admin-guide/kdump/index.rst b/Documentation/admin-guide/kdump/index.rst index 8e2ebd0383cd..cf5d7c868b74 100644 --- a/Documentation/admin-guide/kdump/index.rst +++ b/Documentation/admin-guide/kdump/index.rst @@ -11,10 +11,3 @@ information. kdump vmcoreinfo - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 7b011eb116a7..7587caadbae1 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -591,7 +591,7 @@ with /sys/kernel/config/crash_dm_crypt_keys for setup, cat /sys/kernel/config/crash_dm_crypt_keys/count 2 - # To support CPU/memory hot-plugging, re-use keys already saved to reserved + # To support CPU/memory hot-plugging, reuse keys already saved to reserved # memory echo true > /sys/kernel/config/crash_dm_crypt_key/reuse diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index 7bf8cc7df6b5..02a725536cc5 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -110,102 +110,7 @@ The parameters listed below are only valid if certain kernel build options were enabled and if respective hardware is present. This list should be kept in alphabetical order. The text in square brackets at the beginning of each description states the restrictions within which a parameter -is applicable:: - - ACPI ACPI support is enabled. - AGP AGP (Accelerated Graphics Port) is enabled. - ALSA ALSA sound support is enabled. - APIC APIC support is enabled. - APM Advanced Power Management support is enabled. - APPARMOR AppArmor support is enabled. - ARM ARM architecture is enabled. - ARM64 ARM64 architecture is enabled. - AX25 Appropriate AX.25 support is enabled. - CLK Common clock infrastructure is enabled. - CMA Contiguous Memory Area support is enabled. - DRM Direct Rendering Management support is enabled. - DYNAMIC_DEBUG Build in debug messages and enable them at runtime - EARLY Parameter processed too early to be embedded in initrd. - EDD BIOS Enhanced Disk Drive Services (EDD) is enabled - EFI EFI Partitioning (GPT) is enabled - EVM Extended Verification Module - FB The frame buffer device is enabled. - FTRACE Function tracing enabled. - GCOV GCOV profiling is enabled. - HIBERNATION HIBERNATION is enabled. - HW Appropriate hardware is enabled. - HYPER_V HYPERV support is enabled. - IMA Integrity measurement architecture is enabled. - IP_PNP IP DHCP, BOOTP, or RARP is enabled. - IPV6 IPv6 support is enabled. - ISAPNP ISA PnP code is enabled. - ISDN Appropriate ISDN support is enabled. - ISOL CPU Isolation is enabled. - JOY Appropriate joystick support is enabled. - KGDB Kernel debugger support is enabled. - KVM Kernel Virtual Machine support is enabled. - LIBATA Libata driver is enabled - LOONGARCH LoongArch architecture is enabled. - LOOP Loopback device support is enabled. - LP Printer support is enabled. - M68k M68k architecture is enabled. - These options have more detailed description inside of - Documentation/arch/m68k/kernel-options.rst. - MDA MDA console support is enabled. - MIPS MIPS architecture is enabled. - MOUSE Appropriate mouse support is enabled. - MSI Message Signaled Interrupts (PCI). - MTD MTD (Memory Technology Device) support is enabled. - NET Appropriate network support is enabled. - NFS Appropriate NFS support is enabled. - NUMA NUMA support is enabled. - OF Devicetree is enabled. - PARISC The PA-RISC architecture is enabled. - PCI PCI bus support is enabled. - PCIE PCI Express support is enabled. - PCMCIA The PCMCIA subsystem is enabled. - PNP Plug & Play support is enabled. - PPC PowerPC architecture is enabled. - PPT Parallel port support is enabled. - PS2 Appropriate PS/2 support is enabled. - PV_OPS A paravirtualized kernel is enabled. - RAM RAM disk support is enabled. - RDT Intel Resource Director Technology. - RISCV RISCV architecture is enabled. - S390 S390 architecture is enabled. - SCSI Appropriate SCSI support is enabled. - A lot of drivers have their options described inside - the Documentation/scsi/ sub-directory. - SDW SoundWire support is enabled. - SECURITY Different security models are enabled. - SELINUX SELinux support is enabled. - SERIAL Serial support is enabled. - SH SuperH architecture is enabled. - SMP The kernel is an SMP kernel. - SPARC Sparc architecture is enabled. - SUSPEND System suspend states are enabled. - SWSUSP Software suspend (hibernation) is enabled. - TPM TPM drivers are enabled. - UMS USB Mass Storage support is enabled. - USB USB support is enabled. - USBHID USB Human Interface Device support is enabled. - V4L Video For Linux support is enabled. - VGA The VGA console has been enabled. - VMMIO Driver for memory mapped virtio devices is enabled. - VT Virtual terminal support is enabled. - WDT Watchdog support is enabled. - X86-32 X86-32, aka i386 architecture is enabled. - X86-64 X86-64 architecture is enabled. - X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) - X86_UV SGI UV support is enabled. - XEN Xen support is enabled - XTENSA xtensa architecture is enabled. - -In addition, the following text indicates that the option:: - - BOOT Is a boot loader parameter. - BUGS= Relates to possible processor bugs on the said processor. - KNL Is a kernel start-up parameter. +is applicable. Parameters denoted with BOOT are actually interpreted by the boot loader, and have no meaning to the kernel directly. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6c42061ca20e..03a550630644 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1,3 +1,102 @@ + ACPI ACPI support is enabled. + AGP AGP (Accelerated Graphics Port) is enabled. + ALSA ALSA sound support is enabled. + APIC APIC support is enabled. + APM Advanced Power Management support is enabled. + APPARMOR AppArmor support is enabled. + ARM ARM architecture is enabled. + ARM64 ARM64 architecture is enabled. + AX25 Appropriate AX.25 support is enabled. + CLK Common clock infrastructure is enabled. + CMA Contiguous Memory Area support is enabled. + DRM Direct Rendering Management support is enabled. + DYNAMIC_DEBUG Build in debug messages and enable them at runtime + EARLY Parameter processed too early to be embedded in initrd. + EDD BIOS Enhanced Disk Drive Services (EDD) is enabled + EFI EFI Partitioning (GPT) is enabled + EVM Extended Verification Module + FB The frame buffer device is enabled. + FTRACE Function tracing enabled. + GCOV GCOV profiling is enabled. + HIBERNATION HIBERNATION is enabled. + HW Appropriate hardware is enabled. + HYPER_V HYPERV support is enabled. + IMA Integrity measurement architecture is enabled. + IP_PNP IP DHCP, BOOTP, or RARP is enabled. + IPV6 IPv6 support is enabled. + ISAPNP ISA PnP code is enabled. + ISDN Appropriate ISDN support is enabled. + ISOL CPU Isolation is enabled. + JOY Appropriate joystick support is enabled. + KGDB Kernel debugger support is enabled. + KVM Kernel Virtual Machine support is enabled. + LIBATA Libata driver is enabled + LOONGARCH LoongArch architecture is enabled. + LOOP Loopback device support is enabled. + LP Printer support is enabled. + M68k M68k architecture is enabled. + These options have more detailed description inside of + Documentation/arch/m68k/kernel-options.rst. + MDA MDA console support is enabled. + MIPS MIPS architecture is enabled. + MOUSE Appropriate mouse support is enabled. + MSI Message Signaled Interrupts (PCI). + MTD MTD (Memory Technology Device) support is enabled. + NET Appropriate network support is enabled. + NFS Appropriate NFS support is enabled. + NUMA NUMA support is enabled. + OF Devicetree is enabled. + PARISC The PA-RISC architecture is enabled. + PCI PCI bus support is enabled. + PCIE PCI Express support is enabled. + PCMCIA The PCMCIA subsystem is enabled. + PNP Plug & Play support is enabled. + PPC PowerPC architecture is enabled. + PPT Parallel port support is enabled. + PS2 Appropriate PS/2 support is enabled. + PV_OPS A paravirtualized kernel is enabled. + RAM RAM disk support is enabled. + RDT Intel Resource Director Technology. + RISCV RISCV architecture is enabled. + S390 S390 architecture is enabled. + SCSI Appropriate SCSI support is enabled. + A lot of drivers have their options described inside + the Documentation/scsi/ sub-directory. + SDW SoundWire support is enabled. + SECURITY Different security models are enabled. + SELINUX SELinux support is enabled. + SERIAL Serial support is enabled. + SH SuperH architecture is enabled. + SMP The kernel is an SMP kernel. + SPARC Sparc architecture is enabled. + SUSPEND System suspend states are enabled. + SWSUSP Software suspend (hibernation) is enabled. + TPM TPM drivers are enabled. + UMS USB Mass Storage support is enabled. + USB USB support is enabled. + NVME NVMe support is enabled + USBHID USB Human Interface Device support is enabled. + V4L Video For Linux support is enabled. + VGA The VGA console has been enabled. + VMMIO Driver for memory mapped virtio devices is enabled. + VT Virtual terminal support is enabled. + WDT Watchdog support is enabled. + X86-32 X86-32, aka i386 architecture is enabled. + X86-64 X86-64 architecture is enabled. + X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) + X86_UV SGI UV support is enabled. + XEN Xen support is enabled + XTENSA xtensa architecture is enabled. + +In addition, the following text indicates that the option + + BOOT Is a boot loader parameter. + BUGS= Relates to possible processor bugs on the said processor. + KNL Is a kernel start-up parameter. + + +Kernel parameters + accept_memory= [MM] Format: { eager | lazy } default: lazy @@ -27,6 +126,8 @@ may result in duplicate corrected error reports. nospcr -- disable console in ACPI SPCR table as default _serial_ console on ARM64 + spcr -- enable console in ACPI SPCR table as + default _serial_ console on x86 For ARM64, ONLY "acpi=off", "acpi=on", "acpi=force" or "acpi=nospcr" are available For RISCV64, ONLY "acpi=off", "acpi=on" or "acpi=force" @@ -669,6 +770,14 @@ nokmem -- Disable kernel memory accounting. nobpf -- Disable BPF memory accounting. + check_pages= [MM,EARLY] Enable sanity checking of pages after + allocations / before freeing. This adds checks to catch + double-frees, use-after-frees, and other sources of + page corruption by inspecting page internals (flags, + mapcount/refcount, memcg_data, etc.). + Format: { "0" | "1" } + Default: 0 (1 if CONFIG_DEBUG_VM is set) + checkreqprot= [SELINUX] Set initial checkreqprot flag value. Format: { "0" | "1" } See security/selinux/Kconfig help text. @@ -1013,7 +1122,7 @@ It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. crashkernel=size[KMG],cma - [KNL, X86] Reserve additional crash kernel memory from + [KNL, X86, ppc] Reserve additional crash kernel memory from CMA. This reservation is usable by the first system's userspace memory and kernel movable allocations (memory balloon, zswap). Pages allocated from this memory range @@ -1113,12 +1222,8 @@ debugfs= [KNL,EARLY] This parameter enables what is exposed to userspace and debugfs internal clients. - Format: { on, no-mount, off } + Format: { on, off } on: All functions are enabled. - no-mount: - Filesystem is not registered but kernel clients can - access APIs and a crashkernel can be used to read - its content. There is nothing to mount. off: Filesystem is not registered and clients get a -EPERM as result when trying to register files or directories within debugfs. @@ -1268,6 +1373,13 @@ For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst + dm_verity.keyring_unsealed= + [KNL] When set to 1, leave the dm-verity keyring + unsealed after initialization so userspace can + provision keys. Once the keyring is restricted + it becomes active and is searched during signature + verification. + driver_async_probe= [KNL] List of driver names to be probed asynchronously. * matches with all driver names. If * is specified, the @@ -1867,6 +1979,9 @@ param "no_hash_pointers" is an alias for this mode. + For controlling hashing dynamically at runtime, + use the "kernel.kptr_restrict" sysctl instead. + hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on for 64-bit NUMA, off otherwise. @@ -1907,6 +2022,16 @@ /sys/power/pm_test). Only available when CONFIG_PM_DEBUG is set. Default value is 5. + hibernate_compression_threads= + [HIBERNATION] + Set the number of threads used for compressing or decompressing + hibernation images. + + Format: <integer> + Default: 3 + Minimum: 1 + Example: hibernate_compression_threads=4 + highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact size of <nn>. This works even on boxes that have no highmem otherwise. This also works to reduce highmem @@ -2010,14 +2135,20 @@ the added memory block itself do not be affected. hung_task_panic= - [KNL] Should the hung task detector generate panics. - Format: 0 | 1 + [KNL] Number of hung tasks to trigger kernel panic. + Format: <int> + + When set to a non-zero value, a kernel panic will be triggered if + the number of detected hung tasks reaches this value. - A value of 1 instructs the kernel to panic when a - hung task is detected. The default value is controlled - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time - option. The value selected by this boot parameter can - be changed later by the kernel.hung_task_panic sysctl. + 0: don't panic + 1: panic immediately on first hung task + N: panic after N hung tasks are detected in a single scan + + The default value is controlled by the + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value + selected by this boot parameter can be changed later by the + kernel.hung_task_panic sysctl. hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) terminal devices. Valid values: 0..8 @@ -2557,6 +2688,15 @@ 1 - Bypass the IOMMU for DMA. unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH. + iommu.debug_pagealloc= + [KNL,EARLY] When CONFIG_IOMMU_DEBUG_PAGEALLOC is set, this + parameter enables the feature at boot time. By default, it + is disabled and the system behaves the same way as a kernel + built without CONFIG_IOMMU_DEBUG_PAGEALLOC. + Format: { "0" | "1" } + 0 - Sanitizer disabled. + 1 - Sanitizer enabled, expect runtime overhead. + io7= [HW] IO7 for Marvel-based Alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. @@ -2799,6 +2939,41 @@ for Movable pages. "nn[KMGTPE]", "nn%", and "mirror" are exclusive, so you cannot specify multiple forms. + kfence.burst= [MM,KFENCE] The number of additional successive + allocations to be attempted through KFENCE for each + sample interval. + Format: <unsigned integer> + Default: 0 + + kfence.check_on_panic= + [MM,KFENCE] Whether to check all KFENCE-managed objects' + canaries on panic. + Format: <bool> + Default: false + + kfence.deferrable= + [MM,KFENCE] Whether to use a deferrable timer to trigger + allocations. This avoids forcing CPU wake-ups if the + system is idle, at the risk of a less predictable + sample interval. + Format: <bool> + Default: CONFIG_KFENCE_DEFERRABLE + + kfence.sample_interval= + [MM,KFENCE] KFENCE's sample interval in milliseconds. + Format: <unsigned integer> + 0 - Disable KFENCE. + >0 - Enabled KFENCE with given sample interval. + Default: CONFIG_KFENCE_SAMPLE_INTERVAL + + kfence.skip_covered_thresh= + [MM,KFENCE] If pool utilization reaches this threshold + (pool usage%), KFENCE limits currently covered + allocations of the same source from further filling + up the pool. + Format: <unsigned integer> + Default: 75 + kgdbdbgp= [KGDB,HW,EARLY] kgdb over EHCI usb debug port. Format: <Controller#>[,poll interval] The controller # is the number of the ehci usb debug @@ -2926,6 +3101,26 @@ Default is Y (on). + kvm.enable_pmu=[KVM,X86] + If enabled, KVM will virtualize PMU functionality based + on the virtual CPU model defined by userspace. This + can be overridden on a per-VM basis via + KVM_CAP_PMU_CAPABILITY. + + If disabled, KVM will not virtualize PMU functionality, + e.g. MSRs, PMCs, PMIs, etc., even if userspace defines + a virtual CPU model that contains PMU assets. + + Note, KVM's vPMU support implicitly requires running + with an in-kernel local APIC, e.g. to deliver PMIs to + the guest. Running without an in-kernel local APIC is + not supported, though KVM will allow such a combination + (with severely degraded functionality). + + See also enable_mediated_pmu. + + Default is Y (on). + kvm.enable_virt_at_load=[KVM,ARM64,LOONGARCH,MIPS,RISCV,X86] If enabled, KVM will enable virtualization in hardware when KVM is loaded, and disable virtualization when KVM @@ -2972,6 +3167,35 @@ If the value is 0 (the default), KVM will pick a period based on the ratio, such that a page is zapped after 1 hour on average. + kvm-{amd,intel}.enable_mediated_pmu=[KVM,AMD,INTEL] + If enabled, KVM will provide a mediated virtual PMU, + instead of the default perf-based virtual PMU (if + kvm.enable_pmu is true and PMU is enumerated via the + virtual CPU model). + + With a perf-based vPMU, KVM operates as a user of perf, + i.e. emulates guest PMU counters using perf events. + KVM-created perf events are managed by perf as regular + (guest-only) events, e.g. are scheduled in/out, contend + for hardware resources, etc. Using a perf-based vPMU + allows guest and host usage of the PMU to co-exist, but + incurs non-trivial overhead and can result in silently + dropped guest events (due to resource contention). + + With a mediated vPMU, hardware PMU state is context + switched around the world switch to/from the guest. + KVM mediates which events the guest can utilize, but + gives the guest direct access to all other PMU assets + when possible (KVM may intercept some accesses if the + virtual CPU model provides a subset of hardware PMU + functionality). Using a mediated vPMU significantly + reduces PMU virtualization overhead and eliminates lost + guest events, but is mutually exclusive with using perf + to profile KVM guests and adds latency to most VM-Exits + (to context switch PMU state). + + Default is N (off). + kvm-amd.nested= [KVM,AMD] Control nested virtualization feature in KVM/SVM. Default is 1 (enabled). @@ -3294,6 +3518,11 @@ * [no]logdir: Enable or disable access to the general purpose log directory. + * max_sec=<sectors>: Set the transfer size limit, in + number of 512-byte sectors, to the value specified in + <sectors>. The value specified in <sectors> has to be + a non-zero positive integer. + * max_sec_128: Set transfer size limit to 128 sectors. * max_sec_1024: Set or clear transfer size limit to @@ -3319,7 +3548,10 @@ If there are multiple matching configurations changing the same attribute, the last one is used. - load_ramdisk= [RAM] [Deprecated] + liveupdate= [KNL,EARLY] + Format: <bool> + Enable Live Update Orchestrator (LUO). + Default: off. lockd.nlm_grace_period=P [NFS] Assign grace period. Format: <integer> @@ -3880,6 +4112,7 @@ spectre_v2_user=off [X86] srbds=off [X86,INTEL] ssbd=force-off [ARM64] + tsa=off [X86,AMD] tsx_async_abort=off [X86] vmscape=off [X86] @@ -4326,8 +4559,10 @@ Note that this argument takes precedence over the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option. - noinitrd [RAM] Tells the kernel not to load any configured - initial RAM disk. + noinitrd [Deprecated,RAM] Tells the kernel not to load any configured + initial RAM disk. Currently this parameter applies to + initrd only, not to initramfs. But it applies to both + in EFI mode. nointremap [X86-64,Intel-IOMMU,EARLY] Do not enable interrupt remapping. @@ -4427,7 +4662,7 @@ nosmt [KNL,MIPS,PPC,EARLY] Disable symmetric multithreading (SMT). Equivalent to smt=1. - [KNL,X86,PPC,S390] Disable symmetric multithreading (SMT). + [KNL,LOONGARCH,X86,PPC,S390] Disable symmetric multithreading (SMT). nosmt=force: Force disable SMT, cannot be undone via the sysfs control file. @@ -4553,6 +4788,18 @@ This can be set from sysctl after boot. See Documentation/admin-guide/sysctl/vm.rst for details. + nvme.quirks= [NVME] A list of quirk entries to augment the built-in + nvme quirk list. List entries are separated by a + '-' character. + Each entry has the form VendorID:ProductID:quirk_names. + The IDs are 4-digits hex numbers and quirk_names is a + list of quirk names separated by commas. A quirk name + can be prefixed by '^', meaning that the specified + quirk must be disabled. + + Example: + nvme.quirks=7710:2267:bogus_nid,^identify_cns-9900:7711:broken_msi + ohci1394_dma=early [HW,EARLY] enable debugging via the ohci1394 driver. See Documentation/core-api/debugging-via-ohci1394.rst for more info. @@ -4635,6 +4882,21 @@ panic_on_warn=1 panic() instead of WARN(). Useful to cause kdump on a WARN(). + panic_force_cpu= + [KNL,SMP] Force panic handling to execute on a specific CPU. + Format: <cpu number> + Some platforms require panic handling to occur on a + specific CPU for the crash kernel to function correctly. + This can be due to firmware limitations, interrupt routing + constraints, or platform-specific requirements where only + a particular CPU can safely enter the crash kernel. + When set, panic() will redirect execution to the specified + CPU before proceeding with the normal panic and kexec flow. + If the target CPU is offline or unavailable, panic proceeds + on the current CPU. + This option should only be used for systems with the above + constraints as it might cause the panic operation to be less reliable. + panic_print= Bitmask for printing system info when panic happens. User can chose combination of the following bits: bit 0: print all tasks info @@ -5284,8 +5546,6 @@ Param: <number> - step/bucket size as a power of 2 for statistical time based profiling. - prompt_ramdisk= [RAM] [Deprecated] - prot_virt= [S390] enable hosting protected virtual machines isolated from the hypervisor (if hardware supports that). If enabled, the default kernel base address @@ -5342,7 +5602,7 @@ ramdisk_size= [RAM] Sizes of RAM disks in kilobytes See Documentation/admin-guide/blockdev/ramdisk.rst. - ramdisk_start= [RAM] RAM disk image start address + ramdisk_start= [Deprecated,RAM] RAM disk image start address random.trust_cpu=off [KNL,EARLY] Disable trusting the use of the CPU's @@ -6131,13 +6391,6 @@ dynamically) adjusted. This parameter is intended for use in testing. - rcupdate.rcu_task_ipi_delay= [KNL] - Set time in jiffies during which RCU tasks will - avoid sending IPIs, starting with the beginning - of a given grace period. Setting a large - number avoids disturbing real-time workloads, - but lengthens grace periods. - rcupdate.rcu_task_lazy_lim= [KNL] Number of callbacks on a given CPU that will cancel laziness on that CPU. Use -1 to disable @@ -6181,14 +6434,6 @@ of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks(). - rcupdate.rcu_tasks_trace_lazy_ms= [KNL] - Set timeout in milliseconds RCU Tasks - Trace asynchronous callback batching for - call_rcu_tasks_trace(). A negative value - will take the default. A value of zero will - disable batching. Batching is always disabled - for synchronize_rcu_tasks_trace(). - rcupdate.rcu_self_test= [KNL] Run the RCU early boot self tests @@ -6207,9 +6452,14 @@ rdt= [HW,X86,RDT] Turn on/off individual RDT features. List is: cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp, - mba, smba, bmec, abmc. + mba, smba, bmec, abmc, sdciae, energy[:guid], + perf[:guid]. E.g. to turn on cmt and turn off mba use: rdt=cmt,!mba + To turn off all energy telemetry monitoring and ensure that + perf telemetry monitoring associated with guid 0x12345 + is enabled use: + rdt=!energy,perf:0x12345 reboot= [KNL] Format (x86 or x86_64): @@ -6404,7 +6654,7 @@ that don't. off - no mitigation - auto - automatically select a migitation + auto - automatically select a mitigation auto,nosmt - automatically select a mitigation, disabling SMT if necessary for the full mitigation (only on Zen1 @@ -6453,6 +6703,14 @@ replacement properties are not found. See the Kconfig entry for RISCV_ISA_FALLBACK. + riscv_nousercfi= + all Disable user CFI ABI to userspace even if cpu extension + are available. + bcfi Disable user backward CFI ABI to userspace even if + the shadow stack extension is available. + fcfi Disable user forward CFI ABI to userspace even if the + landing pad extension is available. + ro [KNL] Mount root device read-only on boot rodata= [KNL,EARLY] @@ -6482,6 +6740,11 @@ rootflags= [KNL] Set root filesystem mount option string + rseq_slice_ext= [KNL] RSEQ based time slice extension + Format: boolean + Control enablement of RSEQ based time slice extension. + Default is 'on'. + initramfs_options= [KNL] Specify mount options for for the initramfs mount. @@ -6500,6 +6763,10 @@ Memory area to be used by remote processor image, managed by CMA. + rseq_debug= [KNL] Enable or disable restartable sequence + debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE. + Format: <bool> + rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling when CONFIG_RT_GROUP_SCHED=y. Defaults to !CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED. @@ -6812,12 +7079,12 @@ softlockup_panic= [KNL] Should the soft-lockup detector generate panics. - Format: 0 | 1 + Format: <int> - A value of 1 instructs the soft-lockup detector - to panic the machine when a soft-lockup occurs. It is - also controlled by the kernel.softlockup_panic sysctl - and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the + A value of non-zero instructs the soft-lockup detector + to panic the machine when a soft-lockup duration exceeds + N thresholds. It is also controlled by the kernel.softlockup_panic + sysctl and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the respective build-time switch to that functionality. softlockup_all_cpu_backtrace= @@ -7150,7 +7417,7 @@ limit. Default value is 8191 pools. stacktrace [FTRACE] - Enabled the stack tracer on boot up. + Enable the stack tracer on boot up. stacktrace_filter=[function-list] [FTRACE] Limit the functions that the stack tracer @@ -7192,6 +7459,9 @@ them frequently to increase the rate of SLB faults on kernel addresses. + no_slb_preload [PPC,EARLY] + Disables slb preloading for userspace. + sunrpc.min_resvport= sunrpc.max_resvport= [NFS,SUNRPC] @@ -7630,6 +7900,7 @@ - "tee" - "caam" - "dcp" + - "pkwm" If not specified then it defaults to iterating through the trust source list starting with TPM and assigns the first trust source as a backend which is initialized @@ -7925,6 +8196,9 @@ p = USB_QUIRK_SHORT_SET_ADDRESS_REQ_TIMEOUT (Reduce timeout of the SET_ADDRESS request from 5000 ms to 500 ms); + q = USB_QUIRK_FORCE_ONE_CONFIG (Device + claims zero configurations, + forcing to 1); Example: quirks=0781:5580:bk,0a5c:5834:gij usbhid.mousepoll= @@ -8211,7 +8485,16 @@ CONFIG_WQ_WATCHDOG. It sets the number times of the stall to trigger panic. - The default is 0, which disables the panic on stall. + The default is set by CONFIG_BOOTPARAM_WQ_STALL_PANIC, + which is 0 (disabled) if not configured. + + workqueue.panic_on_stall_time=<uint> + Panic when a workqueue stall has been continuous for + the specified number of seconds. Unlike panic_on_stall + which counts accumulated stall events, this triggers + based on the duration of a single continuous stall. + + The default is 0, which disables the time-based panic. workqueue.cpu_intensive_thresh_us= Per-cpu work items which run for longer than this @@ -8289,6 +8572,11 @@ save/restore/migration must be enabled to handle larger domains. + xen_console_io [XEN,EARLY] + Boolean option to enable/disable the usage of the Xen + console_io hypercalls to read and write to the console. + Mostly useful for debugging and development. + xen_emul_unplug= [HW,X86,XEN,EARLY] Unplug Xen emulated devices Format: [unplug0,][unplug1] diff --git a/Documentation/admin-guide/laptops/alienware-wmi.rst b/Documentation/admin-guide/laptops/alienware-wmi.rst index 27a32a8057da..e532c60db8e2 100644 --- a/Documentation/admin-guide/laptops/alienware-wmi.rst +++ b/Documentation/admin-guide/laptops/alienware-wmi.rst @@ -105,7 +105,7 @@ information. Manual fan control on the other hand, is not exposed directly by the AWCC interface. Instead it let's us control a fan `boost` value. This `boost` value -has the following aproximate behavior over the fan pwm: +has the following approximate behavior over the fan pwm: :: diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst index db842b629303..c0b911d05c59 100644 --- a/Documentation/admin-guide/laptops/index.rst +++ b/Documentation/admin-guide/laptops/index.rst @@ -10,10 +10,10 @@ Laptop Drivers alienware-wmi asus-laptop disk-shock-protection - laptop-mode lg-laptop samsung-galaxybook sony-laptop sonypi thinkpad-acpi toshiba_haps + uniwill-laptop diff --git a/Documentation/admin-guide/laptops/laptop-mode.rst b/Documentation/admin-guide/laptops/laptop-mode.rst deleted file mode 100644 index 66eb9cd918b5..000000000000 --- a/Documentation/admin-guide/laptops/laptop-mode.rst +++ /dev/null @@ -1,770 +0,0 @@ -=============================================== -How to conserve battery power using laptop-mode -=============================================== - -Document Author: Bart Samwel (bart@samwel.tk) - -Date created: January 2, 2004 - -Last modified: December 06, 2004 - -Introduction ------------- - -Laptop mode is used to minimize the time that the hard disk needs to be spun up, -to conserve battery power on laptops. It has been reported to cause significant -power savings. - -.. Contents - - * Introduction - * Installation - * Caveats - * The Details - * Tips & Tricks - * Control script - * ACPI integration - * Monitoring tool - - -Installation ------------- - -To use laptop mode, you don't need to set any kernel configuration options -or anything. Simply install all the files included in this document, and -laptop mode will automatically be started when you're on battery. For -your convenience, a tarball containing an installer can be downloaded at: - - http://www.samwel.tk/laptop_mode/laptop_mode/ - -To configure laptop mode, you need to edit the configuration file, which is -located in /etc/default/laptop-mode on Debian-based systems, or in -/etc/sysconfig/laptop-mode on other systems. - -Unfortunately, automatic enabling of laptop mode does not work for -laptops that don't have ACPI. On those laptops, you need to start laptop -mode manually. To start laptop mode, run "laptop_mode start", and to -stop it, run "laptop_mode stop". (Note: The laptop mode tools package now -has experimental support for APM, you might want to try that first.) - - -Caveats -------- - -* The downside of laptop mode is that you have a chance of losing up to 10 - minutes of work. If you cannot afford this, don't use it! The supplied ACPI - scripts automatically turn off laptop mode when the battery almost runs out, - so that you won't lose any data at the end of your battery life. - -* Most desktop hard drives have a very limited lifetime measured in spindown - cycles, typically about 50.000 times (it's usually listed on the spec sheet). - Check your drive's rating, and don't wear down your drive's lifetime if you - don't need to. - -* If you mount some of your ext3 filesystems with the -n option, then - the control script will not be able to remount them correctly. You must set - DO_REMOUNTS=0 in the control script, otherwise it will remount them with the - wrong options -- or it will fail because it cannot write to /etc/mtab. - -* If you have your filesystems listed as type "auto" in fstab, like I did, then - the control script will not recognize them as filesystems that need remounting. - You must list the filesystems with their true type instead. - -* It has been reported that some versions of the mutt mail client use file access - times to determine whether a folder contains new mail. If you use mutt and - experience this, you must disable the noatime remounting by setting the option - DO_REMOUNT_NOATIME to 0 in the configuration file. - - -The Details ------------ - -Laptop mode is controlled by the knob /proc/sys/vm/laptop_mode. This knob is -present for all kernels that have the laptop mode patch, regardless of any -configuration options. When the knob is set, any physical disk I/O (that might -have caused the hard disk to spin up) causes Linux to flush all dirty blocks. The -result of this is that after a disk has spun down, it will not be spun up -anymore to write dirty blocks, because those blocks had already been written -immediately after the most recent read operation. The value of the laptop_mode -knob determines the time between the occurrence of disk I/O and when the flush -is triggered. A sensible value for the knob is 5 seconds. Setting the knob to -0 disables laptop mode. - -To increase the effectiveness of the laptop_mode strategy, the laptop_mode -control script increases dirty_expire_centisecs and dirty_writeback_centisecs in -/proc/sys/vm to about 10 minutes (by default), which means that pages that are -dirtied are not forced to be written to disk as often. The control script also -changes the dirty background ratio, so that background writeback of dirty pages -is not done anymore. Combined with a higher commit value (also 10 minutes) for -ext3 filesystem (also done automatically by the control script), -this results in concentration of disk activity in a small time interval which -occurs only once every 10 minutes, or whenever the disk is forced to spin up by -a cache miss. The disk can then be spun down in the periods of inactivity. - - -Configuration -------------- - -The laptop mode configuration file is located in /etc/default/laptop-mode on -Debian-based systems, or in /etc/sysconfig/laptop-mode on other systems. It -contains the following options: - -MAX_AGE: - -Maximum time, in seconds, of hard drive spindown time that you are -comfortable with. Worst case, it's possible that you could lose this -amount of work if your battery fails while you're in laptop mode. - -MINIMUM_BATTERY_MINUTES: - -Automatically disable laptop mode if the remaining number of minutes of -battery power is less than this value. Default is 10 minutes. - -AC_HD/BATT_HD: - -The idle timeout that should be set on your hard drive when laptop mode -is active (BATT_HD) and when it is not active (AC_HD). The defaults are -20 seconds (value 4) for BATT_HD and 2 hours (value 244) for AC_HD. The -possible values are those listed in the manual page for "hdparm" for the -"-S" option. - -HD: - -The devices for which the spindown timeout should be adjusted by laptop mode. -Default is /dev/hda. If you specify multiple devices, separate them by a space. - -READAHEAD: - -Disk readahead, in 512-byte sectors, while laptop mode is active. A large -readahead can prevent disk accesses for things like executable pages (which are -loaded on demand while the application executes) and sequentially accessed data -(MP3s). - -DO_REMOUNTS: - -The control script automatically remounts any mounted journaled filesystems -with appropriate commit interval options. When this option is set to 0, this -feature is disabled. - -DO_REMOUNT_NOATIME: - -When remounting, should the filesystems be remounted with the noatime option? -Normally, this is set to "1" (enabled), but there may be programs that require -access time recording. - -DIRTY_RATIO: - -The percentage of memory that is allowed to contain "dirty" or unsaved data -before a writeback is forced, while laptop mode is active. Corresponds to -the /proc/sys/vm/dirty_ratio sysctl. - -DIRTY_BACKGROUND_RATIO: - -The percentage of memory that is allowed to contain "dirty" or unsaved data -after a forced writeback is done due to an exceeding of DIRTY_RATIO. Set -this nice and low. This corresponds to the /proc/sys/vm/dirty_background_ratio -sysctl. - -Note that the behaviour of dirty_background_ratio is quite different -when laptop mode is active and when it isn't. When laptop mode is inactive, -dirty_background_ratio is the threshold percentage at which background writeouts -start taking place. When laptop mode is active, however, background writeouts -are disabled, and the dirty_background_ratio only determines how much writeback -is done when dirty_ratio is reached. - -DO_CPU: - -Enable CPU frequency scaling when in laptop mode. (Requires CPUFreq to be setup. -See Documentation/admin-guide/pm/cpufreq.rst for more info. Disabled by default.) - -CPU_MAXFREQ: - -When on battery, what is the maximum CPU speed that the system should use? Legal -values are "slowest" for the slowest speed that your CPU is able to operate at, -or a value listed in /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies. - - -Tips & Tricks -------------- - -* Bartek Kania reports getting up to 50 minutes of extra battery life (on top - of his regular 3 to 3.5 hours) using a spindown time of 5 seconds (BATT_HD=1). - -* You can spin down the disk while playing MP3, by setting disk readahead - to 8MB (READAHEAD=16384). Effectively, the disk will read a complete MP3 at - once, and will then spin down while the MP3 is playing. (Thanks to Bartek - Kania.) - -* Drew Scott Daniels observed: "I don't know why, but when I decrease the number - of colours that my display uses it consumes less battery power. I've seen - this on powerbooks too. I hope that this is a piece of information that - might be useful to the Laptop Mode patch or its users." - -* In syslog.conf, you can prefix entries with a dash `-` to omit syncing the - file after every logging. When you're using laptop-mode and your disk doesn't - spin down, this is a likely culprit. - -* Richard Atterer observed that laptop mode does not work well with noflushd - (http://noflushd.sourceforge.net/), it seems that noflushd prevents laptop-mode - from doing its thing. - -* If you're worried about your data, you might want to consider using a USB - memory stick or something like that as a "working area". (Be aware though - that flash memory can only handle a limited number of writes, and overuse - may wear out your memory stick pretty quickly. Do _not_ use journalling - filesystems on flash memory sticks.) - - -Configuration file for control and ACPI battery scripts -------------------------------------------------------- - -This allows the tunables to be changed for the scripts via an external -configuration file - -It should be installed as /etc/default/laptop-mode on Debian, and as -/etc/sysconfig/laptop-mode on Red Hat, SUSE, Mandrake, and other work-alikes. - -Config file:: - - # Maximum time, in seconds, of hard drive spindown time that you are - # comfortable with. Worst case, it's possible that you could lose this - # amount of work if your battery fails you while in laptop mode. - #MAX_AGE=600 - - # Automatically disable laptop mode when the number of minutes of battery - # that you have left goes below this threshold. - MINIMUM_BATTERY_MINUTES=10 - - # Read-ahead, in 512-byte sectors. You can spin down the disk while playing MP3/OGG - # by setting the disk readahead to 8MB (READAHEAD=16384). Effectively, the disk - # will read a complete MP3 at once, and will then spin down while the MP3/OGG is - # playing. - #READAHEAD=4096 - - # Shall we remount journaled fs. with appropriate commit interval? (1=yes) - #DO_REMOUNTS=1 - - # And shall we add the "noatime" option to that as well? (1=yes) - #DO_REMOUNT_NOATIME=1 - - # Dirty synchronous ratio. At this percentage of dirty pages the process - # which - # calls write() does its own writeback - #DIRTY_RATIO=40 - - # - # Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been - # exceeded, the kernel will wake flusher threads which will then reduce the - # amount of dirty memory to dirty_background_ratio. Set this nice and low, - # so once some writeout has commenced, we do a lot of it. - # - #DIRTY_BACKGROUND_RATIO=5 - - # kernel default dirty buffer age - #DEF_AGE=30 - #DEF_UPDATE=5 - #DEF_DIRTY_BACKGROUND_RATIO=10 - #DEF_DIRTY_RATIO=40 - #DEF_XFS_AGE_BUFFER=15 - #DEF_XFS_SYNC_INTERVAL=30 - #DEF_XFS_BUFD_INTERVAL=1 - - # This must be adjusted manually to the value of HZ in the running kernel - # on 2.4, until the XFS people change their 2.4 external interfaces to work in - # centisecs. This can be automated, but it's a work in progress that still - # needs# some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for - # external interfaces, and that is currently always set to 100. So you don't - # need to change this on 2.6. - #XFS_HZ=100 - - # Should the maximum CPU frequency be adjusted down while on battery? - # Requires CPUFreq to be setup. - # See Documentation/admin-guide/pm/cpufreq.rst for more info - #DO_CPU=0 - - # When on battery what is the maximum CPU speed that the system should - # use? Legal values are "slowest" for the slowest speed that your - # CPU is able to operate at, or a value listed in: - # /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies - # Only applicable if DO_CPU=1. - #CPU_MAXFREQ=slowest - - # Idle timeout for your hard drive (man hdparm for valid values, -S option) - # Default is 2 hours on AC (AC_HD=244) and 20 seconds for battery (BATT_HD=4). - #AC_HD=244 - #BATT_HD=4 - - # The drives for which to adjust the idle timeout. Separate them by a space, - # e.g. HD="/dev/hda /dev/hdb". - #HD="/dev/hda" - - # Set the spindown timeout on a hard drive? - #DO_HD=1 - - -Control script --------------- - -Please note that this control script works for the Linux 2.4 and 2.6 series (thanks -to Kiko Piris). - -Control script:: - - #!/bin/bash - - # start or stop laptop_mode, best run by a power management daemon when - # ac gets connected/disconnected from a laptop - # - # install as /sbin/laptop_mode - # - # Contributors to this script: Kiko Piris - # Bart Samwel - # Micha Feigin - # Andrew Morton - # Herve Eychenne - # Dax Kelson - # - # Original Linux 2.4 version by: Jens Axboe - - ############################################################################# - - # Source config - if [ -f /etc/default/laptop-mode ] ; then - # Debian - . /etc/default/laptop-mode - elif [ -f /etc/sysconfig/laptop-mode ] ; then - # Others - . /etc/sysconfig/laptop-mode - fi - - # Don't raise an error if the config file is incomplete - # set defaults instead: - - # Maximum time, in seconds, of hard drive spindown time that you are - # comfortable with. Worst case, it's possible that you could lose this - # amount of work if your battery fails you while in laptop mode. - MAX_AGE=${MAX_AGE:-'600'} - - # Read-ahead, in kilobytes - READAHEAD=${READAHEAD:-'4096'} - - # Shall we remount journaled fs. with appropriate commit interval? (1=yes) - DO_REMOUNTS=${DO_REMOUNTS:-'1'} - - # And shall we add the "noatime" option to that as well? (1=yes) - DO_REMOUNT_NOATIME=${DO_REMOUNT_NOATIME:-'1'} - - # Shall we adjust the idle timeout on a hard drive? - DO_HD=${DO_HD:-'1'} - - # Adjust idle timeout on which hard drive? - HD="${HD:-'/dev/hda'}" - - # spindown time for HD (hdparm -S values) - AC_HD=${AC_HD:-'244'} - BATT_HD=${BATT_HD:-'4'} - - # Dirty synchronous ratio. At this percentage of dirty pages the process which - # calls write() does its own writeback - DIRTY_RATIO=${DIRTY_RATIO:-'40'} - - # cpu frequency scaling - # See Documentation/admin-guide/pm/cpufreq.rst for more info - DO_CPU=${CPU_MANAGE:-'0'} - CPU_MAXFREQ=${CPU_MAXFREQ:-'slowest'} - - # - # Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been - # exceeded, the kernel will wake flusher threads which will then reduce the - # amount of dirty memory to dirty_background_ratio. Set this nice and low, - # so once some writeout has commenced, we do a lot of it. - # - DIRTY_BACKGROUND_RATIO=${DIRTY_BACKGROUND_RATIO:-'5'} - - # kernel default dirty buffer age - DEF_AGE=${DEF_AGE:-'30'} - DEF_UPDATE=${DEF_UPDATE:-'5'} - DEF_DIRTY_BACKGROUND_RATIO=${DEF_DIRTY_BACKGROUND_RATIO:-'10'} - DEF_DIRTY_RATIO=${DEF_DIRTY_RATIO:-'40'} - DEF_XFS_AGE_BUFFER=${DEF_XFS_AGE_BUFFER:-'15'} - DEF_XFS_SYNC_INTERVAL=${DEF_XFS_SYNC_INTERVAL:-'30'} - DEF_XFS_BUFD_INTERVAL=${DEF_XFS_BUFD_INTERVAL:-'1'} - - # This must be adjusted manually to the value of HZ in the running kernel - # on 2.4, until the XFS people change their 2.4 external interfaces to work in - # centisecs. This can be automated, but it's a work in progress that still needs - # some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for external - # interfaces, and that is currently always set to 100. So you don't need to - # change this on 2.6. - XFS_HZ=${XFS_HZ:-'100'} - - ############################################################################# - - KLEVEL="$(uname -r | - { - IFS='.' read a b c - echo $a.$b - } - )" - case "$KLEVEL" in - "2.4"|"2.6") - ;; - *) - echo "Unhandled kernel version: $KLEVEL ('uname -r' = '$(uname -r)')" >&2 - exit 1 - ;; - esac - - if [ ! -e /proc/sys/vm/laptop_mode ] ; then - echo "Kernel is not patched with laptop_mode patch." >&2 - exit 1 - fi - - if [ ! -w /proc/sys/vm/laptop_mode ] ; then - echo "You do not have enough privileges to enable laptop_mode." >&2 - exit 1 - fi - - # Remove an option (the first parameter) of the form option=<number> from - # a mount options string (the rest of the parameters). - parse_mount_opts () { - OPT="$1" - shift - echo ",$*," | sed \ - -e 's/,'"$OPT"'=[0-9]*,/,/g' \ - -e 's/,,*/,/g' \ - -e 's/^,//' \ - -e 's/,$//' - } - - # Remove an option (the first parameter) without any arguments from - # a mount option string (the rest of the parameters). - parse_nonumber_mount_opts () { - OPT="$1" - shift - echo ",$*," | sed \ - -e 's/,'"$OPT"',/,/g' \ - -e 's/,,*/,/g' \ - -e 's/^,//' \ - -e 's/,$//' - } - - # Find out the state of a yes/no option (e.g. "atime"/"noatime") in - # fstab for a given filesystem, and use this state to replace the - # value of the option in another mount options string. The device - # is the first argument, the option name the second, and the default - # value the third. The remainder is the mount options string. - # - # Example: - # parse_yesno_opts_wfstab /dev/hda1 atime atime defaults,noatime - # - # If fstab contains, say, "rw" for this filesystem, then the result - # will be "defaults,atime". - parse_yesno_opts_wfstab () { - L_DEV="$1" - OPT="$2" - DEF_OPT="$3" - shift 3 - L_OPTS="$*" - PARSEDOPTS1="$(parse_nonumber_mount_opts $OPT $L_OPTS)" - PARSEDOPTS1="$(parse_nonumber_mount_opts no$OPT $PARSEDOPTS1)" - # Watch for a default atime in fstab - FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)" - if echo "$FSTAB_OPTS" | grep "$OPT" > /dev/null ; then - # option specified in fstab: extract the value and use it - if echo "$FSTAB_OPTS" | grep "no$OPT" > /dev/null ; then - echo "$PARSEDOPTS1,no$OPT" - else - # no$OPT not found -- so we must have $OPT. - echo "$PARSEDOPTS1,$OPT" - fi - else - # option not specified in fstab -- choose the default. - echo "$PARSEDOPTS1,$DEF_OPT" - fi - } - - # Find out the state of a numbered option (e.g. "commit=NNN") in - # fstab for a given filesystem, and use this state to replace the - # value of the option in another mount options string. The device - # is the first argument, and the option name the second. The - # remainder is the mount options string in which the replacement - # must be done. - # - # Example: - # parse_mount_opts_wfstab /dev/hda1 commit defaults,commit=7 - # - # If fstab contains, say, "commit=3,rw" for this filesystem, then the - # result will be "rw,commit=3". - parse_mount_opts_wfstab () { - L_DEV="$1" - OPT="$2" - shift 2 - L_OPTS="$*" - PARSEDOPTS1="$(parse_mount_opts $OPT $L_OPTS)" - # Watch for a default commit in fstab - FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)" - if echo "$FSTAB_OPTS" | grep "$OPT=" > /dev/null ; then - # option specified in fstab: extract the value, and use it - echo -n "$PARSEDOPTS1,$OPT=" - echo ",$FSTAB_OPTS," | sed \ - -e 's/.*,'"$OPT"'=//' \ - -e 's/,.*//' - else - # option not specified in fstab: set it to 0 - echo "$PARSEDOPTS1,$OPT=0" - fi - } - - deduce_fstype () { - MP="$1" - # My root filesystem unfortunately has - # type "unknown" in /etc/mtab. If we encounter - # "unknown", we try to get the type from fstab. - cat /etc/fstab | - grep -v '^#' | - while read FSTAB_DEV FSTAB_MP FSTAB_FST FSTAB_OPTS FSTAB_DUMP FSTAB_DUMP ; do - if [ "$FSTAB_MP" = "$MP" ]; then - echo $FSTAB_FST - exit 0 - fi - done - } - - if [ $DO_REMOUNT_NOATIME -eq 1 ] ; then - NOATIME_OPT=",noatime" - fi - - case "$1" in - start) - AGE=$((100*$MAX_AGE)) - XFS_AGE=$(($XFS_HZ*$MAX_AGE)) - echo -n "Starting laptop_mode" - - if [ -d /proc/sys/vm/pagebuf ] ; then - # (For 2.4 and early 2.6.) - # This only needs to be set, not reset -- it is only used when - # laptop mode is enabled. - echo $XFS_AGE > /proc/sys/vm/pagebuf/lm_flush_age - echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval - elif [ -f /proc/sys/fs/xfs/lm_age_buffer ] ; then - # (A couple of early 2.6 laptop mode patches had these.) - # The same goes for these. - echo $XFS_AGE > /proc/sys/fs/xfs/lm_age_buffer - echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval - elif [ -f /proc/sys/fs/xfs/age_buffer ] ; then - # (2.6.6) - # But not for these -- they are also used in normal - # operation. - echo $XFS_AGE > /proc/sys/fs/xfs/age_buffer - echo $XFS_AGE > /proc/sys/fs/xfs/sync_interval - elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then - # (2.6.7 upwards) - # And not for these either. These are in centisecs, - # not USER_HZ, so we have to use $AGE, not $XFS_AGE. - echo $AGE > /proc/sys/fs/xfs/age_buffer_centisecs - echo $AGE > /proc/sys/fs/xfs/xfssyncd_centisecs - echo 3000 > /proc/sys/fs/xfs/xfsbufd_centisecs - fi - - case "$KLEVEL" in - "2.4") - echo 1 > /proc/sys/vm/laptop_mode - echo "30 500 0 0 $AGE $AGE 60 20 0" > /proc/sys/vm/bdflush - ;; - "2.6") - echo 5 > /proc/sys/vm/laptop_mode - echo "$AGE" > /proc/sys/vm/dirty_writeback_centisecs - echo "$AGE" > /proc/sys/vm/dirty_expire_centisecs - echo "$DIRTY_RATIO" > /proc/sys/vm/dirty_ratio - echo "$DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio - ;; - esac - if [ $DO_REMOUNTS -eq 1 ]; then - cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do - PARSEDOPTS="$(parse_mount_opts "$OPTS")" - if [ "$FST" = 'unknown' ]; then - FST=$(deduce_fstype $MP) - fi - case "$FST" in - "ext3") - PARSEDOPTS="$(parse_mount_opts commit "$OPTS")" - mount $DEV -t $FST $MP -o remount,$PARSEDOPTS,commit=$MAX_AGE$NOATIME_OPT - ;; - "xfs") - mount $DEV -t $FST $MP -o remount,$OPTS$NOATIME_OPT - ;; - esac - if [ -b $DEV ] ; then - blockdev --setra $(($READAHEAD * 2)) $DEV - fi - done - fi - if [ $DO_HD -eq 1 ] ; then - for THISHD in $HD ; do - /sbin/hdparm -S $BATT_HD $THISHD > /dev/null 2>&1 - /sbin/hdparm -B 1 $THISHD > /dev/null 2>&1 - done - fi - if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then - if [ $CPU_MAXFREQ = 'slowest' ]; then - CPU_MAXFREQ=`cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq` - fi - echo $CPU_MAXFREQ > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq - fi - echo "." - ;; - stop) - U_AGE=$((100*$DEF_UPDATE)) - B_AGE=$((100*$DEF_AGE)) - echo -n "Stopping laptop_mode" - echo 0 > /proc/sys/vm/laptop_mode - if [ -f /proc/sys/fs/xfs/age_buffer -a ! -f /proc/sys/fs/xfs/lm_age_buffer ] ; then - # These need to be restored, if there are no lm_*. - echo $(($XFS_HZ*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer - echo $(($XFS_HZ*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/sync_interval - elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then - # These need to be restored as well. - echo $((100*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer_centisecs - echo $((100*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/xfssyncd_centisecs - echo $((100*$DEF_XFS_BUFD_INTERVAL)) > /proc/sys/fs/xfs/xfsbufd_centisecs - fi - case "$KLEVEL" in - "2.4") - echo "30 500 0 0 $U_AGE $B_AGE 60 20 0" > /proc/sys/vm/bdflush - ;; - "2.6") - echo "$U_AGE" > /proc/sys/vm/dirty_writeback_centisecs - echo "$B_AGE" > /proc/sys/vm/dirty_expire_centisecs - echo "$DEF_DIRTY_RATIO" > /proc/sys/vm/dirty_ratio - echo "$DEF_DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio - ;; - esac - if [ $DO_REMOUNTS -eq 1 ] ; then - cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do - # Reset commit and atime options to defaults. - if [ "$FST" = 'unknown' ]; then - FST=$(deduce_fstype $MP) - fi - case "$FST" in - "ext3") - PARSEDOPTS="$(parse_mount_opts_wfstab $DEV commit $OPTS)" - PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $PARSEDOPTS)" - mount $DEV -t $FST $MP -o remount,$PARSEDOPTS - ;; - "xfs") - PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $OPTS)" - mount $DEV -t $FST $MP -o remount,$PARSEDOPTS - ;; - esac - if [ -b $DEV ] ; then - blockdev --setra 256 $DEV - fi - done - fi - if [ $DO_HD -eq 1 ] ; then - for THISHD in $HD ; do - /sbin/hdparm -S $AC_HD $THISHD > /dev/null 2>&1 - /sbin/hdparm -B 255 $THISHD > /dev/null 2>&1 - done - fi - if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then - echo `cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq` > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq - fi - echo "." - ;; - *) - echo "Usage: $0 {start|stop}" 2>&1 - exit 1 - ;; - - esac - - exit 0 - - -ACPI integration ----------------- - -Dax Kelson submitted this so that the ACPI acpid daemon will -kick off the laptop_mode script and run hdparm. The part that -automatically disables laptop mode when the battery is low was -written by Jan Topinski. - -/etc/acpi/events/ac_adapter:: - - event=ac_adapter - action=/etc/acpi/actions/ac.sh %e - -/etc/acpi/events/battery:: - - event=battery.* - action=/etc/acpi/actions/battery.sh %e - -/etc/acpi/actions/ac.sh:: - - #!/bin/bash - - # ac on/offline event handler - - status=`awk '/^state: / { print $2 }' /proc/acpi/ac_adapter/$2/state` - - case $status in - "on-line") - /sbin/laptop_mode stop - exit 0 - ;; - "off-line") - /sbin/laptop_mode start - exit 0 - ;; - esac - - -/etc/acpi/actions/battery.sh:: - - #! /bin/bash - - # Automatically disable laptop mode when the battery almost runs out. - - BATT_INFO=/proc/acpi/battery/$2/state - - if [[ -f /proc/sys/vm/laptop_mode ]] - then - LM=`cat /proc/sys/vm/laptop_mode` - if [[ $LM -gt 0 ]] - then - if [[ -f $BATT_INFO ]] - then - # Source the config file only now that we know we need - if [ -f /etc/default/laptop-mode ] ; then - # Debian - . /etc/default/laptop-mode - elif [ -f /etc/sysconfig/laptop-mode ] ; then - # Others - . /etc/sysconfig/laptop-mode - fi - MINIMUM_BATTERY_MINUTES=${MINIMUM_BATTERY_MINUTES:-'10'} - - ACTION="`cat $BATT_INFO | grep charging | cut -c 26-`" - if [[ ACTION -eq "discharging" ]] - then - PRESENT_RATE=`cat $BATT_INFO | grep "present rate:" | sed "s/.* \([0-9][0-9]* \).*/\1/" ` - REMAINING=`cat $BATT_INFO | grep "remaining capacity:" | sed "s/.* \([0-9][0-9]* \).*/\1/" ` - fi - if (($REMAINING * 60 / $PRESENT_RATE < $MINIMUM_BATTERY_MINUTES)) - then - /sbin/laptop_mode stop - fi - else - logger -p daemon.warning "You are using laptop mode and your battery interface $BATT_INFO is missing. This may lead to loss of data when the battery runs out. Check kernel ACPI support and /proc/acpi/battery folder, and edit /etc/acpi/battery.sh to set BATT_INFO to the correct path." - fi - fi - fi - - -Monitoring tool ---------------- - -Bartek Kania submitted this, it can be used to measure how much time your disk -spends spun up/down. See tools/laptop/dslm/dslm.c diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst index 4ab0fef7d440..03951ed6b628 100644 --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst @@ -54,6 +54,7 @@ detailed description): - Setting keyboard language - WWAN Antenna type - Auxmac + - Hardware damage detection capability A compatibility table by model and feature is maintained on the web site, http://ibm-acpi.sf.net/. I appreciate any success or failure @@ -1576,6 +1577,42 @@ percentage level, above which charging will stop. The exact semantics of the attributes may be found in Documentation/ABI/testing/sysfs-class-power. +Hardware damage detection capability +------------------------------------ + +sysfs attributes: hwdd_status, hwdd_detail + +Thinkpads are adding the ability to detect and report hardware damage. +Add new sysfs interface to identify the damaged device status. +Initial support is available for the USB-C replaceable connector. + +The command to check device damaged status is:: + + cat /sys/devices/platform/thinkpad_acpi/hwdd_status + +This value displays status of device damaged. + +- 0 = Not Damaged +- 1 = Damaged + +The command to check location of damaged device is:: + + cat /sys/devices/platform/thinkpad_acpi/hwdd_detail + +This value displays location of damaged device having 1 line per damaged "item". +For example: + +if no damage is detected: + +- No damage detected + +if damage detected: + +- TYPE-C: Base, Right side, Center port + +The property is read-only. If feature is not supported then sysfs +attribute is not created. + Multiple Commands, Module Parameters ------------------------------------ diff --git a/Documentation/admin-guide/laptops/toshiba_haps.rst b/Documentation/admin-guide/laptops/toshiba_haps.rst index d28b6c3f2849..0226225b82e1 100644 --- a/Documentation/admin-guide/laptops/toshiba_haps.rst +++ b/Documentation/admin-guide/laptops/toshiba_haps.rst @@ -43,7 +43,7 @@ RSSS Shuts down the HDD protection interface for a few seconds, ==== ===================================================================== Note: - The presence of Solid State Drives (SSD) can make this driver to fail loading, + The presence of Solid State Drives (SSD) can cause this driver to fail loading, given the fact that such drives have no movable parts, and thus, not requiring any "protection" as well as failing during the evaluation of the _STA method found under this device. diff --git a/Documentation/admin-guide/laptops/uniwill-laptop.rst b/Documentation/admin-guide/laptops/uniwill-laptop.rst new file mode 100644 index 000000000000..aff5f57a6bd4 --- /dev/null +++ b/Documentation/admin-guide/laptops/uniwill-laptop.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +Uniwill laptop extra features +============================= + +On laptops manufactured by Uniwill (either directly or as ODM), the ``uniwill-laptop`` driver +handles various platform-specific features. + +Module Loading +-------------- + +The ``uniwill-laptop`` driver relies on a DMI table to automatically load on supported devices. +When using the ``force`` module parameter, this DMI check will be omitted, allowing the driver +to be loaded on unsupported devices for testing purposes. + +Hotkeys +------- + +Usually the FN keys work without a special driver. However as soon as the ``uniwill-laptop`` driver +is loaded, the FN keys need to be handled manually. This is done automatically by the driver itself. + +Keyboard settings +----------------- + +The ``uniwill-laptop`` driver allows the user to enable/disable: + + - the FN lock and super key of the integrated keyboard + - the touchpad toggle functionality of the integrated touchpad + +See Documentation/ABI/testing/sysfs-driver-uniwill-laptop for details. + +Hwmon interface +--------------- + +The ``uniwill-laptop`` driver supports reading of the CPU and GPU temperature and supports up to +two fans. Userspace applications can access sensor readings over the hwmon sysfs interface. + +Platform profile +---------------- + +Support for changing the platform performance mode is currently not implemented. + +Battery Charging Control +------------------------ + +The ``uniwill-laptop`` driver supports controlling the battery charge limit. This happens over +the standard ``charge_control_end_threshold`` power supply sysfs attribute. All values +between 1 and 100 percent are supported. + +Additionally the driver signals the presence of battery charging issues through the standard +``health`` power supply sysfs attribute. + +Lightbar +-------- + +The ``uniwill-laptop`` driver exposes the lightbar found on some models as a standard multicolor +LED class device. The default name of this LED class device is ``uniwill:multicolor:status``. + +See Documentation/ABI/testing/sysfs-driver-uniwill-laptop for details on how to control the various +animation modes of the lightbar. diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst index deed823eab01..dc7eab191caa 100644 --- a/Documentation/admin-guide/md.rst +++ b/Documentation/admin-guide/md.rst @@ -238,6 +238,16 @@ All md devices contain: the number of devices in a raid4/5/6, or to support external metadata formats which mandate such clipping. + logical_block_size + Configure the array's logical block size in bytes. This attribute + is only supported for 1.x meta. Write the value before starting + array. The final array LBS uses the maximum between this + configuration and LBS of all combined devices. Note that + LBS cannot exceed PAGE_SIZE before RAID supports folio. + WARNING: Arrays created on new kernel cannot be assembled at old + kernel due to padding check, Set module parameter 'check_new_feature' + to false to bypass, but data loss may occur. + reshape_position This is either ``none`` or a sector number within the devices of the array where ``reshape`` is up to. If this is set, the three diff --git a/Documentation/admin-guide/media/mali-c55-graph.dot b/Documentation/admin-guide/media/mali-c55-graph.dot new file mode 100644 index 000000000000..0775ba42bf4c --- /dev/null +++ b/Documentation/admin-guide/media/mali-c55-graph.dot @@ -0,0 +1,19 @@ +digraph board { + rankdir=TB + n00000001 [label="{{} | mali-c55 tpg\n/dev/v4l-subdev0 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port0 -> n00000003:port0 [style=dashed] + n00000003 [label="{{<port0> 0} | mali-c55 isp\n/dev/v4l-subdev1 | {<port1> 1 | <port2> 2}}", shape=Mrecord, style=filled, fillcolor=green] + n00000003:port1 -> n00000007:port0 [style=bold] + n00000003:port2 -> n00000007:port2 [style=bold] + n00000003:port1 -> n0000000b:port0 [style=bold] + n00000007 [label="{{<port0> 0 | <port2> 2} | mali-c55 resizer fr\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000007:port1 -> n0000000e [style=bold] + n0000000b [label="{{<port0> 0} | mali-c55 resizer ds\n/dev/v4l-subdev3 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000b:port1 -> n00000012 [style=bold] + n0000000e [label="mali-c55 fr\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n00000012 [label="mali-c55 ds\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n00000022 [label="{{<port0> 0} | csi2-rx\n/dev/v4l-subdev4 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000022:port1 -> n00000003:port0 + n00000027 [label="{{} | imx415 1-001a\n/dev/v4l-subdev5 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n00000027:port0 -> n00000022:port0 [style=bold] +}
\ No newline at end of file diff --git a/Documentation/admin-guide/media/mali-c55.rst b/Documentation/admin-guide/media/mali-c55.rst new file mode 100644 index 000000000000..315f982000c4 --- /dev/null +++ b/Documentation/admin-guide/media/mali-c55.rst @@ -0,0 +1,413 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +ARM Mali-C55 Image Signal Processor driver +========================================== + +Introduction +============ + +This file documents the driver for ARM's Mali-C55 Image Signal Processor. The +driver is located under drivers/media/platform/arm/mali-c55. + +The Mali-C55 ISP receives data in either raw Bayer format or RGB/YUV format from +sensors through either a parallel interface or a memory bus before processing it +and outputting it through an internal DMA engine. Two output pipelines are +possible (though one may not be fitted, depending on the implementation). These +are referred to as "Full resolution" and "Downscale", but the naming is historic +and both pipes are capable of cropping/scaling operations. The full resolution +pipe is also capable of outputting RAW data, bypassing much of the ISP's +processing. The downscale pipe cannot output RAW data. An integrated test +pattern generator can be used to drive the ISP and produce image data in the +absence of a connected camera sensor. The driver module is named mali_c55, and +is enabled through the CONFIG_VIDEO_MALI_C55 config option. + +The driver implements V4L2, Media Controller and V4L2 Subdevice interfaces and +expects camera sensors connected to the ISP to have V4L2 subdevice interfaces. + +Mali-C55 ISP hardware +===================== + +A high level functional view of the Mali-C55 ISP is presented below. The ISP +takes input from either a live source or through a DMA engine for memory input, +depending on the SoC integration.:: + + +---------+ +----------+ +--------+ + | Sensor |--->| CSI-2 Rx | "Full Resolution" | DMA | + +---------+ +----------+ |\ Output +--->| Writer | + | | \ | +--------+ + | | \ +----------+ +------+---> Streaming I/O + +------------+ +------->| | | | | + | | | |-->| Mali-C55 |--+ + | DMA Reader |--------------->| | | ISP | | + | | | / | | | +---> Streaming I/O + +------------+ | / +----------+ | | + |/ +------+ + | +--------+ + +--->| DMA | + "Downscaled" | Writer | + Output +--------+ + +Media Controller Topology +========================= + +An example of the ISP's topology (as implemented in a system with an IMX415 +camera sensor and generic CSI-2 receiver) is below: + + +.. kernel-figure:: mali-c55-graph.dot + :alt: mali-c55-graph.dot + :align: center + +The driver has 4 V4L2 subdevices: + +- `mali_c55 isp`: Responsible for configuring input crop and color space + conversion +- `mali_c55 tpg`: The test pattern generator, emulating a camera sensor. +- `mali_c55 resizer fr`: The Full-Resolution pipe resizer +- `mali_c55 resizer ds`: The Downscale pipe resizer + +The driver has 3 V4L2 video devices: + +- `mali-c55 fr`: The full-resolution pipe's capture device +- `mali-c55 ds`: The downscale pipe's capture device +- `mali-c55 3a stats`: The 3A statistics capture device + +Frame sequences are synchronised across to two capture devices, meaning if one +pipe is started later than the other the sequence numbers returned in its +buffers will match those of the other pipe rather than starting from zero. + +Idiosyncrasies +-------------- + +**mali-c55 isp** +The `mali-c55 isp` subdevice has a single sink pad to which all sources of data +should be connected. The active source is selected by enabling the appropriate +media link and disabling all others. The ISP has two source pads, reflecting the +different paths through which it can internally route data. Tap points within +the ISP allow users to divert data to avoid processing by some or all of the +hardware's processing steps. The diagram below is intended only to highlight how +the bypassing works and is not a true reflection of those processing steps; for +a high-level functional block diagram see ARM's developer page for the +ISP [3]_:: + + +--------------------------------------------------------------+ + | Possible Internal ISP Data Routes | + | +------------+ +----------+ +------------+ | + +---+ | | | | | Colour | +---+ + | 0 |--+-->| Processing |->| Demosaic |->| Space |--->| 1 | + +---+ | | | | | | Conversion | +---+ + | | +------------+ +----------+ +------------+ | + | | +---+ + | +---------------------------------------------------| 2 | + | +---+ + | | + +--------------------------------------------------------------+ + + +.. flat-table:: + :header-rows: 1 + + * - Pad + - Direction + - Purpose + + * - 0 + - sink + - Data input, connected to the TPG and camera sensors + + * - 1 + - source + - RGB/YUV data, connected to the FR and DS V4L2 subdevices + + * - 2 + - source + - RAW bayer data, connected to the FR V4L2 subdevices + +The ISP is limited to both input and output resolutions between 640x480 and +8192x8192, and this is reflected in the ISP and resizer subdevice's .set_fmt() +operations. + +**mali-c55 resizer fr** +The `mali-c55 resizer fr` subdevice has two _sink_ pads to reflect the different +insertion points in the hardware (either RAW or demosaiced data): + +.. flat-table:: + :header-rows: 1 + + * - Pad + - Direction + - Purpose + + * - 0 + - sink + - Data input connected to the ISP's demosaiced stream. + + * - 1 + - source + - Data output connected to the capture video device + + * - 2 + - sink + - Data input connected to the ISP's raw data stream + +The data source in use is selected through the routing API; two routes each of a +single stream are available: + +.. flat-table:: + :header-rows: 1 + + * - Sink Pad + - Source Pad + - Purpose + + * - 0 + - 1 + - Demosaiced data route + + * - 2 + - 1 + - Raw data route + + +If the demosaiced route is active then the FR pipe is only capable of output +in RGB/YUV formats. If the raw route is active then the output reflects the +input (which may be either Bayer or RGB/YUV data). + +Using the driver to capture video +================================= + +Using the media controller APIs we can configure the input source and ISP to +capture images in a variety of formats. In the examples below, configuring the +media graph is done with the v4l-utils [1]_ package's media-ctl utility. +Capturing the images is done with yavta [2]_. + +Configuring the input source +---------------------------- + +The first step is to set the input source that we wish by enabling the correct +media link. Using the example topology above, we can select the TPG as follows: + +.. code-block:: none + + media-ctl -l "'lte-csi2-rx':1->'mali-c55 isp':0[0]" + media-ctl -l "'mali-c55 tpg':0->'mali-c55 isp':0[1]" + +Configuring which video devices will stream data +------------------------------------------------ + +The driver will wait for all video devices to have their VIDIOC_STREAMON ioctl +called before it tells the sensor to start streaming. To facilitate this we need +to enable links to the video devices that we want to use. In the example below +we enable the links to both of the image capture video devices + +.. code-block:: none + + media-ctl -l "'mali-c55 resizer fr':1->'mali-c55 fr':0[1]" + media-ctl -l "'mali-c55 resizer ds':1->'mali-c55 ds':0[1]" + +Capturing bayer data from the source and processing to RGB/YUV +-------------------------------------------------------------- + +To capture 1920x1080 bayer data from the source and push it through the ISP's +full processing pipeline, we configure the data formats appropriately on the +source, ISP and resizer subdevices and set the FR resizer's routing to select +processed data. The media bus format on the resizer's source pad will be either +RGB121212_1X36 or YUV10_1X30, depending on whether you want to capture RGB or +YUV. The ISP's debayering block outputs RGB data natively, setting the source +pad format to YUV10_1X30 enables the colour space conversion block. + +In this example we target RGB565 output, so select RGB121212_1X36 as the resizer +source pad's format: + +.. code-block:: none + + # Set formats on the TPG and ISP + media-ctl -V "'mali-c55 tpg':0[fmt:SRGGB20_1X20/1920x1080]" + media-ctl -V "'mali-c55 isp':0[fmt:SRGGB20_1X20/1920x1080]" + media-ctl -V "'mali-c55 isp':1[fmt:SRGGB20_1X20/1920x1080]" + + # Set routing on the FR resizer + media-ctl -R "'mali-c55 resizer fr'[0/0->1/0[1],2/0->1/0[0]]" + + # Set format on the resizer, must be done AFTER the routing. + media-ctl -V "'mali-c55 resizer fr':1[fmt:RGB121212_1X36/1920x1080]" + +The downscale output can also be used to stream data at the same time. In this +case since only processed data can be captured through the downscale output no +routing need be set: + +.. code-block:: none + + # Set format on the resizer + media-ctl -V "'mali-c55 resizer ds':1[fmt:RGB121212_1X36/1920x1080]" + +Following which images can be captured from both the FR and DS output's video +devices (simultaneously, if desired): + +.. code-block:: none + + yavta -f RGB565 -s 1920x1080 -c10 /dev/video0 + yavta -f RGB565 -s 1920x1080 -c10 /dev/video1 + +Cropping the image +~~~~~~~~~~~~~~~~~~ + +Both the full resolution and downscale pipes can crop to a minimum resolution of +640x480. To crop the image simply configure the resizer's sink pad's crop and +compose rectangles and set the format on the video device: + +.. code-block:: none + + media-ctl -V "'mali-c55 resizer fr':0[fmt:RGB121212_1X36/1920x1080 crop:(480,270)/640x480 compose:(0,0)/640x480]" + media-ctl -V "'mali-c55 resizer fr':1[fmt:RGB121212_1X36/640x480]" + yavta -f RGB565 -s 640x480 -c10 /dev/video0 + +Downscaling the image +~~~~~~~~~~~~~~~~~~~~~ + +Both the full resolution and downscale pipes can downscale the image by up to 8x +provided the minimum 640x480 output resolution is adhered to. For the best image +result the scaling ratio for each direction should be the same. To configure +scaling we use the compose rectangle on the resizer's sink pad: + +.. code-block:: none + + media-ctl -V "'mali-c55 resizer fr':0[fmt:RGB121212_1X36/1920x1080 crop:(0,0)/1920x1080 compose:(0,0)/640x480]" + media-ctl -V "'mali-c55 resizer fr':1[fmt:RGB121212_1X36/640x480]" + yavta -f RGB565 -s 640x480 -c10 /dev/video0 + +Capturing images in YUV formats +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If we need to output YUV data rather than RGB the color space conversion block +needs to be active, which is achieved by setting MEDIA_BUS_FMT_YUV10_1X30 on the +resizer's source pad. We can then configure a capture format like NV12 (here in +its multi-planar variant) + +.. code-block:: none + + media-ctl -V "'mali-c55 resizer fr':1[fmt:YUV10_1X30/1920x1080]" + yavta -f NV12M -s 1920x1080 -c10 /dev/video0 + +Capturing RGB data from the source and processing it with the resizers +---------------------------------------------------------------------- + +The Mali-C55 ISP can work with sensors capable of outputting RGB data. In this +case although none of the image quality blocks would be used it can still +crop/scale the data in the usual way. For this reason RGB data input to the ISP +still goes through the ISP subdevice's pad 1 to the resizer. + +To achieve this, the ISP's sink pad's format is set to +MEDIA_BUS_FMT_RGB202020_1X60 - this reflects the format that data must be in to +work with the ISP. Converting the camera sensor's output to that format is the +responsibility of external hardware. + +In this example we ask the test pattern generator to give us RGB data instead of +bayer. + +.. code-block:: none + + media-ctl -V "'mali-c55 tpg':0[fmt:RGB202020_1X60/1920x1080]" + media-ctl -V "'mali-c55 isp':0[fmt:RGB202020_1X60/1920x1080]" + +Cropping or scaling the data can be done in exactly the same way as outlined +earlier. + +Capturing raw data from the source and outputting it unmodified +----------------------------------------------------------------- + +The ISP can additionally capture raw data from the source and output it on the +full resolution pipe only, completely unmodified. In this case the downscale +pipe can still process the data normally and be used at the same time. + +To configure raw bypass the FR resizer's subdevice's routing table needs to be +configured, followed by formats in the appropriate places: + +.. code-block:: none + + media-ctl -R "'mali-c55 resizer fr'[0/0->1/0[0],2/0->1/0[1]]" + media-ctl -V "'mali-c55 isp':0[fmt:RGB202020_1X60/1920x1080]" + media-ctl -V "'mali-c55 resizer fr':2[fmt:RGB202020_1X60/1920x1080]" + media-ctl -V "'mali-c55 resizer fr':1[fmt:RGB202020_1X60/1920x1080]" + + # Set format on the video device and stream + yavta -f RGB565 -s 1920x1080 -c10 /dev/video0 + +.. _mali-c55-3a-stats: + +Capturing ISP Statistics +======================== + +The ISP is capable of producing statistics for consumption by image processing +algorithms running in userspace. These statistics can be captured by queueing +buffers to the `mali-c55 3a stats` V4L2 Device whilst the ISP is streaming. Only +the :ref:`V4L2_META_FMT_MALI_C55_STATS <v4l2-meta-fmt-mali-c55-stats>` +format is supported, so no format-setting need be done: + +.. code-block:: none + + # We assume the media graph has been configured to support RGB565 capture + # from the mali-c55 fr V4L2 Device, which is at /dev/video0. The statistics + # V4L2 device is at /dev/video3 + + yavta -f RGB565 -s 1920x1080 -c32 /dev/video0 && \ + yavta -c10 -F /dev/video3 + +The layout of the buffer is described by :c:type:`mali_c55_stats_buffer`, +but broadly statistics are generated to support three image processing +algorithms; AEXP (Auto-Exposure), AWB (Auto-White Balance) and AF (Auto-Focus). +These stats can be drawn from various places in the Mali C55 ISP pipeline, known +as "tap points". This high-level block diagram is intended to explain where in +the processing flow the statistics can be drawn from:: + + +--> AEXP-2 +----> AEXP-1 +--> AF-0 + | +----> AF-1 | + | | | + +---------+ | +--------------+ | +--------------+ | + | Input +-+-->+ Digital Gain +---+-->+ Black Level +---+---+ + +---------+ +--------------+ +--------------+ | + +-----------------------------------------------------------------+ + | + | +--------------+ +---------+ +----------------+ + +-->| Sinter Noise +-+ White +--+--->| Lens Shading +--+---------------+ + | Reduction | | Balance | | | | | | + +--------------+ +---------+ | +----------------+ | | + +---> AEXP-0 (A) +--> AEXP-0 (B) | + +--------------------------------------------------------------------------+ + | + | +----------------+ +--------------+ +----------------+ + +-->| Tone mapping +-+--->| Demosaicing +->+ Purple Fringe +-+-----------+ + | | | +--------------+ | Correction | | | + +----------------+ +-> AEXP-IRIDIX +----------------+ +---> AWB-0 | + +----------------------------------------------------------------------------+ + | +-------------+ +-------------+ + +------------------->| Colour +---+--->| Output | + | Correction | | | Pipelines | + +-------------+ | +-------------+ + +--> AWB-1 + +By default all statistics are drawn from the 0th tap point for each algorithm; +I.E. AEXP statistics from AEXP-0 (A), AWB statistics from AWB-0 and AF +statistics from AF-0. This is configurable for AEXP and AWB statsistics through +programming the ISP's parameters. + +.. _mali-c55-3a-params: + +Programming ISP Parameters +========================== + +The ISP can be programmed with various parameters from userspace to apply to the +hardware before and during video stream. This allows userspace to dynamically +change values such as black level, white balance and lens shading gains and so +on. + +The buffer format and how to populate it are described by the +:ref:`V4L2_META_FMT_MALI_C55_PARAMS <v4l2-meta-fmt-mali-c55-params>` format, +which should be set as the data format for the `mali-c55 3a params` video node. + +References +========== +.. [1] https://git.linuxtv.org/v4l-utils.git/ +.. [2] https://git.ideasonboard.org/yavta.git +.. [3] https://developer.arm.com/Processors/Mali-C55 diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst index 5ac69b833a7a..0a8a56e837f7 100644 --- a/Documentation/admin-guide/media/mgb4.rst +++ b/Documentation/admin-guide/media/mgb4.rst @@ -31,9 +31,11 @@ Global (PCI card) parameters | 0 - No module present | 1 - FPDL3 - | 2 - GMSL (one serializer, two daisy chained deserializers) - | 3 - GMSL (one serializer, two deserializers) - | 4 - GMSL (two deserializers with two daisy chain outputs) + | 2 - GMSL3 (one serializer, two daisy chained deserializers) + | 3 - GMSL3 (one serializer, two deserializers) + | 4 - GMSL3 (two deserializers with two daisy chain outputs) + | 6 - GMSL1 + | 8 - GMSL3 coax **module_version** (R): Module version number. Zero in case of a missing module. @@ -42,7 +44,8 @@ Global (PCI card) parameters Firmware type. | 1 - FPDL3 - | 2 - GMSL + | 2 - GMSL3 + | 3 - GMSL1 **fw_version** (R): Firmware version number. diff --git a/Documentation/admin-guide/media/platform-cardlist.rst b/Documentation/admin-guide/media/platform-cardlist.rst index 1230ae4037ad..63f4b19c3628 100644 --- a/Documentation/admin-guide/media/platform-cardlist.rst +++ b/Documentation/admin-guide/media/platform-cardlist.rst @@ -18,8 +18,6 @@ am437x-vpfe TI AM437x VPFE aspeed-video Aspeed AST2400 and AST2500 atmel-isc ATMEL Image Sensor Controller (ISC) atmel-isi ATMEL Image Sensor Interface (ISI) -c8sectpfe SDR platform devices -c8sectpfe SDR platform devices cafe_ccic Marvell 88ALP01 (Cafe) CMOS Camera Controller cdns-csi2rx Cadence MIPI-CSI2 RX Controller cdns-csi2tx Cadence MIPI-CSI2 TX Controller diff --git a/Documentation/admin-guide/media/radio-cardlist.rst b/Documentation/admin-guide/media/radio-cardlist.rst index a82a146bf912..cec724256812 100644 --- a/Documentation/admin-guide/media/radio-cardlist.rst +++ b/Documentation/admin-guide/media/radio-cardlist.rst @@ -30,7 +30,6 @@ radio-terratec TerraTec ActiveRadio ISA Standalone radio-timb Enable the Timberdale radio driver radio-trust Trust FM radio card radio-typhoon Typhoon Radio (a.k.a. EcoRadio) -radio-wl1273 Texas Instruments WL1273 I2C FM Radio fm_drv ISA radio devices fm_drv ISA radio devices radio-zoltrix Zoltrix Radio diff --git a/Documentation/admin-guide/media/rkcif-rk3568-vicap.dot b/Documentation/admin-guide/media/rkcif-rk3568-vicap.dot new file mode 100644 index 000000000000..3fac59335459 --- /dev/null +++ b/Documentation/admin-guide/media/rkcif-rk3568-vicap.dot @@ -0,0 +1,8 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0} | rkcif-dvp0\n/dev/v4l-subdev0 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port1 -> n00000004 + n00000004 [label="rkcif-dvp0-id0\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n00000025 [label="{{} | it6801 2-0048\n/dev/v4l-subdev1 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n00000025:port0 -> n00000001:port0 +} diff --git a/Documentation/admin-guide/media/rkcif.rst b/Documentation/admin-guide/media/rkcif.rst new file mode 100644 index 000000000000..2558c121abc4 --- /dev/null +++ b/Documentation/admin-guide/media/rkcif.rst @@ -0,0 +1,79 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +Rockchip Camera Interface (CIF) +========================================= + +Introduction +============ + +The Rockchip Camera Interface (CIF) is featured in many Rockchip SoCs in +different variants. +The different variants are combinations of common building blocks, such as + +* INTERFACE blocks of different types, namely + + * the Digital Video Port (DVP, a parallel data interface) + * the interface block for the MIPI CSI-2 receiver + +* CROP units + +* MIPI CSI-2 receiver (not available on all variants): This unit is referred + to as MIPI CSI HOST in the Rockchip documentation. + Technically, it is a separate hardware block, but it is strongly coupled to + the CIF and therefore included here. + +* MUX units (not available on all variants) that pass the video data to an + image signal processor (ISP) + +* SCALE units (not available on all variants) + +* DMA engines that transfer video data into system memory using a + double-buffering mechanism called ping-pong mode + +* Support for four streams per INTERFACE block (not available on all + variants), e.g., for MIPI CSI-2 Virtual Channels (VCs) + +This document describes the different variants of the CIF, their hardware +layout, as well as their representation in the media controller centric rkcif +device driver, which is located under drivers/media/platform/rockchip/rkcif. + +Variants +======== + +Rockchip PX30 Video Input Processor (VIP) +----------------------------------------- + +The PX30 Video Input Processor (VIP) features a digital video port that accepts +parallel video data or BT.656. +Since these protocols do not feature multiple streams, the VIP has one DMA +engine that transfers the input video data into system memory. + +The rkcif driver represents this hardware variant by exposing one V4L2 subdevice +(the DVP INTERFACE/CROP block) and one V4L2 device (the DVP DMA engine). + +Rockchip RK3568 Video Capture (VICAP) +------------------------------------- + +The RK3568 Video Capture (VICAP) unit features a digital video port and a MIPI +CSI-2 receiver that can receive video data independently. +The DVP accepts parallel video data, BT.656 and BT.1120. +Since the BT.1120 protocol may feature more than one stream, the RK3568 VICAP +DVP features four DMA engines that can capture different streams. +Similarly, the RK3568 VICAP MIPI CSI-2 receiver features four DMA engines to +handle different Virtual Channels (VCs). + +The rkcif driver represents this hardware variant by exposing up the following +V4L2 subdevices: + +* rkcif-dvp0: INTERFACE/CROP block for the DVP + +and the following video devices: + +* rkcif-dvp0-id0: The support for multiple streams on the DVP is not yet + implemented, as it is hard to find test hardware. Thus, this video device + represents the first DMA engine of the RK3568 DVP. + +.. kernel-figure:: rkcif-rk3568-vicap.dot + :alt: Topology of the RK3568 Video Capture (VICAP) unit + :align: center diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst index 3bac5165b134..393f83e8dc4d 100644 --- a/Documentation/admin-guide/media/v4l-drivers.rst +++ b/Documentation/admin-guide/media/v4l-drivers.rst @@ -19,12 +19,14 @@ Video4Linux (V4L) driver-specific documentation ipu3 ipu6-isys ivtv + mali-c55 mgb4 omap3isp philips qcom_camss raspberrypi-pisp-be rcar-fdp1 + rkcif rkisp1 raspberrypi-rp1-cfe saa7134 diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst index 7b0775d281b4..20a8378d5a94 100644 --- a/Documentation/admin-guide/mm/damon/lru_sort.rst +++ b/Documentation/admin-guide/mm/damon/lru_sort.rst @@ -79,6 +79,43 @@ of parametrs except ``enabled`` again. Once the re-reading is done, this parameter is set as ``N``. If invalid parameters are found while the re-reading, DAMON_LRU_SORT will be disabled. +active_mem_bp +------------- + +Desired active to [in]active memory ratio in bp (1/10,000). + +While keeping the caps that set by other quotas, DAMON_LRU_SORT automatically +increases and decreases the effective level of the quota aiming the LRU +[de]prioritizations of the hot and cold memory resulting in this active to +[in]active memory ratio. Value zero means disabling this auto-tuning feature. + +Disabled by default. + +Auto-tune monitoring intervals +------------------------------ + +If this parameter is set as ``Y``, DAMON_LRU_SORT automatically tunes DAMON's +sampling and aggregation intervals. The auto-tuning aims to capture meaningful +amount of access events in each DAMON-snapshot, while keeping the sampling +interval 5 milliseconds in minimum, and 10 seconds in maximum. Setting this as +``N`` disables the auto-tuning. + +Disabled by default. + +filter_young_pages +------------------ + +Filter [non-]young pages accordingly for LRU [de]prioritizations. + +If this is set, check page level access (youngness) once again before each +LRU [de]prioritization operation. LRU prioritization operation is skipped +if the page has not accessed since the last check (not young). LRU +deprioritization operation is skipped if the page has accessed since the +last check (young). The feature is enabled or disabled if this parameter is +set as ``Y`` or ``N``, respectively. + +Disabled by default. + hot_thres_access_freq --------------------- @@ -211,6 +248,28 @@ End of target memory region in physical address. The end physical address of memory region that DAMON_LRU_SORT will do work against. By default, biggest System RAM is used as the region. +addr_unit +--------- + +A scale factor for memory addresses and bytes. + +This parameter is for setting and getting the :ref:`address unit +<damon_design_addr_unit>` parameter of the DAMON instance for DAMON_RECLAIM. + +``monitor_region_start`` and ``monitor_region_end`` should be provided in this +unit. For example, let's suppose ``addr_unit``, ``monitor_region_start`` and +``monitor_region_end`` are set as ``1024``, ``0`` and ``10``, respectively. +Then DAMON_LRU_SORT will work for 10 KiB length of physical address range that +starts from address zero (``[0 * 1024, 10 * 1024)`` in bytes). + +Stat parameters having ``bytes_`` prefix are also in this unit. For example, +let's suppose values of ``addr_unit``, ``bytes_lru_sort_tried_hot_regions`` and +``bytes_lru_sorted_hot_regions`` are ``1024``, ``42``, and ``32``, +respectively. Then it means DAMON_LRU_SORT tried to LRU-sort 42 KiB of hot +memory and successfully LRU-sorted 32 KiB of the memory in total. + +If unsure, use only the default value (``1``) and forget about this. + kdamond_pid ----------- diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index af05ae617018..8eba3da8dcee 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -232,6 +232,28 @@ The end physical address of memory region that DAMON_RECLAIM will do work against. That is, DAMON_RECLAIM will find cold memory regions in this region and reclaims. By default, biggest System RAM is used as the region. +addr_unit +--------- + +A scale factor for memory addresses and bytes. + +This parameter is for setting and getting the :ref:`address unit +<damon_design_addr_unit>` parameter of the DAMON instance for DAMON_RECLAIM. + +``monitor_region_start`` and ``monitor_region_end`` should be provided in this +unit. For example, let's suppose ``addr_unit``, ``monitor_region_start`` and +``monitor_region_end`` are set as ``1024``, ``0`` and ``10``, respectively. +Then DAMON_RECLAIM will work for 10 KiB length of physical address range that +starts from address zero (``[0 * 1024, 10 * 1024)`` in bytes). + +``bytes_reclaim_tried_regions`` and ``bytes_reclaimed_regions`` are also in +this unit. For example, let's suppose values of ``addr_unit``, +``bytes_reclaim_tried_regions`` and ``bytes_reclaimed_regions`` are ``1024``, +``42``, and ``32``, respectively. Then it means DAMON_RECLAIM tried to reclaim +42 KiB memory and successfully reclaimed 32 KiB memory in total. + +If unsure, use only the default value (``1``) and forget about this. + skip_anon --------- diff --git a/Documentation/admin-guide/mm/damon/stat.rst b/Documentation/admin-guide/mm/damon/stat.rst index 4c517c2c219a..e5a5a2c4f803 100644 --- a/Documentation/admin-guide/mm/damon/stat.rst +++ b/Documentation/admin-guide/mm/damon/stat.rst @@ -10,6 +10,8 @@ on the system's entire physical memory using DAMON, and provides simplified access monitoring results statistics, namely idle time percentiles and estimated memory bandwidth. +.. _damon_stat_monitoring_accuracy_overhead: + Monitoring Accuracy and Overhead ================================ @@ -17,9 +19,11 @@ DAMON_STAT uses monitoring intervals :ref:`auto-tuning <damon_design_monitoring_intervals_autotuning>` to make its accuracy high and overhead minimum. It auto-tunes the intervals aiming 4 % of observable access events to be captured in each snapshot, while limiting the resulting sampling -events to be 5 milliseconds in minimum and 10 seconds in maximum. On a few +interval to be 5 milliseconds in minimum and 10 seconds in maximum. On a few production server systems, it resulted in consuming only 0.x % single CPU time, -while capturing reasonable quality of access patterns. +while capturing reasonable quality of access patterns. The tuning-resulting +intervals can be retrieved via ``aggr_interval_us`` :ref:`parameter +<damon_stat_aggr_interval_us>`. Interface: Module Parameters ============================ @@ -41,6 +45,18 @@ You can enable DAMON_STAT by setting the value of this parameter as ``Y``. Setting it as ``N`` disables DAMON_STAT. The default value is set by ``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` build config option. +.. _damon_stat_aggr_interval_us: + +aggr_interval_us +---------------- + +Auto-tuned aggregation time interval in microseconds. + +Users can read the aggregation interval of DAMON that is being used by the +DAMON instance for DAMON_STAT. It is :ref:`auto-tuned +<damon_stat_monitoring_accuracy_overhead>` and therefore the value is +dynamically changed. + estimated_memory_bandwidth -------------------------- @@ -58,12 +74,13 @@ memory_idle_ms_percentiles Per-byte idle time (milliseconds) percentiles of the system. DAMON_STAT calculates how long each byte of the memory was not accessed until -now (idle time), based on the current DAMON results snapshot. If DAMON found a -region of access frequency (nr_accesses) larger than zero, every byte of the -region gets zero idle time. If a region has zero access frequency -(nr_accesses), how long the region was keeping the zero access frequency (age) -becomes the idle time of every byte of the region. Then, DAMON_STAT exposes -the percentiles of the idle time values via this read-only parameter. Reading -the parameter returns 101 idle time values in milliseconds, separated by comma. +now (idle time), based on the current DAMON results snapshot. For regions +having access frequency (nr_accesses) larger than zero, how long the current +access frequency level was kept multiplied by ``-1`` becomes the idlee time of +every byte of the region. If a region has zero access frequency (nr_accesses), +how long the region was keeping the zero access frequency (age) becomes the +idle time of every byte of the region. Then, DAMON_STAT exposes the +percentiles of the idle time values via this read-only parameter. Reading the +parameter returns 101 idle time values in milliseconds, separated by comma. Each value represents 0-th, 1st, 2nd, 3rd, ..., 99th and 100th percentile idle times. diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index eae534bc1bee..b0f3969b6b3b 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -6,6 +6,11 @@ Detailed Usages DAMON provides below interfaces for different users. +- *Special-purpose DAMON modules.* + :ref:`This <damon_modules_special_purpose>` is for people who are building, + distributing, and/or administrating the kernel with special-purpose DAMON + usages. Using this, users can use DAMON's major features for the given + purposes in build, boot, or runtime in simple ways. - *DAMON user space tool.* `This <https://github.com/damonitor/damo>`_ is for privileged people such as system administrators who want a just-working human-friendly interface. @@ -67,7 +72,7 @@ comma (","). │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us │ │ │ │ │ │ nr_regions/min,max │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets - │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target + │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target,obsolete_target │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions │ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end │ │ │ │ │ │ │ │ ... @@ -81,13 +86,13 @@ comma (","). │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid,path │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max │ │ │ │ │ │ │ :ref:`dests <damon_sysfs_dests>`/nr_dests │ │ │ │ │ │ │ │ 0/id,weight - │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds + │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds,nr_snapshots,max_nr_snapshots │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed │ │ │ │ │ │ │ │ ... @@ -134,7 +139,8 @@ Users can write below commands for the kdamond to the ``state`` file. - ``on``: Start running. - ``off``: Stop running. - ``commit``: Read the user inputs in the sysfs files except ``state`` file - again. + again. Monitoring :ref:`target region <sysfs_regions>` inputs are also be + ignored if no target region is specified. - ``update_tuned_intervals``: Update the contents of ``sample_us`` and ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling interval`` and ``aggregation interval`` for the files. Please refer to @@ -264,13 +270,20 @@ to ``N-1``. Each directory represents each monitoring target. targets/<N>/ ------------ -In each target directory, one file (``pid_target``) and one directory -(``regions``) exist. +In each target directory, two files (``pid_target`` and ``obsolete_target``) +and one directory (``regions``) exist. If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should be a process. You can specify the process to DAMON by writing the pid of the process to the ``pid_target`` file. +Users can selectively remove targets in the middle of the targets array by +writing non-zero value to ``obsolete_target`` file and committing it (writing +``commit`` to ``state`` file). DAMON will remove the matching targets from its +internal targets array. Users are responsible to construct target directories +again, so that those correctly represent the changed internal targets array. + + .. _sysfs_regions: targets/<N>/regions @@ -289,6 +302,11 @@ In the beginning, this directory has only one file, ``nr_regions``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each initial monitoring target region. +If ``nr_regions`` is zero when committing new DAMON parameters online (writing +``commit`` to ``state`` file of :ref:`kdamond <sysfs_kdamond>`), the commit +logic ignores the target regions. In other words, the current monitoring +results for the target are preserved. + .. _sysfs_region: regions/<N>/ @@ -402,9 +420,9 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains four files, namely ``target_metric``, -``target_value``, ``current_value`` and ``nid``. Users can set and get the -four parameters for the quota auto-tuning goals that specified on the +Each goal directory contains five files, namely ``target_metric``, +``target_value``, ``current_value`` ``nid`` and ``path``. Users can set and +get the five parameters for the quota auto-tuning goals that specified on the :ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond @@ -530,10 +548,14 @@ online analysis or tuning of the schemes. Refer to :ref:`design doc The statistics can be retrieved by reading the files under ``stats`` directory (``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, -``sz_ops_filter_passed``, and ``qt_exceeds``), respectively. The files are not -updated in real time, so you should ask DAMON sysfs interface to update the -content of the files for the stats by writing a special keyword, -``update_schemes_stats`` to the relevant ``kdamonds/<N>/state`` file. +``sz_ops_filter_passed``, ``qt_exceeds``, ``nr_snapshots`` and +``max_nr_snapshots``), respectively. + +The files are not updated in real time by default. Users should ask DAMON +sysfs interface to periodically update those using ``refresh_ms``, or do a one +time update by writing a special keyword, ``update_schemes_stats`` to the +relevant ``kdamonds/<N>/state`` file. Refer to :ref:`kdamond directory +<sysfs_kdamond>` for more details. .. _sysfs_schemes_tried_regions: diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index ebc83ca20fdc..bbb563cba5d2 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -39,7 +39,6 @@ the Linux memory management. shrinker_debugfs slab soft-dirty - swap_numa transhuge userfaultfd zswap diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 33c886f3d198..0207f8725142 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -603,17 +603,18 @@ ZONE_MOVABLE, especially when fine-tuning zone ratios: memory for metadata and page tables in the direct map; having a lot of offline memory blocks is not a typical case, though. -- Memory ballooning without balloon compaction is incompatible with - ZONE_MOVABLE. Only some implementations, such as virtio-balloon and - pseries CMM, fully support balloon compaction. +- Memory ballooning without support for balloon memory migration is incompatible + with ZONE_MOVABLE. Only some implementations, such as virtio-balloon and + pseries CMM, fully support balloon memory migration. - Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be + Further, the CONFIG_BALLOON_MIGRATION kernel configuration option might be disabled. In that case, balloon inflation will only perform unmovable allocations and silently create a zone imbalance, usually triggered by inflation requests from the hypervisor. -- Gigantic pages are unmovable, resulting in user space consuming a - lot of unmovable memory. +- Gigantic pages are unmovable when an architecture does not support + huge page migration and/or the ``movable_gigantic_pages`` sysctl is false. + See Documentation/admin-guide/sysctl/vm.rst for more info on this sysctl. - Huge pages are unmovable when an architectures does not support huge page migration, resulting in a similar issue as with gigantic pages. @@ -672,6 +673,15 @@ block might fail: - Concurrent activity that operates on the same physical memory area, such as allocating gigantic pages, can result in temporary offlining failures. +- When an admin sets the ``movable_gigantic_pages`` sysctl to true, gigantic + pages are allowed in ZONE_MOVABLE. This only allows migratable gigantic + pages to be allocated; however, if there are no eligible destination gigantic + pages at offline, the offlining operation will fail. + + Users leveraging ``movable_gigantic_pages`` should weigh the value of + ZONE_MOVABLE for increasing the reliability of gigantic page allocation + against the potential loss of hot-unplug reliability. + - Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap Optimization (HVO) is enabled. diff --git a/Documentation/admin-guide/mm/nommu-mmap.rst b/Documentation/admin-guide/mm/nommu-mmap.rst index 530fed08de2c..8a1949b3690f 100644 --- a/Documentation/admin-guide/mm/nommu-mmap.rst +++ b/Documentation/admin-guide/mm/nommu-mmap.rst @@ -38,7 +38,7 @@ and it's also much more restricted in the latter case: In the no-MMU case: - - If one exists, the kernel will re-use an existing mapping to the + - If one exists, the kernel will reuse an existing mapping to the same segment of the same file if that has compatible permissions, even if this was created by another process. diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index e60e9211fd9b..c57e61b5d8aa 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -115,7 +115,8 @@ Short descriptions to the page flags A free memory block managed by the buddy system allocator. The buddy system organizes free memory in blocks of various orders. An order N block has 2^N physically contiguous pages, with the BUDDY flag - set for and _only_ for the first page. + set for all pages. + Before 4.6 only the first page of the block had the flag set. 15 - COMPOUND_HEAD A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its diff --git a/Documentation/admin-guide/mm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst deleted file mode 100644 index 2e630627bcee..000000000000 --- a/Documentation/admin-guide/mm/swap_numa.rst +++ /dev/null @@ -1,78 +0,0 @@ -=========================================== -Automatically bind swap device to numa node -=========================================== - -If the system has more than one swap device and swap device has the node -information, we can make use of this information to decide which swap -device to use in get_swap_pages() to get better performance. - - -How to use this feature -======================= - -Swap device has priority and that decides the order of it to be used. To make -use of automatically binding, there is no need to manipulate priority settings -for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and -swapB, with swapA attached to node 0 and swapB attached to node 1, are going -to be swapped on. Simply swapping them on by doing:: - - # swapon /dev/swapA - # swapon /dev/swapB - -Then node 0 will use the two swap devices in the order of swapA then swapB and -node 1 will use the two swap devices in the order of swapB then swapA. Note -that the order of them being swapped on doesn't matter. - -A more complex example on a 4 node machine. Assume 6 swap devices are going to -be swapped on: swapA and swapB are attached to node 0, swapC is attached to -node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. -The way to swap them on is the same as above:: - - # swapon /dev/swapA - # swapon /dev/swapB - # swapon /dev/swapC - # swapon /dev/swapD - # swapon /dev/swapE - # swapon /dev/swapF - -Then node 0 will use them in the order of:: - - swapA/swapB -> swapC -> swapD -> swapE -> swapF - -swapA and swapB will be used in a round robin mode before any other swap device. - -node 1 will use them in the order of:: - - swapC -> swapA -> swapB -> swapD -> swapE -> swapF - -node 2 will use them in the order of:: - - swapD/swapE -> swapA -> swapB -> swapC -> swapF - -Similaly, swapD and swapE will be used in a round robin mode before any -other swap devices. - -node 3 will use them in the order of:: - - swapF -> swapA -> swapB -> swapC -> swapD -> swapE - - -Implementation details -====================== - -The current code uses a priority based list, swap_avail_list, to decide -which swap device to use and if multiple swap devices share the same -priority, they are used round robin. This change here replaces the single -global swap_avail_list with a per-numa-node list, i.e. for each numa node, -it sees its own priority based list of available swap devices. Swap -device's priority can be promoted on its matching node's swap_avail_list. - -The current swap device's priority is set as: user can set a >=0 value, -or the system will pick one starting from -1 then downwards. The priority -value in the swap_avail_list is the negated value of the swap device's -due to plist being sorted from low to high. The new policy doesn't change -the semantics for priority >=0 cases, the previous starting from -1 then -downwards now becomes starting from -2 then downwards and -1 is reserved -as the promoted value. So if multiple swap devices are attached to the same -node, they will all be promoted to priority -1 on that node's plist and will -be used round robin before any other swap devices. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 1654211cc6cf..5fbc3d89bb07 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -381,6 +381,11 @@ hugepage allocation policy for the tmpfs mount by using the kernel parameter four valid policies for tmpfs (``always``, ``within_size``, ``advise``, ``never``). The tmpfs mount default policy is ``never``. +Additionally, Kconfig options are available to set the default hugepage +policies for shmem (``CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_*``) and tmpfs +(``CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_*``) at build time. Refer to the +Kconfig help for more details. + In the same manner as ``thp_anon`` controls each supported anonymous THP size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem`` has the same format as ``thp_anon``, but also supports the policy diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 283d77217c6f..2464425c783d 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -59,11 +59,11 @@ returned by the allocation routine and that handle must be mapped before being accessed. The compressed memory pool grows on demand and shrinks as compressed pages are freed. The pool is not preallocated. -When a swap page is passed from swapout to zswap, zswap maintains a mapping -of the swap entry, a combination of the swap type and swap offset, to the -zsmalloc handle that references that compressed swap page. This mapping is -achieved with a red-black tree per swap type. The swap offset is the search -key for the tree nodes. +When a swap page is passed from swapout to zswap, zswap maintains a mapping of +the swap entry, a combination of the swap type and swap offset, to the zsmalloc +handle that references that compressed swap page. This mapping is achieved +with an xarray per swap type. The swap offset is the search key for the xarray +nodes. During a page fault on a PTE that is a swap entry, the swapin code calls the zswap load function to decompress the page into the page allocated by the page diff --git a/Documentation/admin-guide/module-signing.rst b/Documentation/admin-guide/module-signing.rst index a8667a777490..7f2f127dc76f 100644 --- a/Documentation/admin-guide/module-signing.rst +++ b/Documentation/admin-guide/module-signing.rst @@ -28,10 +28,12 @@ trusted userspace bits. This facility uses X.509 ITU-T standard certificates to encode the public keys involved. The signatures are not themselves encoded in any industrial standard -type. The built-in facility currently only supports the RSA & NIST P-384 ECDSA -public key signing standard (though it is pluggable and permits others to be -used). The possible hash algorithms that can be used are SHA-2 and SHA-3 of -sizes 256, 384, and 512 (the algorithm is selected by data in the signature). +type. The built-in facility currently only supports the RSA, NIST P-384 ECDSA +and NIST FIPS-204 ML-DSA public key signing standards (though it is pluggable +and permits others to be used). For RSA and ECDSA, the possible hash +algorithms that can be used are SHA-2 and SHA-3 of sizes 256, 384, and 512 (the +algorithm is selected by data in the signature); ML-DSA does its own hashing, +but is allowed to be used with a SHA512 hash for signed attributes. ========================== @@ -146,9 +148,9 @@ into vmlinux) using parameters in the:: file (which is also generated if it does not already exist). -One can select between RSA (``MODULE_SIG_KEY_TYPE_RSA``) and ECDSA -(``MODULE_SIG_KEY_TYPE_ECDSA``) to generate either RSA 4k or NIST -P-384 keypair. +One can select between RSA (``MODULE_SIG_KEY_TYPE_RSA``), ECDSA +(``MODULE_SIG_KEY_TYPE_ECDSA``) and ML-DSA (``MODULE_SIG_KEY_TYPE_MLDSA_*``) to +generate an RSA 4k, a NIST P-384 keypair or an ML-DSA 44, 65 or 87 keypair. It is strongly recommended that you provide your own x509.genkey file. diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 738d7b4dc33a..dbe6d23a5d67 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -439,7 +439,7 @@ This governor exposes only one tunable: ``rate_limit_us`` Minimum time (in microseconds) that has to pass between two consecutive runs of governor computations (default: 1.5 times the scaling driver's - transition latency or the maximum 2ms). + transition latency or 1ms if the driver does not provide a latency value). The purpose of this tunable is to reduce the scheduler context overhead of the governor which might be excessive without it. diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 0c090b076224..be4c1120e3f0 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -580,6 +580,15 @@ the given CPU as the upper limit for the exit latency of the idle states that they are allowed to select for that CPU. They should never select any idle states with exit latency beyond that limit. +While the above CPU QoS constraints apply to CPU idle time management, user +space may also request a CPU system wakeup latency QoS limit, via the +`cpu_wakeup_latency` file. This QoS constraint is respected when selecting a +suitable idle state for the CPUs, while entering the system-wide suspend-to-idle +sleep state, but also to the regular CPU idle time management. + +Note that, the management of the `cpu_wakeup_latency` file works according to +the 'cpu_dma_latency' file from user space point of view. Moreover, the unit +is also microseconds. Idle States Control Via Kernel Command Line =========================================== diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst index ed6f055d4b14..188d52cd26e8 100644 --- a/Documentation/admin-guide/pm/intel_idle.rst +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -260,6 +260,17 @@ mode to off when the CPU is in any one of the available idle states. This may help performance of a sibling CPU at the expense of a slightly higher wakeup latency for the idle CPU. +The ``table`` argument allows customization of idle state latency and target +residency. The syntax is a comma-separated list of ``name:latency:residency`` +entries, where ``name`` is the idle state name, ``latency`` is the exit latency +in microseconds, and ``residency`` is the target residency in microseconds. It +is not necessary to specify all idle states; only those to be customized. For +example, ``C1:1:3,C6:50:100`` sets the exit latency and target residency for +C1 and C6 to 1/3 and 50/100 microseconds, respectively. Remaining idle states +keep their default values. The driver verifies that deeper idle states have +higher latency and target residency than shallower ones. Also, target +residency cannot be smaller than exit latency. If any of these conditions is +not met, the driver ignores the entire ``table`` parameter. .. _intel-idle-core-and-package-idle-states: diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index 26e702c7016e..fde967b0c2e0 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -48,8 +48,9 @@ only way to pass early-configuration-time parameters to it is via the kernel command line. However, its configuration can be adjusted via ``sysfs`` to a great extent. In some configurations it even is possible to unregister it via ``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and -registered (see `below <status_attr_>`_). +registered (see :ref:`below <status_attr>`). +.. _operation_modes: Operation Modes =============== @@ -62,6 +63,8 @@ a certain performance scaling algorithm. Which of them will be in effect depends on what kernel command line options are used and on the capabilities of the processor. +.. _active_mode: + Active Mode ----------- @@ -94,6 +97,8 @@ Which of the P-state selection algorithms is used by default depends on the Namely, if that option is set, the ``performance`` algorithm will be used by default, and the other one will be used by default if it is not set. +.. _active_mode_hwp: + Active Mode With HWP ~~~~~~~~~~~~~~~~~~~~ @@ -123,7 +128,7 @@ Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's internal P-state selection logic is expected to focus entirely on performance. This will override the EPP/EPB setting coming from the ``sysfs`` interface -(see `Energy vs Performance Hints`_ below). Moreover, any attempts to change +(see :ref:`energy_performance_hints` below). Moreover, any attempts to change the EPP/EPB to a value different from 0 ("performance") via ``sysfs`` in this configuration will be rejected. @@ -192,6 +197,8 @@ This is the default P-state selection algorithm if the :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option is not set. +.. _passive_mode: + Passive Mode ------------ @@ -289,12 +296,12 @@ Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes the entire range of available P-states, including the whole turbo range, to the ``CPUFreq`` core and (in the passive mode) to generic scaling governors. This generally causes turbo P-states to be set more often when ``intel_pstate`` is -used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_ -for more information). +used relative to ACPI-based CPU performance scaling (see +:ref:`below <acpi-cpufreq>` for more information). Moreover, since ``intel_pstate`` always knows what the real turbo threshold is (even if the Configurable TDP feature is enabled in the processor), its -``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should +``no_turbo`` attribute in ``sysfs`` (described :ref:`below <no_turbo_attr>`) should work as expected in all cases (that is, if set to disable turbo P-states, it always should prevent ``intel_pstate`` from using them). @@ -307,12 +314,12 @@ pieces of information on it to be known, including: * The minimum supported P-state. - * The maximum supported `non-turbo P-state <turbo_>`_. + * The maximum supported :ref:`non-turbo P-state <turbo>`. * Whether or not turbo P-states are supported at all. - * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states - are supported). + * The maximum supported :ref:`one-core turbo P-state <turbo>` (if turbo + P-states are supported). * The scaling formula to translate the driver's internal representation of P-states into frequencies and the other way around. @@ -400,10 +407,10 @@ Energy-Aware Scheduling Support If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and ``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling -`CAS <CAS_>`_ it registers an Energy Model for the processor. This allows the +:ref:`CAS` it registers an Energy Model for the processor. This allows the Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if ``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate`` -to operate in the `passive mode <Passive Mode_>`_. +to operate in the :ref:`passive mode <passive_mode>`. The Energy Model registered by ``intel_pstate`` is artificial (that is, it is based on abstract cost values and it does not include any real power numbers) @@ -432,6 +439,8 @@ the ``energy_model`` directory in ``debugfs`` (typlically mounted on User Space Interface in ``sysfs`` ================================= +.. _global_attributes: + Global Attributes ----------------- @@ -444,8 +453,8 @@ argument is passed to the kernel in the command line. ``max_perf_pct`` Maximum P-state the driver is allowed to set in percent of the - maximum supported performance level (the highest supported `turbo - P-state <turbo_>`_). + maximum supported performance level (the highest supported :ref:`turbo + P-state <turbo>`). This attribute will not be exposed if the ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel @@ -453,8 +462,8 @@ argument is passed to the kernel in the command line. ``min_perf_pct`` Minimum P-state the driver is allowed to set in percent of the - maximum supported performance level (the highest supported `turbo - P-state <turbo_>`_). + maximum supported performance level (the highest supported :ref:`turbo + P-state <turbo>`). This attribute will not be exposed if the ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel @@ -463,18 +472,18 @@ argument is passed to the kernel in the command line. ``num_pstates`` Number of P-states supported by the processor (between 0 and 255 inclusive) including both turbo and non-turbo P-states (see - `Turbo P-states Support`_). + :ref:`turbo`). This attribute is present only if the value exposed by it is the same for all of the CPUs in the system. The value of this attribute is not affected by the ``no_turbo`` - setting described `below <no_turbo_attr_>`_. + setting described :ref:`below <no_turbo_attr>`. This attribute is read-only. ``turbo_pct`` - Ratio of the `turbo range <turbo_>`_ size to the size of the entire + Ratio of the :ref:`turbo range <turbo>` size to the size of the entire range of supported P-states, in percent. This attribute is present only if the value exposed by it is the same @@ -486,7 +495,7 @@ argument is passed to the kernel in the command line. ``no_turbo`` If set (equal to 1), the driver is not allowed to set any turbo P-states - (see `Turbo P-states Support`_). If unset (equal to 0, which is the + (see :ref:`turbo`). If unset (equal to 0, which is the default), turbo P-states can be set by the driver. [Note that ``intel_pstate`` does not support the general ``boost`` attribute (supported by some other scaling drivers) which is replaced @@ -495,11 +504,11 @@ argument is passed to the kernel in the command line. This attribute does not affect the maximum supported frequency value supplied to the ``CPUFreq`` core and exposed via the policy interface, but it affects the maximum possible value of per-policy P-state limits - (see `Interpretation of Policy Attributes`_ below for details). + (see :ref:`policy_attributes_interpretation` below for details). ``hwp_dynamic_boost`` This attribute is only present if ``intel_pstate`` works in the - `active mode with the HWP feature enabled <Active Mode With HWP_>`_ in + :ref:`active mode with the HWP feature enabled <active_mode_hwp>` in the processor. If set (equal to 1), it causes the minimum P-state limit to be increased dynamically for a short time whenever a task previously waiting on I/O is selected to run on a given logical CPU (the purpose @@ -514,12 +523,12 @@ argument is passed to the kernel in the command line. Operation mode of the driver: "active", "passive" or "off". "active" - The driver is functional and in the `active mode - <Active Mode_>`_. + The driver is functional and in the :ref:`active mode + <active_mode>`. "passive" - The driver is functional and in the `passive mode - <Passive Mode_>`_. + The driver is functional and in the :ref:`passive mode + <passive_mode>`. "off" The driver is not functional (it is not registered as a scaling @@ -547,13 +556,15 @@ argument is passed to the kernel in the command line. attribute to "1" enables the energy-efficiency optimizations and setting to "0" disables them. +.. _policy_attributes_interpretation: + Interpretation of Policy Attributes ----------------------------------- The interpretation of some ``CPUFreq`` policy attributes described in Documentation/admin-guide/pm/cpufreq.rst is special with ``intel_pstate`` as the current scaling driver and it generally depends on the driver's -`operation mode <Operation Modes_>`_. +:ref:`operation mode <operation_modes>`. First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and ``scaling_cur_freq`` attributes are produced by applying a processor-specific @@ -562,9 +573,10 @@ Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq`` attributes are capped by the frequency corresponding to the maximum P-state that the driver is allowed to set. -If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is -not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq`` -and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency. +If the ``no_turbo`` :ref:`global attribute <no_turbo_attr>` is set, the driver +is not allowed to use turbo P-states, so the maximum value of +``scaling_max_freq`` and ``scaling_min_freq`` is limited to the maximum +non-turbo P-state frequency. Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and ``scaling_min_freq`` to go down to that value if they were above it before. However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be @@ -576,7 +588,7 @@ and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state, which also is the value of ``cpuinfo_max_freq`` in either case. Next, the following policy attributes have special meaning if -``intel_pstate`` works in the `active mode <Active Mode_>`_: +``intel_pstate`` works in the :ref:`active mode <active_mode>`: ``scaling_available_governors`` List of P-state selection algorithms provided by ``intel_pstate``. @@ -597,20 +609,22 @@ processor: Shows the base frequency of the CPU. Any frequency above this will be in the turbo frequency range. -The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the +The meaning of these attributes in the :ref:`passive mode <passive_mode>` is the same as for other scaling drivers. Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate`` depends on the operation mode of the driver. Namely, it is either -"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the -`passive mode <Passive Mode_>`_). +"intel_pstate" (in the :ref:`active mode <active_mode>`) or "intel_cpufreq" +(in the :ref:`passive mode <passive_mode>`). + +.. _pstate_limits_coordination: Coordination of P-State Limits ------------------------------ ``intel_pstate`` allows P-state limits to be set in two ways: with the help of -the ``max_perf_pct`` and ``min_perf_pct`` `global attributes -<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq`` +the ``max_perf_pct`` and ``min_perf_pct`` :ref:`global attributes +<global_attributes>` or via the ``scaling_max_freq`` and ``scaling_min_freq`` ``CPUFreq`` policy attributes. The coordination between those limits is based on the following rules, regardless of the current operation mode of the driver: @@ -632,17 +646,18 @@ on the following rules, regardless of the current operation mode of the driver: 3. The global and per-policy limits can be set independently. -In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the +In the :ref:`active mode with the HWP feature enabled <active_mode_hwp>`, the resulting effective values are written into hardware registers whenever the limits change in order to request its internal P-state selection logic to always set P-states within these limits. Otherwise, the limits are taken into account -by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver -every time before setting a new P-state for a CPU. +by scaling governors (in the :ref:`passive mode <passive_mode>`) and by the +driver every time before setting a new P-state for a CPU. Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed at all and the only way to set the limits is by using the policy attributes. +.. _energy_performance_hints: Energy vs Performance Hints --------------------------- @@ -702,9 +717,9 @@ output. On those systems each ``_PSS`` object returns a list of P-states supported by the corresponding CPU which basically is a subset of the P-states range that can be used by ``intel_pstate`` on the same system, with one exception: the whole -`turbo range <turbo_>`_ is represented by one item in it (the topmost one). By -convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz -than the frequency of the highest non-turbo P-state listed by it, but the +:ref:`turbo range <turbo>` is represented by one item in it (the topmost one). +By convention, the frequency returned by ``_PSS`` for that item is greater by +1 MHz than the frequency of the highest non-turbo P-state listed by it, but the corresponding P-state representation (following the hardware specification) returned for it matches the maximum supported turbo P-state (or is the special value 255 meaning essentially "go as high as you can get"). @@ -730,18 +745,18 @@ benefit from running at turbo frequencies will be given non-turbo P-states instead. One more issue related to that may appear on systems supporting the -`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the -turbo threshold. Namely, if that is not coordinated with the lists of P-states -returned by ``_PSS`` properly, there may be more than one item corresponding to -a turbo P-state in those lists and there may be a problem with avoiding the -turbo range (if desirable or necessary). Usually, to avoid using turbo -P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed -by ``_PSS``, but that is not sufficient when there are other turbo P-states in -the list returned by it. +:ref:`Configurable TDP feature <turbo>` allowing the platform firmware to set +the turbo threshold. Namely, if that is not coordinated with the lists of +P-states returned by ``_PSS`` properly, there may be more than one item +corresponding to a turbo P-state in those lists and there may be a problem with +avoiding the turbo range (if desirable or necessary). Usually, to avoid using +turbo P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state +listed by ``_PSS``, but that is not sufficient when there are other turbo +P-states in the list returned by it. Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the -`passive mode <Passive Mode_>`_, except that the number of P-states it can set -is limited to the ones listed by the ACPI ``_PSS`` objects. +:ref:`passive mode <passive_mode>`, except that the number of P-states it can +set is limited to the ones listed by the ACPI ``_PSS`` objects. Kernel Command Line Options for ``intel_pstate`` @@ -756,11 +771,11 @@ of them have to be prepended with the ``intel_pstate=`` prefix. processor is supported by it. ``active`` - Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start - with. + Register ``intel_pstate`` in the :ref:`active mode <active_mode>` to + start with. ``passive`` - Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to + Register ``intel_pstate`` in the :ref:`passive mode <passive_mode>` to start with. ``force`` @@ -793,12 +808,12 @@ of them have to be prepended with the ``intel_pstate=`` prefix. and this option has no effect. ``per_cpu_perf_limits`` - Use per-logical-CPU P-State limits (see `Coordination of P-state - Limits`_ for details). + Use per-logical-CPU P-State limits (see + :ref:`pstate_limits_coordination` for details). ``no_cas`` - Do not enable `capacity-aware scheduling <CAS_>`_ which is enabled by - default on hybrid systems without SMT. + Do not enable :ref:`capacity-aware scheduling <CAS>` which is enabled + by default on hybrid systems without SMT. Diagnostics and Tuning ====================== @@ -810,7 +825,7 @@ There are two static trace events that can be used for ``intel_pstate`` diagnostics. One of them is the ``cpu_frequency`` trace event generally used by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if -it works in the `active mode <Active Mode_>`_. +it works in the :ref:`active mode <active_mode>`. The following sequence of shell commands can be used to enable them and see their output (if the kernel is generally configured to support event tracing):: @@ -822,7 +837,7 @@ their output (if the kernel is generally configured to support event tracing):: gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476 cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 -If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the +If ``intel_pstate`` works in the :ref:`passive mode <passive_mode>`, the ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the policies with other scaling governors). diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index f3ee807b5d8b..9aed74e65cf4 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -397,13 +397,14 @@ a hung task is detected. hung_task_panic =============== -Controls the kernel's behavior when a hung task is detected. +When set to a non-zero value, a kernel panic will be triggered if the +number of hung tasks found during a single scan reaches this value. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -= ================================================= += ======================================================= 0 Continue operation. This is the default behavior. -1 Panic immediately. -= ================================================= +N Panic when N hung tasks are found during a single scan. += ======================================================= hung_task_check_count @@ -421,6 +422,11 @@ the system boot. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. +hung_task_sys_info +================== +A comma separated list of extra system information to be dumped when +hung task is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. hung_task_timeout_secs ====================== @@ -515,6 +521,15 @@ default), only processes with the CAP_SYS_ADMIN capability may create io_uring instances. +kernel_sys_info +=============== +A comma separated list of extra system information to be dumped when +soft/hard lockup is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. + +It serves as the default kernel control knob, which will take effect +when a kernel module calls sys_info() with parameter==0. + kexec_load_disabled =================== @@ -576,6 +591,14 @@ if leaking kernel pointer values to unprivileged users is a concern. When ``kptr_restrict`` is set to 2, kernel pointers printed using %pK will be replaced with 0s regardless of privileges. +For disabling these security restrictions early at boot time (and once +for all), use the ``hash_pointers`` boot parameter instead. + +softlockup_sys_info & hardlockup_sys_info +========================================= +A comma separated list of extra system information to be dumped when +soft/hard lockup is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. modprobe ======== @@ -910,8 +933,8 @@ to 'panic_print'. Possible values are: ============= =================================================== tasks print all tasks info mem print system memory info -timer print timers info -lock print locks info if CONFIG_LOCKDEP is on +timers print timers info +locks print locks info if CONFIG_LOCKDEP is on ftrace print ftrace buffer all_bt print all CPUs backtrace (if available in the arch) blocked_tasks print only tasks in uninterruptible (blocked) state @@ -1215,12 +1238,6 @@ that support this feature. == =========================================================================== -real-root-dev -============= - -See Documentation/admin-guide/initrd.rst. - - reboot-cmd (SPARC only) ======================= diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 2ef50828aff1..3b2ad61995d4 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -40,8 +40,8 @@ Table : Subdirectories in /proc/sys/net bridge Bridging rose X.25 PLP layer core General parameter tipc TIPC ethernet Ethernet protocol unix Unix domain sockets - ipv4 IP version 4 x25 X.25 protocol - ipv6 IP version 6 + ipv4 IP version 4 vsock VSOCK sockets + ipv6 IP version 6 x25 X.25 protocol ========= =================== = ========== =================== 1. /proc/sys/net/core - Network core options @@ -212,6 +212,14 @@ mem_pcpu_rsv Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU. +bypass_prot_mem +--------------- + +Skip charging socket buffers to the global per-protocol memory +accounting controlled by net.ipv4.tcp_mem, net.ipv4.udp_mem, etc. + +Default: 0 (off) + rmem_default ------------ @@ -295,24 +303,33 @@ netdev_max_backlog Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. +qdisc_max_burst +------------------ + +Maximum number of packets that can be temporarily stored before +reaching qdisc. + +Default: 1000 + netdev_rss_key -------------- -RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is -randomly generated. +RSS (Receive Side Scaling) enabled drivers use a host key that +is randomly generated. Some user space might need to gather its content even if drivers do not provide ethtool -x support yet. :: myhost:~# cat /proc/sys/net/core/netdev_rss_key - 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (256 bytes total) -File contains nul bytes if no driver ever called netdev_rss_key_fill() function. +File contains all nul bytes if no driver ever called netdev_rss_key_fill() +function. Note: - /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, - but most drivers only use 40 bytes of it. + /proc/sys/net/core/netdev_rss_key contains 256 bytes of key, + but many drivers only use 40 or 52 bytes of it. :: @@ -347,9 +364,9 @@ skb_defer_max ------------- Max size (in skbs) of the per-cpu list of skbs being freed -by the cpu which allocated them. Used by TCP stack so far. +by the cpu which allocated them. -Default: 64 +Default: 128 optmem_max ---------- @@ -406,6 +423,23 @@ to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). If set to 1 (default), hash rethink is performed on listening socket. If set to 0, hash rethink is not performed. +txq_reselection_ms +------------------ + +Controls how often (in ms) a busy connected flow can select another tx queue. + +A resection is desirable when/if user thread has migrated and XPS +would select a different queue. Same can occur without XPS +if the flow hash has changed. + +But switching txq can introduce reorders, especially if the +old queue is under high pressure. Modern TCP stacks deal +well with reorders if they happen not too often. + +To disable this feature, set the value to 0. + +Default : 1000 + gro_normal_batch ---------------- @@ -517,3 +551,54 @@ originally may have been issued in the correct sequential order. If named_timeout is nonzero, failed topology updates will be placed on a defer queue until another event arrives that clears the error, or until the timeout expires. Value is in milliseconds. + +6. /proc/sys/net/vsock - VSOCK sockets +-------------------------------------- + +VSOCK sockets (AF_VSOCK) provide communication between virtual machines and +their hosts. The behavior of VSOCK sockets in a network namespace is determined +by the namespace's mode (``global`` or ``local``), which controls how CIDs +(Context IDs) are allocated and how sockets interact across namespaces. + +ns_mode +------- + +Read-only. Reports the current namespace's mode, set at namespace creation +and immutable thereafter. + +Values: + + - ``global`` - the namespace shares system-wide CID allocation and + its sockets can reach any VM or socket in any global namespace. + Sockets in this namespace cannot reach sockets in local + namespaces. + - ``local`` - the namespace has private CID allocation and its + sockets can only connect to VMs or sockets within the same + namespace. + +The init_net mode is always ``global``. + +child_ns_mode +------------- + +Controls what mode newly created child namespaces will inherit. At namespace +creation, ``ns_mode`` is inherited from the parent's ``child_ns_mode``. The +initial value matches the namespace's own ``ns_mode``. + +Values: + + - ``global`` - child namespaces will share system-wide CID allocation + and their sockets will be able to reach any VM or socket in any + global namespace. + - ``local`` - child namespaces will have private CID allocation and + their sockets will only be able to connect within their own + namespace. + +The first write to ``child_ns_mode`` locks its value. Subsequent writes of the +same value succeed, but writing a different value returns ``-EBUSY``. + +Changing ``child_ns_mode`` only affects namespaces created after the change; +it does not modify the current namespace or any existing children. + +A namespace with ``ns_mode`` set to ``local`` cannot change +``child_ns_mode`` to ``global`` (returns ``-EPERM``). diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 4d71211fdad8..97e12359775c 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -41,7 +41,6 @@ Currently, these files are in /proc/sys/vm: - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group -- laptop_mode - legacy_va_layout - lowmem_reserve_ratio - max_map_count @@ -54,6 +53,7 @@ Currently, these files are in /proc/sys/vm: - mmap_min_addr - mmap_rnd_bits - mmap_rnd_compat_bits +- movable_gigantic_pages - nr_hugepages - nr_hugepages_mempolicy - nr_overcommit_hugepages @@ -231,6 +231,8 @@ eventually gets pushed out to disk. This tunable is used to define when dirty inode is old enough to be eligible for writeback by the kernel flusher threads. And, it is also used as the interval to wakeup dirtytime_writeback thread. +Setting this to zero disables periodic dirtytime writeback. + dirty_writeback_centisecs ========================= @@ -363,13 +365,6 @@ hugetlb_shm_group contains group id that is allowed to create SysV shared memory segment using hugetlb page. -laptop_mode -=========== - -laptop_mode is a knob that controls "laptop mode". All the things that are -controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst. - - legacy_va_layout ================ @@ -494,6 +489,10 @@ memory allocations. The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. +When CONFIG_MEM_ALLOC_PROFILING_DEBUG=y, this control is read-only to avoid +warnings produced by allocations made while profiling is disabled and freed +when it's enabled. + memory_failure_early_kill ========================= @@ -624,6 +623,33 @@ This value can be changed after boot using the /proc/sys/vm/mmap_rnd_compat_bits tunable +movable_gigantic_pages +====================== + +This parameter controls whether gigantic pages may be allocated from +ZONE_MOVABLE. If set to non-zero, gigantic pages can be allocated +from ZONE_MOVABLE. ZONE_MOVABLE memory may be created via the kernel +boot parameter `kernelcore` or via memory hotplug as discussed in +Documentation/admin-guide/mm/memory-hotplug.rst. + +Support may depend on specific architecture. + +Note that using ZONE_MOVABLE gigantic pages make memory hotremove unreliable. + +Memory hot-remove operations will block indefinitely until the admin reserves +sufficient gigantic pages to service migration requests associated with the +memory offlining process. As HugeTLB gigantic page reservation is a manual +process (via `nodeN/hugepages/.../nr_hugepages` interfaces) this may not be +obvious when just attempting to offline a block of memory. + +Additionally, as multiple gigantic pages may be reserved on a single block, +it may appear that gigantic pages are available for migration when in reality +they are in the process of being removed. For example if `memoryN` contains +two gigantic pages, one reserved and one allocated, and an admin attempts to +offline that block, this operations may hang indefinitely unless another +reserved gigantic page is available on another block `memoryM`. + + nr_hugepages ============ diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index a0cc017e4424..ed1f8f1e86c5 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -186,6 +186,6 @@ More detailed explanation for tainting 18) ``N`` if an in-kernel test, such as a KUnit test, has been run. - 19) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE + 19) ``J`` if userspace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE to use the devices debugging features. Device debugging features could cause the device to malfunction in undefined ways. diff --git a/Documentation/admin-guide/thermal/index.rst b/Documentation/admin-guide/thermal/index.rst index 193b7b01a87d..e48bc0a1951b 100644 --- a/Documentation/admin-guide/thermal/index.rst +++ b/Documentation/admin-guide/thermal/index.rst @@ -6,3 +6,4 @@ Thermal Subsystem :maxdepth: 1 intel_powerclamp + intel_thermal_throttle diff --git a/Documentation/admin-guide/thermal/intel_thermal_throttle.rst b/Documentation/admin-guide/thermal/intel_thermal_throttle.rst new file mode 100644 index 000000000000..f4fbf9d5a4ec --- /dev/null +++ b/Documentation/admin-guide/thermal/intel_thermal_throttle.rst @@ -0,0 +1,91 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +======================================= +Intel thermal throttle events reporting +======================================= + +:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> + +Introduction +------------ + +Intel processors have built in automatic and adaptive thermal monitoring +mechanisms that force the processor to reduce its power consumption in order +to operate within predetermined temperature limits. + +Refer to section "THERMAL MONITORING AND PROTECTION" in the "Intel® 64 and +IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C, & 3D): +System Programming Guide" for more details. + +In general, there are two mechanisms to control the core temperature of the +processor. They are called "Thermal Monitor 1 (TM1) and Thermal Monitor 2 (TM2)". + +The status of the temperature sensor that triggers the thermal monitor (TM1/TM2) +is indicated through the "thermal status flag" and "thermal status log flag" in +MSR_IA32_THERM_STATUS for core level and MSR_IA32_PACKAGE_THERM_STATUS for +package level. + +Thermal Status flag, bit 0 — When set, indicates that the processor core +temperature is currently at the trip temperature of the thermal monitor and that +the processor power consumption is being reduced via either TM1 or TM2, depending +on which is enabled. When clear, the flag indicates that the core temperature is +below the thermal monitor trip temperature. This flag is read only. + +Thermal Status Log flag, bit 1 — When set, indicates that the thermal sensor has +tripped since the last power-up or reset or since the last time that software +cleared this flag. This flag is a sticky bit; once set it remains set until +cleared by software or until a power-up or reset of the processor. The default +state is clear. + +It is possible that when user reads MSR_IA32_THERM_STATUS or +MSR_IA32_PACKAGE_THERM_STATUS, TM1/TM2 is not active. In this case, +"Thermal Status flag" will read "0" and the "Thermal Status Log flag" will be set +to show any previous "TM1/TM2" activation. But since it needs to be cleared by +the software, it can't show the number of occurrences of "TM1/TM2" activations. + +Hence, Linux provides counters of how many times the "Thermal Status flag" was +set. Also presents how long the "Thermal Status flag" was active in milliseconds. +Using these counters, users can check if the performance was limited because of +thermal events. It is recommended to read from sysfs instead of directly reading +MSRs as the "Thermal Status Log flag" is reset by the driver to implement rate +control. + +Sysfs Interface +--------------- + +Thermal throttling events are presented for each CPU under +"/sys/devices/system/cpu/cpuX/thermal_throttle/", where "X" is the CPU number. + +All these counters are read-only. They can't be reset to 0. So, they can potentially +overflow after reaching the maximum 64 bit unsigned integer. + +``core_throttle_count`` + Shows the number of times "Thermal Status flag" changed from 0 to 1 for this + CPU since OS boot and thermal vector is initialized. This is a 64 bit counter. + +``package_throttle_count`` + Shows the number of times "Thermal Status flag" changed from 0 to 1 for the + package containing this CPU since OS boot and thermal vector is initialized. + Package status is broadcast to all CPUs; all CPUs in the package increment + this count. This is a 64-bit counter. + +``core_throttle_max_time_ms`` + Shows the maximum amount of time for which "Thermal Status flag" has been + set to 1 for this CPU at the core level since OS boot and thermal vector + is initialized. + +``package_throttle_max_time_ms`` + Shows the maximum amount of time for which "Thermal Status flag" has been + set to 1 for the package containing this CPU since OS boot and thermal + vector is initialized. + +``core_throttle_total_time_ms`` + Shows the cumulative time for which "Thermal Status flag" has been + set to 1 for this CPU for core level since OS boot and thermal vector + is initialized. + +``package_throttle_total_time_ms`` + Shows the cumulative time for which "Thermal Status flag" has been set + to 1 for the package containing this CPU since OS boot and thermal vector + is initialized. diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst index 102c693c8f81..89df26553aa0 100644 --- a/Documentation/admin-guide/thunderbolt.rst +++ b/Documentation/admin-guide/thunderbolt.rst @@ -203,10 +203,10 @@ host controller or a device, it is important that the firmware can be upgraded to the latest where possible bugs in it have been fixed. Typically OEMs provide this firmware from their support site. -There is also a central site which has links where to download firmware -for some machines: - - `Thunderbolt Updates <https://thunderbolttechnology.net/updates>`_ +Currently, recommended method of updating firmware is through "fwupd" tool. +It uses LVFS (Linux Vendor Firmware Service) portal by default to get the +latest firmware from hardware vendors and updates connected devices if found +compatible. For details refer to: https://github.com/fwupd/fwupd. Before you upgrade firmware on a device, host or retimer, please make sure it is a suitable upgrade. Failing to do that may render the device @@ -215,18 +215,40 @@ tools! Host NVM upgrade on Apple Macs is not supported. -Once the NVM image has been downloaded, you need to plug in a -Thunderbolt device so that the host controller appears. It does not -matter which device is connected (unless you are upgrading NVM on a -device - then you need to connect that particular device). +Fwupd is installed by default. If you don't have it on your system, simply +use your distro package manager to get it. + +To see possible updates through fwupd, you need to plug in a Thunderbolt +device so that the host controller appears. It does not matter which +device is connected (unless you are upgrading NVM on a device - then you +need to connect that particular device). Note an OEM-specific method to power the controller up ("force power") may be available for your system in which case there is no need to plug in a Thunderbolt device. -After that we can write the firmware to the non-active parts of the NVM -of the host or device. As an example here is how Intel NUC6i7KYK (Skull -Canyon) Thunderbolt controller NVM is upgraded:: +Updating firmware using fwupd is straightforward - refer to official +readme on fwupd github. + +If firmware image is written successfully, the device shortly disappears. +Once it comes back, the driver notices it and initiates a full power +cycle. After a while device appears again and this time it should be +fully functional. + +Device of interest should display new version under "Current version" +and "Update State: Success" in fwupd's interface. + +Upgrading firmware manually +--------------------------------------------------------------- +If possible, use fwupd to updated the firmware. However, if your device OEM +has not uploaded the firmware to LVFS, but it is available for download +from their side, you can use method below to directly upgrade the +firmware. + +Manual firmware update can be done with 'dd' tool. To update firmware +using this method, you need to write it to the non-active parts of NVM +of the host or device. Example on how to update Intel NUC6i7KYK +(Skull Canyon) Thunderbolt controller NVM:: # dd if=KYK_TBT_FW_0018.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem @@ -235,10 +257,8 @@ upgrade process as follows:: # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate -If no errors are returned, the host controller shortly disappears. Once -it comes back the driver notices it and initiates a full power cycle. -After a while the host controller appears again and this time it should -be fully functional. +If no errors are returned, device should behave as described in previous +section. We can verify that the new NVM firmware is active by running the following commands:: @@ -350,7 +370,7 @@ is built-in to the kernel image, there is no need to do anything. The driver will create one virtual ethernet interface per Thunderbolt port which are named like ``thunderbolt0`` and so on. From this point -you can either use standard userspace tools like ``ifconfig`` to +you can either use standard userspace tools like ``ip`` to configure the interface or let your GUI handle it automatically. Forcing power diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst index d6313890ee41..35963491b9f1 100644 --- a/Documentation/admin-guide/workload-tracing.rst +++ b/Documentation/admin-guide/workload-tracing.rst @@ -196,11 +196,11 @@ Let’s checkout the latest Linux repository and build cscope database:: cscope -R -p10 # builds cscope.out database before starting browse session cscope -d -p10 # starts browse session on cscope.out database -Note: Run "cscope -R -p10" to build the database and c"scope -d -p10" to -enter into the browsing session. cscope by default cscope.out database. -To get out of this mode press ctrl+d. -p option is used to specify the -number of file path components to display. -p10 is optimal for browsing -kernel sources. +Note: Run "cscope -R -p10" to build the database and "cscope -d -p10" to +enter into the browsing session. cscope by default uses the cscope.out +database. To get out of this mode press ctrl+d. -p option is used to +specify the number of file path components to display. -p10 is optimal +for browsing kernel sources. What is perf and how do we use it? ================================== diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index c85cd327af28..746ea60eed3f 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -215,6 +215,14 @@ When mounting an XFS filesystem, the following options are accepted. inconsistent namespace presentation during or after a failover event. + errortag=tagname + When specified, enables the error inject tag named "tagname" with the + default frequency. Can be specified multiple times to enable multiple + errortags. Specifying this option on remount will reset the error tag + to the default value if it was set to any other value before. + This option is only supported when CONFIG_XFS_DEBUG is enabled, and + will not be reflected in /proc/self/mounts. + Deprecation of V4 Format ======================== |
