diff options
Diffstat (limited to 'Documentation/admin-guide')
20 files changed, 477 insertions, 126 deletions
diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 30858757c9f2..7da0504388ec 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -252,7 +252,7 @@ For example, if you find a bug at the gspca's sonixj.c file, you can get its maintainers with:: $ ./scripts/get_maintainer.pl --bug -f drivers/media/usb/gspca/sonixj.c - Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) + Hans Verkuil <hverkuil@kernel.org> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%) Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%) Bhaktipriya Shridhar <bhaktipriya96@gmail.com> (commit_signer:1/1=100%,authored:1/1=100%,added_lines:4/4=100%,removed_lines:9/9=100%) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 51c0bc4c2dc5..0e6c67ac585a 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -15,6 +15,9 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou .. CONTENTS + [Whenever any new section is added to this document, please also add + an entry here.] + 1. Introduction 1-1. Terminology 1-2. What is cgroup? @@ -25,9 +28,10 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 2-2-2. Threads 2-3. [Un]populated Notification 2-4. Controlling Controllers - 2-4-1. Enabling and Disabling - 2-4-2. Top-down Constraint - 2-4-3. No Internal Process Constraint + 2-4-1. Availability + 2-4-2. Enabling and Disabling + 2-4-3. Top-down Constraint + 2-4-4. No Internal Process Constraint 2-5. Delegation 2-5-1. Model of Delegation 2-5-2. Delegation Containment @@ -61,14 +65,15 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 5-4-1. PID Interface Files 5-5. Cpuset 5.5-1. Cpuset Interface Files - 5-6. Device + 5-6. Device controller 5-7. RDMA 5-7-1. RDMA Interface Files 5-8. DMEM + 5-8-1. DMEM Interface Files 5-9. HugeTLB 5.9-1. HugeTLB Interface Files 5-10. Misc - 5.10-1 Miscellaneous cgroup Interface Files + 5.10-1 Misc Interface Files 5.10-2 Migration and Ownership 5-11. Others 5-11-1. perf_event @@ -1001,6 +1006,24 @@ All cgroup core files are prefixed with "cgroup." Total number of dying cgroup subsystems (e.g. memory cgroup) at and beneath the current cgroup. + cgroup.stat.local + A read-only flat-keyed file which exists in non-root cgroups. + The following entry is defined: + + frozen_usec + Cumulative time that this cgroup has spent between freezing and + thawing, regardless of whether by self or ancestor groups. + NB: (not) reaching "frozen" state is not accounted here. + + Using the following ASCII representation of a cgroup's freezer + state, :: + + 1 _____ + frozen 0 __/ \__ + ab cd + + the duration being measured is the span between a and c. + cgroup.freeze A read-write single value file which exists on non-root cgroups. Allowed values are "0" and "1". The default is "0". diff --git a/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst b/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst index 5964901d66e3..d0bdbd81dcf9 100644 --- a/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst +++ b/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst @@ -218,6 +218,7 @@ SRSO X X X X SSB X TAA X X X X * (Note 2) TSA X X X X +VMSCAPE X =============== ============== ============ ============= ============== ============ ======== Notes: diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 89ca636081b7..55d747511f83 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -26,3 +26,4 @@ are configurable at compile, boot or run time. rsb old_microcode indirect-target-selection + vmscape diff --git a/Documentation/admin-guide/hw-vuln/vmscape.rst b/Documentation/admin-guide/hw-vuln/vmscape.rst new file mode 100644 index 000000000000..d9b9a2b6c114 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/vmscape.rst @@ -0,0 +1,110 @@ +.. SPDX-License-Identifier: GPL-2.0 + +VMSCAPE +======= + +VMSCAPE is a vulnerability that may allow a guest to influence the branch +prediction in host userspace. It particularly affects hypervisors like QEMU. + +Even if a hypervisor may not have any sensitive data like disk encryption keys, +guest-userspace may be able to attack the guest-kernel using the hypervisor as +a confused deputy. + +Affected processors +------------------- + +The following CPU families are affected by VMSCAPE: + +**Intel processors:** + - Skylake generation (Parts without Enhanced-IBRS) + - Cascade Lake generation - (Parts affected by ITS guest/host separation) + - Alder Lake and newer (Parts affected by BHI) + +Note that, BHI affected parts that use BHB clearing software mitigation e.g. +Icelake are not vulnerable to VMSCAPE. + +**AMD processors:** + - Zen series (families 0x17, 0x19, 0x1a) + +** Hygon processors:** + - Family 0x18 + +Mitigation +---------- + +Conditional IBPB +---------------- + +Kernel tracks when a CPU has run a potentially malicious guest and issues an +IBPB before the first exit to userspace after VM-exit. If userspace did not run +between VM-exit and the next VM-entry, no IBPB is issued. + +Note that the existing userspace mitigation against Spectre-v2 is effective in +protecting the userspace. They are insufficient to protect the userspace VMMs +from a malicious guest. This is because Spectre-v2 mitigations are applied at +context switch time, while the userspace VMM can run after a VM-exit without a +context switch. + +Vulnerability enumeration and mitigation is not applied inside a guest. This is +because nested hypervisors should already be deploying IBPB to isolate +themselves from nested guests. + +SMT considerations +------------------ + +When Simultaneous Multi-Threading (SMT) is enabled, hypervisors can be +vulnerable to cross-thread attacks. For complete protection against VMSCAPE +attacks in SMT environments, STIBP should be enabled. + +The kernel will issue a warning if SMT is enabled without adequate STIBP +protection. Warning is not issued when: + +- SMT is disabled +- STIBP is enabled system-wide +- Intel eIBRS is enabled (which implies STIBP protection) + +System information and options +------------------------------ + +The sysfs file showing VMSCAPE mitigation status is: + + /sys/devices/system/cpu/vulnerabilities/vmscape + +The possible values in this file are: + + * 'Not affected': + + The processor is not vulnerable to VMSCAPE attacks. + + * 'Vulnerable': + + The processor is vulnerable and no mitigation has been applied. + + * 'Mitigation: IBPB before exit to userspace': + + Conditional IBPB mitigation is enabled. The kernel tracks when a CPU has + run a potentially malicious guest and issues an IBPB before the first + exit to userspace after VM-exit. + + * 'Mitigation: IBPB on VMEXIT': + + IBPB is issued on every VM-exit. This occurs when other mitigations like + RETBLEED or SRSO are already issuing IBPB on VM-exit. + +Mitigation control on the kernel command line +---------------------------------------------- + +The mitigation can be controlled via the ``vmscape=`` command line parameter: + + * ``vmscape=off``: + + Disable the VMSCAPE mitigation. + + * ``vmscape=ibpb``: + + Enable conditional IBPB mitigation (default when CONFIG_MITIGATION_VMSCAPE=y). + + * ``vmscape=force``: + + Force vulnerability detection and mitigation even on processors that are + not known to be affected. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 86f395f2933b..74ca438d2d6d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2606,6 +2606,11 @@ for it. Intended to get systems with badly broken firmware running. + irqhandler.duration_warn_us= [KNL] + Warn if an IRQ handler exceeds the specified duration + threshold in microseconds. Useful for identifying + long-running IRQs in the system. + irqpoll [HW] When an interrupt is not handled search all handlers for it. Also check all handlers each timer @@ -3767,8 +3772,16 @@ mga= [HW,DRM] - microcode.force_minrev= [X86] - Format: <bool> + microcode= [X86] Control the behavior of the microcode loader. + Available options, comma separated: + + base_rev=X - with <X> with format: <u32> + Set the base microcode revision of each thread when in + debug mode. + + dis_ucode_ldr: disable the microcode loader + + force_minrev: Enable or disable the microcode minimal revision enforcement for the runtime microcode loader. @@ -3829,6 +3842,7 @@ srbds=off [X86,INTEL] ssbd=force-off [ARM64] tsx_async_abort=off [X86] + vmscape=off [X86] Exceptions: This does not have any effect on @@ -6154,7 +6168,7 @@ rdt= [HW,X86,RDT] Turn on/off individual RDT features. List is: cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp, - mba, smba, bmec. + mba, smba, bmec, abmc. E.g. to turn on cmt and turn off mba use: rdt=cmt,!mba @@ -6405,8 +6419,9 @@ rodata= [KNL,EARLY] on Mark read-only kernel memory as read-only (default). off Leave read-only kernel memory writable for debugging. - full Mark read-only kernel memory and aliases as read-only - [arm64] + noalias Mark read-only kernel memory as read-only but retain + writable aliases in the direct map for regions outside + of the kernel image. [arm64] rockchip.usb_uart [EARLY] @@ -6428,6 +6443,9 @@ rootflags= [KNL] Set root filesystem mount option string + initramfs_options= [KNL] + Specify mount options for for the initramfs mount. + rootfstype= [KNL] Set root filesystem type rootwait [KNL] Wait (indefinitely) for root device to show up. @@ -8041,6 +8059,16 @@ vmpoff= [KNL,S390] Perform z/VM CP command after power off. Format: <command> + vmscape= [X86] Controls mitigation for VMscape attacks. + VMscape attacks can leak information from a userspace + hypervisor to a guest via speculative side-channels. + + off - disable the mitigation + ibpb - use Indirect Branch Prediction Barrier + (IBPB) mitigation (default) + force - force vulnerability detection even on + unaffected processors + vsyscall= [X86-64,EARLY] Controls the behavior of vsyscalls (i.e. calls to fixed addresses of 0xffffffffff600x00 from legacy diff --git a/Documentation/admin-guide/laptops/lg-laptop.rst b/Documentation/admin-guide/laptops/lg-laptop.rst index 67fd6932cef4..c4dd534f91ed 100644 --- a/Documentation/admin-guide/laptops/lg-laptop.rst +++ b/Documentation/admin-guide/laptops/lg-laptop.rst @@ -48,8 +48,8 @@ This value is reset to 100 when the kernel boots. Fan mode -------- -Writing 1/0 to /sys/devices/platform/lg-laptop/fan_mode disables/enables -the fan silent mode. +Writing 0/1/2 to /sys/devices/platform/lg-laptop/fan_mode sets fan mode to +Optimal/Silent/Performance respectively. USB charge diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst index 4ff2cc291d18..1c2eacc94758 100644 --- a/Documentation/admin-guide/md.rst +++ b/Documentation/admin-guide/md.rst @@ -347,6 +347,54 @@ All md devices contain: active-idle like active, but no writes have been seen for a while (safe_mode_delay). + consistency_policy + This indicates how the array maintains consistency in case of unexpected + shutdown. It can be: + + none + Array has no redundancy information, e.g. raid0, linear. + + resync + Full resync is performed and all redundancy is regenerated when the + array is started after unclean shutdown. + + bitmap + Resync assisted by a write-intent bitmap. + + journal + For raid4/5/6, journal device is used to log transactions and replay + after unclean shutdown. + + ppl + For raid5 only, Partial Parity Log is used to close the write hole and + eliminate resync. + + The accepted values when writing to this file are ``ppl`` and ``resync``, + used to enable and disable PPL. + + uuid + This indicates the UUID of the array in the following format: + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + + bitmap_type + [RW] When read, this file will display the current and available + bitmap for this array. The currently active bitmap will be enclosed + in [] brackets. Writing an bitmap name or ID to this file will switch + control of this array to that new bitmap. Note that writing a new + bitmap for created array is forbidden. + + none + No bitmap + bitmap + The default internal bitmap + llbitmap + The lockless internal bitmap + +If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or +llbitmap/xxx will be created after md device KOBJ_CHANGE event. + +If bitmap_type is bitmap, then the md device will also contain: + bitmap/location This indicates where the write-intent bitmap for the array is stored. @@ -401,35 +449,23 @@ All md devices contain: once the array becomes non-degraded, and this fact has been recorded in the metadata. - consistency_policy - This indicates how the array maintains consistency in case of unexpected - shutdown. It can be: - - none - Array has no redundancy information, e.g. raid0, linear. - - resync - Full resync is performed and all redundancy is regenerated when the - array is started after unclean shutdown. - - bitmap - Resync assisted by a write-intent bitmap. +If bitmap_type is llbitmap, then the md device will also contain: - journal - For raid4/5/6, journal device is used to log transactions and replay - after unclean shutdown. + llbitmap/bits + This is read-only, show status of bitmap bits, the number of each + value. - ppl - For raid5 only, Partial Parity Log is used to close the write hole and - eliminate resync. - - The accepted values when writing to this file are ``ppl`` and ``resync``, - used to enable and disable PPL. + llbitmap/metadata + This is read-only, show bitmap metadata, include chunksize, chunkshift, + chunks, offset and daemon_sleep. - uuid - This indicates the UUID of the array in the following format: - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + llbitmap/daemon_sleep + This is read-write, time in seconds that daemon function will be + triggered to clear dirty bits. + llbitmap/barrier_idle + This is read-write, time in seconds that page barrier will be idled, + means dirty bits in the page will be cleared. As component devices are added to an md array, they appear in the ``md`` directory as new directories named:: diff --git a/Documentation/admin-guide/media/i2c-cardlist.rst b/Documentation/admin-guide/media/i2c-cardlist.rst index 1825a0bb47bd..fff962558cd5 100644 --- a/Documentation/admin-guide/media/i2c-cardlist.rst +++ b/Documentation/admin-guide/media/i2c-cardlist.rst @@ -91,7 +91,6 @@ ov5647 OmniVision OV5647 sensor ov5670 OmniVision OV5670 sensor ov5675 OmniVision OV5675 sensor ov5695 OmniVision OV5695 sensor -ov6650 OmniVision OV6650 sensor ov7251 OmniVision OV7251 sensor ov7640 OmniVision OV7640 sensor ov7670 OmniVision OV7670 sensor diff --git a/Documentation/admin-guide/media/ivtv.rst b/Documentation/admin-guide/media/ivtv.rst index 101f16d0263e..8b65ac3f5321 100644 --- a/Documentation/admin-guide/media/ivtv.rst +++ b/Documentation/admin-guide/media/ivtv.rst @@ -3,7 +3,7 @@ The ivtv driver =============== -Author: Hans Verkuil <hverkuil@xs4all.nl> +Author: Hans Verkuil <hverkuil@kernel.org> This is a v4l2 device driver for the Conexant cx23415/6 MPEG encoder/decoder. The cx23415 can do both encoding and decoding, the cx23416 can only do MPEG diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index ede14b679d02..ec8c34b2d32f 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -175,4 +175,4 @@ Below command makes every memory region of size >=4K that has not accessed for $ sudo damo start --damos_access_rate 0 0 --damos_sz_region 4K max \ --damos_age 60s max --damos_action pageout \ - <pid of your workload> + --target_pid <pid of your workload> diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index ff3a2dda1f02..2cae60b6f3ca 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -61,7 +61,7 @@ comma (","). │ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds │ │ :ref:`0 <sysfs_kdamond>`/state,pid,refresh_ms │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts - │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations + │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations,addr_unit │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us @@ -188,9 +188,9 @@ details). At the moment, only one context per kdamond is supported, so only contexts/<N>/ ------------- -In each context directory, two files (``avail_operations`` and ``operations``) -and three directories (``monitoring_attrs``, ``targets``, and ``schemes``) -exist. +In each context directory, three files (``avail_operations``, ``operations`` +and ``addr_unit``) and three directories (``monitoring_attrs``, ``targets``, +and ``schemes``) exist. DAMON supports multiple types of :ref:`monitoring operations <damon_design_configurable_operations_set>`, including those for virtual address @@ -205,6 +205,9 @@ You can set and get what type of monitoring operations DAMON will use for the context by writing one of the keywords listed in ``avail_operations`` file and reading from the ``operations`` file. +``addr_unit`` file is for setting and getting the :ref:`address unit +<damon_design_addr_unit>` parameter of the operations set. + .. _sysfs_monitoring_attrs: contexts/<N>/monitoring_attrs/ diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 370fba113460..1654211cc6cf 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -225,6 +225,42 @@ to "always" or "madvise"), and it'll be automatically shutdown when PMD-sized THP is disabled (when both the per-size anon control and the top-level control are "never") +process THP controls +-------------------- + +A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE`` +and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. The THP behaviour set using +``PR_SET_THP_DISABLE`` is inherited across fork(2) and execve(2). These calls +support the following arguments:: + + prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0): + This will disable THPs completely for the process, irrespective + of global THP controls or madvise(..., MADV_COLLAPSE) being used. + + prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0): + This will disable THPs for the process except when the usage of THPs is + advised. Consequently, THPs will only be used when: + - Global THP controls are set to "always" or "madvise" and + madvise(..., MADV_HUGEPAGE) or madvise(..., MADV_COLLAPSE) is used. + - Global THP controls are set to "never" and madvise(..., MADV_COLLAPSE) + is used. This is the same behavior as if THPs would not be disabled on + a process level. + Note that MADV_COLLAPSE is currently always rejected if + madvise(..., MADV_NOHUGEPAGE) is set on an area. + + prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0): + This will re-enable THPs for the process, as if they were never disabled. + Whether THPs will actually be used depends on global THP controls and + madvise() calls. + + prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0): + This returns a value whose bits indicate how THP-disable is configured: + Bits + 1 0 Value Description + |0|0| 0 No THP-disable behaviour specified. + |0|1| 1 THP is entirely disabled for this process. + |1|1| 3 THP-except-advised mode is set for this process. + Khugepaged controls ------------------- @@ -383,6 +419,8 @@ option: ``huge=``. It can have following values: always Attempt to allocate huge pages every time we need a new page; + Always try PMD-sized huge pages first, and fall back to smaller-sized + huge pages if the PMD-sized huge page allocation fails; never Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)`` @@ -390,7 +428,9 @@ never is specified everywhere; within_size - Only allocate huge page if it will be fully within i_size. + Only allocate huge page if it will be fully within i_size; + Always try PMD-sized huge pages first, and fall back to smaller-sized + huge pages if the PMD-sized huge page allocation fails; Also respect madvise() hints; advise diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index fd3370aa43fe..283d77217c6f 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -53,26 +53,17 @@ Zswap receives pages for compression from the swap subsystem and is able to evict pages from its own compressed pool on an LRU basis and write them back to the backing swap device in the case that the compressed pool is full. -Zswap makes use of zpool for the managing the compressed memory pool. Each -allocation in zpool is not directly accessible by address. Rather, a handle is +Zswap makes use of zsmalloc for the managing the compressed memory pool. Each +allocation in zsmalloc is not directly accessible by address. Rather, a handle is returned by the allocation routine and that handle must be mapped before being accessed. The compressed memory pool grows on demand and shrinks as compressed -pages are freed. The pool is not preallocated. By default, a zpool -of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, -but it can be overridden at boot time by setting the ``zpool`` attribute, -e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs -``zpool`` attribute, e.g.:: - - echo zsmalloc > /sys/module/zswap/parameters/zpool - -The zsmalloc type zpool has a complex compressed page storage method, and it -can achieve great storage densities. +pages are freed. The pool is not preallocated. When a swap page is passed from swapout to zswap, zswap maintains a mapping -of the swap entry, a combination of the swap type and swap offset, to the zpool -handle that references that compressed swap page. This mapping is achieved -with a red-black tree per swap type. The swap offset is the search key for the -tree nodes. +of the swap entry, a combination of the swap type and swap offset, to the +zsmalloc handle that references that compressed swap page. This mapping is +achieved with a red-black tree per swap type. The swap offset is the search +key for the tree nodes. During a page fault on a PTE that is a swap entry, the swapin code calls the zswap load function to decompress the page into the page allocated by the page @@ -96,11 +87,11 @@ attribute, e.g.:: echo lzo > /sys/module/zswap/parameters/compressor -When the zpool and/or compressor parameter is changed at runtime, any existing -compressed pages are not modified; they are left in their own zpool. When a -request is made for a page in an old zpool, it is uncompressed using its -original compressor. Once all pages are removed from an old zpool, the zpool -and its compressor are freed. +When the compressor parameter is changed at runtime, any existing compressed +pages are not modified; they are left in their own pool. When a request is +made for a page in an old pool, it is uncompressed using its original +compressor. Once all pages are removed from an old pool, the pool and its +compressor are freed. Some of the pages in zswap are same-value filled pages (i.e. contents of the page have same value or repetitive pattern). These pages include zero-filled diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst index cb376f335f40..167f9281fbf5 100644 --- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -16,8 +16,8 @@ provides the following two features: - one 64-bit counter for Time Based Analysis (RX/TX data throughput and time spent in each low-power LTSSM state) and -- one 32-bit counter for Event Counting (error and non-error events for - a specified lane) +- one 32-bit counter per event for Event Counting (error and non-error + events for a specified lane) Note: There is no interrupt for counter overflow. diff --git a/Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst b/Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst new file mode 100644 index 000000000000..46595b788d3a --- /dev/null +++ b/Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst @@ -0,0 +1,110 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +================================================ +Fujitsu Uncore Performance Monitoring Unit (PMU) +================================================ + +This driver supports the Uncore MAC PMUs and the Uncore PCI PMUs found +in Fujitsu chips. +Each MAC PMU on these chips is exposed as a uncore perf PMU with device name +mac_iod<iod>_mac<mac>_ch<ch>. +And each PCI PMU on these chips is exposed as a uncore perf PMU with device name +pci_iod<iod>_pci<pci>. + +The driver provides a description of its available events and configuration +options in sysfs, see /sys/bus/event_sources/devices/mac_iod<iod>_mac<mac>_ch<ch>/ +and /sys/bus/event_sources/devices/pci_iod<iod>_pci<pci>/. +This driver exports: +- formats, used by perf user space and other tools to configure events +- events, used by perf user space and other tools to create events + symbolically, e.g.: + perf stat -a -e mac_iod0_mac0_ch0/event=0x21/ ls + perf stat -a -e pci_iod0_pci0/event=0x24/ ls +- cpumask, used by perf user space and other tools to know on which CPUs + to open the events + +This driver supports the following events for MAC: +- cycles + This event counts MAC cycles at MAC frequency. +- read-count + This event counts the number of read requests to MAC. +- read-count-request + This event counts the number of read requests including retry to MAC. +- read-count-return + This event counts the number of responses to read requests to MAC. +- read-count-request-pftgt + This event counts the number of read requests including retry with PFTGT + flag. +- read-count-request-normal + This event counts the number of read requests including retry without PFTGT + flag. +- read-count-return-pftgt-hit + This event counts the number of responses to read requests which hit the + PFTGT buffer. +- read-count-return-pftgt-miss + This event counts the number of responses to read requests which miss the + PFTGT buffer. +- read-wait + This event counts outstanding read requests issued by DDR memory controller + per cycle. +- write-count + This event counts the number of write requests to MAC (including zero write, + full write, partial write, write cancel). +- write-count-write + This event counts the number of full write requests to MAC (not including + zero write). +- write-count-pwrite + This event counts the number of partial write requests to MAC. +- memory-read-count + This event counts the number of read requests from MAC to memory. +- memory-write-count + This event counts the number of full write requests from MAC to memory. +- memory-pwrite-count + This event counts the number of partial write requests from MAC to memory. +- ea-mac + This event counts energy consumption of MAC. +- ea-memory + This event counts energy consumption of memory. +- ea-memory-mac-write + This event counts the number of write requests from MAC to memory. +- ea-ha + This event counts energy consumption of HA. + + 'ea' is the abbreviation for 'Energy Analyzer'. + +Examples for use with perf:: + + perf stat -e mac_iod0_mac0_ch0/ea-mac/ ls + +And, this driver supports the following events for PCI: +- pci-port0-cycles + This event counts PCI cycles at PCI frequency in port0. +- pci-port0-read-count + This event counts read transactions for data transfer in port0. +- pci-port0-read-count-bus + This event counts read transactions for bus usage in port0. +- pci-port0-write-count + This event counts write transactions for data transfer in port0. +- pci-port0-write-count-bus + This event counts write transactions for bus usage in port0. +- pci-port1-cycles + This event counts PCI cycles at PCI frequency in port1. +- pci-port1-read-count + This event counts read transactions for data transfer in port1. +- pci-port1-read-count-bus + This event counts read transactions for bus usage in port1. +- pci-port1-write-count + This event counts write transactions for data transfer in port1. +- pci-port1-write-count-bus + This event counts write transactions for bus usage in port1. +- ea-pci + This event counts energy consumption of PCI. + + 'ea' is the abbreviation for 'Energy Analyzer'. + +Examples for use with perf:: + + perf stat -e pci_iod0_pci0/ea-pci/ ls + +Given that these are uncore PMUs the driver does not support sampling, therefore +"perf record" will not work. Per-task perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst index 48992a0b8e94..c4c2cbbf88cb 100644 --- a/Documentation/admin-guide/perf/hisi-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pmu.rst @@ -18,9 +18,10 @@ HiSilicon SoC uncore PMU driver Each device PMU has separate registers for event counting, control and interrupt, and the PMU driver shall register perf PMU drivers like L3C, HHA and DDRC etc. The available events and configuration options shall -be described in the sysfs, see: +be described in the sysfs, see:: + +/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}> -/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>. The "perf list" command shall list the available events from sysfs. Each L3C, HHA and DDRC is registered as a separate PMU with perf. The PMU @@ -112,6 +113,50 @@ uring channel. It is 2 bits. Some important codes are as follows: - 2'b00: default value, count the events which sent to the both uring and uring_ext channel; +6. ch: NoC PMU supports filtering the event counts of certain transaction +channel with this option. The current supported channels are as follows: + +- 3'b010: Request channel +- 3'b100: Snoop channel +- 3'b110: Response channel +- 3'b111: Data channel + +7. tt_en: NoC PMU supports counting only transactions that have tracetag set +if this option is set. See the 2nd list for more information about tracetag. + +For HiSilicon uncore PMU v3 whose identifier is 0x40, some uncore PMUs are +further divided into parts for finer granularity of tracing, each part has its +own dedicated PMU, and all such PMUs together cover the monitoring job of events +on particular uncore device. Such PMUs are described in sysfs with name format +slightly changed:: + +/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}_{Z}/ddrc{Y}_{Z}/noc{Y}_{Z}> + +Z is the sub-id, indicating different PMUs for part of hardware device. + +Usage of most PMUs with different sub-ids are identical. Specially, L3C PMU +provides ``ext`` option to allow exploration of even finer granual statistics +of L3C PMU. L3C PMU driver uses that as hint of termination when delivering +perf command to hardware: + +- ext=0: Default, could be used with event names. +- ext=1 and ext=2: Must be used with event codes, event names are not supported. + +An example of perf command could be:: + + $# perf stat -a -e hisi_sccl0_l3c1_0/rd_spipe/ sleep 5 + +or:: + + $# perf stat -a -e hisi_sccl0_l3c1_0/event=0x1,ext=1/ sleep 5 + +As above, ``hisi_sccl0_l3c1_0`` locates PMU of Super CPU CLuster 0, L3 cache 1 +pipe0. + +First command locates the first part of L3C since ``ext=0`` is implied by +default. Second command issues the counting on another part of L3C with the +event ``0x1``. + Users could configure IDs to count data come from specific CCL/ICL, by setting srcid_cmd & srcid_msk, and data desitined for specific CCL/ICL by setting tgtid_cmd & tgtid_msk. A set bit in srcid_msk/tgtid_msk means the PMU will not diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 072b510385c4..47d9a3df6329 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -29,3 +29,4 @@ Performance monitor support cxl ampere_cspmu mrvl-pem-pmu + fujitsu_uncore_pmu diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 7b0c4291c686..2ef50828aff1 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -222,6 +222,8 @@ rmem_max The maximum receive socket buffer size in bytes. +Default: 4194304 + rps_default_mask ---------------- @@ -247,6 +249,8 @@ wmem_max The maximum send socket buffer size in bytes. +Default: 4194304 + message_burst and message_cost ------------------------------ diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index a18328a5fb93..c85cd327af28 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -34,22 +34,6 @@ When mounting an XFS filesystem, the following options are accepted. to the file. Specifying a fixed ``allocsize`` value turns off the dynamic behaviour. - attr2 or noattr2 - The options enable/disable an "opportunistic" improvement to - be made in the way inline extended attributes are stored - on-disk. When the new form is used for the first time when - ``attr2`` is selected (either when setting or removing extended - attributes) the on-disk superblock feature bit field will be - updated to reflect this format being in use. - - The default behaviour is determined by the on-disk feature - bit indicating that ``attr2`` behaviour is active. If either - mount option is set, then that becomes the new default used - by the filesystem. - - CRC enabled filesystems always use the ``attr2`` format, and so - will reject the ``noattr2`` mount option if it is set. - discard or nodiscard (default) Enable/disable the issuing of commands to let the block device reclaim space freed by the filesystem. This is @@ -75,12 +59,6 @@ When mounting an XFS filesystem, the following options are accepted. across the entire filesystem rather than just on directories configured to use it. - ikeep or noikeep (default) - When ``ikeep`` is specified, XFS does not delete empty inode - clusters and keeps them around on disk. When ``noikeep`` is - specified, empty inode clusters are returned to the free - space pool. - inode32 or inode64 (default) When ``inode32`` is specified, it indicates that XFS limits inode creation to locations which will not result in inode @@ -253,9 +231,8 @@ latest version and try again. The deprecation will take place in two parts. Support for mounting V4 filesystems can now be disabled at kernel build time via Kconfig option. -The option will default to yes until September 2025, at which time it -will be changed to default to no. In September 2030, support will be -removed from the codebase entirely. +These options were changed to default to no in September 2025. In +September 2030, support will be removed from the codebase entirely. Note: Distributors may choose to withdraw V4 format support earlier than the dates listed above. @@ -268,8 +245,6 @@ Deprecated Mount Options ============================ ================ Mounting with V4 filesystem September 2030 Mounting ascii-ci filesystem September 2030 -ikeep/noikeep September 2025 -attr2/noattr2 September 2025 ============================ ================ @@ -285,6 +260,8 @@ Removed Mount Options osyncisdsync/osyncisosync v4.0 barrier v4.19 nobarrier v4.19 + ikeep/noikeep v6.18 + attr2/noattr2 v6.18 =========================== ======= sysctls @@ -312,9 +289,6 @@ The following sysctls are available for the XFS filesystem: removes unused preallocation from clean inodes and releases the unused space back to the free pool. - fs.xfs.speculative_cow_prealloc_lifetime - This is an alias for speculative_prealloc_lifetime. - fs.xfs.error_level (Min: 0 Default: 3 Max: 11) A volume knob for error reporting when internal errors occur. This will generate detailed messages & backtraces for filesystem @@ -341,17 +315,6 @@ The following sysctls are available for the XFS filesystem: This option is intended for debugging only. - fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) - Controls whether symlinks are created with mode 0777 (default) - or whether their mode is affected by the umask (irix mode). - - fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) - Controls files created in SGID directories. - If the group ID of the new file does not match the effective group - ID or one of the supplementary group IDs of the parent dir, the - ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl - is set. - fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "sync" flag set by the **xfs_io(8)** chattr command on a directory to be @@ -387,24 +350,20 @@ The following sysctls are available for the XFS filesystem: Deprecated Sysctls ================== -=========================================== ================ - Name Removal Schedule -=========================================== ================ -fs.xfs.irix_sgid_inherit September 2025 -fs.xfs.irix_symlink_mode September 2025 -fs.xfs.speculative_cow_prealloc_lifetime September 2025 -=========================================== ================ - +None currently. Removed Sysctls =============== -============================= ======= - Name Removed -============================= ======= - fs.xfs.xfsbufd_centisec v4.0 - fs.xfs.age_buffer_centisecs v4.0 -============================= ======= +========================================== ======= + Name Removed +========================================== ======= + fs.xfs.xfsbufd_centisec v4.0 + fs.xfs.age_buffer_centisecs v4.0 + fs.xfs.irix_symlink_mode v6.18 + fs.xfs.irix_sgid_inherit v6.18 + fs.xfs.speculative_cow_prealloc_lifetime v6.18 +========================================== ======= Error handling ============== |