| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading updates from Borislav Petkov:
- Add infrastructure to be able to debug the microcode loader in a guest
- Refresh Intel old microcode revisions
* tag 'x86_microcode_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Add microcode loader debugging functionality
x86/microcode: Add microcode= cmdline parsing
x86/microcode/intel: Refresh the revisions that determine old_microcode
|
|
HEAD
KVM SEV-SNP CipherText Hiding support for 6.18
Add support for SEV-SNP's CipherText Hiding, an opt-in feature that prevents
unauthorized CPU accesses from reading the ciphertext of SNP guest private
memory, e.g. to attempt an offline attack. Instead of ciphertext, the CPU
will always read back all FFs when CipherText Hiding is enabled.
Add new module parameter to the KVM module to enable CipherText Hiding and
control the number of ASIDs that can be used for VMs with CipherText Hiding,
which is in effect the number of SNP VMs. When CipherText Hiding is enabled,
the shared SEV-ES/SEV-SNP ASID space is split into separate ranges for SEV-ES
and SEV-SNP guests, i.e. ASIDs that can be used for CipherText Hiding cannot
be used to run SEV-ES guests.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Extensive cpuset code cleanup and refactoring work with no functional
changes: CPU mask computation logic refactoring, introducing new
helpers, removing redundant code paths, and improving error handling
for better maintainability.
- A few bug fixes to cpuset including fixes for partition creation
failures when isolcpus is in use, missing error returns, and null
pointer access prevention in free_tmpmasks().
- Core cgroup changes include replacing the global percpu_rwsem with
per-threadgroup rwsem when writing to cgroup.procs for better
scalability, workqueue conversions to use WQ_PERCPU and
system_percpu_wq to prepare for workqueue default switching from
percpu to unbound, and removal of unused code including the
post_attach callback.
- New cgroup.stat.local time accounting feature that tracks frozen time
duration.
- Misc changes including selftests updates (new freezer time tests and
backward compatibility fixes), documentation sync, string function
safety improvements, and 64-bit division fixes.
* tag 'cgroup-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (39 commits)
cpuset: remove is_prs_invalid helper
cpuset: remove impossible warning in update_parent_effective_cpumask
cpuset: remove redundant special case for null input in node mask update
cpuset: fix missing error return in update_cpumask
cpuset: Use new excpus for nocpu error check when enabling root partition
cpuset: fix failure to enable isolated partition when containing isolcpus
Documentation: cgroup-v2: Sync manual toctree
cpuset: use partition_cpus_change for setting exclusive cpus
cpuset: use parse_cpulist for setting cpus.exclusive
cpuset: introduce partition_cpus_change
cpuset: refactor cpus_allowed_validate_change
cpuset: refactor out validate_partition
cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpers
cpuset: refactor CPU mask buffer parsing logic
cpuset: Refactor exclusive CPU mask computation logic
cpuset: change return type of is_partition_[in]valid to bool
cpuset: remove unused assignment to trialcs->partition_root_state
cpuset: move the root cpuset write check earlier
cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lock
cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's good stuff across the board, including some nice mm
improvements for CPUs with the 'noabort' BBML2 feature and a clever
patch to allow ptdump to play nicely with block mappings in the
vmalloc area.
Confidential computing:
- Add support for accepting secrets from firmware (e.g. ACPI CCEL)
and mapping them with appropriate attributes.
CPU features:
- Advertise atomic floating-point instructions to userspace
- Extend Spectre workarounds to cover additional Arm CPU variants
- Extend list of CPUs that support break-before-make level 2 and
guarantee not to generate TLB conflict aborts for changes of
mapping granularity (BBML2_NOABORT)
- Add GCS support to our uprobes implementation.
Documentation:
- Remove bogus SME documentation concerning register state when
entering/exiting streaming mode.
Entry code:
- Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY)
- Micro-optimise syscall entry path with a compiler branch hint.
Memory management:
- Enable huge mappings in vmalloc space even when kernel page-table
dumping is enabled
- Tidy up the types used in our early MMU setup code
- Rework rodata= for closer parity with the behaviour on x86
- For CPUs implementing BBML2_NOABORT, utilise block mappings in the
linear map even when rodata= applies to virtual aliases
- Don't re-allocate the virtual region between '_text' and '_stext',
as doing so confused tools parsing /proc/vmcore.
Miscellaneous:
- Clean-up Kconfig menuconfig text for architecture features
- Avoid redundant bitmap_empty() during determination of supported
SME vector lengths
- Re-enable warnings when building the 32-bit vDSO object
- Avoid breaking our eggs at the wrong end.
Perf and PMUs:
- Support for v3 of the Hisilicon L3C PMU
- Support for Hisilicon's MN and NoC PMUs
- Support for Fujitsu's Uncore PMU
- Support for SPE's extended event filtering feature
- Preparatory work to enable data source filtering in SPE
- Support for multiple lanes in the DWC PCIe PMU
- Support for i.MX94 in the IMX DDR PMU driver
- MAINTAINERS update (Thank you, Yicong)
- Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
Selftests:
- Add basic LSFE check to the existing hwcaps test
- Support nolibc in GCS tests
- Extend SVE ptrace test to pass unsupported regsets and invalid
vector lengths
- Minor cleanups (typos, cosmetic changes).
System registers:
- Fix ID_PFR1_EL1 definition
- Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1
- Sync TCR_EL1 definition with the latest Arm ARM (L.b)
- Be stricter about the input fed into our AWK sysreg generator
script
- Typo fixes and removal of redundant definitions.
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls
- Fix a node refcount imbalance in the PSCI device-tree code.
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
guests can reliably query the underlying CPU types from the VMM
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64: cpufeature: Remove duplicate asm/mmu.h header
arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
perf/dwc_pcie: Fix use of uninitialized variable
arm/syscalls: mark syscall invocation as likely in invoke_syscall
Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
Documentation: hisi-pmu: Fix of minor format error
drivers/perf: hisi: Add support for L3C PMU v3
drivers/perf: hisi: Refactor the event configuration of L3C PMU
drivers/perf: hisi: Extend the field of tt_core
drivers/perf: hisi: Extract the event filter check of L3C PMU
drivers/perf: hisi: Simplify the probe process of each L3C PMU version
drivers/perf: hisi: Export hisi_uncore_pmu_isr()
drivers/perf: hisi: Relax the event ID check in the framework
perf: Fujitsu: Add the Uncore PMU driver
arm64: map [_text, _stext) virtual address range non-executable+read-only
arm64/sysreg: Update TCR_EL1 register
arm64: Enable vmalloc-huge with ptdump
arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list
arm64: errata: Apply workarounds for Neoverse-V3AE
arm64: cputype: Add Neoverse-V3AE definitions
...
|
|
Pull xfs updates from Carlos Maiolino:
"For this merge window, there are really no new features, but there are
a few things worth to emphasize:
- Deprecated for years already, the (no)attr2 and (no)ikeep mount
options have been removed for good
- Several cleanups (specially from typedefs) and bug fixes
- Improvements made in the online repair reap calculations
- online fsck is now enabled by default"
* tag 'xfs-merge-6.18' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
xfs: rework datasync tracking and execution
xfs: rearrange code in xfs_inode_item_precommit
xfs: scrub: use kstrdup_const() for metapath scan setups
xfs: use bt_nr_sectors in xfs_dax_translate_range
xfs: track the number of blocks in each buftarg
xfs: constify xfs_errortag_random_default
xfs: improve default maximum number of open zones
xfs: improve zone statistics message
xfs: centralize error tag definitions
xfs: remove pointless externs in xfs_error.h
xfs: remove the expr argument to XFS_TEST_ERROR
xfs: remove xfs_errortag_set
xfs: remove xfs_errortag_get
xfs: move the XLOG_REG_ constants out of xfs_log_format.h
xfs: adjust the hint based zone allocation policy
xfs: refactor hint based zone allocation
fs: add an enum for number of life time hints
xfs: fix log CRC mismatches between i386 and other architectures
xfs: rename the old_crc variable in xlog_recover_process
xfs: remove the unused xfs_log_iovec_t typedef
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
"Fixes and New HW Supoort
- amd/pmc: Use 8042 quirk for Stellaris Slim Gen6 AMD
- dell: Set USTT mode according to BIOS after reboot
- dell-lis3lv02d: Add Latitude E6530
- lg-laptop: Fix setting the fan mode"
* tag 'platform-drivers-x86-v6.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: lg-laptop: Fix WMAB call in fan_mode_store()
platform/x86: dell-lis3lv02d: Add Latitude E6530
platform/x86/dell: Set USTT mode according to BIOS after reboot
platform/x86/amd/pmc: Add Stellaris Slim Gen6 AMD to spurious 8042 quirks list
|
|
Running "make htmldocs" generates the following build errors and
warnings for fujitsu_uncore_pmu.rst:
Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst:20: ERROR: Unexpected indentation.
Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst:23: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst:28: ERROR: Unexpected indentation.
Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst:29: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst:81: ERROR: Unexpected indentation.
Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst:82: WARNING: Block quote ends without a blank line; unexpected unindent.
Add blank line before bullet lists, block quotes to fix build
errors, resolve warnings and properly render perf commands as
code blocks.
Signed-off-by: Gopi Krishna Menon <krishnagopi487@gmail.com>
Reviewed-by: Koichi Okuno <fj2767dz@fujitsu.com>
Fixes: bad11557eed2 ("perf: Fujitsu: Add the Uncore PMU driver")
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Previously tt_core is defined as config1:0-7 which may not cover all
the CPUs sharing L3C on platforms with more than 8 CPUs in a L3C. In
order to support such platforms extend tt_core to 16 bits, since no
spare space in config1, tt_core was moved to config2:0-15.
Though linux expects the users to retrieve the control encoding from
sysfs first for each option, it's possible if user doesn't follow
this and hardcoded tt_core in config1. So add an option
tt_core_deprecated for config1:0-7 for backward compatibility.
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When WMAB is called to set the fan mode, the new mode is read from either
bits 0-1 or bits 4-5 (depending on the value of some other EC register).
Thus when WMAB is called with bits 4-5 zeroed and called again with
bits 0-1 zeroed, the second call undoes the effect of the first call.
This causes writes to /sys/devices/platform/lg-laptop/fan_mode to have
no effect (and causes reads to always report a status of zero).
Fix this by calling WMAB once, with the mode set in bits 0,1 and 4,5.
When the fan mode is returned from WMAB it always has this form, so
there is no need to preserve the other bits. As a bonus, the driver
now supports the "Performance" fan mode seen in the LG-provided Windows
control app, which provides less aggressive CPU throttling but louder
fan noise and shorter battery life.
Also, correct the documentation to reflect that 0 corresponds to the
default mode (what the Windows app calls "Optimal") and 1 corresponds
to the silent mode.
Fixes: dbf0c5a6b1f8 ("platform/x86: Add LG Gram laptop special features driver")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204913#c4
Signed-off-by: Daniel Lee <dany97@live.ca>
Link: https://patch.msgid.link/MN2PR06MB55989CB10E91C8DA00EE868DDC1CA@MN2PR06MB5598.namprd06.prod.outlook.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
* for-next/perf: (29 commits)
perf/dwc_pcie: Fix use of uninitialized variable
Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
Documentation: hisi-pmu: Fix of minor format error
drivers/perf: hisi: Add support for L3C PMU v3
drivers/perf: hisi: Refactor the event configuration of L3C PMU
drivers/perf: hisi: Extend the field of tt_core
drivers/perf: hisi: Extract the event filter check of L3C PMU
drivers/perf: hisi: Simplify the probe process of each L3C PMU version
drivers/perf: hisi: Export hisi_uncore_pmu_isr()
drivers/perf: hisi: Relax the event ID check in the framework
perf: Fujitsu: Add the Uncore PMU driver
perf/arm-cmn: Fix CMN S3 DTM offset
perf: arm_spe: Prevent overflow in PERF_IDX2OFF()
coresight: trbe: Prevent overflow in PERF_IDX2OFF()
MAINTAINERS: Remove myself from HiSilicon PMU maintainers
drivers/perf: hisi: Add support for HiSilicon MN PMU driver
drivers/perf: hisi: Add support for HiSilicon NoC PMU
perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions
arm64/boot: Enable EL2 requirements for SPE_FEAT_FDS
arm64/boot: Factor out a macro to check SPE version
...
|
|
Some of HiSilicon V3 PMU hardware is divided into parts to fulfill the
job of monitoring specific parts of a device. Add description on that
as well as the newly added ext option for L3C PMU.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Yushan Wang <wangyushan12@huawei.com>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The inline path of sysfs should be placed in literal blocks to make
documentation look better.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Acked-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Yushan Wang <wangyushan12@huawei.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This adds a new dynamic PMU to the Perf Events framework to program and
control the Uncore PMUs in Fujitsu chips.
This driver exports formatting and event information to sysfs so it can
be used by the perf user space tools with the syntaxes:
perf stat -e pci_iod0_pci0/ea-pci/ ls
perf stat -e pci_iod0_pci0/event=0x80/ ls
perf stat -e mac_iod0_mac0_ch0/ea-mac/ ls
perf stat -e mac_iod0_mac0_ch0/event=0x80/ ls
FUJITSU-MONAKA PMU Events Specification v1.1 URL:
https://github.com/fujitsu/FUJITSU-MONAKA
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Koichi Okuno <fj2767dz@fujitsu.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The example command doesn't work [1] on the latest DAMON user-space tool,
since --damos_action option is updated to receive multiple arguments, and
hence cannot know if the final argument is for deductible monitoring
target or an argument for --damos_action option. Add --target_pid option
to let damo understand it is for target pid.
Link: https://lkml.kernel.org/r/20250916032339.115817-5-sj@kernel.org
Link: https://github.com/damonitor/damo/pull/32 [2]
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After commit acd7ccb284b8 ("mm: shmem: add large folio support for
tmpfs"), we have extended tmpfs to allow any sized large folios, rather
than just PMD-sized large folios.
The strategy discussed previously was:
: Considering that tmpfs already has the 'huge=' option to control the
: PMD-sized large folios allocation, we can extend the 'huge=' option to
: allow any sized large folios. The semantics of the 'huge=' mount option
: are:
:
: huge=never: no any sized large folios
: huge=always: any sized large folios
: huge=within_size: like 'always' but respect the i_size
: huge=advise: like 'always' if requested with madvise()
:
: Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
: allocate the PMD-sized huge folios if huge=always/within_size/advise is
: set.
:
: Moreover, the 'deny' and 'force' testing options controlled by
: '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
: semantics. The 'deny' can disable any sized large folios for tmpfs, while
: the 'force' can enable PMD sized large folios for tmpfs.
This means that when tmpfs is mounted with 'huge=always' or
'huge=within_size', tmpfs will allow getting a highest order hint based on
the size of write() and fallocate() paths. It will then try each
allowable large order, rather than continually attempting to allocate
PMD-sized large folios as before.
However, this might break some user scenarios for those who want to use
PMD-sized large folios, such as the i915 driver which did not supply a
write size hint when allocating shmem [1].
Moreover, Hugh also complained that this will cause a regression in userspace
with 'huge=always' or 'huge=within_size'.
So, let's revisit the strategy for tmpfs large page allocation. A simple fix
would be to always try PMD-sized large folios first, and if that fails, fall
back to smaller large folios. This approach differs from the strategy for
large folio allocation used by other file systems, however, tmpfs is somewhat
different from other file systems, as quoted from David's opinion:
: There were opinions in the past that tmpfs should just behave like any
: other fs, and I think that's what we tried to satisfy here: use the write
: size as an indication.
:
: I assume there will be workloads where either approach will be beneficial.
: I also assume that workloads that use ordinary fs'es could benefit from
: the same strategy (start with PMD), while others will clearly not.
Link: https://lkml.kernel.org/r/10e7ac6cebe6535c137c064d5c5a235643eebb4a.1756888965.git.baolin.wang@linux.alibaba.com
Link: https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/ [1]
Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
With zswap using zsmalloc directly, there are no more in-tree users of
this code. Remove it.
With zpool gone, zsmalloc is now always a simple dependency and no
longer something the user needs to configure. Hide CONFIG_ZSMALLOC
from the user and have zswap and zram pull it in as needed.
Link: https://lkml.kernel.org/r/20250829162212.208258-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Manually-arranged toctree comment in cgroup v2 documentation is a rather
out-of-sync with actual contents: a few sections are missing and/or
named differently.
Sync the toctree.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Adds the support for HiSilicon NoC (Network on Chip) PMU which
will be used to monitor the events on the system bus. The PMU
device will be named after the SCL ID (either Super CPU cluster
or Super IO cluster) and the index ID, just similar to other
HiSilicon Uncore PMUs. Below PMU formats are provided besides
the event:
- ch: the transaction channel (data, request, response, etc) which
can be used to filter the counting.
- tt_en: tracetag filtering enable. Just as other HiSilicon Uncore
PMUs the NoC PMU supports only counting the transactions with
tracetag.
The NoC PMU doesn't have an interrupt to indicate the overflow.
However we have a 64 bit counter which is large enough and it's
nearly impossible to overflow.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
While Designware PCIe PMU allows to count only one time based event
at a time, it allows to count all the lane events simultaneously.
After the patch one is able to count a group of lane events:
$ perf stat -e '{dwc_rootport/tx_memory_write,lane=1/,dwc_rootport/rx_memory_read,lane=0/}' dd if=/dev/nvme0n1 of=/dev/null bs=1M count=1
Earlier the events wouldn't have been counted successfully.
Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
As per admin guide documentation, "rodata=on" should be the default on
platforms. Documentation/admin-guide/kernel-parameters.txt describes
these options as
rodata= [KNL,EARLY]
on Mark read-only kernel memory as read-only (default).
off Leave read-only kernel memory writable for debugging.
full Mark read-only kernel memory and aliases as read-only
[arm64]
But on arm64 platform, RODATA_FULL_DEFAULT_ENABLED is enabled by default,
so "rodata=full" is the default instead.
For parity with other architectures, namely x86, rework 'rodata=on' to
match the current "full" behaviour and replace 'rodata=full' with a new
'rodata=noalias' option which retains writable aliases in the direct map
for memory regions outside of the kernel image.
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit 21d59d00221e4e ("xfs: remove deprecated sysctl knobs") moves
recently-removed sysctls to the removed sysctls table but fails to
extend the table, hence triggering Sphinx warning:
Documentation/admin-guide/xfs.rst:365: ERROR: Malformed table.
Text in column margin in table line 8.
============================= =======
Name Removed
============================= =======
fs.xfs.xfsbufd_centisec v4.0
fs.xfs.age_buffer_centisecs v4.0
fs.xfs.irix_symlink_mode v6.18
fs.xfs.irix_sgid_inherit v6.18
fs.xfs.speculative_cow_prealloc_lifetime v6.18
============================= ======= [docutils]
Extend "Name" column of the table to fit the now-longest sysctl, which
is fs.xfs.speculative_cow_prealloc_lifetime.
Fixes: 21d59d00221e ("xfs: remove deprecated sysctl knobs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250908180406.32124fb7@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Ciphertext hiding prevents host accesses from reading the ciphertext of
SNP guest private memory. Instead of reading ciphertext, the host reads
will see constant default values (0xff).
The SEV ASID space is split into SEV and SEV-ES/SEV-SNP ASID ranges.
Enabling ciphertext hiding further splits the SEV-ES/SEV-SNP ASID space
into separate ASID ranges for SEV-ES and SEV-SNP guests.
Add a new off-by-default kvm-amd module parameter to enable ciphertext
hiding and allow the admin to configure the SEV-ES and SEV-SNP ASID
ranges. Simply cap the maximum SEV-SNP ASID as appropriate, i.e. don't
reject loading KVM or disable ciphertest hiding for a too-big value, as
KVM's general approach for module params is to sanitize inputs based on
hardware/kernel support, not burn the world down. This also allows the
admin to use -1u to assign all SEV-ES/SNP ASIDs to SNP without needing
dedicated handling in KVM.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/95abc49edfde36d4fb791570ea2a4be6ad95fd0d.1755721927.git.ashish.kalra@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a kernel command-line parameter to enable or disable the exposure of
the ABMC (Assignable Bandwidth Monitoring Counters) hardware feature to
resctrl.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
User reported current document about SYS_INFO_PANIC_CONSOLE_REPLAY is
confusing, that people could expect all user space console messages to be
replayed.
Specify that only 'kernel' messages will be replayed to solve the confusion.
Link: https://lkml.kernel.org/r/20250825025701.81921-3-feng.tang@linux.alibaba.com
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Reported-by: Askar Safin <safinaskar@zohomail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Document addr_unit DAMON sysfs file on DAMON usage document.
Link: https://lkml.kernel.org/r/20250828171242.59810-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This includes the PR_SET_THP_DISABLE/PR_GET_THP_DISABLE pair of prctl
calls as well the newly introduced PR_THP_DISABLE_EXCEPT_ADVISED flag for
the PR_SET_THP_DISABLE prctl call.
Link: https://lkml.kernel.org/r/20250815135549.130506-5-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use attack vector controls to select whether VMSCAPE requires mitigation,
similar to other bugs.
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
Cross-merge networking fixes after downstream PR (net-6.17-rc6).
Conflicts:
net/netfilter/nft_set_pipapo.c
net/netfilter/nft_set_pipapo_avx2.c
c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups")
84c1da7b38d9 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too")
Only trivial adjacent changes (in a doc and a Makefile).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull vmescape mitigation fixes from Dave Hansen:
"Mitigate vmscape issue with indirect branch predictor flushes.
vmscape is a vulnerability that essentially takes Spectre-v2 and
attacks host userspace from a guest. It particularly affects
hypervisors like QEMU.
Even if a hypervisor may not have any sensitive data like disk
encryption keys, guest-userspace may be able to attack the
guest-kernel using the hypervisor as a confused deputy.
There are many ways to mitigate vmscape using the existing Spectre-v2
defenses like IBRS variants or the IBPB flushes. This series focuses
solely on IBPB because it works universally across vendors and all
vulnerable processors. Further work doing vendor and model-specific
optimizations can build on top of this if needed / wanted.
Do the normal issue mitigation dance:
- Add the CPU bug boilerplate
- Add a list of vulnerable CPUs
- Use IBPB to flush the branch predictors after running guests"
* tag 'vmscape-for-linus-20250904' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vmscape: Add old Intel CPUs to affected list
x86/vmscape: Warn when STIBP is disabled with SMT
x86/bugs: Move cpu_bugs_smt_update() down
x86/vmscape: Enable the mitigation
x86/vmscape: Add conditional IBPB mitigation
x86/vmscape: Enumerate VMSCAPE bug
Documentation/hw-vuln: Add VMSCAPE documentation
|
|
The ov6650 driver was introduced in v2.6.37 to support the OMAP1-based
Amstrad Delta video phone. The platform still has a board file in the
kernel, but support for the camera was dropped in commit ce548396a433
("media: mach-omap1: board-ams-delta.c: remove soc_camera dependencies")
in v5.9. The driver has been unused since as it has received neither
ACPI nor DT support.
The ov6650 driver is one of the last sensor drivers calling
clk_set_rate(). This is deprecated, and calls to the function are being
removed to avoid cargo-cult. As the driver is unlikely to ever be used
again, drop it instead of trying to avoid call clk_set_rate().
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: Mehdi Djait <mehdi.djait@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
|
|
Replace hverkuil@xs4all.nl by hverkuil@kernel.org.
Signed-off-by: Hans Verkuil <hverkuil@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
|
|
Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.
Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.
Due to known performance issues with md-bitmap and the unreasonable
implementations:
- self-managed IO submitting like filemap_write_page();
- global spin_lock
I have decided not to continue optimizing based on the current bitmap
implementation, this new bitmap is invented without locking from IO fast
path and can be used with fast disks.
For designs and details, see the comments in drivers/md-llbitmap.c.
Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
|
|
Currently bitmap_ops is registered while allocating mddev, this is fine
when there is only one bitmap_ops.
Delay setting bitmap_ops until creating bitmap, so that user can choose
which bitmap to use before running the array.
Link: https://lore.kernel.org/linux-raid/20250721171557.34587-7-yukuai@kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
|
|
The api will be used by mdadm to set bitmap_type while creating new array
or assembling array, prepare to add a new bitmap.
Currently available options are:
cat /sys/block/md0/md/bitmap_type
none [bitmap]
Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Li Nan <linan122@huawei.com>
|
|
Pull for-6.17-fixes to receive 79f919a89c9d ("cgroup: split
cgroup_destroy_wq into 3 workqueues") to resolve its conflict with
7fa33aa3b001 ("cgroup: WQ_PERCPU added to alloc_workqueue users"). The
latter adds WQ_PERCPU when creating cgroup_destroy_wq and the former splits
the workqueue into three. Resolve by applying WQ_PERCPU to the three split
workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
These sysctl knobs were scheduled for removal in September 2025. That
time has come, so remove them.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
These four mount options were scheduled for removal in September 2025,
so remove them now.
Cc: preichl@redhat.com
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
We promised to turn off these old features by default in September 2025.
Do so now.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Cross-merge networking fixes after downstream PR (net-6.17-rc5).
No conflicts.
Adjacent changes:
include/net/sock.h
c51613fa276f ("net: add sk->sk_drop_counters")
5d6b58c932ec ("net: lockless sock_i_ino()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add "debug" option for "cfi=" bootparam to get details on early CFI
initialization steps so future Kees can find breakage easier.
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250904034656.3670313-5-kees@kernel.org
|
|
The kernel-parameters.txt didn't have a section for the cfi= options.
Add it.
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20250904034656.3670313-3-kees@kernel.org
|
|
Instead of adding ad-hoc debugging glue to the microcode loader each
time I need it, add debugging functionality which is not built by
default.
Simulate all patch handling the loader does except the actual loading of
the microcode patch into the hardware.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250820135043.19048-3-bp@kernel.org
|
|
Add a "microcode=" command line argument after which all options can be
passed in a comma-separated list.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/20250820135043.19048-2-bp@kernel.org
|
|
There is an obvious mistake in nfsroot.rst where pxelinux was wrongly
written as pxeliunx. Fix it.
Signed-off-by: Zenghui Yu <zenghui.yu@linux.dev>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250827144738.43361-1-zenghui.yu@linux.dev
|
|
Introduce a mechanism to detect and warn about prolonged interrupt handlers.
With a new command-line parameter (irqhandler.duration_warn_us=), users can
configure the duration threshold in microseconds when a warning in such
format should be emitted:
"[CPU14] long duration of IRQ[159:bad_irq_handler [long_irq]], took: 1330 us"
The implementation uses local_clock() to measure the execution duration of the
generic IRQ per-CPU event handler.
Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/all/20250804093525.851-1-wladislav.wiebe@nokia.com
|
|
Changed "write-throuth" to "write-through".
Signed-off-by: Vivek Alurkar <primalkenja@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250821051622.8341-2-primalkenja@gmail.com
|
|
Cross-merge networking fixes after downstream PR (net-6.17-rc4).
No conflicts.
Adjacent changes:
drivers/net/ethernet/intel/idpf/idpf_txrx.c
02614eee26fb ("idpf: do not linearize big TSO packets")
6c4e68480238 ("idpf: remove obsolete stashing code")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Attack vector controls for SSB were missed in the initial attack vector series.
The default mitigation for SSB requires user-space opt-in so it is only
relevant for user->user attacks. Check with attack vector controls when
the command is auto - i.e., no explicit user selection has been done.
Fixes: 2d31d2874663 ("x86/bugs: Define attack vectors relevant for each bug")
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250819192200.2003074-5-david.kaplan@amd.com
|
|
This patch introduces dm-pcache, a new DM target that places a DAX-
capable persistent-memory device in front of any slower block device and
uses it as a high-throughput, low-latency cache.
Design highlights
-----------------
- DAX data path – data is copied directly between DRAM and the pmem
mapping, bypassing the block layer’s overhead.
- Segmented, crash-consistent layout
- all layout metadata are dual-replicated CRC-protected.
- atomic kset flushes; key replay on mount guarantees cache integrity
even after power loss.
- Striped multi-tree index
- Multi‑tree indexing for high parallelism.
- overlap-resolution logic ensures non-intersecting cached extents.
- Background services
- write-back worker flushes dirty keys in order, preserving backing-device
crash consistency. This is important for checkpoint in cloud storage.
- garbage collector reclaims clean segments when utilisation exceeds a
tunable threshold.
- Data integrity – optional CRC32 on cached payload; metadata always protected.
Comparison with existing block-level caches
---------------------------------------------------------------------------------------------------------------------------------
| Feature | pcache (this patch) | bcache | dm-writecache |
|----------------------------------|---------------------------------|------------------------------|---------------------------|
| pmem access method | DAX | bio (block I/O) | DAX |
| Write latency (4 K rand-write) | ~5 µs | ~20 µs | ~5 µs |
| Concurrency | multi subtree index | global index tree | single tree + wc_lock |
| IOPS (4K randwrite, 32 numjobs) | 2.1 M | 352 K | 283 K |
| Read-cache support | YES | YES | NO |
| Deployment | no re-format of backend | backend devices must be | no re-format of backend |
| | | reformatted | |
| Write-back ordering | log-structured; | no ordering guarantee | no ordering guarantee |
| | preserves app-IO-order | | |
| Data integrity checks | metadata + data CRC(optional) | metadata CRC only | none |
---------------------------------------------------------------------------------------------------------------------------------
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|