linux-toradex.git/kernel/fork.c, branch v6.4-rc1

Merge tag 'iommu-updates-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

2023-04-30T20:00:38+00:00

Pull iommu updates from Joerg Roedel:

 - Convert to platform remove callback returning void

 - Extend changing default domain to normal group

 - Intel VT-d updates:
     - Remove VT-d virtual command interface and IOASID
     - Allow the VT-d driver to support non-PRI IOPF
     - Remove PASID supervisor request support
     - Various small and misc cleanups

 - ARM SMMU updates:
     - Device-tree binding updates:
         * Allow Qualcomm GPU SMMUs to accept relevant clock properties
         * Document Qualcomm 8550 SoC as implementing an MMU-500
         * Favour new "qcom,smmu-500" binding for Adreno SMMUs

     - Fix S2CR quirk detection on non-architectural Qualcomm SMMU
       implementations

     - Acknowledge SMMUv3 PRI queue overflow when consuming events

     - Document (in a comment) why ATS is disabled for bypass streams

 - AMD IOMMU updates:
     - 5-level page-table support
     - NUMA awareness for memory allocations

 - Unisoc driver: Support for reattaching an existing domain

 - Rockchip driver: Add missing set_platform_dma_ops callback

 - Mediatek driver: Adjust the dma-ranges

 - Various other small fixes and cleanups

* tag 'iommu-updates-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (82 commits)
  iommu: Remove iommu_group_get_by_id()
  iommu: Make iommu_release_device() static
  iommu/vt-d: Remove BUG_ON in dmar_insert_dev_scope()
  iommu/vt-d: Remove a useless BUG_ON(dev->is_virtfn)
  iommu/vt-d: Remove BUG_ON in map/unmap()
  iommu/vt-d: Remove BUG_ON when domain->pgd is NULL
  iommu/vt-d: Remove BUG_ON in handling iotlb cache invalidation
  iommu/vt-d: Remove BUG_ON on checking valid pfn range
  iommu/vt-d: Make size of operands same in bitwise operations
  iommu/vt-d: Remove PASID supervisor request support
  iommu/vt-d: Use non-privileged mode for all PASIDs
  iommu/vt-d: Remove extern from function prototypes
  iommu/vt-d: Do not use GFP_ATOMIC when not needed
  iommu/vt-d: Remove unnecessary checks in iopf disabling path
  iommu/vt-d: Move PRI handling to IOPF feature path
  iommu/vt-d: Move pfsid and ats_qdep calculation to device probe path
  iommu/vt-d: Move iopf code from SVA to IOPF enabling path
  iommu/vt-d: Allow SVA with device-specific IOPF
  dmaengine: idxd: Add enable/disable device IOPF feature
  arm64: dts: mt8186: Add dma-ranges for the parent "soc" node
  ...

Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

2023-04-28T22:57:53+00:00

Pull tracing updates from Steven Rostedt:

- User events are finally ready!

After lots of collaboration between various parties, we finally
locked down on a stable interface for user events that can also work
with user space only tracing.

This is implemented by telling the kernel (or user space library, but
that part is user space only and not part of this patch set), where
the variable is that the application uses to know if something is
listening to the trace.

There's also an interface to tell the kernel about these events,
which will show up in the /sys/kernel/tracing/events/user_events/
directory, where it can be enabled.

When it's enabled, the kernel will update the variable, to tell the
application to start writing to the kernel.

See https://lwn.net/Articles/927595/

- Cleaned up the direct trampolines code to simplify arm64 addition of
direct trampolines.

Direct trampolines use the ftrace interface but instead of jumping to
the ftrace trampoline, applications (mostly BPF) can register their
own trampoline for performance reasons.

- Some updates to the fprobe infrastructure. fprobes are more efficient
than kprobes, as it does not need to save all the registers that
kprobes on ftrace do. More work needs to be done before the fprobes
will be exposed as dynamic events.

- More updates to references to the obsolete path of
/sys/kernel/debug/tracing for the new /sys/kernel/tracing path.

- Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer
line by line instead of all at once.

There are users in production kernels that have a large data dump
that originally used printk() directly, but the data dump was larger
than what printk() allowed as a single print.

Using seq_buf() to do the printing fixes that.

- Add /sys/kernel/tracing/touched_functions that shows all functions
that was every traced by ftrace or a direct trampoline. This is used
for debugging issues where a traced function could have caused a
crash by a bpf program or live patching.

- Add a "fields" option that is similar to "raw" but outputs the fields
of the events. It's easier to read by humans.

- Some minor fixes and clean ups.

* tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits)
ring-buffer: Sync IRQ works before buffer destruction
tracing: Add missing spaces in trace_print_hex_seq()
ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
recordmcount: Fix memory leaks in the uwrite function
tracing/user_events: Limit max fault-in attempts
tracing/user_events: Prevent same address and bit per process
tracing/user_events: Ensure bit is cleared on unregister
tracing/user_events: Ensure write index cannot be negative
seq_buf: Add seq_buf_do_printk() helper
tracing: Fix print_fields() for __dyn_loc/__rel_loc
tracing/user_events: Set event filter_type from type
ring-buffer: Clearly check null ptr returned by rb_set_head_page()
tracing: Unbreak user events
tracing/user_events: Use print_format_fields() for trace output
tracing/user_events: Align structs with tabs for readability
tracing/user_events: Limit global user_event count
tracing/user_events: Charge event allocs to cgroups
tracing/user_events: Update documentation for ABI
tracing/user_events: Use write ABI in example
tracing/user_events: Add ABI self-test
...

Merge tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2023-04-28T21:53:30+00:00

Pull scheduler updates from Ingo Molnar:

 - Allow unprivileged PSI poll()ing

 - Fix performance regression introduced by mm_cid

 - Improve livepatch stalls by adding livepatch task switching to
   cond_resched(). This resolves livepatching busy-loop stalls with
   certain CPU-bound kthreads

 - Improve sched_move_task() performance on autogroup configs

 - On core-scheduling CPUs, avoid selecting throttled tasks to run

 - Misc cleanups, fixes and improvements

* tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock: Fix local_clock() before sched_clock_init()
  sched/rt: Fix bad task migration for rt tasks
  sched: Fix performance regression introduced by mm_cid
  sched/core: Make sched_dynamic_mutex static
  sched/psi: Allow unprivileged polling of N*2s period
  sched/psi: Extract update_triggers side effect
  sched/psi: Rename existing poll members in preparation
  sched/psi: Rearrange polling code in preparation
  sched/fair: Fix inaccurate tally of ttwu_move_affine
  vhost: Fix livepatch timeouts in vhost_worker()
  livepatch,sched: Add livepatch task switching to cond_resched()
  livepatch: Skip task_call_func() for current task
  livepatch: Convert stack entries array to percpu
  sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization
  sched/core: Reduce cost of sched_move_task when config autogroup
  sched/core: Avoid selecting the task that is throttled to run when core-sched enable
  sched/topology: Make sched_energy_mutex,update static

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2023-04-28T02:42:02+00:00

Pull MM updates from Andrew Morton:

- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.

- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.

- zsmalloc performance improvements from Sergey Senozhatsky.

- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.

- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful

- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.

- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.

- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).

- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.

- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.

- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.

- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.

- Cleanups to do_fault_around() by Lorenzo Stoakes.

- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.

- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.

- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.

- Lorenzo has also contributed further cleanups of vma_merge().

- Chaitanya Prakash provides some fixes to the mmap selftesting code.

- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.

- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.

- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.

- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.

- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.

- Yosry Ahmed has improved the performance of memcg statistics
flushing.

- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.

- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.

- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.

- Pankaj Raghav has made page_endio() unneeded and has removed it.

- Peter Xu contributed some rationalizations of the userfaultfd
selftests.

- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.

- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.

- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.

- Peter Xu fixes a few issues with uffd-wp at fork() time.

- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...

Merge tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

2023-04-24T20:03:42+00:00

Pull pidfd updates from Christian Brauner:
 "This adds a new pidfd_prepare() helper which allows the caller to
  reserve a pidfd number and allocates a new pidfd file that stashes the
  provided struct pid.

  It should be avoided installing a file descriptor into a task's file
  descriptor table just to close it again via close_fd() in case an
  error occurs. The fd has been visible to userspace and might already
  be in use. Instead, a file descriptor should be reserved but not
  installed into the caller's file descriptor table.

  If another failure path is hit then the reserved file descriptor and
  file can just be put without any userspace visible side-effects. And
  if all failure paths are cleared the file descriptor and file can be
  installed into the task's file descriptor table.

  This helper is now used in all places that open coded this
  functionality before. For example, this is currently done during
  copy_process() and fanotify used pidfd_create(), which returns a pidfd
  that has already been made visibile in the caller's file descriptor
  table, but then closed it using close_fd().

  In one of the next merge windows there is also new functionality
  coming to unix domain sockets that will have to rely on
  pidfd_prepare()"

* tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fanotify: use pidfd_prepare()
  fork: use pidfd_prepare()
  pid: add pidfd_prepare()

Merge tag 'v6.4/kernel.user_worker' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

2023-04-24T19:52:35+00:00

Pull user work thread updates from Christian Brauner:
"This contains the work generalizing the ability to create a kernel
worker from a userspace process.

Such user workers will run with the same credentials as the userspace
process they were created from providing stronger security and
accounting guarantees than the traditional override_creds() approach
ever could've hoped for.

The original work was heavily based and optimzed for the needs of
io_uring which was the first user. However, as it quickly turned out
the ability to create user workers inherting properties from a
userspace process is generally useful.

The vhost subsystem currently creates workers using the kthread api.
The consequences of using the kthread api are that RLIMITs don't work
correctly as they are inherited from khtreadd. This leads to bugs
where more workers are created than would be allowed by the RLIMITs of
the userspace process in lieu of which workers are created.

Problems like this disappear with user workers created from the
userspace processes for which they perform the work. In addition,
providing this api allows vhost to remove additional complexity. For
example, cgroup and mm sharing will just work out of the box with user
workers based on the relevant userspace process instead of manually
ensuring the correct cgroup and mm contexts are used.

So the vhost subsystem should simply be made to use the same mechanism
as io_uring. To this end the original mechanism used for
create_io_thread() is generalized into user workers:

- Introduce PF_USER_WORKER as a generic indicator that a given task
is a user worker, i.e., a kernel task that was created from a
userspace process. Now a PF_IO_WORKER thread is just a specialized
version of PF_USER_WORKER. So io_uring io workers raise both flags.

- Make copy_process() available to core kernel code

- Extend struct kernel_clone_args with the following bitfields
allowing to indicate to copy_process():
- to create a user worker (raise PF_USER_WORKER)
- to not inherit any files from the userspace process
- to ignore signals

After all generic changes are in place the vhost subsystem implements
a new dedicated vhost api based on user workers. Finally, vhost is
switched to rely on the new api moving it off of kthreads.

Thanks to Mike for sticking it out and making it through this rather
arduous journey"

* tag 'v6.4/kernel.user_worker' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
vhost: use vhost_tasks for worker threads
vhost: move worker thread fields to new struct
vhost_task: Allow vhost layer to use copy_process
fork: allow kernel code to call copy_process
fork: Add kernel_clone_args flag to ignore signals
fork: add kernel_clone_args flag to not dup/clone files
fork/vm: Move common PF_IO_WORKER behavior to new flag
kernel: Make io_thread and kthread bit fields
kthread: Pass in the thread's name during creation
kernel: Allow a kernel thread's name to be set in copy_process
csky: Remove kernel_thread declaration

sched: Fix performance regression introduced by mm_cid

2023-04-21T11:24:20+00:00

Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
sysbench regression reported by Aaron Lu.

Keep track of the currently allocated mm_cid for each mm/cpu rather than
freeing them immediately on context switch. This eliminates most atomic
operations when context switching back and forth between threads
belonging to different memory spaces in multi-threaded scenarios (many
processes, each with many threads). The per-mm/per-cpu mm_cid values are
serialized by their respective runqueue locks.

Thread migration is handled by introducing invocation to
sched_mm_cid_migrate_to() (with destination runqueue lock held) in
activate_task() for migrating tasks. If the destination cpu's mm_cid is
unset, and if the source runqueue is not actively using its mm_cid, then
the source cpu's mm_cid is moved to the destination cpu on migration.

Introduce a task-work executed periodically, similarly to NUMA work,
which delays reclaim of cid values when they are unused for a period of
time.

Keep track of the allocation time for each per-cpu cid, and let the task
work clear them when they are observed to be older than
SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
mm_cids which are greater or equal to the Hamming weight of the mm
cidmask to keep concurrency ids compact.

Because we want to ensure the mm_cid converges towards the smaller
values as migrations happen, the prior optimization that was done when
context switching between threads belonging to the same mm is removed,
because it could delay the lazy release of the destination runqueue
mm_cid after it has been replaced by a migration. Removing this prior
optimization is not an issue performance-wise because the introduced
per-mm/per-cpu mm_cid tracking also covers this more specific case.

Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Reported-by: Aaron Lu 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Aaron Lu 
Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/

sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes

2023-04-18T21:53:49+00:00

mm: fix memory leak on mm_init error handling

2023-04-18T21:22:12+00:00

commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
introduces a memory leak by missing a call to destroy_context() when a
percpu_counter fails to allocate.

Before introducing the per-cpu counter allocations, init_new_context() was
the last call that could fail in mm_init(), and thus there was no need to
ever invoke destroy_context() in the error paths.  Adding the following
percpu counter allocations adds error paths after init_new_context(),
which means its associated destroy_context() needs to be called when
percpu counters fail to allocate.

Link: https://lkml.kernel.org/r/20230330133822.66271-1-mathieu.desnoyers@efficios.com
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Mathieu Desnoyers 
Acked-by: Shakeel Butt 
Cc: Marek Szyprowski 
Cc: 
Signed-off-by: Andrew Morton

sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes

2023-04-16T19:31:58+00:00