linux-toradex.git/fs/proc/task_nommu.c, branch v6.7

Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

2023-10-30T19:14:19+00:00

Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.

Features:

- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.

- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".

- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.

When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:

(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem

The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.

This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.

Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.

- Make overlayfs use SB_I_NOUMASK too.

- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.

- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:

When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.

In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.

This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get
the i_version"). This commit uses direct referencing to f_path in
IMA code that otherwise uses file_inode() and file_dentry() to
reference the filesystem objects that it is measuring.

This contains work to switch things around: instead of having
filesystem code opt-in to get the "real" path, have generic code
opt-in for the "fake" path in the few places that it is needed.

Is it far more likely that new filesystems code that does not use
the file_dentry() and file_real_path() helpers will end up causing
crashes or averting LSM/audit rules if we keep the "fake" path
exposed by default.

This change already makes file_dentry() moot, but for now we did
not change this helper just added a WARN_ON() in ovl_d_real() to
catch if we have made any wrong assumptions.

After the dust settles on this change, we can make file_dentry() a
plain accessor and we can drop the inode argument to ->d_real().

- Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
change but it really isn't and I would like to see everyone on
their tippie toes for any possible bugs from this work.

Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
files since a very long time because of the nasty interactions
between the SCM_RIGHTS file descriptor garbage collection. So
extending it makes a lot of sense but it is a subtle change. There
are almost no places that fiddle with file rcu semantics directly
and the ones that did mess around with struct file internal under
rcu have been made to stop doing that because it really was always
dodgy.

I forgot to put in the link tag for this change and the discussion
in the commit so adding it into the merge message:

https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com

Cleanups:

- Various smaller pipe cleanups including the removal of a spin lock
that was only used to protect against writes without pipe_lock()
from O_NOTIFICATION_PIPE aka watch queues. As that was never
implemented remove the additional locking from pipe_write().

- Annotate struct watch_filter with the new __counted_by attribute.

- Clarify do_unlinkat() cleanup so that it doesn't look like an extra
iput() is done that would cause issues.

- Simplify file cleanup when the file has never been opened.

- Use module helper instead of open-coding it.

- Predict error unlikely for stale retry.

- Use WRITE_ONCE() for mount expiry field instead of just commenting
that one hopes the compiler doesn't get smart.

Fixes:

- Fix readahead on block devices.

- Fix writeback when layztime is enabled and inodes whose timestamp
is the only thing that changed reside on wb->b_dirty_time. This
caused excessively large zombie memory cgroup when lazytime was
enabled as such inodes weren't handled fast enough.

- Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"

* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
file, i915: fix file reference for mmap_singleton()
vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
chardev: Simplify usage of try_module_get()
ovl: rely on SB_I_NOUMASK
fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
fs: store real path instead of fake path in backing file f_path
fs: create helper file_user_path() for user displayed mapped file path
fs: get mnt_writers count for an open backing file's real path
vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
vfs: predict the error in retry_estale as unlikely
backing file: free directly
vfs: fix readahead(2) on block devices
io_uring: use files_lookup_fd_locked()
file: convert to SLAB_TYPESAFE_BY_RCU
vfs: shave work on failed file open
fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
watch_queue: Annotate struct watch_filter with __counted_by
fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
fs/pipe: remove unnecessary spinlock from pipe_write()
...

fs: create helper file_user_path() for user displayed mapped file path

2023-10-19T09:03:15+00:00

Overlayfs uses backing files with "fake" overlayfs f_path and "real"
underlying f_inode, in order to use underlying inode aops for mapped
files and to display the overlayfs path in /proc//maps.

In preparation for storing the overlayfs "fake" path instead of the
underlying "real" path in struct backing_file, define a noop helper
file_user_path() that returns f_path for now.

Use the new helper in procfs and kernel logs whenever a path of a
mapped file is displayed to users.

Signed-off-by: Amir Goldstein 
Link: https://lore.kernel.org/r/20231009153712.1566422-3-amir73il@gmail.com
Signed-off-by: Christian Brauner

proc: nommu: fix empty /proc//maps

2023-09-19T20:21:34+00:00

On no-MMU, /proc//maps reads as an empty file.  This happens because
find_vma(mm, 0) always returns NULL (assuming no vma actually contains the
zero address, which is normally the case).

To fix this bug and improve the maintainability in the future, this patch
makes the no-MMU implementation as similar as possible to the MMU
implementation.

The only remaining differences are the lack of hold/release_task_mempolicy
and the extra code to shoehorn the gate vma into the iterator.

This has been tested on top of 6.5.3 on an STM32F746.

Link: https://lkml.kernel.org/r/20230915160055.971059-2-ben.wolsieffer@hefring.com
Fixes: 0c563f148043 ("proc: remove VMA rbtree use from nommu")
Signed-off-by: Ben Wolsieffer 
Cc: Davidlohr Bueso 
Cc: Giulio Benetti 
Cc: Liam R. Howlett 
Cc: Matthew Wilcox (Oracle) 
Cc: Oleg Nesterov 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

proc: nommu: /proc//maps: release mmap read lock

2023-09-19T20:21:34+00:00

The no-MMU implementation of /proc//map doesn't normally release
the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine
whether to release the lock.  Since _vml is NULL when the end of the
mappings is reached, the lock is not released.

Reading /proc/1/maps twice doesn't cause a hang because it only
takes the read lock, which can be taken multiple times and therefore
doesn't show any problem if the lock isn't released. Instead, you need
to perform some operation that attempts to take the write lock after
reading /proc//maps. To actually reproduce the bug, compile the
following code as 'proc_maps_bug':

#include 
#include 
#include 

int main(int argc, char *argv[]) {
        void *buf;
        sleep(1);
        buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        puts("mmap returned");
        return 0;
}

Then, run:

  ./proc_maps_bug &; cat /proc/$!/maps; fg

Without this patch, mmap() will hang and the command will never
complete.
	
This code was incorrectly adapted from the MMU implementation, which at
the time released the lock in m_next() before returning the last entry.

The MMU implementation has diverged further from the no-MMU version since
then, so this patch brings their locking and error handling into sync,
fixing the bug and hopefully avoiding similar issues in the future.

Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.com
Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm")
Signed-off-by: Ben Wolsieffer 
Acked-by: Oleg Nesterov 
Cc: Giulio Benetti 
Cc: Greg Ungerer 
Cc: 
Signed-off-by: Andrew Morton

mm: factor out VMA stack and heap checks

2023-08-21T20:37:31+00:00

Patch series "mm: convert to vma_is_initial_heap/stack()", v3.

Add vma_is_initial_stack() and vma_is_initial_heap() helpers and use them
to simplify code.


This patch (of 4):

Factor out VMA stack and heap checks and name them vma_is_initial_stack()
and vma_is_initial_heap() for general use.

Link: https://lkml.kernel.org/r/20230728050043.59880-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20230728050043.59880-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang 
Reviewed-by: David Hildenbrand 
Acked-by: Peter Zijlstra (Intel) 
Cc: Christian Göttsche 
Cc: Alex Deucher 
Cc: Arnaldo Carvalho de Melo 
Cc: Christian Göttsche 
Cc: Christian König 
Cc: Daniel Vetter 
Cc: David Airlie 
Cc: Eric Paris 
Cc: Felix Kuehling 
Cc: "Pan, Xinhui" 
Cc: Paul Moore 
Cc: Stephen Smalley 
Signed-off-by: Andrew Morton

mm: nommu: correct the range of mmap_sem_read_lock in task_mem()

2023-06-23T23:59:32+00:00

During the seq_printf,the mmap_sem_read_lock protection is not
required.

Link: https://lkml.kernel.org/r/20230622040152.1173-1-lipeifeng@oppo.com
Signed-off-by: lipeifeng 
Cc: David Hildenbrand 
Cc: Liam R. Howlett 
Cc: Matthew Wilcox 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping()

2023-01-19T01:12:56+00:00

Patch series "mm/nommu: don't use VM_MAYSHARE for MAP_PRIVATE mappings".

Trying to reduce the confusion around VM_SHARED and VM_MAYSHARE first
requires !CONFIG_MMU to stop using VM_MAYSHARE for MAP_PRIVATE mappings. 
CONFIG_MMU only sets VM_MAYSHARE for MAP_SHARED mappings.

This paves the way for further VM_MAYSHARE and VM_SHARED cleanups: for
example, renaming VM_MAYSHARED to VM_MAP_SHARED to make it cleaner what is
actually means.

Let's first get the weird case out of the way and not use VM_MAYSHARE in
MAP_PRIVATE mappings, using a new VM_MAYOVERLAY flag instead.


This patch (of 3):

We want to stop using VM_MAYSHARE in private mappings to pave the way for
clarifying the semantics of VM_MAYSHARE vs.  VM_SHARED and reduce the
confusion.  While CONFIG_MMU uses VM_MAYSHARE to represent MAP_SHARED,
!CONFIG_MMU also sets VM_MAYSHARE for selected R/O private file mappings
that are an effective overlay of a file mapping.

Let's factor out all relevant VM_MAYSHARE checks in !CONFIG_MMU code into
is_nommu_shared_mapping() first.

Note that whenever VM_SHARED is set, VM_MAYSHARE must be set as well
(unless there is a serious BUG).  So there is not need to test for
VM_SHARED manually.

No functional change intended.

Link: https://lkml.kernel.org/r/20230102160856.500584-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230102160856.500584-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Arnd Bergmann 
Cc: David Hildenbrand 
Cc: Greg Kroah-Hartman 
Cc: Jens Axboe 
Cc: Nicolas Pitre 
Cc: Pavel Begunkov 
Signed-off-by: Andrew Morton

proc: remove VMA rbtree use from nommu

2022-09-27T02:46:16+00:00

These users of the rbtree should probably have been walks of the linked
list, but convert them to use walks of the maple tree.

Link: https://lkml.kernel.org/r/20220906194824.2110408-17-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Liam R. Howlett 
Acked-by: Vlastimil Babka 
Reviewed-by: Davidlohr Bueso 
Tested-by: Yu Zhao 
Cc: Catalin Marinas 
Cc: David Hildenbrand 
Cc: David Howells 
Cc: SeongJae Park 
Cc: Sven Schnelle 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

2020-06-09T16:39:14+00:00

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse 
Signed-off-by: Andrew Morton 
Reviewed-by: Daniel Jordan 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
Cc: Davidlohr Bueso 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Jason Gunthorpe 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Liam Howlett 
Cc: Matthew Wilcox 
Cc: Peter Zijlstra 
Cc: Ying Han 
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

proc: use down_read_killable mmap_sem for /proc/pid/maps

2019-07-12T18:05:46+00:00

Do not remain stuck forever if something goes wrong.  Using a killable
lock permits cleanup of stuck tasks and simplifies investigation.

This function is also used for /proc/pid/smaps.

Link: http://lkml.kernel.org/r/156007493160.3335.14447544314127417266.stgit@buzz
Signed-off-by: Konstantin Khlebnikov 
Reviewed-by: Roman Gushchin 
Reviewed-by: Cyrill Gorcunov 
Reviewed-by: Kirill Tkhai 
Acked-by: Michal Hocko 
Cc: Alexey Dobriyan 
Cc: Al Viro 
Cc: Matthew Wilcox 
Cc: Michal Koutný 
Cc: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds