diff options
Diffstat (limited to 'Documentation/filesystems')
| -rw-r--r-- | Documentation/filesystems/locking.rst | 79 | ||||
| -rw-r--r-- | Documentation/filesystems/mandatory-locking.rst | 188 |
2 files changed, 53 insertions, 214 deletions
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 2183fd8cc350..2a75dd5da7b5 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -271,19 +271,19 @@ prototypes:: locking rules: All except set_page_dirty and freepage may block -====================== ======================== ========= -ops PageLocked(page) i_rwsem -====================== ======================== ========= +====================== ======================== ========= =============== +ops PageLocked(page) i_rwsem invalidate_lock +====================== ======================== ========= =============== writepage: yes, unlocks (see below) -readpage: yes, unlocks +readpage: yes, unlocks shared writepages: set_page_dirty no -readahead: yes, unlocks -readpages: no +readahead: yes, unlocks shared +readpages: no shared write_begin: locks the page exclusive write_end: yes, unlocks exclusive bmap: -invalidatepage: yes +invalidatepage: yes exclusive releasepage: yes freepage: yes direct_IO: @@ -295,7 +295,7 @@ is_partially_uptodate: yes error_remove_page: yes swap_activate: no swap_deactivate: no -====================== ======================== ========= +====================== ======================== ========= =============== ->write_begin(), ->write_end() and ->readpage() may be called from the request handler (/dev/loop). @@ -378,7 +378,10 @@ keep it that way and don't breed new callers. ->invalidatepage() is called when the filesystem must attempt to drop some or all of the buffers from the page when it is being truncated. It returns zero on success. If ->invalidatepage is zero, the kernel uses -block_invalidatepage() instead. +block_invalidatepage() instead. The filesystem must exclusively acquire +invalidate_lock before invalidating page cache in truncate / hole punch path +(and thus calling into ->invalidatepage) to block races between page cache +invalidation and page cache filling functions (fault, read, ...). ->releasepage() is called when the kernel is about to try to drop the buffers from the page in preparation for freeing it. It returns zero to @@ -506,6 +509,7 @@ prototypes:: ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll) (struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -518,12 +522,6 @@ prototypes:: int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, - loff_t *); - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, - loff_t *); - ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, - void __user *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, @@ -536,6 +534,14 @@ prototypes:: size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **, void **); long (*fallocate)(struct file *, int, loff_t, loff_t); + void (*show_fdinfo)(struct seq_file *m, struct file *f); + unsigned (*mmap_capabilities)(struct file *); + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, + loff_t, size_t, unsigned int); + loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags); + int (*fadvise)(struct file *, loff_t, loff_t, int); locking rules: All may block. @@ -570,6 +576,25 @@ in sys_read() and friends. the lease within the individual filesystem to record the result of the operation +->fallocate implementation must be really careful to maintain page cache +consistency when punching holes or performing other operations that invalidate +page cache contents. Usually the filesystem needs to call +truncate_inode_pages_range() to invalidate relevant range of the page cache. +However the filesystem usually also needs to update its internal (and on disk) +view of file offset -> disk block mapping. Until this update is finished, the +filesystem needs to block page faults and reads from reloading now-stale page +cache contents from the disk. Since VFS acquires mapping->invalidate_lock in +shared mode when loading pages from disk (filemap_fault(), filemap_read(), +readahead paths), the fallocate implementation must take the invalidate_lock to +prevent reloading. + +->copy_file_range and ->remap_file_range implementations need to serialize +against modifications of file data while the operation is running. For +blocking changes through write(2) and similar operations inode->i_rwsem can be +used. To block changes to file contents via a memory mapping during the +operation, the filesystem must take mapping->invalidate_lock to coordinate +with ->page_mkwrite. + dquot_operations ================ @@ -627,11 +652,11 @@ pfn_mkwrite: yes access: yes ============= ========= =========================== -->fault() is called when a previously not present pte is about -to be faulted in. The filesystem must find and return the page associated -with the passed in "pgoff" in the vm_fault structure. If it is possible that -the page may be truncated and/or invalidated, then the filesystem must lock -the page, then ensure it is not already truncated (the page lock will block +->fault() is called when a previously not present pte is about to be faulted +in. The filesystem must find and return the page associated with the passed in +"pgoff" in the vm_fault structure. If it is possible that the page may be +truncated and/or invalidated, then the filesystem must lock invalidate_lock, +then ensure the page is not already truncated (invalidate_lock will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the page locked. The VM will unlock the page. @@ -644,12 +669,14 @@ page table entry. Pointer to entry associated with the page is passed in "pte" field in vm_fault structure. Pointers to entries for other offsets should be calculated relative to "pte". -->page_mkwrite() is called when a previously read-only pte is -about to become writeable. The filesystem again must ensure that there are -no truncate/invalidate races, and then return with the page locked. If -the page has been truncated, the filesystem should not look up a new page -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which -will cause the VM to retry the fault. +->page_mkwrite() is called when a previously read-only pte is about to become +writeable. The filesystem again must ensure that there are no +truncate/invalidate races or races with operations such as ->remap_file_range +or ->copy_file_range, and then return with the page locked. Usually +mapping->invalidate_lock is suitable for proper serialization. If the page has +been truncated, the filesystem should not look up a new page like the ->fault() +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to +retry the fault. ->pfn_mkwrite() is the same as page_mkwrite but when the pte is VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is diff --git a/Documentation/filesystems/mandatory-locking.rst b/Documentation/filesystems/mandatory-locking.rst deleted file mode 100644 index 9ce73544a8f0..000000000000 --- a/Documentation/filesystems/mandatory-locking.rst +++ /dev/null @@ -1,188 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -===================================================== -Mandatory File Locking For The Linux Operating System -===================================================== - - Andy Walker <andy@lysaker.kvaerner.no> - - 15 April 1996 - - (Updated September 2007) - -0. Why you should avoid mandatory locking ------------------------------------------ - -The Linux implementation is prey to a number of difficult-to-fix race -conditions which in practice make it not dependable: - - - The write system call checks for a mandatory lock only once - at its start. It is therefore possible for a lock request to - be granted after this check but before the data is modified. - A process may then see file data change even while a mandatory - lock was held. - - Similarly, an exclusive lock may be granted on a file after - the kernel has decided to proceed with a read, but before the - read has actually completed, and the reading process may see - the file data in a state which should not have been visible - to it. - - Similar races make the claimed mutual exclusion between lock - and mmap similarly unreliable. - -1. What is mandatory locking? ------------------------------- - -Mandatory locking is kernel enforced file locking, as opposed to the more usual -cooperative file locking used to guarantee sequential access to files among -processes. File locks are applied using the flock() and fcntl() system calls -(and the lockf() library routine which is a wrapper around fcntl().) It is -normally a process' responsibility to check for locks on a file it wishes to -update, before applying its own lock, updating the file and unlocking it again. -The most commonly used example of this (and in the case of sendmail, the most -troublesome) is access to a user's mailbox. The mail user agent and the mail -transfer agent must guard against updating the mailbox at the same time, and -prevent reading the mailbox while it is being updated. - -In a perfect world all processes would use and honour a cooperative, or -"advisory" locking scheme. However, the world isn't perfect, and there's -a lot of poorly written code out there. - -In trying to address this problem, the designers of System V UNIX came up -with a "mandatory" locking scheme, whereby the operating system kernel would -block attempts by a process to write to a file that another process holds a -"read" -or- "shared" lock on, and block attempts to both read and write to a -file that a process holds a "write " -or- "exclusive" lock on. - -The System V mandatory locking scheme was intended to have as little impact as -possible on existing user code. The scheme is based on marking individual files -as candidates for mandatory locking, and using the existing fcntl()/lockf() -interface for applying locks just as if they were normal, advisory locks. - -.. Note:: - - 1. In saying "file" in the paragraphs above I am actually not telling - the whole truth. System V locking is based on fcntl(). The granularity of - fcntl() is such that it allows the locking of byte ranges in files, in - addition to entire files, so the mandatory locking rules also have byte - level granularity. - - 2. POSIX.1 does not specify any scheme for mandatory locking, despite - borrowing the fcntl() locking scheme from System V. The mandatory locking - scheme is defined by the System V Interface Definition (SVID) Version 3. - -2. Marking a file for mandatory locking ---------------------------------------- - -A file is marked as a candidate for mandatory locking by setting the group-id -bit in its file mode but removing the group-execute bit. This is an otherwise -meaningless combination, and was chosen by the System V implementors so as not -to break existing user programs. - -Note that the group-id bit is usually automatically cleared by the kernel when -a setgid file is written to. This is a security measure. The kernel has been -modified to recognize the special case of a mandatory lock candidate and to -refrain from clearing this bit. Similarly the kernel has been modified not -to run mandatory lock candidates with setgid privileges. - -3. Available implementations ----------------------------- - -I have considered the implementations of mandatory locking available with -SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. - -Generally I have tried to make the most sense out of the behaviour exhibited -by these three reference systems. There are many anomalies. - -All the reference systems reject all calls to open() for a file on which -another process has outstanding mandatory locks. This is in direct -contravention of SVID 3, which states that only calls to open() with the -O_TRUNC flag set should be rejected. The Linux implementation follows the SVID -definition, which is the "Right Thing", since only calls with O_TRUNC can -modify the contents of the file. - -HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not -just mandatory locks. That would appear to contravene POSIX.1. - -mmap() is another interesting case. All the operating systems mentioned -prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX -also disallows advisory locks for such a file. SVID actually specifies the -paranoid HP-UX behaviour. - -In my opinion only MAP_SHARED mappings should be immune from locking, and then -only from mandatory locks - that is what is currently implemented. - -SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for -mandatory locks, so reads and writes to locked files always block when they -should return EAGAIN. - -I'm afraid that this is such an esoteric area that the semantics described -below are just as valid as any others, so long as the main points seem to -agree. - -4. Semantics ------------- - -1. Mandatory locks can only be applied via the fcntl()/lockf() locking - interface - in other words the System V/POSIX interface. BSD style - locks using flock() never result in a mandatory lock. - -2. If a process has locked a region of a file with a mandatory read lock, then - other processes are permitted to read from that region. If any of these - processes attempts to write to the region it will block until the lock is - released, unless the process has opened the file with the O_NONBLOCK - flag in which case the system call will return immediately with the error - status EAGAIN. - -3. If a process has locked a region of a file with a mandatory write lock, all - attempts to read or write to that region block until the lock is released, - unless a process has opened the file with the O_NONBLOCK flag in which case - the system call will return immediately with the error status EAGAIN. - -4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has - any mandatory locks owned by other processes will be rejected with the - error status EAGAIN. - -5. Attempts to apply a mandatory lock to a file that is memory mapped and - shared (via mmap() with MAP_SHARED) will be rejected with the error status - EAGAIN. - -6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) - that has any mandatory locks in effect will be rejected with the error status - EAGAIN. - -5. Which system calls are affected? ------------------------------------ - -Those which modify a file's contents, not just the inode. That gives read(), -write(), readv(), writev(), open(), creat(), mmap(), truncate() and -ftruncate(). truncate() and ftruncate() are considered to be "write" actions -for the purposes of mandatory locking. - -The affected region is usually defined as stretching from the current position -for the total number of bytes read or written. For the truncate calls it is -defined as the bytes of a file removed or added (we must also consider bytes -added, as a lock can specify just "the whole file", rather than a specific -range of bytes.) - -Note 3: I may have overlooked some system calls that need mandatory lock -checking in my eagerness to get this code out the door. Please let me know, or -better still fix the system calls yourself and submit a patch to me or Linus. - -6. Warning! ------------ - -Not even root can override a mandatory lock, so runaway processes can wreak -havoc if they lock crucial files. The way around it is to change the file -permissions (remove the setgid bit) before trying to read or write to it. -Of course, that might be a bit tricky if the system is hung :-( - -7. The "mand" mount option --------------------------- -Mandatory locking is disabled on all filesystems by default, and must be -administratively enabled by mounting with "-o mand". That mount option -is only allowed if the mounting task has the CAP_SYS_ADMIN capability. - -Since kernel v4.5, it is possible to disable mandatory locking -altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel -with this disabled will reject attempts to mount filesystems with the -"mand" mount option with the error status EPERM. |
