diff options
Diffstat (limited to 'Documentation/filesystems')
38 files changed, 13259 insertions, 0 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX new file mode 100644 index 000000000000..bcfbab899b37 --- /dev/null +++ b/Documentation/filesystems/00-INDEX @@ -0,0 +1,50 @@ +00-INDEX + - this file (info on some of the filesystems supported by linux). +Locking + - info on locking rules as they pertain to Linux VFS. +adfs.txt + - info and mount options for the Acorn Advanced Disc Filing System. +affs.txt + - info and mount options for the Amiga Fast File System. +bfs.txt + - info for the SCO UnixWare Boot Filesystem (BFS). +cifs.txt + - description of the CIFS filesystem +coda.txt + - description of the CODA filesystem. +cramfs.txt + - info on the cram filesystem for small storage (ROMs etc) +devfs/ + - directory containing devfs documentation. +ext2.txt + - info, mount options and specifications for the Ext2 filesystem. +fat_cvf.txt + - info on the Compressed Volume Files extension to the FAT filesystem +hpfs.txt + - info and mount options for the OS/2 HPFS. +isofs.txt + - info and mount options for the ISO 9660 (CDROM) filesystem. +jfs.txt + - info and mount options for the JFS filesystem. +ncpfs.txt + - info on Novell Netware(tm) filesystem using NCP protocol. +ntfs.txt + - info and mount options for the NTFS filesystem (Windows NT). +proc.txt + - info on Linux's /proc filesystem. +romfs.txt + - Description of the ROMFS filesystem. +smbfs.txt + - info on using filesystems with the SMB protocol (Windows 3.11 and NT) +sysv-fs.txt + - info on the SystemV/V7/Xenix/Coherent filesystem. +udf.txt + - info and mount options for the UDF filesystem. +ufs.txt + - info on the ufs filesystem. +vfat.txt + - info on using the VFAT filesystem used in Windows NT and Windows 95 +vfs.txt + - Overview of the Virtual File System +xfs.txt + - info and mount options for the XFS filesystem. diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting new file mode 100644 index 000000000000..31047e0fe14b --- /dev/null +++ b/Documentation/filesystems/Exporting @@ -0,0 +1,176 @@ + +Making Filesystems Exportable +============================= + +Most filesystem operations require a dentry (or two) as a starting +point. Local applications have a reference-counted hold on suitable +dentrys via open file descriptors or cwd/root. However remote +applications that access a filesystem via a remote filesystem protocol +such as NFS may not be able to hold such a reference, and so need a +different way to refer to a particular dentry. As the alternative +form of reference needs to be stable across renames, truncates, and +server-reboot (among other things, though these tend to be the most +problematic), there is no simple answer like 'filename'. + +The mechanism discussed here allows each filesystem implementation to +specify how to generate an opaque (out side of the filesystem) byte +string for any dentry, and how to find an appropriate dentry for any +given opaque byte string. +This byte string will be called a "filehandle fragment" as it +corresponds to part of an NFS filehandle. + +A filesystem which supports the mapping between filehandle fragments +and dentrys will be termed "exportable". + + + +Dcache Issues +------------- + +The dcache normally contains a proper prefix of any given filesystem +tree. This means that if any filesystem object is in the dcache, then +all of the ancestors of that filesystem object are also in the dcache. +As normal access is by filename this prefix is created naturally and +maintained easily (by each object maintaining a reference count on +its parent). + +However when objects are included into the dcache by interpreting a +filehandle fragment, there is no automatic creation of a path prefix +for the object. This leads to two related but distinct features of +the dcache that are not needed for normal filesystem access. + +1/ The dcache must sometimes contain objects that are not part of the + proper prefix. i.e that are not connected to the root. +2/ The dcache must be prepared for a newly found (via ->lookup) directory + to already have a (non-connected) dentry, and must be able to move + that dentry into place (based on the parent and name in the + ->lookup). This is particularly needed for directories as + it is a dcache invariant that directories only have one dentry. + +To implement these features, the dcache has: + +a/ A dentry flag DCACHE_DISCONNECTED which is set on + any dentry that might not be part of the proper prefix. + This is set when anonymous dentries are created, and cleared when a + dentry is noticed to be a child of a dentry which is in the proper + prefix. + +b/ A per-superblock list "s_anon" of dentries which are the roots of + subtrees that are not in the proper prefix. These dentries, as + well as the proper prefix, need to be released at unmount time. As + these dentries will not be hashed, they are linked together on the + d_hash list_head. + +c/ Helper routines to allocate anonymous dentries, and to help attach + loose directory dentries at lookup time. They are: + d_alloc_anon(inode) will return a dentry for the given inode. + If the inode already has a dentry, one of those is returned. + If it doesn't, a new anonymous (IS_ROOT and + DCACHE_DISCONNECTED) dentry is allocated and attached. + In the case of a directory, care is taken that only one dentry + can ever be attached. + d_splice_alias(inode, dentry) will make sure that there is a + dentry with the same name and parent as the given dentry, and + which refers to the given inode. + If the inode is a directory and already has a dentry, then that + dentry is d_moved over the given dentry. + If the passed dentry gets attached, care is taken that this is + mutually exclusive to a d_alloc_anon operation. + If the passed dentry is used, NULL is returned, else the used + dentry is returned. This corresponds to the calling pattern of + ->lookup. + + +Filesystem Issues +----------------- + +For a filesystem to be exportable it must: + + 1/ provide the filehandle fragment routines described below. + 2/ make sure that d_splice_alias is used rather than d_add + when ->lookup finds an inode for a given parent and name. + Typically the ->lookup routine will end: + if (inode) + return d_splice(inode, dentry); + d_add(dentry, inode); + return NULL; + } + + + + A file system implementation declares that instances of the filesystem +are exportable by setting the s_export_op field in the struct +super_block. This field must point to a "struct export_operations" +struct which could potentially be full of NULLs, though normally at +least get_parent will be set. + + The primary operations are decode_fh and encode_fh. +decode_fh takes a filehandle fragment and tries to find or create a +dentry for the object referred to by the filehandle. +encode_fh takes a dentry and creates a filehandle fragment which can +later be used to find/create a dentry for the same object. + +decode_fh will probably make use of "find_exported_dentry". +This function lives in the "exportfs" module which a filesystem does +not need unless it is being exported. So rather that calling +find_exported_dentry directly, each filesystem should call it through +the find_exported_dentry pointer in it's export_operations table. +This field is set correctly by the exporting agent (e.g. nfsd) when a +filesystem is exported, and before any export operations are called. + +find_exported_dentry needs three support functions from the +filesystem: + get_name. When given a parent dentry and a child dentry, this + should find a name in the directory identified by the parent + dentry, which leads to the object identified by the child dentry. + If no get_name function is supplied, a default implementation is + provided which uses vfs_readdir to find potential names, and + matches inode numbers to find the correct match. + + get_parent. When given a dentry for a directory, this should return + a dentry for the parent. Quite possibly the parent dentry will + have been allocated by d_alloc_anon. + The default get_parent function just returns an error so any + filehandle lookup that requires finding a parent will fail. + ->lookup("..") is *not* used as a default as it can leave ".." + entries in the dcache which are too messy to work with. + + get_dentry. When given an opaque datum, this should find the + implied object and create a dentry for it (possibly with + d_alloc_anon). + The opaque datum is whatever is passed down by the decode_fh + function, and is often simply a fragment of the filehandle + fragment. + decode_fh passes two datums through find_exported_dentry. One that + should be used to identify the target object, and one that can be + used to identify the object's parent, should that be necessary. + The default get_dentry function assumes that the datum contains an + inode number and a generation number, and it attempts to get the + inode using "iget" and check it's validity by matching the + generation number. A filesystem should only depend on the default + if iget can safely be used this way. + +If decode_fh and/or encode_fh are left as NULL, then default +implementations are used. These defaults are suitable for ext2 and +extremely similar filesystems (like ext3). + +The default encode_fh creates a filehandle fragment from the inode +number and generation number of the target together with the inode +number and generation number of the parent (if the parent is +required). + +The default decode_fh extract the target and parent datums from the +filehandle assuming the format used by the default encode_fh and +passed them to find_exported_dentry. + + +A filehandle fragment consists of an array of 1 or more 4byte words, +together with a one byte "type". +The decode_fh routine should not depend on the stated size that is +passed to it. This size may be larger than the original filehandle +generated by encode_fh, in which case it will have been padded with +nuls. Rather, the encode_fh routine should choose a "type" which +indicates the decode_fh how much of the filehandle is valid, and how +it should be interpreted. + + diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking new file mode 100644 index 000000000000..a934baeeb33a --- /dev/null +++ b/Documentation/filesystems/Locking @@ -0,0 +1,515 @@ + The text below describes the locking rules for VFS-related methods. +It is (believed to be) up-to-date. *Please*, if you change anything in +prototypes or locking protocols - update this file. And update the relevant +instances in the tree, don't leave that to maintainers of filesystems/devices/ +etc. At the very least, put the list of dubious cases in the end of this file. +Don't turn it into log - maintainers of out-of-the-tree code are supposed to +be able to use diff(1). + Thing currently missing here: socket operations. Alexey? + +--------------------------- dentry_operations -------------------------- +prototypes: + int (*d_revalidate)(struct dentry *, int); + int (*d_hash) (struct dentry *, struct qstr *); + int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); + int (*d_delete)(struct dentry *); + void (*d_release)(struct dentry *); + void (*d_iput)(struct dentry *, struct inode *); + +locking rules: + none have BKL + dcache_lock rename_lock ->d_lock may block +d_revalidate: no no no yes +d_hash no no no yes +d_compare: no yes no no +d_delete: yes no yes no +d_release: no no no yes +d_iput: no no no yes + +--------------------------- inode_operations --------------------------- +prototypes: + int (*create) (struct inode *,struct dentry *,int, struct nameidata *); + struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid +ata *); + int (*link) (struct dentry *,struct inode *,struct dentry *); + int (*unlink) (struct inode *,struct dentry *); + int (*symlink) (struct inode *,struct dentry *,const char *); + int (*mkdir) (struct inode *,struct dentry *,int); + int (*rmdir) (struct inode *,struct dentry *); + int (*mknod) (struct inode *,struct dentry *,int,dev_t); + int (*rename) (struct inode *, struct dentry *, + struct inode *, struct dentry *); + int (*readlink) (struct dentry *, char __user *,int); + int (*follow_link) (struct dentry *, struct nameidata *); + void (*truncate) (struct inode *); + int (*permission) (struct inode *, int, struct nameidata *); + int (*setattr) (struct dentry *, struct iattr *); + int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *); + int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); + ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); + ssize_t (*listxattr) (struct dentry *, char *, size_t); + int (*removexattr) (struct dentry *, const char *); + +locking rules: + all may block, none have BKL + i_sem(inode) +lookup: yes +create: yes +link: yes (both) +mknod: yes +symlink: yes +mkdir: yes +unlink: yes (both) +rmdir: yes (both) (see below) +rename: yes (all) (see below) +readlink: no +follow_link: no +truncate: yes (see below) +setattr: yes +permission: no +getattr: no +setxattr: yes +getxattr: no +listxattr: no +removexattr: yes + Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on +victim. + cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. + ->truncate() is never called directly - it's a callback, not a +method. It's called by vmtruncate() - library function normally used by +->setattr(). Locking information above applies to that call (i.e. is +inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been +passed). + +See Documentation/filesystems/directory-locking for more detailed discussion +of the locking scheme for directory operations. + +--------------------------- super_operations --------------------------- +prototypes: + struct inode *(*alloc_inode)(struct super_block *sb); + void (*destroy_inode)(struct inode *); + void (*read_inode) (struct inode *); + void (*dirty_inode) (struct inode *); + int (*write_inode) (struct inode *, int); + void (*put_inode) (struct inode *); + void (*drop_inode) (struct inode *); + void (*delete_inode) (struct inode *); + void (*put_super) (struct super_block *); + void (*write_super) (struct super_block *); + int (*sync_fs)(struct super_block *sb, int wait); + void (*write_super_lockfs) (struct super_block *); + void (*unlockfs) (struct super_block *); + int (*statfs) (struct super_block *, struct kstatfs *); + int (*remount_fs) (struct super_block *, int *, char *); + void (*clear_inode) (struct inode *); + void (*umount_begin) (struct super_block *); + int (*show_options)(struct seq_file *, struct vfsmount *); + ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); + ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); + +locking rules: + All may block. + BKL s_lock s_umount +alloc_inode: no no no +destroy_inode: no +read_inode: no (see below) +dirty_inode: no (must not sleep) +write_inode: no +put_inode: no +drop_inode: no !!!inode_lock!!! +delete_inode: no +put_super: yes yes no +write_super: no yes read +sync_fs: no no read +write_super_lockfs: ? +unlockfs: ? +statfs: no no no +remount_fs: no yes maybe (see below) +clear_inode: no +umount_begin: yes no no +show_options: no (vfsmount->sem) +quota_read: no no no (see below) +quota_write: no no no (see below) + +->read_inode() is not a method - it's a callback used in iget(). +->remount_fs() will have the s_umount lock if it's already mounted. +When called from get_sb_single, it does NOT have the s_umount lock. +->quota_read() and ->quota_write() functions are both guaranteed to +be the only ones operating on the quota file by the quota code (via +dqio_sem) (unless an admin really wants to screw up something and +writes to quota files with quotas on). For other details about locking +see also dquot_operations section. + +--------------------------- file_system_type --------------------------- +prototypes: + struct super_block *(*get_sb) (struct file_system_type *, int, + const char *, void *); + void (*kill_sb) (struct super_block *); +locking rules: + may block BKL +get_sb yes yes +kill_sb yes yes + +->get_sb() returns error or a locked superblock (exclusive on ->s_umount). +->kill_sb() takes a write-locked superblock, does all shutdown work on it, +unlocks and drops the reference. + +--------------------------- address_space_operations -------------------------- +prototypes: + int (*writepage)(struct page *page, struct writeback_control *wbc); + int (*readpage)(struct file *, struct page *); + int (*sync_page)(struct page *); + int (*writepages)(struct address_space *, struct writeback_control *); + int (*set_page_dirty)(struct page *page); + int (*readpages)(struct file *filp, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages); + int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); + int (*commit_write)(struct file *, struct page *, unsigned, unsigned); + sector_t (*bmap)(struct address_space *, sector_t); + int (*invalidatepage) (struct page *, unsigned long); + int (*releasepage) (struct page *, int); + int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, + loff_t offset, unsigned long nr_segs); + +locking rules: + All except set_page_dirty may block + + BKL PageLocked(page) +writepage: no yes, unlocks (see below) +readpage: no yes, unlocks +sync_page: no maybe +writepages: no +set_page_dirty no no +readpages: no +prepare_write: no yes +commit_write: no yes +bmap: yes +invalidatepage: no yes +releasepage: no yes +direct_IO: no + + ->prepare_write(), ->commit_write(), ->sync_page() and ->readpage() +may be called from the request handler (/dev/loop). + + ->readpage() unlocks the page, either synchronously or via I/O +completion. + + ->readpages() populates the pagecache with the passed pages and starts +I/O against them. They come unlocked upon I/O completion. + + ->writepage() is used for two purposes: for "memory cleansing" and for +"sync". These are quite different operations and the behaviour may differ +depending upon the mode. + +If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then +it *must* start I/O against the page, even if that would involve +blocking on in-progress I/O. + +If writepage is called for memory cleansing (sync_mode == +WBC_SYNC_NONE) then its role is to get as much writeout underway as +possible. So writepage should try to avoid blocking against +currently-in-progress I/O. + +If the filesystem is not called for "sync" and it determines that it +would need to block against in-progress I/O to be able to start new I/O +against the page the filesystem should redirty the page with +redirty_page_for_writepage(), then unlock the page and return zero. +This may also be done to avoid internal deadlocks, but rarely. + +If the filesytem is called for sync then it must wait on any +in-progress I/O and then start new I/O. + +The filesystem should unlock the page synchronously, before returning +to the caller. + +Unless the filesystem is going to redirty_page_for_writepage(), unlock the page +and return zero, writepage *must* run set_page_writeback() against the page, +followed by unlocking it. Once set_page_writeback() has been run against the +page, write I/O can be submitted and the write I/O completion handler must run +end_page_writeback() once the I/O is complete. If no I/O is submitted, the +filesystem must run end_page_writeback() against the page before returning from +writepage. + +That is: after 2.5.12, pages which are under writeout are *not* locked. Note, +if the filesystem needs the page to be locked during writeout, that is ok, too, +the page is allowed to be unlocked at any point in time between the calls to +set_page_writeback() and end_page_writeback(). + +Note, failure to run either redirty_page_for_writepage() or the combination of +set_page_writeback()/end_page_writeback() on a page submitted to writepage +will leave the page itself marked clean but it will be tagged as dirty in the +radix tree. This incoherency can lead to all sorts of hard-to-debug problems +in the filesystem like having dirty inodes at umount and losing written data. + + ->sync_page() locking rules are not well-defined - usually it is called +with lock on page, but that is not guaranteed. Considering the currently +existing instances of this method ->sync_page() itself doesn't look +well-defined... + + ->writepages() is used for periodic writeback and for syscall-initiated +sync operations. The address_space should start I/O against at least +*nr_to_write pages. *nr_to_write must be decremented for each page which is +written. The address_space implementation may write more (or less) pages +than *nr_to_write asks for, but it should try to be reasonably close. If +nr_to_write is NULL, all dirty pages must be written. + +writepages should _only_ write pages which are present on +mapping->io_pages. + + ->set_page_dirty() is called from various places in the kernel +when the target page is marked as needing writeback. It may be called +under spinlock (it cannot block) and is sometimes called with the page +not locked. + + ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some +filesystems and by the swapper. The latter will eventually go away. All +instances do not actually need the BKL. Please, keep it that way and don't +breed new callers. + + ->invalidatepage() is called when the filesystem must attempt to drop +some or all of the buffers from the page when it is being truncated. It +returns zero on success. If ->invalidatepage is zero, the kernel uses +block_invalidatepage() instead. + + ->releasepage() is called when the kernel is about to try to drop the +buffers from the page in preparation for freeing it. It returns zero to +indicate that the buffers are (or may be) freeable. If ->releasepage is zero, +the kernel assumes that the fs has no private interest in the buffers. + + Note: currently almost all instances of address_space methods are +using BKL for internal serialization and that's one of the worst sources +of contention. Normally they are calling library functions (in fs/buffer.c) +and pass foo_get_block() as a callback (on local block-based filesystems, +indeed). BKL is not needed for library stuff and is usually taken by +foo_get_block(). It's an overkill, since block bitmaps can be protected by +internal fs locking and real critical areas are much smaller than the areas +filesystems protect now. + +----------------------- file_lock_operations ------------------------------ +prototypes: + void (*fl_insert)(struct file_lock *); /* lock insertion callback */ + void (*fl_remove)(struct file_lock *); /* lock removal callback */ + void (*fl_copy_lock)(struct file_lock *, struct file_lock *); + void (*fl_release_private)(struct file_lock *); + + +locking rules: + BKL may block +fl_insert: yes no +fl_remove: yes no +fl_copy_lock: yes no +fl_release_private: yes yes + +----------------------- lock_manager_operations --------------------------- +prototypes: + int (*fl_compare_owner)(struct file_lock *, struct file_lock *); + void (*fl_notify)(struct file_lock *); /* unblock callback */ + void (*fl_copy_lock)(struct file_lock *, struct file_lock *); + void (*fl_release_private)(struct file_lock *); + void (*fl_break)(struct file_lock *); /* break_lease callback */ + +locking rules: + BKL may block +fl_compare_owner: yes no +fl_notify: yes no +fl_copy_lock: yes no +fl_release_private: yes yes +fl_break: yes no + + Currently only NFSD and NLM provide instances of this class. None of the +them block. If you have out-of-tree instances - please, show up. Locking +in that area will change. +--------------------------- buffer_head ----------------------------------- +prototypes: + void (*b_end_io)(struct buffer_head *bh, int uptodate); + +locking rules: + called from interrupts. In other words, extreme care is needed here. +bh is locked, but that's all warranties we have here. Currently only RAID1, +highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices +call this method upon the IO completion. + +--------------------------- block_device_operations ----------------------- +prototypes: + int (*open) (struct inode *, struct file *); + int (*release) (struct inode *, struct file *); + int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); + int (*media_changed) (struct gendisk *); + int (*revalidate_disk) (struct gendisk *); + +locking rules: + BKL bd_sem +open: yes yes +release: yes yes +ioctl: yes no +media_changed: no no +revalidate_disk: no no + +The last two are called only from check_disk_change(). + +--------------------------- file_operations ------------------------------- +prototypes: + loff_t (*llseek) (struct file *, loff_t, int); + ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); + ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); + ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); + ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, + loff_t); + int (*readdir) (struct file *, void *, filldir_t); + unsigned int (*poll) (struct file *, struct poll_table_struct *); + int (*ioctl) (struct inode *, struct file *, unsigned int, + unsigned long); + long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); + long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + int (*mmap) (struct file *, struct vm_area_struct *); + int (*open) (struct inode *, struct file *); + int (*flush) (struct file *); + int (*release) (struct inode *, struct file *); + int (*fsync) (struct file *, struct dentry *, int datasync); + int (*aio_fsync) (struct kiocb *, int datasync); + int (*fasync) (int, struct file *, int); + int (*lock) (struct file *, int, struct file_lock *); + ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, + loff_t *); + ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, + loff_t *); + ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, + void __user *); + ssize_t (*sendpage) (struct file *, struct page *, int, size_t, + loff_t *, int); + unsigned long (*get_unmapped_area)(struct file *, unsigned long, + unsigned long, unsigned long, unsigned long); + int (*check_flags)(int); + int (*dir_notify)(struct file *, unsigned long); +}; + +locking rules: + All except ->poll() may block. + BKL +llseek: no (see below) +read: no +aio_read: no +write: no +aio_write: no +readdir: no +poll: no +ioctl: yes (see below) +unlocked_ioctl: no (see below) +compat_ioctl: no +mmap: no +open: maybe (see below) +flush: no +release: no +fsync: no (see below) +aio_fsync: no +fasync: yes (see below) +lock: yes +readv: no +writev: no +sendfile: no +sendpage: no +get_unmapped_area: no +check_flags: no +dir_notify: no + +->llseek() locking has moved from llseek to the individual llseek +implementations. If your fs is not using generic_file_llseek, you +need to acquire and release the appropriate locks in your ->llseek(). +For many filesystems, it is probably safe to acquire the inode +semaphore. Note some filesystems (i.e. remote ones) provide no +protection for i_size so you will need to use the BKL. + +->open() locking is in-transit: big lock partially moved into the methods. +The only exception is ->open() in the instances of file_operations that never +end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices +(chrdev_open() takes lock before replacing ->f_op and calling the secondary +method. As soon as we fix the handling of module reference counters all +instances of ->open() will be called without the BKL. + +Note: ext2_release() was *the* source of contention on fs-intensive +loads and dropping BKL on ->release() helps to get rid of that (we still +grab BKL for cases when we close a file that had been opened r/w, but that +can and should be done using the internal locking with smaller critical areas). +Current worst offender is ext2_get_block()... + +->fasync() is a mess. This area needs a big cleanup and that will probably +affect locking. + +->readdir() and ->ioctl() on directories must be changed. Ideally we would +move ->readdir() to inode_operations and use a separate method for directory +->ioctl() or kill the latter completely. One of the problems is that for +anything that resembles union-mount we won't have a struct file for all +components. And there are other reasons why the current interface is a mess... + +->ioctl() on regular files is superceded by the ->unlocked_ioctl() that +doesn't take the BKL. + +->read on directories probably must go away - we should just enforce -EISDIR +in sys_read() and friends. + +->fsync() has i_sem on inode. + +--------------------------- dquot_operations ------------------------------- +prototypes: + int (*initialize) (struct inode *, int); + int (*drop) (struct inode *); + int (*alloc_space) (struct inode *, qsize_t, int); + int (*alloc_inode) (const struct inode *, unsigned long); + int (*free_space) (struct inode *, qsize_t); + int (*free_inode) (const struct inode *, unsigned long); + int (*transfer) (struct inode *, struct iattr *); + int (*write_dquot) (struct dquot *); + int (*acquire_dquot) (struct dquot *); + int (*release_dquot) (struct dquot *); + int (*mark_dirty) (struct dquot *); + int (*write_info) (struct super_block *, int); + +These operations are intended to be more or less wrapping functions that ensure +a proper locking wrt the filesystem and call the generic quota operations. + +What filesystem should expect from the generic quota functions: + + FS recursion Held locks when called +initialize: yes maybe dqonoff_sem +drop: yes - +alloc_space: ->mark_dirty() - +alloc_inode: ->mark_dirty() - +free_space: ->mark_dirty() - +free_inode: ->mark_dirty() - +transfer: yes - +write_dquot: yes dqonoff_sem or dqptr_sem +acquire_dquot: yes dqonoff_sem or dqptr_sem +release_dquot: yes dqonoff_sem or dqptr_sem +mark_dirty: no - +write_info: yes dqonoff_sem + +FS recursion means calling ->quota_read() and ->quota_write() from superblock +operations. + +->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called +only directly by the filesystem and do not call any fs functions only +the ->mark_dirty() operation. + +More details about quota locking can be found in fs/dquot.c. + +--------------------------- vm_operations_struct ----------------------------- +prototypes: + void (*open)(struct vm_area_struct*); + void (*close)(struct vm_area_struct*); + struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *); + +locking rules: + BKL mmap_sem +open: no yes +close: no yes +nopage: no yes + +================================================================================ + Dubious stuff + +(if you break something or notice that it is broken and do not fix it yourself +- at least put it here) + +ipc/shm.c::shm_delete() - may need BKL. +->read() and ->write() in many drivers are (probably) missing BKL. +drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL. diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt new file mode 100644 index 000000000000..060abb0c7004 --- /dev/null +++ b/Documentation/filesystems/adfs.txt @@ -0,0 +1,57 @@ +Mount options for ADFS +---------------------- + + uid=nnn All files in the partition will be owned by + user id nnn. Default 0 (root). + gid=nnn All files in the partition willbe in group + nnn. Default 0 (root). + ownmask=nnn The permission mask for ADFS 'owner' permissions + will be nnn. Default 0700. + othmask=nnn The permission mask for ADFS 'other' permissions + will be nnn. Default 0077. + +Mapping of ADFS permissions to Linux permissions +------------------------------------------------ + + ADFS permissions consist of the following: + + Owner read + Owner write + Other read + Other write + + (In older versions, an 'execute' permission did exist, but this + does not hold the same meaning as the Linux 'execute' permission + and is now obsolete). + + The mapping is performed as follows: + + Owner read -> -r--r--r-- + Owner write -> --w--w---w + Owner read and filetype UnixExec -> ---x--x--x + These are then masked by ownmask, eg 700 -> -rwx------ + Possible owner mode permissions -> -rwx------ + + Other read -> -r--r--r-- + Other write -> --w--w--w- + Other read and filetype UnixExec -> ---x--x--x + These are then masked by othmask, eg 077 -> ----rwxrwx + Possible other mode permissions -> ----rwxrwx + + Hence, with the default masks, if a file is owner read/write, and + not a UnixExec filetype, then the permissions will be: + + -rw------- + + However, if the masks were ownmask=0770,othmask=0007, then this would + be modified to: + -rw-rw---- + + There is no restriction on what you can do with these masks. You may + wish that either read bits give read access to the file for all, but + keep the default write protection (ownmask=0755,othmask=0577): + + -rw-r--r-- + + You can therefore tailor the permission translation to whatever you + desire the permissions should be under Linux. diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt new file mode 100644 index 000000000000..30c9738590f4 --- /dev/null +++ b/Documentation/filesystems/affs.txt @@ -0,0 +1,219 @@ +Overview of Amiga Filesystems +============================= + +Not all varieties of the Amiga filesystems are supported for reading and +writing. The Amiga currently knows six different filesystems: + +DOS\0 The old or original filesystem, not really suited for + hard disks and normally not used on them, either. + Supported read/write. + +DOS\1 The original Fast File System. Supported read/write. + +DOS\2 The old "international" filesystem. International means that + a bug has been fixed so that accented ("international") letters + in file names are case-insensitive, as they ought to be. + Supported read/write. + +DOS\3 The "international" Fast File System. Supported read/write. + +DOS\4 The original filesystem with directory cache. The directory + cache speeds up directory accesses on floppies considerably, + but slows down file creation/deletion. Doesn't make much + sense on hard disks. Supported read only. + +DOS\5 The Fast File System with directory cache. Supported read only. + +All of the above filesystems allow block sizes from 512 to 32K bytes. +Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks +speed up almost everything at the expense of wasted disk space. The speed +gain above 4K seems not really worth the price, so you don't lose too +much here, either. + +The muFS (multi user File System) equivalents of the above file systems +are supported, too. + +Mount options for the AFFS +========================== + +protect If this option is set, the protection bits cannot be altered. + +setuid[=uid] This sets the owner of all files and directories in the file + system to uid or the uid of the current user, respectively. + +setgid[=gid] Same as above, but for gid. + +mode=mode Sets the mode flags to the given (octal) value, regardless + of the original permissions. Directories will get an x + permission if the corresponding r bit is set. + This is useful since most of the plain AmigaOS files + will map to 600. + +reserved=num Sets the number of reserved blocks at the start of the + partition to num. You should never need this option. + Default is 2. + +root=block Sets the block number of the root block. This should never + be necessary. + +bs=blksize Sets the blocksize to blksize. Valid block sizes are 512, + 1024, 2048 and 4096. Like the root option, this should + never be necessary, as the affs can figure it out itself. + +quiet The file system will not return an error for disallowed + mode changes. + +verbose The volume name, file system type and block size will + be written to the syslog when the filesystem is mounted. + +mufs The filesystem is really a muFS, also it doesn't + identify itself as one. This option is necessary if + the filesystem wasn't formatted as muFS, but is used + as one. + +prefix=path Path will be prefixed to every absolute path name of + symbolic links on an AFFS partition. Default = "/". + (See below.) + +volume=name When symbolic links with an absolute path are created + on an AFFS partition, name will be prepended as the + volume name. Default = "" (empty string). + (See below.) + +Handling of the Users/Groups and protection flags +================================================= + +Amiga -> Linux: + +The Amiga protection flags RWEDRWEDHSPARWED are handled as follows: + + - R maps to r for user, group and others. On directories, R implies x. + + - If both W and D are allowed, w will be set. + + - E maps to x. + + - H and P are always retained and ignored under Linux. + + - A is always reset when a file is written to. + +User id and group id will be used unless set[gu]id are given as mount +options. Since most of the Amiga file systems are single user systems +they will be owned by root. The root directory (the mount point) of the +Amiga filesystem will be owned by the user who actually mounts the +filesystem (the root directory doesn't have uid/gid fields). + +Linux -> Amiga: + +The Linux rwxrwxrwx file mode is handled as follows: + + - r permission will set R for user, group and others. + + - w permission will set W and D for user, group and others. + + - x permission of the user will set E for plain files. + + - All other flags (suid, sgid, ...) are ignored and will + not be retained. + +Newly created files and directories will get the user and group ID +of the current user and a mode according to the umask. + +Symbolic links +============== + +Although the Amiga and Linux file systems resemble each other, there +are some, not always subtle, differences. One of them becomes apparent +with symbolic links. While Linux has a file system with exactly one +root directory, the Amiga has a separate root directory for each +file system (for example, partition, floppy disk, ...). With the Amiga, +these entities are called "volumes". They have symbolic names which +can be used to access them. Thus, symbolic links can point to a +different volume. AFFS turns the volume name into a directory name +and prepends the prefix path (see prefix option) to it. + +Example: +You mount all your Amiga partitions under /amiga/<volume> (where +<volume> is the name of the volume), and you give the option +"prefix=/amiga/" when mounting all your AFFS partitions. (They +might be "User", "WB" and "Graphics", the mount points /amiga/User, +/amiga/WB and /amiga/Graphics). A symbolic link referring to +"User:sc/include/dos/dos.h" will be followed to +"/amiga/User/sc/include/dos/dos.h". + +Examples +======== + +Command line: + mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose + mount /dev/sda3 /Amiga -t affs + +/etc/fstab entry: + /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 + +IMPORTANT NOTE +============== + +If you boot Windows 95 (don't know about 3.x, 98 and NT) while you +have an Amiga harddisk connected to your PC, it will overwrite +the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating +the Rigid Disk Block. Sheer luck has it that this is an unused +area of the RDB, so only the checksum doesn't match anymore. +Linux will ignore this garbage and recognize the RDB anyway, but +before you connect that drive to your Amiga again, you must +restore or repair your RDB. So please do make a backup copy of it +before booting Windows! + +If the damage is already done, the following should fix the RDB +(where <disk> is the device name). +DO AT YOUR OWN RISK: + + dd if=/dev/<disk> of=rdb.tmp count=1 + cp rdb.tmp rdb.fixed + dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4 + dd if=rdb.fixed of=/dev/<disk> + +Bugs, Restrictions, Caveats +=========================== + +Quite a few things may not work as advertised. Not everything is +tested, though several hundred MB have been read and written using +this fs. For a most up-to-date list of bugs please consult +fs/affs/Changes. + +Filenames are truncated to 30 characters without warning (this +can be changed by setting the compile-time option AFFS_NO_TRUNCATE +in include/linux/amigaffs.h). + +Case is ignored by the affs in filename matching, but Linux shells +do care about the case. Example (with /wb being an affs mounted fs): + rm /wb/WRONGCASE +will remove /mnt/wrongcase, but + rm /wb/WR* +will not since the names are matched by the shell. + +The block allocation is designed for hard disk partitions. If more +than 1 process writes to a (small) diskette, the blocks are allocated +in an ugly way (but the real AFFS doesn't do much better). This +is also true when space gets tight. + +You cannot execute programs on an OFS (Old File System), since the +program files cannot be memory mapped due to the 488 byte blocks. +For the same reason you cannot mount an image on such a filesystem +via the loopback device. + +The bitmap valid flag in the root block may not be accurate when the +system crashes while an affs partition is mounted. There's currently +no way to fix a garbled filesystem without an Amiga (disk validator) +or manually (who would do this?). Maybe later. + +If you mount affs partitions on system startup, you may want to tell +fsck that the fs should not be checked (place a '0' in the sixth field +of /etc/fstab). + +It's not possible to read floppy disks with a normal PC or workstation +due to an incompatibility with the Amiga floppy controller. + +If you are interested in an Amiga Emulator for Linux, look at + +http://www-users.informatik.rwth-aachen.de/~crux/uae.html diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt new file mode 100644 index 000000000000..2f4237dfb8c7 --- /dev/null +++ b/Documentation/filesystems/afs.txt @@ -0,0 +1,155 @@ + kAFS: AFS FILESYSTEM + ==================== + +ABOUT +===== + +This filesystem provides a fairly simple AFS filesystem driver. It is under +development and only provides very basic facilities. It does not yet support +the following AFS features: + + (*) Write support. + (*) Communications security. + (*) Local caching. + (*) pioctl() system call. + (*) Automatic mounting of embedded mountpoints. + + +USAGE +===== + +When inserting the driver modules the root cell must be specified along with a +list of volume location server IP addresses: + + insmod rxrpc.o + insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + +The first module is a driver for the RxRPC remote operation protocol, and the +second is the actual filesystem driver for the AFS filesystem. + +Once the module has been loaded, more modules can be added by the following +procedure: + + echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + +Where the parameters to the "add" command are the name of a cell and a list of +volume location servers within that cell. + +Filesystems can be mounted anywhere by commands similar to the following: + + mount -t afs "%cambridge.redhat.com:root.afs." /afs + mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge + mount -t afs "#root.afs." /afs + mount -t afs "#root.cell." /afs/cambridge + + NB: When using this on Linux 2.4, the mount command has to be different, + since the filesystem doesn't have access to the device name argument: + + mount -t afs none /afs -ovol="#root.afs." + +Where the initial character is either a hash or a percent symbol depending on +whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O +volume, but are willing to use a R/W volume instead (percent). + +The name of the volume can be suffixes with ".backup" or ".readonly" to +specify connection to only volumes of those types. + +The name of the cell is optional, and if not given during a mount, then the +named volume will be looked up in the cell specified during insmod. + +Additional cells can be added through /proc (see later section). + + +MOUNTPOINTS +=========== + +AFS has a concept of mountpoints. These are specially formatted symbolic links +(of the same form as the "device name" passed to mount). kAFS presents these +to the user as directories that have special properties: + + (*) They cannot be listed. Running a program like "ls" on them will incur an + EREMOTE error (Object is remote). + + (*) Other objects can't be looked up inside of them. This also incurs an + EREMOTE error. + + (*) They can be queried with the readlink() system call, which will return + the name of the mountpoint to which they point. The "readlink" program + will also work. + + (*) They can be mounted on (which symbolic links can't). + + +PROC FILESYSTEM +=============== + +The rxrpc module creates a number of files in various places in the /proc +filesystem: + + (*) Firstly, some information files are made available in a directory called + "/proc/net/rxrpc/". These list the extant transport endpoint, peer, + connection and call records. + + (*) Secondly, some control files are made available in a directory called + "/proc/sys/rxrpc/". Currently, all these files can be used for is to + turn on various levels of tracing. + +The AFS modules creates a "/proc/fs/afs/" directory and populates it: + + (*) A "cells" file that lists cells currently known to the afs module. + + (*) A directory per cell that contains files that list volume location + servers, volumes, and active servers known within that cell. + + +THE CELL DATABASE +================= + +The filesystem maintains an internal database of all the cells it knows and +the IP addresses of the volume location servers for those cells. The cell to +which the computer belongs is added to the database when insmod is performed +by the "rootcell=" argument. + +Further cells can be added by commands similar to the following: + + echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells + echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + +No other cell database operations are available at this time. + + +EXAMPLES +======== + +Here's what I use to test this. Some of the names and IP addresses are local +to my internal DNS. My "root.afs" partition has a mount point within it for +some public volumes volumes. + +insmod -S /tmp/rxrpc.o +insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + +mount -t afs \%root.afs. /afs +mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/ + +echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells +mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/ +mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive +mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib +mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc +mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project +mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service +mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software +mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user + +umount /afs/grand.central.org/user +umount /afs/grand.central.org/software +umount /afs/grand.central.org/service +umount /afs/grand.central.org/project +umount /afs/grand.central.org/doc +umount /afs/grand.central.org/contrib +umount /afs/grand.central.org/archive +umount /afs/grand.central.org +umount /afs/cambridge.redhat.com +umount /afs +rmmod kafs +rmmod rxrpc diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt new file mode 100644 index 000000000000..58c65a1713e5 --- /dev/null +++ b/Documentation/filesystems/automount-support.txt @@ -0,0 +1,118 @@ +Support is available for filesystems that wish to do automounting support (such +as kAFS which can be found in fs/afs/). This facility includes allowing +in-kernel mounts to be performed and mountpoint degradation to be +requested. The latter can also be requested by userspace. + + +====================== +IN-KERNEL AUTOMOUNTING +====================== + +A filesystem can now mount another filesystem on one of its directories by the +following procedure: + + (1) Give the directory a follow_link() operation. + + When the directory is accessed, the follow_link op will be called, and + it will be provided with the location of the mountpoint in the nameidata + structure (vfsmount and dentry). + + (2) Have the follow_link() op do the following steps: + + (a) Call do_kern_mount() to call the appropriate filesystem to set up a + superblock and gain a vfsmount structure representing it. + + (b) Copy the nameidata provided as an argument and substitute the dentry + argument into it the copy. + + (c) Call do_add_mount() to install the new vfsmount into the namespace's + mountpoint tree, thus making it accessible to userspace. Use the + nameidata set up in (b) as the destination. + + If the mountpoint will be automatically expired, then do_add_mount() + should also be given the location of an expiration list (see further + down). + + (d) Release the path in the nameidata argument and substitute in the new + vfsmount and its root dentry. The ref counts on these will need + incrementing. + +Then from userspace, you can just do something like: + + [root@andromeda root]# mount -t afs \#root.afs. /afs + [root@andromeda root]# ls /afs + asd cambridge cambridge.redhat.com grand.central.org + [root@andromeda root]# ls /afs/cambridge + afsdoc + [root@andromeda root]# ls /afs/cambridge/afsdoc/ + ChangeLog html LICENSE pdf RELNOTES-1.2.2 + +And then if you look in the mountpoint catalogue, you'll see something like: + + [root@andromeda root]# cat /proc/mounts + ... + #root.afs. /afs afs rw 0 0 + #root.cell. /afs/cambridge.redhat.com afs rw 0 0 + #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0 + + +=========================== +AUTOMATIC MOUNTPOINT EXPIRY +=========================== + +Automatic expiration of mountpoints is easy, provided you've mounted the +mountpoint to be expired in the automounting procedure outlined above. + +To do expiration, you need to follow these steps: + + (3) Create at least one list off which the vfsmounts to be expired can be + hung. Access to this list will be governed by the vfsmount_lock. + + (4) In step (2c) above, the call to do_add_mount() should be provided with a + pointer to this list. It will hang the vfsmount off of it if it succeeds. + + (5) When you want mountpoints to be expired, call mark_mounts_for_expiry() + with a pointer to this list. This will process the list, marking every + vfsmount thereon for potential expiry on the next call. + + If a vfsmount was already flagged for expiry, and if its usage count is 1 + (it's only referenced by its parent vfsmount), then it will be deleted + from the namespace and thrown away (effectively unmounted). + + It may prove simplest to simply call this at regular intervals, using + some sort of timed event to drive it. + +The expiration flag is cleared by calls to mntput. This means that expiration +will only happen on the second expiration request after the last time the +mountpoint was accessed. + +If a mountpoint is moved, it gets removed from the expiration list. If a bind +mount is made on an expirable mount, the new vfsmount will not be on the +expiration list and will not expire. + +If a namespace is copied, all mountpoints contained therein will be copied, +and the copies of those that are on an expiration list will be added to the +same expiration list. + + +======================= +USERSPACE DRIVEN EXPIRY +======================= + +As an alternative, it is possible for userspace to request expiry of any +mountpoint (though some will be rejected - the current process's idea of the +rootfs for example). It does this by passing the MNT_EXPIRE flag to +umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH. + +If the mountpoint in question is in referenced by something other than +umount() or its parent mountpoint, an EBUSY error will be returned and the +mountpoint will not be marked for expiration or unmounted. + +If the mountpoint was not already marked for expiry at that time, an EAGAIN +error will be given and it won't be unmounted. + +Otherwise if it was already marked and it wasn't referenced, unmounting will +take place as usual. + +Again, the expiration flag is cleared every time anything other than umount() +looks at a mountpoint. diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt new file mode 100644 index 000000000000..877a7b1d46ec --- /dev/null +++ b/Documentation/filesystems/befs.txt @@ -0,0 +1,117 @@ +BeOS filesystem for Linux + +Document last updated: Dec 6, 2001 + +WARNING +======= +Make sure you understand that this is alpha software. This means that the +implementation is neither complete nor well-tested. + +I DISCLAIM ALL RESPONSIBILTY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! + +LICENSE +===== +This software is covered by the GNU General Public License. +See the file COPYING for the complete text of the license. +Or the GNU website: <http://www.gnu.org/licenses/licenses.html> + +AUTHOR +===== +The largest part of the code written by Will Dyson <will_dyson@pobox.com> +He has been working on the code since Aug 13, 2001. See the changelog for +details. + +Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp> +His orriginal code can still be found at: +<http://hp.vector.co.jp/authors/VA008030/bfs/> +Does anyone know of a more current email address for Makoto? He doesn't +respond to the address given above... + +Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru> + +WHAT IS THIS DRIVER? +================== +This module implements the native filesystem of BeOS <http://www.be.com/> +for the linux 2.4.1 and later kernels. Currently it is a read-only +implementation. + +Which is it, BFS or BEFS? +================ +Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". +But Unixware Boot Filesystem is called bfs, too. And they are already in +the kernel. Because of this nameing conflict, on Linux the BeOS +filesystem is called befs. + +HOW TO INSTALL +============== +step 1. Install the BeFS patch into the source code tree of linux. + +Apply the patchfile to your kernel source tree. +Assuming that your kernel source is in /foo/bar/linux and the patchfile +is called patch-befs-xxx, you would do the following: + + cd /foo/bar/linux + patch -p1 < /path/to/patch-befs-xxx + +if the patching step fails (i.e. there are rejected hunks), you can try to +figure it out yourself (it shouldn't be hard), or mail the maintainer +(Will Dyson <will_dyson@pobox.com>) for help. + +step 2. Configuretion & make kernel + +The linux kernel has many compile-time options. Most of them are beyond the +scope of this document. I suggest the Kernel-HOWTO document as a good general +reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html> + +However, to use the BeFS module, you must enable it at configure time. + + cd /foo/bar/linux + make menuconfig (or xconfig) + +The BeFS module is not a standard part of the linux kernel, so you must first +enable support for experimental code under the "Code maturity level" menu. + +Then, under the "Filesystems" menu will be an option called "BeFS +filesystem (experimental)", or something like that. Enable that option +(it is fine to make it a module). + +Save your kernel configuration and then build your kernel. + +step 3. Install + +See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for +instructions on this critical step. + +USING BFS +========= +To use the BeOS filesystem, use filesystem type 'befs'. + +ex) + mount -t befs /dev/fd0 /beos + +MOUNT OPTIONS +============= +uid=nnn All files in the partition will be owned by user id nnn. +gid=nnn All files in the partition will be in group nnn. +iocharset=xxx Use xxx as the name of the NLS translation table. +debug The driver will output debugging information to the syslog. + +HOW TO GET LASTEST VERSION +========================== + +The latest version is currently available at: +<http://befs-driver.sourceforge.net/> + +ANY KNOWN BUGS? +=========== +As of Jan 20, 2002: + + None + +SPECIAL THANKS +============== +Dominic Giampalo ... Writing "Practical file system design with Be filesystem" +Hiroyuki Yamada ... Testing LinuxPPC. + + + diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt new file mode 100644 index 000000000000..d2841e0bcf02 --- /dev/null +++ b/Documentation/filesystems/bfs.txt @@ -0,0 +1,57 @@ +BFS FILESYSTEM FOR LINUX +======================== + +The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which +usually contains the kernel image and a few other files required for the +boot process. + +In order to access /stand partition under Linux you obviously need to +know the partition number and the kernel must support UnixWare disk slices +(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not +depend on having UnixWare disklabel support because one can also mount +BFS filesystem via loopback: + +# losetup /dev/loop0 stand.img +# mount -t bfs /dev/loop0 /mnt/stand + +where stand.img is a file containing the image of BFS filesystem. +When you have finished using it and umounted you need to also deallocate +/dev/loop0 device by: + +# losetup -d /dev/loop0 + +You can simplify mounting by just typing: + +# mount -t bfs -o loop stand.img /mnt/stand + +this will allocate the first available loopback device (and load loop.o +kernel module if necessary) automatically. If the loopback driver is not +loaded automatically, make sure that your kernel is compiled with kmod +support (CONFIG_KMOD) enabled. Beware that umount will not +deallocate /dev/loopN device if /etc/mtab file on your system is a +symbolic link to /proc/mounts. You will need to do it manually using +"-d" switch of losetup(8). Read losetup(8) manpage for more info. + +To create the BFS image under UnixWare you need to find out first which +slice contains it. The command prtvtoc(1M) is your friend: + +# prtvtoc /dev/rdsk/c0b0t0d0s0 + +(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you +look for the slice with tag "STAND", which is usually slice 10. With this +information you can use dd(1) to create the BFS image: + +# umount /stand +# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 + +Just in case, you can verify that you have done the right thing by checking +the magic number: + +# od -Ad -tx4 stand.img | more + +The first 4 bytes should be 0x1badface. + +If you have any patches, questions or suggestions regarding this BFS +implementation please contact the author: + +Tigran A. Aivazian <tigran@veritas.com> diff --git a/Documentation/filesystems/cifs.txt b/Documentation/filesystems/cifs.txt new file mode 100644 index 000000000000..49cc923a93e3 --- /dev/null +++ b/Documentation/filesystems/cifs.txt @@ -0,0 +1,51 @@ + This is the client VFS module for the Common Internet File System + (CIFS) protocol which is the successor to the Server Message Block + (SMB) protocol, the native file sharing mechanism for most early + PC operating systems. CIFS is fully supported by current network + file servers such as Windows 2000, Windows 2003 (including + Windows XP) as well by Samba (which provides excellent CIFS + server support for Linux and many other operating systems), so + this network filesystem client can mount to a wide variety of + servers. The smbfs module should be used instead of this cifs module + for mounting to older SMB servers such as OS/2. The smbfs and cifs + modules can coexist and do not conflict. The CIFS VFS filesystem + module is designed to work well with servers that implement the + newer versions (dialects) of the SMB/CIFS protocol such as Samba, + the program written by Andrew Tridgell that turns any Unix host + into a SMB/CIFS file server. + + The intent of this module is to provide the most advanced network + file system function for CIFS compliant servers, including better + POSIX compliance, secure per-user session establishment, high + performance safe distributed caching (oplock), optional packet + signing, large files, Unicode support and other internationalization + improvements. Since both Samba server and this filesystem client support + the CIFS Unix extensions, the combination can provide a reasonable + alternative to NFSv4 for fileserving in some Linux to Linux environments, + not just in Linux to Windows environments. + + This filesystem has an optional mount utility (mount.cifs) that can + be obtained from the project page and installed in the path in the same + directory with the other mount helpers (such as mount.smbfs). + Mounting using the cifs filesystem without installing the mount helper + requires specifying the server's ip address. + + For Linux 2.4: + mount //anything/here /mnt_target -o + user=username,pass=password,unc=//ip_address_of_server/sharename + + For Linux 2.5: + mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password + + + For more information on the module see the project page at + + http://us1.samba.org/samba/Linux_CIFS_client.html + + For more information on CIFS see: + + http://www.snia.org/tech_activities/CIFS + + or the Samba site: + + http://www.samba.org diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt new file mode 100644 index 000000000000..61311356025d --- /dev/null +++ b/Documentation/filesystems/coda.txt @@ -0,0 +1,1673 @@ +NOTE: +This is one of the technical documents describing a component of +Coda -- this document describes the client kernel-Venus interface. + +For more information: + http://www.coda.cs.cmu.edu +For user level software needed to run Coda: + ftp://ftp.coda.cs.cmu.edu + +To run Coda you need to get a user level cache manager for the client, +named Venus, as well as tools to manipulate ACLs, to log in, etc. The +client needs to have the Coda filesystem selected in the kernel +configuration. + +The server needs a user level server and at present does not depend on +kernel support. + + + + + + + + The Venus kernel interface + Peter J. Braam + v1.0, Nov 9, 1997 + + This document describes the communication between Venus and kernel + level filesystem code needed for the operation of the Coda file sys- + tem. This document version is meant to describe the current interface + (version 1.0) as well as improvements we envisage. + ______________________________________________________________________ + + Table of Contents + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1. Introduction + + 2. Servicing Coda filesystem calls + + 3. The message layer + + 3.1 Implementation details + + 4. The interface at the call level + + 4.1 Data structures shared by the kernel and Venus + 4.2 The pioctl interface + 4.3 root + 4.4 lookup + 4.5 getattr + 4.6 setattr + 4.7 access + 4.8 create + 4.9 mkdir + 4.10 link + 4.11 symlink + 4.12 remove + 4.13 rmdir + 4.14 readlink + 4.15 open + 4.16 close + 4.17 ioctl + 4.18 rename + 4.19 readdir + 4.20 vget + 4.21 fsync + 4.22 inactive + 4.23 rdwr + 4.24 odymount + 4.25 ody_lookup + 4.26 ody_expand + 4.27 prefetch + 4.28 signal + + 5. The minicache and downcalls + + 5.1 INVALIDATE + 5.2 FLUSH + 5.3 PURGEUSER + 5.4 ZAPFILE + 5.5 ZAPDIR + 5.6 ZAPVNODE + 5.7 PURGEFID + 5.8 REPLACE + + 6. Initialization and cleanup + + 6.1 Requirements + + + ______________________________________________________________________ + 0wpage + + 11.. IInnttrroodduuccttiioonn + + + + A key component in the Coda Distributed File System is the cache + manager, _V_e_n_u_s. + + + When processes on a Coda enabled system access files in the Coda + filesystem, requests are directed at the filesystem layer in the + operating system. The operating system will communicate with Venus to + service the request for the process. Venus manages a persistent + client cache and makes remote procedure calls to Coda file servers and + related servers (such as authentication servers) to service these + requests it receives from the operating system. When Venus has + serviced a request it replies to the operating system with appropriate + return codes, and other data related to the request. Optionally the + kernel support for Coda may maintain a minicache of recently processed + requests to limit the number of interactions with Venus. Venus + possesses the facility to inform the kernel when elements from its + minicache are no longer valid. + + This document describes precisely this communication between the + kernel and Venus. The definitions of so called upcalls and downcalls + will be given with the format of the data they handle. We shall also + describe the semantic invariants resulting from the calls. + + Historically Coda was implemented in a BSD file system in Mach 2.6. + The interface between the kernel and Venus is very similar to the BSD + VFS interface. Similar functionality is provided, and the format of + the parameters and returned data is very similar to the BSD VFS. This + leads to an almost natural environment for implementing a kernel-level + filesystem driver for Coda in a BSD system. However, other operating + systems such as Linux and Windows 95 and NT have virtual filesystem + with different interfaces. + + To implement Coda on these systems some reverse engineering of the + Venus/Kernel protocol is necessary. Also it came to light that other + systems could profit significantly from certain small optimizations + and modifications to the protocol. To facilitate this work as well as + to make future ports easier, communication between Venus and the + kernel should be documented in great detail. This is the aim of this + document. + + 0wpage + + 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss + + The service of a request for a Coda file system service originates in + a process PP which accessing a Coda file. It makes a system call which + traps to the OS kernel. Examples of such calls trapping to the kernel + are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix + context. Similar calls exist in the Win32 environment, and are named + _C_r_e_a_t_e_F_i_l_e_, . + + Generally the operating system handles the request in a virtual + filesystem (VFS) layer, which is named I/O Manager in NT and IFS + manager in Windows 95. The VFS is responsible for partial processing + of the request and for locating the specific filesystem(s) which will + service parts of the request. Usually the information in the path + assists in locating the correct FS drivers. Sometimes after extensive + pre-processing, the VFS starts invoking exported routines in the FS + driver. This is the point where the FS specific processing of the + request starts, and here the Coda specific kernel code comes into + play. + + The FS layer for Coda must expose and implement several interfaces. + First and foremost the VFS must be able to make all necessary calls to + the Coda FS layer, so the Coda FS driver must expose the VFS interface + as applicable in the operating system. These differ very significantly + among operating systems, but share features such as facilities to + read/write and create and remove objects. The Coda FS layer services + such VFS requests by invoking one or more well defined services + offered by the cache manager Venus. When the replies from Venus have + come back to the FS driver, servicing of the VFS call continues and + finishes with a reply to the kernel's VFS. Finally the VFS layer + returns to the process. + + As a result of this design a basic interface exposed by the FS driver + must allow Venus to manage message traffic. In particular Venus must + be able to retrieve and place messages and to be notified of the + arrival of a new message. The notification must be through a mechanism + which does not block Venus since Venus must attend to other tasks even + when no messages are waiting or being processed. + + + + + + + Interfaces of the Coda FS Driver + + Furthermore the FS layer provides for a special path of communication + between a user process and Venus, called the pioctl interface. The + pioctl interface is used for Coda specific services, such as + requesting detailed information about the persistent cache managed by + Venus. Here the involvement of the kernel is minimal. It identifies + the calling process and passes the information on to Venus. When + Venus replies the response is passed back to the caller in unmodified + form. + + Finally Venus allows the kernel FS driver to cache the results from + certain services. This is done to avoid excessive context switches + and results in an efficient system. However, Venus may acquire + information, for example from the network which implies that cached + information must be flushed or replaced. Venus then makes a downcall + to the Coda FS layer to request flushes or updates in the cache. The + kernel FS driver handles such requests synchronously. + + Among these interfaces the VFS interface and the facility to place, + receive and be notified of messages are platform specific. We will + not go into the calls exported to the VFS layer but we will state the + requirements of the message exchange mechanism. + + 0wpage + + 33.. TThhee mmeessssaaggee llaayyeerr + + + + At the lowest level the communication between Venus and the FS driver + proceeds through messages. The synchronization between processes + requesting Coda file service and Venus relies on blocking and waking + up processes. The Coda FS driver processes VFS- and pioctl-requests + on behalf of a process P, creates messages for Venus, awaits replies + and finally returns to the caller. The implementation of the exchange + of messages is platform specific, but the semantics have (so far) + appeared to be generally applicable. Data buffers are created by the + FS Driver in kernel memory on behalf of P and copied to user memory in + Venus. + + The FS Driver while servicing P makes upcalls to Venus. Such an + upcall is dispatched to Venus by creating a message structure. The + structure contains the identification of P, the message sequence + number, the size of the request and a pointer to the data in kernel + memory for the request. Since the data buffer is re-used to hold the + reply from Venus, there is a field for the size of the reply. A flags + field is used in the message to precisely record the status of the + message. Additional platform dependent structures involve pointers to + determine the position of the message on queues and pointers to + synchronization objects. In the upcall routine the message structure + is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g + queue. The routine calling upcall is responsible for allocating the + data buffer; its structure will be described in the next section. + + A facility must exist to notify Venus that the message has been + created, and implemented using available synchronization objects in + the OS. This notification is done in the upcall context of the process + P. When the message is on the pending queue, process P cannot proceed + in upcall. The (kernel mode) processing of P in the filesystem + request routine must be suspended until Venus has replied. Therefore + the calling thread in P is blocked in upcall. A pointer in the + message structure will locate the synchronization object on which P is + sleeping. + + Venus detects the notification that a message has arrived, and the FS + driver allow Venus to retrieve the message with a getmsg_from_kernel + call. This action finishes in the kernel by putting the message on the + queue of processing messages and setting flags to READ. Venus is + passed the contents of the data buffer. The getmsg_from_kernel call + now returns and Venus processes the request. + + At some later point the FS driver receives a message from Venus, + namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS + driver looks at the contents of the message and decides if: + + + +o the message is a reply for a suspended thread P. If so it removes + the message from the processing queue and marks the message as + WRITTEN. Finally, the FS driver unblocks P (still in the kernel + mode context of Venus) and the sendmsg_to_kernel call returns to + Venus. The process P will be scheduled at some point and continues + processing its upcall with the data buffer replaced with the reply + from Venus. + + +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to + the FS Driver. The FS driver processes the request immediately + (usually a cache eviction or replacement) and when it finishes + sendmsg_to_kernel returns. + + Now P awakes and continues processing upcall. There are some + subtleties to take account of. First P will determine if it was woken + up in upcall by a signal from some other source (for example an + attempt to terminate P) or as is normally the case by Venus in its + sendmsg_to_kernel call. In the normal case, the upcall routine will + deallocate the message structure and return. The FS routine can proceed + with its processing. + + + + + + + + Sleeping and IPC arrangements + + In case P is woken up by a signal and not by Venus, it will first look + at the flags field. If the message is not yet READ, the process P can + handle its signal without notifying Venus. If Venus has READ, and + the request should not be processed, P can send Venus a signal message + to indicate that it should disregard the previous message. Such + signals are put in the queue at the head, and read first by Venus. If + the message is already marked as WRITTEN it is too late to stop the + processing. The VFS routine will now continue. (-- If a VFS request + involves more than one upcall, this can lead to complicated state, an + extra field "handle_signals" could be added in the message structure + to indicate points of no return have been passed.--) + + + + 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss + + The Unix implementation of this mechanism has been through the + implementation of a character device associated with Coda. Venus + retrieves messages by doing a read on the device, replies are sent + with a write and notification is through the select system call on the + file descriptor for the device. The process P is kept waiting on an + interruptible wait queue object. + + In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl + call is used. The DeviceIoControl call is designed to copy buffers + from user memory to kernel memory with OPCODES. The sendmsg_to_kernel + is issued as a synchronous call, while the getmsg_from_kernel call is + asynchronous. Windows EventObjects are used for notification of + message arrival. The process P is kept waiting on a KernelEvent + object in NT and a semaphore in Windows 95. + + 0wpage + + 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell + + + This section describes the upcalls a Coda FS driver can make to Venus. + Each of these upcalls make use of two structures: inputArgs and + outputArgs. In pseudo BNF form the structures take the following + form: + + + struct inputArgs { + u_long opcode; + u_long unique; /* Keep multiple outstanding msgs distinct */ + u_short pid; /* Common to all */ + u_short pgid; /* Common to all */ + struct CodaCred cred; /* Common to all */ + + <union "in" of call dependent parts of inputArgs> + }; + + struct outputArgs { + u_long opcode; + u_long unique; /* Keep multiple outstanding msgs distinct */ + u_long result; + + <union "out" of call dependent parts of inputArgs> + }; + + + + Before going on let us elucidate the role of the various fields. The + inputArgs start with the opcode which defines the type of service + requested from Venus. There are approximately 30 upcalls at present + which we will discuss. The unique field labels the inputArg with a + unique number which will identify the message uniquely. A process and + process group id are passed. Finally the credentials of the caller + are included. + + Before delving into the specific calls we need to discuss a variety of + data structures shared by the kernel and Venus. + + + + + 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss + + + The CodaCred structure defines a variety of user and group ids as + they are set for the calling process. The vuid_t and guid_t are 32 bit + unsigned integers. It also defines group membership in an array. On + Unix the CodaCred has proven sufficient to implement good security + semantics for Coda but the structure may have to undergo modification + for the Windows environment when these mature. + + struct CodaCred { + vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid*/ + vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */ + vgid_t cr_groups[NGROUPS]; /* Group membership for caller */ + }; + + + + NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus + doesn't know about groups, although it does create files with the + default uid/gid. Perhaps the list of group membership is superfluous. + + + The next item is the fundamental identifier used to identify Coda + files, the ViceFid. A fid of a file uniquely defines a file or + directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a + group of Coda servers acting under the aegis of a single system + control machine or SCM. See the Coda Administration manual for a + detailed description of the role of the SCM.--) + + + typedef struct ViceFid { + VolumeId Volume; + VnodeId Vnode; + Unique_t Unique; + } ViceFid; + + + + Each of the constituent fields: VolumeId, VnodeId and Unique_t are + unsigned 32 bit integers. We envisage that a further field will need + to be prefixed to identify the Coda cell; this will probably take the + form of a Ipv6 size IP address naming the Coda cell through DNS. + + The next important structure shared between Venus and the kernel is + the attributes of the file. The following structure is used to + exchange information. It has room for future extensions such as + support for device files (currently not present in Coda). + + + + + + + + + + + + + + + + + + + struct coda_vattr { + enum coda_vtype va_type; /* vnode type (for create) */ + u_short va_mode; /* files access mode and type */ + short va_nlink; /* number of references to file */ + vuid_t va_uid; /* owner user id */ + vgid_t va_gid; /* owner group id */ + long va_fsid; /* file system id (dev for now) */ + long va_fileid; /* file id */ + u_quad_t va_size; /* file size in bytes */ + long va_blocksize; /* blocksize preferred for i/o */ + struct timespec va_atime; /* time of last access */ + struct timespec va_mtime; /* time of last modification */ + struct timespec va_ctime; /* time file changed */ + u_long va_gen; /* generation number of file */ + u_long va_flags; /* flags defined for file */ + dev_t va_rdev; /* device special file represents */ + u_quad_t va_bytes; /* bytes of disk space held by file */ + u_quad_t va_filerev; /* file modification number */ + u_int va_vaflags; /* operations flags, see below */ + long va_spare; /* remain quad aligned */ + }; + + + + + 44..22.. TThhee ppiiooccttll iinntteerrffaaccee + + + Coda specific requests can be made by application through the pioctl + interface. The pioctl is implemented as an ordinary ioctl on a + fictitious file /coda/.CONTROL. The pioctl call opens this file, gets + a file handle and makes the ioctl call. Finally it closes the file. + + The kernel involvement in this is limited to providing the facility to + open and close and pass the ioctl message _a_n_d to verify that a path in + the pioctl data buffers is a file in a Coda filesystem. + + The kernel is handed a data packet of the form: + + struct { + const char *path; + struct ViceIoctl vidata; + int follow; + } data; + + + + where + + + struct ViceIoctl { + caddr_t in, out; /* Data to be transferred in, or out */ + short in_size; /* Size of input buffer <= 2K */ + short out_size; /* Maximum size of output buffer, <= 2K */ + }; + + + + The path must be a Coda file, otherwise the ioctl upcall will not be + made. + + NNOOTTEE The data structures and code are a mess. We need to clean this + up. + + We now proceed to document the individual calls: + + 0wpage + + 44..33.. rroooott + + + AArrgguummeennttss + + iinn empty + + oouutt + + struct cfs_root_out { + ViceFid VFid; + } cfs_root; + + + + DDeessccrriippttiioonn This call is made to Venus during the initialization of + the Coda filesystem. If the result is zero, the cfs_root structure + contains the ViceFid of the root of the Coda filesystem. If a non-zero + result is generated, its value is a platform dependent error code + indicating the difficulty Venus encountered in locating the root of + the Coda filesystem. + + 0wpage + + 44..44.. llooookkuupp + + + SSuummmmaarryy Find the ViceFid and type of an object in a directory if it + exists. + + AArrgguummeennttss + + iinn + + struct cfs_lookup_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_lookup; + + + + oouutt + + struct cfs_lookup_out { + ViceFid VFid; + int vtype; + } cfs_lookup; + + + + DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of + a directory entry. The directory entry requested carries name name + and Venus will search the directory identified by cfs_lookup_in.VFid. + The result may indicate that the name does not exist, or that + difficulty was encountered in finding it (e.g. due to disconnection). + If the result is zero, the field cfs_lookup_out.VFid contains the + targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the + type of object the name designates. + + The name of the object is an 8 bit character string of maximum length + CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.) + + It is extremely important to realize that Venus bitwise ors the field + cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should + not be put in the kernel name cache. + + NNOOTTEE The type of the vtype is currently wrong. It should be + coda_vtype. Linux does not take note of CFS_NOCACHE. It should. + + 0wpage + + 44..55.. ggeettaattttrr + + + SSuummmmaarryy Get the attributes of a file. + + AArrgguummeennttss + + iinn + + struct cfs_getattr_in { + ViceFid VFid; + struct coda_vattr attr; /* XXXXX */ + } cfs_getattr; + + + + oouutt + + struct cfs_getattr_out { + struct coda_vattr attr; + } cfs_getattr; + + + + DDeessccrriippttiioonn This call returns the attributes of the file identified by + fid. + + EErrrroorrss Errors can occur if the object with fid does not exist, is + unaccessible or if the caller does not have permission to fetch + attributes. + + NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire + the attributes as well as the Fid for the instantiation of an internal + "inode" or "FileHandle". A significant improvement in performance on + such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls + both at the Venus/kernel interaction level and at the RPC level. + + The vattr structure included in the input arguments is superfluous and + should be removed. + + 0wpage + + 44..66.. sseettaattttrr + + + SSuummmmaarryy Set the attributes of a file. + + AArrgguummeennttss + + iinn + + struct cfs_setattr_in { + ViceFid VFid; + struct coda_vattr attr; + } cfs_setattr; + + + + + oouutt + empty + + DDeessccrriippttiioonn The structure attr is filled with attributes to be changed + in BSD style. Attributes not to be changed are set to -1, apart from + vtype which is set to VNON. Other are set to the value to be assigned. + The only attributes which the FS driver may request to change are the + mode, owner, groupid, atime, mtime and ctime. The return value + indicates success or failure. + + EErrrroorrss A variety of errors can occur. The object may not exist, may + be inaccessible, or permission may not be granted by Venus. + + 0wpage + + 44..77.. aacccceessss + + + SSuummmmaarryy + + AArrgguummeennttss + + iinn + + struct cfs_access_in { + ViceFid VFid; + int flags; + } cfs_access; + + + + oouutt + empty + + DDeessccrriippttiioonn Verify if access to the object identified by VFid for + operations described by flags is permitted. The result indicates if + access will be granted. It is important to remember that Coda uses + ACLs to enforce protection and that ultimately the servers, not the + clients enforce the security of the system. The result of this call + will depend on whether a _t_o_k_e_n is held by the user. + + EErrrroorrss The object may not exist, or the ACL describing the protection + may not be accessible. + + 0wpage + + 44..88.. ccrreeaattee + + + SSuummmmaarryy Invoked to create a file + + AArrgguummeennttss + + iinn + + struct cfs_create_in { + ViceFid VFid; + struct coda_vattr attr; + int excl; + int mode; + char *name; /* Place holder for data. */ + } cfs_create; + + + + + oouutt + + struct cfs_create_out { + ViceFid VFid; + struct coda_vattr attr; + } cfs_create; + + + + DDeessccrriippttiioonn This upcall is invoked to request creation of a file. + The file will be created in the directory identified by VFid, its name + will be name, and the mode will be mode. If excl is set an error will + be returned if the file already exists. If the size field in attr is + set to zero the file will be truncated. The uid and gid of the file + are set by converting the CodaCred to a uid using a macro CRTOUID + (this macro is platform dependent). Upon success the VFid and + attributes of the file are returned. The Coda FS Driver will normally + instantiate a vnode, inode or file handle at kernel level for the new + object. + + + EErrrroorrss A variety of errors can occur. Permissions may be insufficient. + If the object exists and is not a file the error EISDIR is returned + under Unix. + + NNOOTTEE The packing of parameters is very inefficient and appears to + indicate confusion between the system call creat and the VFS operation + create. The VFS operation create is only called to create new objects. + This create call differs from the Unix one in that it is not invoked + to return a file descriptor. The truncate and exclusive options, + together with the mode, could simply be part of the mode as it is + under Unix. There should be no flags argument; this is used in open + (2) to return a file descriptor for READ or WRITE mode. + + The attributes of the directory should be returned too, since the size + and mtime changed. + + 0wpage + + 44..99.. mmkkddiirr + + + SSuummmmaarryy Create a new directory. + + AArrgguummeennttss + + iinn + + struct cfs_mkdir_in { + ViceFid VFid; + struct coda_vattr attr; + char *name; /* Place holder for data. */ + } cfs_mkdir; + + + + oouutt + + struct cfs_mkdir_out { + ViceFid VFid; + struct coda_vattr attr; + } cfs_mkdir; + + + + + DDeessccrriippttiioonn This call is similar to create but creates a directory. + Only the mode field in the input parameters is used for creation. + Upon successful creation, the attr returned contains the attributes of + the new directory. + + EErrrroorrss As for create. + + NNOOTTEE The input parameter should be changed to mode instead of + attributes. + + The attributes of the parent should be returned since the size and + mtime changes. + + 0wpage + + 44..1100.. lliinnkk + + + SSuummmmaarryy Create a link to an existing file. + + AArrgguummeennttss + + iinn + + struct cfs_link_in { + ViceFid sourceFid; /* cnode to link *to* */ + ViceFid destFid; /* Directory in which to place link */ + char *tname; /* Place holder for data. */ + } cfs_link; + + + + oouutt + empty + + DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory + identified by destFid with name tname. The source must reside in the + target's parent, i.e. the source must be have parent destFid, i.e. Coda + does not support cross directory hard links. Only the return value is + relevant. It indicates success or the type of failure. + + EErrrroorrss The usual errors can occur.0wpage + + 44..1111.. ssyymmlliinnkk + + + SSuummmmaarryy create a symbolic link + + AArrgguummeennttss + + iinn + + struct cfs_symlink_in { + ViceFid VFid; /* Directory to put symlink in */ + char *srcname; + struct coda_vattr attr; + char *tname; + } cfs_symlink; + + + + oouutt + none + + DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the + directory identified by VFid and named tname. It should point to the + pathname srcname. The attributes of the newly created object are to + be set to attr. + + EErrrroorrss + + NNOOTTEE The attributes of the target directory should be returned since + its size changed. + + 0wpage + + 44..1122.. rreemmoovvee + + + SSuummmmaarryy Remove a file + + AArrgguummeennttss + + iinn + + struct cfs_remove_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_remove; + + + + oouutt + none + + DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory + identified by VFid. + + EErrrroorrss + + NNOOTTEE The attributes of the directory should be returned since its + mtime and size may change. + + 0wpage + + 44..1133.. rrmmddiirr + + + SSuummmmaarryy Remove a directory + + AArrgguummeennttss + + iinn + + struct cfs_rmdir_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_rmdir; + + + + oouutt + none + + DDeessccrriippttiioonn Remove the directory with name name from the directory + identified by VFid. + + EErrrroorrss + + NNOOTTEE The attributes of the parent directory should be returned since + its mtime and size may change. + + 0wpage + + 44..1144.. rreeaaddlliinnkk + + + SSuummmmaarryy Read the value of a symbolic link. + + AArrgguummeennttss + + iinn + + struct cfs_readlink_in { + ViceFid VFid; + } cfs_readlink; + + + + oouutt + + struct cfs_readlink_out { + int count; + caddr_t data; /* Place holder for data. */ + } cfs_readlink; + + + + DDeessccrriippttiioonn This routine reads the contents of symbolic link + identified by VFid into the buffer data. The buffer data must be able + to hold any name up to CFS_MAXNAMLEN (PATH or NAM??). + + EErrrroorrss No unusual errors. + + 0wpage + + 44..1155.. ooppeenn + + + SSuummmmaarryy Open a file. + + AArrgguummeennttss + + iinn + + struct cfs_open_in { + ViceFid VFid; + int flags; + } cfs_open; + + + + oouutt + + struct cfs_open_out { + dev_t dev; + ino_t inode; + } cfs_open; + + + + DDeessccrriippttiioonn This request asks Venus to place the file identified by + VFid in its cache and to note that the calling process wishes to open + it with flags as in open(2). The return value to the kernel differs + for Unix and Windows systems. For Unix systems the Coda FS Driver is + informed of the device and inode number of the container file in the + fields dev and inode. For Windows the path of the container file is + returned to the kernel. + EErrrroorrss + + NNOOTTEE Currently the cfs_open_out structure is not properly adapted to + deal with the Windows case. It might be best to implement two + upcalls, one to open aiming at a container file name, the other at a + container file inode. + + 0wpage + + 44..1166.. cclloossee + + + SSuummmmaarryy Close a file, update it on the servers. + + AArrgguummeennttss + + iinn + + struct cfs_close_in { + ViceFid VFid; + int flags; + } cfs_close; + + + + oouutt + none + + DDeessccrriippttiioonn Close the file identified by VFid. + + EErrrroorrss + + NNOOTTEE The flags argument is bogus and not used. However, Venus' code + has room to deal with an execp input field, probably this field should + be used to inform Venus that the file was closed but is still memory + mapped for execution. There are comments about fetching versus not + fetching the data in Venus vproc_vfscalls. This seems silly. If a + file is being closed, the data in the container file is to be the new + data. Here again the execp flag might be in play to create confusion: + currently Venus might think a file can be flushed from the cache when + it is still memory mapped. This needs to be understood. + + 0wpage + + 44..1177.. iiooccttll + + + SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface. + + AArrgguummeennttss + + iinn + + struct cfs_ioctl_in { + ViceFid VFid; + int cmd; + int len; + int rwflag; + char *data; /* Place holder for data. */ + } cfs_ioctl; + + + + oouutt + + + struct cfs_ioctl_out { + int len; + caddr_t data; /* Place holder for data. */ + } cfs_ioctl; + + + + DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and + data arguments are filled as usual. flags is not used by Venus. + + EErrrroorrss + + NNOOTTEE Another bogus parameter. flags is not used. What is the + business about PREFETCHING in the Venus code? + + + 0wpage + + 44..1188.. rreennaammee + + + SSuummmmaarryy Rename a fid. + + AArrgguummeennttss + + iinn + + struct cfs_rename_in { + ViceFid sourceFid; + char *srcname; + ViceFid destFid; + char *destname; + } cfs_rename; + + + + oouutt + none + + DDeessccrriippttiioonn Rename the object with name srcname in directory + sourceFid to destname in destFid. It is important that the names + srcname and destname are 0 terminated strings. Strings in Unix + kernels are not always null terminated. + + EErrrroorrss + + 0wpage + + 44..1199.. rreeaaddddiirr + + + SSuummmmaarryy Read directory entries. + + AArrgguummeennttss + + iinn + + struct cfs_readdir_in { + ViceFid VFid; + int count; + int offset; + } cfs_readdir; + + + + + oouutt + + struct cfs_readdir_out { + int size; + caddr_t data; /* Place holder for data. */ + } cfs_readdir; + + + + DDeessccrriippttiioonn Read directory entries from VFid starting at offset and + read at most count bytes. Returns the data in data and returns + the size in size. + + EErrrroorrss + + NNOOTTEE This call is not used. Readdir operations exploit container + files. We will re-evaluate this during the directory revamp which is + about to take place. + + 0wpage + + 44..2200.. vvggeett + + + SSuummmmaarryy instructs Venus to do an FSDB->Get. + + AArrgguummeennttss + + iinn + + struct cfs_vget_in { + ViceFid VFid; + } cfs_vget; + + + + oouutt + + struct cfs_vget_out { + ViceFid VFid; + int vtype; + } cfs_vget; + + + + DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj + labelled by VFid. + + EErrrroorrss + + NNOOTTEE This operation is not used. However, it is extremely useful + since it can be used to deal with read/write memory mapped files. + These can be "pinned" in the Venus cache using vget and released with + inactive. + + 0wpage + + 44..2211.. ffssyynncc + + + SSuummmmaarryy Tell Venus to update the RVM attributes of a file. + + AArrgguummeennttss + + iinn + + struct cfs_fsync_in { + ViceFid VFid; + } cfs_fsync; + + + + oouutt + none + + DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This + should be called as part of kernel level fsync type calls. The + result indicates if the syncing was successful. + + EErrrroorrss + + NNOOTTEE Linux does not implement this call. It should. + + 0wpage + + 44..2222.. iinnaaccttiivvee + + + SSuummmmaarryy Tell Venus a vnode is no longer in use. + + AArrgguummeennttss + + iinn + + struct cfs_inactive_in { + ViceFid VFid; + } cfs_inactive; + + + + oouutt + none + + DDeessccrriippttiioonn This operation returns EOPNOTSUPP. + + EErrrroorrss + + NNOOTTEE This should perhaps be removed. + + 0wpage + + 44..2233.. rrddwwrr + + + SSuummmmaarryy Read or write from a file + + AArrgguummeennttss + + iinn + + struct cfs_rdwr_in { + ViceFid VFid; + int rwflag; + int count; + int offset; + int ioflag; + caddr_t data; /* Place holder for data. */ + } cfs_rdwr; + + + + + oouutt + + struct cfs_rdwr_out { + int rwflag; + int count; + caddr_t data; /* Place holder for data. */ + } cfs_rdwr; + + + + DDeessccrriippttiioonn This upcall asks Venus to read or write from a file. + + EErrrroorrss + + NNOOTTEE It should be removed since it is against the Coda philosophy that + read/write operations never reach Venus. I have been told the + operation does not work. It is not currently used. + + + 0wpage + + 44..2244.. ooddyymmoouunntt + + + SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount + point. + + AArrgguummeennttss + + iinn + + struct ody_mount_in { + char *name; /* Place holder for data. */ + } ody_mount; + + + + oouutt + + struct ody_mount_out { + ViceFid VFid; + } ody_mount; + + + + DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named + name. The fid is returned in VFid. + + EErrrroorrss + + NNOOTTEE This call was used by David for dynamic sets. It should be + removed since it causes a jungle of pointers in the VFS mounting area. + It is not used by Coda proper. Call is not implemented by Venus. + + 0wpage + + 44..2255.. ooddyy__llooookkuupp + + + SSuummmmaarryy Looks up something. + + AArrgguummeennttss + + iinn irrelevant + + + oouutt + irrelevant + + DDeessccrriippttiioonn + + EErrrroorrss + + NNOOTTEE Gut it. Call is not implemented by Venus. + + 0wpage + + 44..2266.. ooddyy__eexxppaanndd + + + SSuummmmaarryy expands something in a dynamic set. + + AArrgguummeennttss + + iinn irrelevant + + oouutt + irrelevant + + DDeessccrriippttiioonn + + EErrrroorrss + + NNOOTTEE Gut it. Call is not implemented by Venus. + + 0wpage + + 44..2277.. pprreeffeettcchh + + + SSuummmmaarryy Prefetch a dynamic set. + + AArrgguummeennttss + + iinn Not documented. + + oouutt + Not documented. + + DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is + noted that it doesn't work. Not surprising, since the kernel does not + have support for it. (ODY_PREFETCH is not a defined operation). + + EErrrroorrss + + NNOOTTEE Gut it. It isn't working and isn't used by Coda. + + + 0wpage + + 44..2288.. ssiiggnnaall + + + SSuummmmaarryy Send Venus a signal about an upcall. + + AArrgguummeennttss + + iinn none + + oouutt + not applicable. + + DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus + that the calling process received a signal after Venus read the + message from the input queue. Venus is supposed to clean up the + operation. + + EErrrroorrss No reply is given. + + NNOOTTEE We need to better understand what Venus needs to clean up and if + it is doing this correctly. Also we need to handle multiple upcall + per system call situations correctly. It would be important to know + what state changes in Venus take place after an upcall for which the + kernel is responsible for notifying Venus to clean up (e.g. open + definitely is such a state change, but many others are maybe not). + + 0wpage + + 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss + + + The Coda FS Driver can cache results of lookup and access upcalls, to + limit the frequency of upcalls. Upcalls carry a price since a process + context switch needs to take place. The counterpart of caching the + information is that Venus will notify the FS Driver that cached + entries must be flushed or renamed. + + The kernel code generally has to maintain a structure which links the + internal file handles (called vnodes in BSD, inodes in Linux and + FileHandles in Windows) with the ViceFid's which Venus maintains. The + reason is that frequent translations back and forth are needed in + order to make upcalls and use the results of upcalls. Such linking + objects are called ccnnooddeess. + + The current minicache implementations have cache entries which record + the following: + + 1. the name of the file + + 2. the cnode of the directory containing the object + + 3. a list of CodaCred's for which the lookup is permitted. + + 4. the cnode of the object + + The lookup call in the Coda FS Driver may request the cnode of the + desired object from the cache, by passing its name, directory and the + CodaCred's of the caller. The cache will return the cnode or indicate + that it cannot be found. The Coda FS Driver must be careful to + invalidate cache entries when it modifies or removes objects. + + When Venus obtains information that indicates that cache entries are + no longer valid, it will make a downcall to the kernel. Downcalls are + intercepted by the Coda FS Driver and lead to cache invalidations of + the kind described below. The Coda FS Driver does not return an error + unless the downcall data could not be read into kernel memory. + + + 55..11.. IINNVVAALLIIDDAATTEE + + + No information is available on this call. + + + 55..22.. FFLLUUSSHH + + + + AArrgguummeennttss None + + SSuummmmaarryy Flush the name cache entirely. + + DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This + is to prevent stale cache information being held. Some operating + systems allow the kernel name cache to be switched off dynamically. + When this is done, this downcall is made. + + + 55..33.. PPUURRGGEEUUSSEERR + + + AArrgguummeennttss + + struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */ + struct CodaCred cred; + } cfs_purgeuser; + + + + DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This + call is issued when tokens for a user expire or are flushed. + + + 55..44.. ZZAAPPFFIILLEE + + + AArrgguummeennttss + + struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */ + ViceFid CodaFid; + } cfs_zapfile; + + + + DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair. + This is issued as a result of an invalidation of cached attributes of + a vnode. + + NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache + zapfile routine takes different arguments. Linux does not implement + the invalidation of attributes correctly. + + + + 55..55.. ZZAAPPDDIIRR + + + AArrgguummeennttss + + struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */ + ViceFid CodaFid; + } cfs_zapdir; + + + + DDeessccrriippttiioonn Remove all entries in the cache lying in a directory + CodaFid, and all children of this directory. This call is issued when + Venus receives a callback on the directory. + + + 55..66.. ZZAAPPVVNNOODDEE + + + + AArrgguummeennttss + + struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */ + struct CodaCred cred; + ViceFid VFid; + } cfs_zapvnode; + + + + DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid + as in the arguments. This downcall is probably never issued. + + + 55..77.. PPUURRGGEEFFIIDD + + + SSuummmmaarryy + + AArrgguummeennttss + + struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */ + ViceFid CodaFid; + } cfs_purgefid; + + + + DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd + vnode), purge its children from the namecache and remove the file from the + namecache. + + + + 55..88.. RREEPPLLAACCEE + + + SSuummmmaarryy Replace the Fid's for a collection of names. + + AArrgguummeennttss + + struct cfs_replace_out { /* cfs_replace is a venus->kernel call */ + ViceFid NewFid; + ViceFid OldFid; + } cfs_replace; + + + + DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with + another. It is added to allow Venus during reintegration to replace + locally allocated temp fids while disconnected with global fids even + when the reference counts on those fids are not zero. + + 0wpage + + 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp + + + This section gives brief hints as to desirable features for the Coda + FS Driver at startup and upon shutdown or Venus failures. Before + entering the discussion it is useful to repeat that the Coda FS Driver + maintains the following data: + + + 1. message queues + + 2. cnodes + + 3. name cache entries + + The name cache entries are entirely private to the driver, so they + can easily be manipulated. The message queues will generally have + clear points of initialization and destruction. The cnodes are + much more delicate. User processes hold reference counts in Coda + filesystems and it can be difficult to clean up the cnodes. + + It can expect requests through: + + 1. the message subsystem + + 2. the VFS layer + + 3. pioctl interface + + Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can + treat these similarly. + + + 66..11.. RReeqquuiirreemmeennttss + + + The following requirements should be accommodated: + + 1. The message queues should have open and close routines. On Unix + the opening of the character devices are such routines. + + +o Before opening, no messages can be placed. + + +o Opening will remove any old messages still pending. + + +o Close will notify any sleeping processes that their upcall cannot + be completed. + + +o Close will free all memory allocated by the message queues. + + + 2. At open the namecache shall be initialized to empty state. + + 3. Before the message queues are open, all VFS operations will fail. + Fortunately this can be achieved by making sure than mounting the + Coda filesystem cannot succeed before opening. + + 4. After closing of the queues, no VFS operations can succeed. Here + one needs to be careful, since a few operations (lookup, + read/write, readdir) can proceed without upcalls. These must be + explicitly blocked. + + 5. Upon closing the namecache shall be flushed and disabled. + + 6. All memory held by cnodes can be freed without relying on upcalls. + + 7. Unmounting the file system can be done without relying on upcalls. + + 8. Mounting the Coda filesystem should fail gracefully if Venus cannot + get the rootfid or the attributes of the rootfid. The latter is + best implemented by Venus fetching these objects before attempting + to mount. + + NNOOTTEE NetBSD in particular but also Linux have not implemented the + above requirements fully. For smooth operation this needs to be + corrected. + + + diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt new file mode 100644 index 000000000000..31f53f0ab957 --- /dev/null +++ b/Documentation/filesystems/cramfs.txt @@ -0,0 +1,76 @@ + + Cramfs - cram a filesystem onto a small ROM + +cramfs is designed to be simple and small, and to compress things well. + +It uses the zlib routines to compress a file one page at a time, and +allows random page access. The meta-data is not compressed, but is +expressed in a very terse representation to make it use much less +diskspace than traditional filesystems. + +You can't write to a cramfs filesystem (making it compressible and +compact also makes it _very_ hard to update on-the-fly), so you have to +create the disk image with the "mkcramfs" utility. + + +Usage Notes +----------- + +File sizes are limited to less than 16MB. + +Maximum filesystem size is a little over 256MB. (The last file on the +filesystem is allowed to extend past 256MB.) + +Only the low 8 bits of gid are stored. The current version of +mkcramfs simply truncates to 8 bits, which is a potential security +issue. + +Hard links are supported, but hard linked files +will still have a link count of 1 in the cramfs image. + +Cramfs directories have no `.' or `..' entries. Directories (like +every other file on cramfs) always have a link count of 1. (There's +no need to use -noleaf in `find', btw.) + +No timestamps are stored in a cramfs, so these default to the epoch +(1970 GMT). Recently-accessed files may have updated timestamps, but +the update lasts only as long as the inode is cached in memory, after +which the timestamp reverts to 1970, i.e. moves backwards in time. + +Currently, cramfs must be written and read with architectures of the +same endianness, and can be read only by kernels with PAGE_CACHE_SIZE +== 4096. At least the latter of these is a bug, but it hasn't been +decided what the best fix is. For the moment if you have larger pages +you can just change the #define in mkcramfs.c, so long as you don't +mind the filesystem becoming unreadable to future kernels. + + +For /usr/share/magic +-------------------- + +0 ulelong 0x28cd3d45 Linux cramfs offset 0 +>4 ulelong x size %d +>8 ulelong x flags 0x%x +>12 ulelong x future 0x%x +>16 string >\0 signature "%.16s" +>32 ulelong x fsid.crc 0x%x +>36 ulelong x fsid.edition %d +>40 ulelong x fsid.blocks %d +>44 ulelong x fsid.files %d +>48 string >\0 name "%.16s" +512 ulelong 0x28cd3d45 Linux cramfs offset 512 +>516 ulelong x size %d +>520 ulelong x flags 0x%x +>524 ulelong x future 0x%x +>528 string >\0 signature "%.16s" +>544 ulelong x fsid.crc 0x%x +>548 ulelong x fsid.edition %d +>552 ulelong x fsid.blocks %d +>556 ulelong x fsid.files %d +>560 string >\0 name "%.16s" + + +Hacker Notes +------------ + +See fs/cramfs/README for filesystem layout and implementation notes. diff --git a/Documentation/filesystems/devfs/ChangeLog b/Documentation/filesystems/devfs/ChangeLog new file mode 100644 index 000000000000..e5aba5246d7c --- /dev/null +++ b/Documentation/filesystems/devfs/ChangeLog @@ -0,0 +1,1977 @@ +/* -*- auto-fill -*- */ +=============================================================================== +Changes for patch v1 + +- creation of devfs + +- modified miscellaneous character devices to support devfs +=============================================================================== +Changes for patch v2 + +- bug fix with manual inode creation +=============================================================================== +Changes for patch v3 + +- bugfixes + +- documentation improvements + +- created a couple of scripts (one to save&restore a devfs and the + other to set up compatibility symlinks) + +- devfs support for SCSI discs. New name format is: sd_hHcCiIlL +=============================================================================== +Changes for patch v4 + +- bugfix for the directory reading code + +- bugfix for compilation with kerneld + +- devfs support for generic hard discs + +- rationalisation of the various watchdog drivers +=============================================================================== +Changes for patch v5 + +- support for mounting directly from entries in the devfs (it doesn't + need to be mounted to do this), including the root filesystem. + Mounting of swap partitions also works. Hence, now if you set + CONFIG_DEVFS_ONLY to 'Y' then you won't be able to access your discs + via ordinary device nodes. Naturally, the default is 'N' so that you + can still use your old device nodes. If you want to mount from devfs + entries, make sure you use: append = "root=/dev/sd_..." in your + lilo.conf. It seems LILO looks for the device number (major&minor) + and writes that into the kernel image :-( + +- support for character memory devices (/dev/null, /dev/zero, /dev/full + and so on). Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> +=============================================================================== +Changes for patch v6 + +- support for subdirectories + +- support for symbolic links (created by devfs_mk_symlink(), no + support yet for creation via symlink(2)) + +- SCSI disc naming now cast in stone, with the format: + /dev/sd/c0b1t2u3 controller=0, bus=1, ID=2, LUN=3, whole disc + /dev/sd/c0b1t2u3p4 controller=0, bus=1, ID=2, LUN=3, 4th partition + +- loop devices now appear in devfs + +- tty devices, console, serial ports, etc. now appear in devfs + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- bugs with mounting devfs-only devices now fixed +=============================================================================== +Changes for patch v7 + +- SCSI CD-ROMS, tapes and generic devices now appear in devfs +=============================================================================== +Changes for patch v8 + +- bugfix with no-rewind SCSI tapes + +- RAMDISCs now appear in devfs + +- better cleaning up of devfs entries created by various modules + +- interface change to <devfs_register> +=============================================================================== +Changes for patch v9 + +- the v8 patch was corrupted somehow, which would affect the patch for + linux/fs/filesystems.c + I've also fixed the v8 patch file on the WWW + +- MetaDevices (/dev/md*) should now appear in devfs +=============================================================================== +Changes for patch v10 + +- bugfix in meta device support for devfs + +- created this ChangeLog file + +- added devfs support to the floppy driver + +- added support for creating sockets in a devfs +=============================================================================== +Changes for patch v11 + +- added DEVFS_FL_HIDE_UNREG flag + +- incorporated better patch for ttyname() in libc 5.4.43 from H.J. Lu. + +- interface change to <devfs_mk_symlink> + +- support for creating symlinks with symlink(2) + +- parallel port printer (/dev/lp*) now appears in devfs +=============================================================================== +Changes for patch v12 + +- added inode check to <devfs_fill_file> function + +- improved devfs support when mounting from devfs + +- added call to <<release>> operation when removing swap areas on + devfs devices + +- increased NR_SUPER to 128 to support large numbers of devfs mounts + (for chroot(2) gaols) + +- fixed bug in SCSI disc support: was generating incorrect minors if + SCSI ID's did not start at 0 and increase by 1 + +- support symlink traversal when mounting root +=============================================================================== +Changes for patch v13 + +- added devfs support to soundcard driver + Thanks to Eric Dumas <dumas@linux.eu.org> and + C. Scott Ananian <cananian@alumni.princeton.edu> + +- added devfs support to the joystick driver + +- loop driver now has it's own subdirectory "/dev/loop/" + +- created <devfs_get_flags> and <devfs_set_flags> functions + +- fix problem with SCSI disc compatibility names (sd{a,b,c,d,e,f}) + which assumes ID's start at 0 and increase by 1. Also only create + devfs entries for SCSI disc partitions which actually exist + Show new names in partition check + Thanks to Jakub Jelinek <jj@sunsite.ms.mff.cuni.cz> +=============================================================================== +Changes for patch v14 + +- bug fix in floppy driver: would not compile without + CONFIG_DEVFS_FS='Y' + Thanks to Jurgen Botz <jbotz@nova.botz.org> + +- bug fix in loop driver + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- do not create devfs entries for printers not configured + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- do not create devfs entries for serial ports not present + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- ensure <tty_register_devfs> is exported from tty_io.c + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- allow unregistering of devfs symlink entries + +- fixed bug in SCSI disc naming introduced in last patch version +=============================================================================== +Changes for patch v15 + +- ported to kernel 2.1.81 +=============================================================================== +Changes for patch v16 + +- created <devfs_set_symlink_destination> function + +- moved DEVFS_SUPER_MAGIC into header file + +- added DEVFS_FL_HIDE flag + +- created <devfs_get_maj_min> + +- created <devfs_get_handle_from_inode> + +- fixed bugs in searching by major&minor + +- changed interface to <devfs_unregister>, <devfs_fill_file> and + <devfs_find_handle> + +- fixed inode times when symlink created with symlink(2) + +- change tty driver to do auto-creation of devfs entries + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to + devfs + +- updated libc 5.4.43 patch for ttyname() +=============================================================================== +Changes for patch v17 + +- added CONFIG_DEVFS_TTY_COMPAT + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- bugfix in devfs support for drivers/char/lp.c + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- clean up serial driver so that PCMCIA devices unregister correctly + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to + devfs [was missing in patch v16] + +- updated libc 5.4.43 patch for ttyname() [was missing in patch v16] + +- all SCSI devices now registered in /dev/sg + +- support removal of devfs entries via unlink(2) +=============================================================================== +Changes for patch v18 + +- added floppy/?u720 floppy entry + +- fixed kerneld support for entries in devfs subdirectories + +- incorporated latest patch for ttyname() in libc 5.4.43 from H.J. Lu. +=============================================================================== +Changes for patch v19 + +- bug fix when looking up unregistered entries: kerneld was not called + +- fixes for kernel 2.1.86 (now requires 2.1.86) +=============================================================================== +Changes for patch v20 + +- only create available floppy entries + Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> + +- new IDE naming scheme following SCSI format (i.e. /dev/id/c0b0t0u0p1 + instead of /dev/hda1) + Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> + +- new XT disc naming scheme following SCSI format (i.e. /dev/xd/c0t0p1 + instead of /dev/xda1) + Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> + +- new non-standard CD-ROM names (i.e. /dev/sbp/c#t#) + Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> + +- allow symlink traversal when mounting the root filesystem + +- Create entries for MD devices at MD init + Thanks to Christophe Leroy <christophe.leroy5@capway.com> +=============================================================================== +Changes for patch v21 + +- ported to kernel 2.1.91 +=============================================================================== +Changes for patch v22 + +- SCSI host number patch ("scsihosts=" kernel option) + Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> +=============================================================================== +Changes for patch v23 + +- Fixed persistence bug with device numbers for manually created + device files + +- Fixed problem with recreating symlinks with different content + +- Added CONFIG_DEVFS_MOUNT (mount devfs on /dev at boot time) +=============================================================================== +Changes for patch v24 + +- Switched from CONFIG_KERNELD to CONFIG_KMOD: module autoloading + should now work again + +- Hide entries which are manually unlinked + +- Always invalidate devfs dentry cache when registering entries + +- Support removal of devfs directories via rmdir(2) + +- Ensure directories created by <devfs_mk_dir> are visible + +- Default no access for "other" for floppy device +=============================================================================== +Changes for patch v25 + +- Updates to CREDITS file and minor IDE numbering change + Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> + +- Invalidate devfs dentry cache when making directories + +- Invalidate devfs dentry cache when removing entries + +- More informative message if root FS mount fails when devfs + configured + +- Fixed persistence bug with fifos +=============================================================================== +Changes for patch v26 + +- ported to kernel 2.1.97 + +- Changed serial directory from "/dev/serial" to "/dev/tts" and + "/dev/consoles" to "/dev/vc" to be more friendly to new procps +=============================================================================== +Changes for patch v27 + +- Added support for IDE4 and IDE5 + Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> + +- Documented "scsihosts=" boot parameter + +- Print process command when debugging kerneld/kmod + +- Added debugging for register/unregister/change operations + +- Added "devfs=" boot options + +- Hide unregistered entries by default +=============================================================================== +Changes for patch v28 + +- No longer lock/unlock superblock in <devfs_put_super> (cope with + recent VFS interface change) + +- Do not automatically change ownership/protection of /dev/tty + +- Drop negative dentries when they are released + +- Manage dcache more efficiently +=============================================================================== +Changes for patch v29 + +- Added DEVFS_FL_AUTO_DEVNUM flag +=============================================================================== +Changes for patch v30 + +- No longer set unnecessary methods + +- Ported to kernel 2.1.99-pre3 +=============================================================================== +Changes for patch v31 + +- Added PID display to <call_kerneld> debugging message + +- Added "diread" and "diwrite" options + +- Ported to kernel 2.1.102 + +- Fixed persistence problem with permissions +=============================================================================== +Changes for patch v32 + +- Fixed devfs support in drivers/block/md.c +=============================================================================== +Changes for patch v33 + +- Support legacy device nodes + +- Fixed bug where recreated inodes were hidden + +- New IDE naming scheme: everything is under /dev/ide +=============================================================================== +Changes for patch v34 + +- Improved debugging in <get_vfs_inode> + +- Prevent duplicate calls to <devfs_mk_dir> in SCSI layer + +- No longer free old dentries in <devfs_mk_dir> + +- Free all dentries for a given entry when deleting inodes +=============================================================================== +Changes for patch v35 + +- Ported to kernel 2.1.105 (sound driver changes) +=============================================================================== +Changes for patch v36 + +- Fixed sound driver port +=============================================================================== +Changes for patch v37 + +- Minor documentation tweaks +=============================================================================== +Changes for patch v38 + +- More documentation tweaks + +- Fix for sound driver port + +- Removed ttyname-patch (grab libc 5.4.44 instead) + +- Ported to kernel 2.1.107-pre2 (loop driver fix) +=============================================================================== +Changes for patch v39 + +- Ported to kernel 2.1.107 (hd.c hunk broke due to spelling "fixes"). Sigh + +- Removed many #ifdef's, replaced with trickery in include/devfs_fs.h +=============================================================================== +Changes for patch v40 + +- Fix for sound driver port + +- Limit auto-device numbering to majors 128 to 239 +=============================================================================== +Changes for patch v41 + +- Fixed inode times persistence problem +=============================================================================== +Changes for patch v42 + +- Ported to kernel 2.1.108 (drivers/scsi/hosts.c hunk broke) +=============================================================================== +Changes for patch v43 + +- Fixed spelling in <devfs_readlink> debug + +- Fixed bug in <devfs_setup> parsing "dilookup" + +- More #ifdef's removed + +- Supported Sparc keyboard (/dev/kbd) + +- Supported DSP56001 digital signal processor (/dev/dsp56k) + +- Supported Apple Desktop Bus (/dev/adb) + +- Supported Coda network file system (/dev/cfs*) +=============================================================================== +Changes for patch v44 + +- Fixed devfs inode leak when manually recreating inodes + +- Fixed permission persistence problem when recreating inodes +=============================================================================== +Changes for patch v45 + +- Ported to kernel 2.1.110 +=============================================================================== +Changes for patch v46 + +- Ported to kernel 2.1.112-pre1 + +- Removed harmless "unused variable" compiler warning + +- Fixed modes for manually recreated device nodes +=============================================================================== +Changes for patch v47 + +- Added NULL devfs inode warning in <devfs_read_inode> + +- Force all inode nlink values to 1 +=============================================================================== +Changes for patch v48 + +- Added "dimknod" option + +- Set inode nlink to 0 when freeing dentries + +- Added support for virtual console capture devices (/dev/vcs*) + Thanks to Dennis Hou <smilax@mindmeld.yi.org> + +- Fixed modes for manually recreated symlinks +=============================================================================== +Changes for patch v49 + +- Ported to kernel 2.1.113 +=============================================================================== +Changes for patch v50 + +- Fixed bugs in recreated directories and symlinks +=============================================================================== +Changes for patch v51 + +- Improved robustness of rc.devfs script + Thanks to Roderich Schupp <rsch@experteam.de> + +- Fixed bugs in recreated device nodes + +- Fixed bug in currently unused <devfs_get_handle_from_inode> + +- Defined new <devfs_handle_t> type + +- Improved debugging when getting entries + +- Fixed bug where directories could be emptied + +- Ported to kernel 2.1.115 +=============================================================================== +Changes for patch v52 + +- Replaced dummy .epoch inode with .devfsd character device + +- Modified rc.devfs to take account of above change + +- Removed spurious driver warning messages when CONFIG_DEVFS_FS=n + +- Implemented devfsd protocol revision 0 +=============================================================================== +Changes for patch v53 + +- Ported to kernel 2.1.116 (kmod change broke hunk) + +- Updated Documentation/Configure.help + +- Test and tty pattern patch for rc.devfs script + Thanks to Roderich Schupp <rsch@experteam.de> + +- Added soothing message to warning in <devfs_d_iput> +=============================================================================== +Changes for patch v54 + +- Ported to kernel 2.1.117 + +- Fixed default permissions in sound driver + +- Added support for frame buffer devices (/dev/fb*) +=============================================================================== +Changes for patch v55 + +- Ported to kernel 2.1.119 + +- Use GCC extensions for structure initialisations + +- Implemented async open notification + +- Incremented devfsd protocol revision to 1 +=============================================================================== +Changes for patch v56 + +- Ported to kernel 2.1.120-pre3 + +- Moved async open notification to end of <devfs_open> +=============================================================================== +Changes for patch v57 + +- Ported to kernel 2.1.121 + +- Prepended "/dev/" to module load request + +- Renamed <call_kerneld> to <call_kmod> + +- Created sample modules.conf file +=============================================================================== +Changes for patch v58 + +- Fixed typo "AYSNC" -> "ASYNC" +=============================================================================== +Changes for patch v59 + +- Added open flag for files +=============================================================================== +Changes for patch v60 + +- Ported to kernel 2.1.123-pre2 +=============================================================================== +Changes for patch v61 + +- Set i_blocks=0 and i_blksize=1024 in <devfs_read_inode> +=============================================================================== +Changes for patch v62 + +- Ported to kernel 2.1.123 +=============================================================================== +Changes for patch v63 + +- Ported to kernel 2.1.124-pre2 +=============================================================================== +Changes for patch v64 + +- Fixed Unix98 pty support + +- Increased buffer size in <get_partition_list> to avoid crash and + burn +=============================================================================== +Changes for patch v65 + +- More Unix98 pty support fixes + +- Added test for empty <<name>> in <devfs_find_handle> + +- Renamed <generate_path> to <devfs_generate_path> and published + +- Created /dev/root symlink + Thanks to Roderich Schupp <rsch@ExperTeam.de> + with further modifications by me +=============================================================================== +Changes for patch v66 + +- Yet more Unix98 pty support fixes (now tested) + +- Created <devfs_get_fops> + +- Support media change checks when CONFIG_DEVFS_ONLY=y + +- Abolished Unix98-style PTY names for old PTY devices +=============================================================================== +Changes for patch v67 + +- Added inline declaration for dummy <devfs_generate_path> + +- Removed spurious "unable to register... in devfs" messages when + CONFIG_DEVFS_FS=n + +- Fixed misc. devices when CONFIG_DEVFS_FS=n + +- Limit auto-device numbering to majors 144 to 239 +=============================================================================== +Changes for patch v68 + +- Hide unopened virtual consoles from directory listings + +- Added support for video capture devices + +- Ported to kernel 2.1.125 +=============================================================================== +Changes for patch v69 + +- Fix for CONFIG_VT=n +=============================================================================== +Changes for patch v70 + +- Added support for non-OSS/Free sound cards +=============================================================================== +Changes for patch v71 + +- Ported to kernel 2.1.126-pre2 +=============================================================================== +Changes for patch v72 + +- #ifdef's for CONFIG_DEVFS_DISABLE_OLD_NAMES removed +=============================================================================== +Changes for patch v73 + +- CONFIG_DEVFS_DISABLE_OLD_NAMES replaced with "nocompat" boot option + +- CONFIG_DEVFS_BOOT_OPTIONS removed: boot options always available +=============================================================================== +Changes for patch v74 + +- Removed CONFIG_DEVFS_MOUNT and "mount" boot option and replaced with + "nomount" boot option + +- Documentation updates + +- Updated sample modules.conf +=============================================================================== +Changes for patch v75 + +- Updated sample modules.conf + +- Remount devfs after initrd finishes + +- Ported to kernel 2.1.127 + +- Added support for ISDN + Thanks to Christophe Leroy <christophe.leroy5@capway.com> +=============================================================================== +Changes for patch v76 + +- Updated an email address in ChangeLog + +- CONFIG_DEVFS_ONLY replaced with "only" boot option +=============================================================================== +Changes for patch v77 + +- Added DEVFS_FL_REMOVABLE flag + +- Check for disc change when listing directories with removable media + devices + +- Use DEVFS_FL_REMOVABLE in sd.c + +- Ported to kernel 2.1.128 +=============================================================================== +Changes for patch v78 + +- Only call <scan_dir_for_removable> on first call to <devfs_readdir> + +- Ported to kernel 2.1.129-pre5 + +- ISDN support improvements + Thanks to Christophe Leroy <christophe.leroy5@capway.com> +=============================================================================== +Changes for patch v79 + +- Ported to kernel 2.1.130 + +- Renamed miscdevice "apm" to "apm_bios" to be consistent with + devices.txt +=============================================================================== +Changes for patch v80 + +- Ported to kernel 2.1.131 + +- Updated <devfs_rmdir> for VFS change in 2.1.131 +=============================================================================== +Changes for patch v81 + +- Fixed permissions on /dev/ptmx +=============================================================================== +Changes for patch v82 + +- Ported to kernel 2.1.132-pre4 + +- Changed initial permissions on /dev/pts/* + +- Created <devfs_mk_compat> + +- Added "symlinks" boot option + +- Changed devfs_register_blkdev() back to register_blkdev() for IDE + +- Check for partitions on removable media in <devfs_lookup> +=============================================================================== +Changes for patch v83 + +- Fixed support for ramdisc when using string-based root FS name + +- Ported to kernel 2.2.0-pre1 +=============================================================================== +Changes for patch v84 + +- Ported to kernel 2.2.0-pre7 +=============================================================================== +Changes for patch v85 + +- Compile fixes for driver/sound/sound_common.c (non-module) and + drivers/isdn/isdn_common.c + Thanks to Christophe Leroy <christophe.leroy5@capway.com> + +- Added support for registering regular files + +- Created <devfs_set_file_size> + +- Added /dev/cpu/mtrr as an alternative interface to /proc/mtrr + +- Update devfs inodes from entries if not changed through FS +=============================================================================== +Changes for patch v86 + +- Ported to kernel 2.2.0-pre9 +=============================================================================== +Changes for patch v87 + +- Fixed bug when mounting non-devfs devices in a devfs +=============================================================================== +Changes for patch v88 + +- Fixed <devfs_fill_file> to only initialise temporary inodes + +- Trap for NULL fops in <devfs_register> + +- Return -ENODEV in <devfs_fill_file> for non-driver inodes + +- Fixed bug when unswapping non-devfs devices in a devfs +=============================================================================== +Changes for patch v89 + +- Switched to C data types in include/linux/devfs_fs.h + +- Switched from PATH_MAX to DEVFS_PATHLEN + +- Updated Documentation/filesystems/devfs/modules.conf to take account + of reverse scanning (!) by modprobe + +- Ported to kernel 2.2.0 +=============================================================================== +Changes for patch v90 + +- CONFIG_DEVFS_DISABLE_OLD_TTY_NAMES replaced with "nottycompat" boot + option + +- CONFIG_DEVFS_TTY_COMPAT removed: existing "symlinks" boot option now + controls this. This means you must have libc 5.4.44 or later, or a + recent version of libc 6 if you use the "symlinks" option +=============================================================================== +Changes for patch v91 + +- Switch from <devfs_mk_symlink> to <devfs_mk_compat> in + drivers/char/vc_screen.c to fix problems with Midnight Commander +=============================================================================== +Changes for patch v92 + +- Ported to kernel 2.2.2-pre5 +=============================================================================== +Changes for patch v93 + +- Modified <sd_name> in drivers/scsi/sd.c to cope with devices that + don't exist (which happens with new RAID autostart code printk()s) +=============================================================================== +Changes for patch v94 + +- Fixed bug in joystick driver: only first joystick was registered +=============================================================================== +Changes for patch v95 + +- Fixed another bug in joystick driver + +- Fixed <devfsd_read> to not overrun event buffer +=============================================================================== +Changes for patch v96 + +- Ported to kernel 2.2.5-2 + +- Created <devfs_auto_unregister> + +- Fixed bugs: compatibility entries were not unregistered for: + loop driver + floppy driver + RAMDISC driver + IDE tape driver + SCSI CD-ROM driver + SCSI HDD driver +=============================================================================== +Changes for patch v97 + +- Fixed bugs: compatibility entries were not unregistered for: + ALSA sound driver + partitions in generic disc driver + +- Don't return unregistred entries in <devfs_find_handle> + +- Panic in <devfs_unregister> if entry unregistered + +- Don't panic in <devfs_auto_unregister> for duplicates +=============================================================================== +Changes for patch v98 + +- Don't unregister already unregistered entries in <unregister> + +- Register entry in <sd_detect> + +- Unregister entry in <sd_detach> + +- Changed to <devfs_*register_chrdev> in drivers/char/tty_io.c + +- Ported to kernel 2.2.7 +=============================================================================== +Changes for patch v99 + +- Ported to kernel 2.2.8 + +- Fixed bug in drivers/scsi/sd.c when >16 SCSI discs + +- Disable warning messages when unable to read partition table for + removable media +=============================================================================== +Changes for patch v100 + +- Ported to kernel 2.3.1-pre5 + +- Added "oops-on-panic" boot option + +- Improved debugging in <devfs_register> and <devfs_unregister> + +- Register entry in <sr_detect> + +- Unregister entry in <sr_detach> + +- Register entry in <sg_detect> + +- Unregister entry in <sg_detach> + +- Added support for ALSA drivers +=============================================================================== +Changes for patch v101 + +- Ported to kernel 2.3.2 +=============================================================================== +Changes for patch v102 + +- Update serial driver to register PCMCIA entries + Thanks to Roch-Alexandre Nomine-Beguin <roch@samarkand.infini.fr> + +- Updated an email address in ChangeLog + +- Hide virtual console capture entries from directory listings when + corresponding console device is not open +=============================================================================== +Changes for patch v103 + +- Ported to kernel 2.3.3 +=============================================================================== +Changes for patch v104 + +- Added documentation for some functions + +- Added "doc" target to fs/devfs/Makefile + +- Added "v4l" directory for video4linux devices + +- Replaced call to <devfs_unregister> in <sd_detach> with call to + <devfs_register_partitions> + +- Moved registration for sr and sg drivers from detect() to attach() + methods + +- Register entries in <st_attach> and unregister in <st_detach> + +- Work around IDE driver treating CD-ROM as gendisk + +- Use <sed> instead of <tr> in rc.devfs + +- Updated ToDo list + +- Removed "oops-on-panic" boot option: now always Oops +=============================================================================== +Changes for patch v105 + +- Unregister SCSI host from <scsi_host_no_list> in <scsi_unregister> + Thanks to Zoltán Böszörményi <zboszor@mail.externet.hu> + +- Don't save /dev/log in rc.devfs + +- Ported to kernel 2.3.4-pre1 +=============================================================================== +Changes for patch v106 + +- Fixed silly typo in drivers/scsi/st.c + +- Improved debugging in <devfs_register> +=============================================================================== +Changes for patch v107 + +- Added "diunlink" and "nokmod" boot options + +- Removed superfluous warning message in <devfs_d_iput> +=============================================================================== +Changes for patch v108 + +- Remove entries when unloading sound module +=============================================================================== +Changes for patch v109 + +- Ported to kernel 2.3.6-pre2 +=============================================================================== +Changes for patch v110 + +- Took account of change to <d_alloc_root> +=============================================================================== +Changes for patch v111 + +- Created separate event queue for each mounted devfs + +- Removed <devfs_invalidate_dcache> + +- Created new ioctl()s for devfsd + +- Incremented devfsd protocol revision to 3 + +- Fixed bug when re-creating directories: contents were lost + +- Block access to inodes until devfsd updates permissions +=============================================================================== +Changes for patch v112 + +- Modified patch so it applies against 2.3.5 and 2.3.6 + +- Updated an email address in ChangeLog + +- Do not automatically change ownership/protection of /dev/tty<n> + +- Updated sample modules.conf + +- Switched to sending process uid/gid to devfsd + +- Renamed <call_kmod> to <try_modload> + +- Added DEVFSD_NOTIFY_LOOKUP event + +- Added DEVFSD_NOTIFY_CHANGE event + +- Added DEVFSD_NOTIFY_CREATE event + +- Incremented devfsd protocol revision to 4 + +- Moved kernel-specific stuff to include/linux/devfs_fs_kernel.h +=============================================================================== +Changes for patch v113 + +- Ported to kernel 2.3.9 + +- Restricted permissions on some block devices +=============================================================================== +Changes for patch v114 + +- Added support for /dev/netlink + Thanks to Dennis Hou <smilax@mindmeld.yi.org> + +- Return EISDIR rather than EINVAL for read(2) on directories + +- Ported to kernel 2.3.10 +=============================================================================== +Changes for patch v115 + +- Added support for all remaining character devices + Thanks to Dennis Hou <smilax@mindmeld.yi.org> + +- Cleaned up netlink support +=============================================================================== +Changes for patch v116 + +- Added support for /dev/parport%d + Thanks to Tim Waugh <tim@cyberelk.demon.co.uk> + +- Fixed parallel port ATAPI tape driver + +- Fixed Atari SLM laser printer driver +=============================================================================== +Changes for patch v117 + +- Added support for COSA card + Thanks to Dennis Hou <smilax@mindmeld.yi.org> + +- Fixed drivers/char/ppdev.c: missing #include <linux/init.h> + +- Fixed drivers/char/ftape/zftape/zftape-init.c + Thanks to Vladimir Popov <mashgrad@usa.net> +=============================================================================== +Changes for patch v118 + +- Ported to kernel 2.3.15-pre3 + +- Fixed bug in loop driver + +- Unregister /dev/lp%d entries in drivers/char/lp.c + Thanks to Maciej W. Rozycki <macro@ds2.pg.gda.pl> +=============================================================================== +Changes for patch v119 + +- Ported to kernel 2.3.16 +=============================================================================== +Changes for patch v120 + +- Fixed bug in drivers/scsi/scsi.c + +- Added /dev/ppp + Thanks to Dennis Hou <smilax@mindmeld.yi.org> + +- Ported to kernel 2.3.17 +=============================================================================== +Changes for patch v121 + +- Fixed bug in drivers/block/loop.c + +- Ported to kernel 2.3.18 +=============================================================================== +Changes for patch v122 + +- Ported to kernel 2.3.19 +=============================================================================== +Changes for patch v123 + +- Ported to kernel 2.3.20 +=============================================================================== +Changes for patch v124 + +- Ported to kernel 2.3.21 +=============================================================================== +Changes for patch v125 + +- Created <devfs_get_info>, <devfs_set_info>, + <devfs_get_first_child> and <devfs_get_next_sibling> + Added <<dir>> parameter to <devfs_register>, <devfs_mk_compat>, + <devfs_mk_dir> and <devfs_find_handle> + Work sponsored by SGI + +- Fixed apparent bug in COSA driver + +- Re-instated "scsihosts=" boot option +=============================================================================== +Changes for patch v126 + +- Always create /dev/pts if CONFIG_UNIX98_PTYS=y + +- Fixed call to <devfs_mk_dir> in drivers/block/ide-disk.c + Thanks to Dennis Hou <smilax@mindmeld.yi.org> + +- Allow multiple unregistrations + +- Created /dev/scsi hierarchy + Work sponsored by SGI +=============================================================================== +Changes for patch v127 + +Work sponsored by SGI + +- No longer disable devpts if devfs enabled (caveat emptor) + +- Added flags array to struct gendisk and removed code from + drivers/scsi/sd.c + +- Created /dev/discs hierarchy +=============================================================================== +Changes for patch v128 + +Work sponsored by SGI + +- Created /dev/cdroms hierarchy +=============================================================================== +Changes for patch v129 + +Work sponsored by SGI + +- Removed compatibility entries for sound devices + +- Removed compatibility entries for printer devices + +- Removed compatibility entries for video4linux devices + +- Removed compatibility entries for parallel port devices + +- Removed compatibility entries for frame buffer devices +=============================================================================== +Changes for patch v130 + +Work sponsored by SGI + +- Added major and minor number to devfsd protocol + +- Incremented devfsd protocol revision to 5 + +- Removed compatibility entries for SoundBlaster CD-ROMs + +- Removed compatibility entries for netlink devices + +- Removed compatibility entries for SCSI generic devices + +- Removed compatibility entries for SCSI tape devices +=============================================================================== +Changes for patch v131 + +Work sponsored by SGI + +- Support info pointer for all devfs entry types + +- Added <<info>> parameter to <devfs_mk_dir> and <devfs_mk_symlink> + +- Removed /dev/st hierarchy + +- Removed /dev/sg hierarchy + +- Removed compatibility entries for loop devices + +- Removed compatibility entries for IDE tape devices + +- Removed compatibility entries for SCSI CD-ROMs + +- Removed /dev/sr hierarchy +=============================================================================== +Changes for patch v132 + +Work sponsored by SGI + +- Removed compatibility entries for floppy devices + +- Removed compatibility entries for RAMDISCs + +- Removed compatibility entries for meta-devices + +- Removed compatibility entries for SCSI discs + +- Created <devfs_make_root> + +- Removed /dev/sd hierarchy + +- Support "../" when searching devfs namespace + +- Created /dev/ide/host* hierarchy + +- Supported IDE hard discs in /dev/ide/host* hierarchy + +- Removed compatibility entries for IDE discs + +- Removed /dev/ide/hd hierarchy + +- Supported IDE CD-ROMs in /dev/ide/host* hierarchy + +- Removed compatibility entries for IDE CD-ROMs + +- Removed /dev/ide/cd hierarchy +=============================================================================== +Changes for patch v133 + +Work sponsored by SGI + +- Created <devfs_get_unregister_slave> + +- Fixed bug in fs/partitions/check.c when rescanning +=============================================================================== +Changes for patch v134 + +Work sponsored by SGI + +- Removed /dev/sd, /dev/sr, /dev/st and /dev/sg directories + +- Removed /dev/ide/hd directory + +- Exported <devfs_get_parent> + +- Created <devfs_register_tape> and /dev/tapes hierarchy + +- Removed /dev/ide/mt hierarchy + +- Removed /dev/ide/fd hierarchy + +- Ported to kernel 2.3.25 +=============================================================================== +Changes for patch v135 + +Work sponsored by SGI + +- Removed compatibility entries for virtual console capture devices + +- Removed unused <devfs_set_symlink_destination> + +- Removed compatibility entries for serial devices + +- Removed compatibility entries for console devices + +- Do not hide entries from devfsd or children + +- Removed DEVFS_FL_TTY_COMPAT flag + +- Removed "nottycompat" boot option + +- Removed <devfs_mk_compat> +=============================================================================== +Changes for patch v136 + +Work sponsored by SGI + +- Moved BSD pty devices to /dev/pty + +- Added DEVFS_FL_WAIT flag +=============================================================================== +Changes for patch v137 + +Work sponsored by SGI + +- Really fixed bug in fs/partitions/check.c when rescanning + +- Support new "disc" naming scheme in <get_removable_partition> + +- Allow NULL fops in <devfs_register> + +- Removed redundant name functions in SCSI disc and IDE drivers +=============================================================================== +Changes for patch v138 + +Work sponsored by SGI + +- Fixed old bugs in drivers/block/paride/pt.c, drivers/char/tpqic02.c, + drivers/net/wan/cosa.c and drivers/scsi/scsi.c + Thanks to Sergey Kubushin <ksi@ksi-linux.com> + +- Fall back to major table if NULL fops given to <devfs_register> +=============================================================================== +Changes for patch v139 + +Work sponsored by SGI + +- Corrected and moved <get_blkfops> and <get_chrfops> declarations + from arch/alpha/kernel/osf_sys.c to include/linux/fs.h + +- Removed name function from struct gendisk + +- Updated devfs FAQ +=============================================================================== +Changes for patch v140 + +Work sponsored by SGI + +- Ported to kernel 2.3.27 +=============================================================================== +Changes for patch v141 + +Work sponsored by SGI + +- Bug fix in arch/m68k/atari/joystick.c + +- Moved ISDN and capi devices to /dev/isdn +=============================================================================== +Changes for patch v142 + +Work sponsored by SGI + +- Bug fix in drivers/block/ide-probe.c (patch confusion) +=============================================================================== +Changes for patch v143 + +Work sponsored by SGI + +- Bug fix in drivers/block/blkpg.c:partition_name() +=============================================================================== +Changes for patch v144 + +Work sponsored by SGI + +- Ported to kernel 2.3.29 + +- Removed calls to <devfs_register> from cdu31a, cm206, mcd and mcdx + CD-ROM drivers: generic driver handles this now + +- Moved joystick devices to /dev/joysticks +=============================================================================== +Changes for patch v145 + +Work sponsored by SGI + +- Ported to kernel 2.3.30-pre3 + +- Register whole-disc entry even for invalid partition tables + +- Fixed bug in mounting root FS when initrd enabled + +- Fixed device entry leak with IDE CD-ROMs + +- Fixed compile problem with drivers/isdn/isdn_common.c + +- Moved COSA devices to /dev/cosa + +- Support fifos when unregistering + +- Created <devfs_register_series> and used in many drivers + +- Moved Coda devices to /dev/coda + +- Moved parallel port IDE tapes to /dev/pt + +- Moved parallel port IDE generic devices to /dev/pg +=============================================================================== +Changes for patch v146 + +Work sponsored by SGI + +- Removed obsolete DEVFS_FL_COMPAT and DEVFS_FL_TOLERANT flags + +- Fixed compile problem with fs/coda/psdev.c + +- Reinstate change to <devfs_register_blkdev> in + drivers/block/ide-probe.c now that fs/isofs/inode.c is fixed + +- Switched to <devfs_register_blkdev> in drivers/block/floppy.c, + drivers/scsi/sr.c and drivers/block/md.c + +- Moved DAC960 devices to /dev/dac960 +=============================================================================== +Changes for patch v147 + +Work sponsored by SGI + +- Ported to kernel 2.3.32-pre4 +=============================================================================== +Changes for patch v148 + +Work sponsored by SGI + +- Removed kmod support: use devfsd instead + +- Moved miscellaneous character devices to /dev/misc +=============================================================================== +Changes for patch v149 + +Work sponsored by SGI + +- Ensure include/linux/joystick.h is OK for user-space + +- Improved debugging in <get_vfs_inode> + +- Ensure dentries created by devfsd will be cleaned up +=============================================================================== +Changes for patch v150 + +Work sponsored by SGI + +- Ported to kernel 2.3.34 +=============================================================================== +Changes for patch v151 + +Work sponsored by SGI + +- Ported to kernel 2.3.35-pre1 + +- Created <devfs_get_name> +=============================================================================== +Changes for patch v152 + +Work sponsored by SGI + +- Updated sample modules.conf + +- Ported to kernel 2.3.36-pre1 +=============================================================================== +Changes for patch v153 + +Work sponsored by SGI + +- Ported to kernel 2.3.42 + +- Removed <devfs_fill_file> +=============================================================================== +Changes for patch v154 + +Work sponsored by SGI + +- Took account of device number changes for /dev/fb* +=============================================================================== +Changes for patch v155 + +Work sponsored by SGI + +- Ported to kernel 2.3.43-pre8 + +- Moved /dev/tty0 to /dev/vc/0 + +- Moved sequence number formatting from <_tty_make_name> to drivers +=============================================================================== +Changes for patch v156 + +Work sponsored by SGI + +- Fixed breakage in drivers/scsi/sd.c due to recent SCSI changes +=============================================================================== +Changes for patch v157 + +Work sponsored by SGI + +- Ported to kernel 2.3.45 +=============================================================================== +Changes for patch v158 + +Work sponsored by SGI + +- Ported to kernel 2.3.46-pre2 +=============================================================================== +Changes for patch v159 + +Work sponsored by SGI + +- Fixed drivers/block/md.c + Thanks to Mike Galbraith <mikeg@weiden.de> + +- Documentation fixes + +- Moved device registration from <lp_init> to <lp_register> + Thanks to Tim Waugh <twaugh@redhat.com> +=============================================================================== +Changes for patch v160 + +Work sponsored by SGI + +- Fixed drivers/char/joystick/joystick.c + Thanks to Vojtech Pavlik <vojtech@suse.cz> + +- Documentation updates + +- Fixed arch/i386/kernel/mtrr.c if procfs and devfs not enabled + +- Fixed drivers/char/stallion.c +=============================================================================== +Changes for patch v161 + +Work sponsored by SGI + +- Remove /dev/ide when ide-mod is unloaded + +- Fixed bug in drivers/block/ide-probe.c when secondary but no primary + +- Added DEVFS_FL_NO_PERSISTENCE flag + +- Used new DEVFS_FL_NO_PERSISTENCE flag for Unix98 pty slaves + +- Removed unnecessary call to <update_devfs_inode_from_entry> in + <devfs_readdir> + +- Only set auto-ownership for /dev/pty/s* +=============================================================================== +Changes for patch v162 + +Work sponsored by SGI + +- Set inode->i_size to correct size for symlinks + Thanks to Jeremy Fitzhardinge <jeremy@goop.org> + +- Only give lookup() method to directories to comply with new VFS + assumptions + +- Remove unnecessary tests in symlink methods + +- Don't kill existing block ops in <devfs_read_inode> + +- Restore auto-ownership for /dev/pty/m* +=============================================================================== +Changes for patch v163 + +Work sponsored by SGI + +- Don't create missing directories in <devfs_find_handle> + +- Removed Documentation/filesystems/devfs/mk-devlinks + +- Updated Documentation/filesystems/devfs/README +=============================================================================== +Changes for patch v164 + +Work sponsored by SGI + +- Fixed CONFIG_DEVFS breakage in drivers/char/serial.c introduced in + linux-2.3.99-pre6-7 +=============================================================================== +Changes for patch v165 + +Work sponsored by SGI + +- Ported to kernel 2.3.99-pre6 +=============================================================================== +Changes for patch v166 + +Work sponsored by SGI + +- Added CONFIG_DEVFS_MOUNT +=============================================================================== +Changes for patch v167 + +Work sponsored by SGI + +- Updated Documentation/filesystems/devfs/README + +- Updated sample modules.conf +=============================================================================== +Changes for patch v168 + +Work sponsored by SGI + +- Disabled multi-mount capability (use VFS bindings instead) + +- Updated README from master HTML file +=============================================================================== +Changes for patch v169 + +Work sponsored by SGI + +- Removed multi-mount code + +- Removed compatibility macros: VFS has changed too much +=============================================================================== +Changes for patch v170 + +Work sponsored by SGI + +- Updated README from master HTML file + +- Merged devfs inode into devfs entry +=============================================================================== +Changes for patch v171 + +Work sponsored by SGI + +- Updated sample modules.conf + +- Removed dead code in <devfs_register> which used to call + <free_dentries> + +- Ported to kernel 2.4.0-test2-pre3 +=============================================================================== +Changes for patch v172 + +Work sponsored by SGI + +- Changed interface to <devfs_register> + +- Changed interface to <devfs_register_series> +=============================================================================== +Changes for patch v173 + +Work sponsored by SGI + +- Simplified interface to <devfs_mk_symlink> + +- Simplified interface to <devfs_mk_dir> + +- Simplified interface to <devfs_find_handle> +=============================================================================== +Changes for patch v174 + +Work sponsored by SGI + +- Updated README from master HTML file +=============================================================================== +Changes for patch v175 + +Work sponsored by SGI + +- DocBook update for fs/devfs/base.c + Thanks to Tim Waugh <twaugh@redhat.com> + +- Removed stale fs/tunnel.c (was never used or completed) +=============================================================================== +Changes for patch v176 + +Work sponsored by SGI + +- Updated ToDo list + +- Removed sample modules.conf: now distributed with devfsd + +- Updated README from master HTML file + +- Ported to kernel 2.4.0-test3-pre4 (which had devfs-patch-v174) +=============================================================================== +Changes for patch v177 + +- Updated README from master HTML file + +- Documentation cleanups + +- Ensure <devfs_generate_path> terminates string for root entry + Thanks to Tim Jansen <tim@tjansen.de> + +- Exported <devfs_get_name> to modules + +- Make <devfs_mk_symlink> send events to devfsd + +- Cleaned up option processing in <devfs_setup> + +- Fixed bugs in handling symlinks: could leak or cause Oops + +- Cleaned up directory handling by separating fops + Thanks to Alexander Viro <viro@parcelfarce.linux.theplanet.co.uk> +=============================================================================== +Changes for patch v178 + +- Fixed handling of inverted options in <devfs_setup> +=============================================================================== +Changes for patch v179 + +- Adjusted <try_modload> to account for <devfs_generate_path> fix +=============================================================================== +Changes for patch v180 + +- Fixed !CONFIG_DEVFS_FS stub declaration of <devfs_get_info> +=============================================================================== +Changes for patch v181 + +- Answered question posed by Al Viro and removed his comments from <devfs_open> + +- Moved setting of registered flag after other fields are changed + +- Fixed race between <devfsd_close> and <devfsd_notify_one> + +- Global VFS changes added bogus BKL to devfsd_close(): removed + +- Widened locking in <devfs_readlink> and <devfs_follow_link> + +- Replaced <devfsd_read> stack usage with <devfsd_ioctl> kmalloc + +- Simplified locking in <devfsd_ioctl> and fixed memory leak +=============================================================================== +Changes for patch v182 + +- Created <devfs_*alloc_major> and <devfs_*alloc_devnum> + +- Removed broken devnum allocation and use <devfs_alloc_devnum> + +- Fixed old devnum leak by calling new <devfs_dealloc_devnum> + +- Created <devfs_*alloc_unique_number> + +- Fixed number leak for /dev/cdroms/cdrom%d + +- Fixed number leak for /dev/discs/disc%d +=============================================================================== +Changes for patch v183 + +- Fixed bug in <devfs_setup> which could hang boot process +=============================================================================== +Changes for patch v184 + +- Documentation typo fix for fs/devfs/util.c + +- Fixed drivers/char/stallion.c for devfs + +- Added DEVFSD_NOTIFY_DELETE event + +- Updated README from master HTML file + +- Removed #include <asm/segment.h> from fs/devfs/base.c +=============================================================================== +Changes for patch v185 + +- Made <block_semaphore> and <char_semaphore> in fs/devfs/util.c + private + +- Fixed inode table races by removing it and using inode->u.generic_ip + instead + +- Moved <devfs_read_inode> into <get_vfs_inode> + +- Moved <devfs_write_inode> into <devfs_notify_change> +=============================================================================== +Changes for patch v186 + +- Fixed race in <devfs_do_symlink> for uni-processor + +- Updated README from master HTML file +=============================================================================== +Changes for patch v187 + +- Fixed drivers/char/stallion.c for devfs + +- Fixed drivers/char/rocket.c for devfs + +- Fixed bug in <devfs_alloc_unique_number>: limited to 128 numbers +=============================================================================== +Changes for patch v188 + +- Updated major masks in fs/devfs/util.c up to Linus' "no new majors" + proclamation. Block: were 126 now 122 free, char: were 26 now 19 free + +- Updated README from master HTML file + +- Removed remnant of multi-mount support in <devfs_mknod> + +- Removed unused DEVFS_FL_SHOW_UNREG flag +=============================================================================== +Changes for patch v189 + +- Removed nlink field from struct devfs_inode + +- Removed auto-ownership for /dev/pty/* (BSD ptys) and used + DEVFS_FL_CURRENT_OWNER|DEVFS_FL_NO_PERSISTENCE for /dev/pty/s* (just + like Unix98 pty slaves) and made /dev/pty/m* rw-rw-rw- access +=============================================================================== +Changes for patch v190 + +- Updated README from master HTML file + +- Replaced BKL with global rwsem to protect symlink data (quick and + dirty hack) +=============================================================================== +Changes for patch v191 + +- Replaced global rwsem for symlink with per-link refcount +=============================================================================== +Changes for patch v192 + +- Removed unnecessary #ifdef CONFIG_DEVFS_FS from arch/i386/kernel/mtrr.c + +- Ported to kernel 2.4.10-pre11 + +- Set inode->i_mapping->a_ops for block nodes in <get_vfs_inode> +=============================================================================== +Changes for patch v193 + +- Went back to global rwsem for symlinks (refcount scheme no good) +=============================================================================== +Changes for patch v194 + +- Fixed overrun in <devfs_link> by removing function (not needed) + +- Updated README from master HTML file +=============================================================================== +Changes for patch v195 + +- Fixed buffer underrun in <try_modload> + +- Moved down_read() from <search_for_entry_in_dir> to <find_entry> +=============================================================================== +Changes for patch v196 + +- Fixed race in <devfsd_ioctl> when setting event mask + Thanks to Kari Hurtta <hurtta@leija.mh.fmi.fi> + +- Avoid deadlock in <devfs_follow_link> by using temporary buffer +=============================================================================== +Changes for patch v197 + +- First release of new locking code for devfs core (v1.0) + +- Fixed bug in drivers/cdrom/cdrom.c +=============================================================================== +Changes for patch v198 + +- Discard temporary buffer, now use "%s" for dentry names + +- Don't generate path in <try_modload>: use fake entry instead + +- Use "existing" directory in <_devfs_make_parent_for_leaf> + +- Use slab cache rather than fixed buffer for devfsd events +=============================================================================== +Changes for patch v199 + +- Removed obsolete usage of DEVFS_FL_NO_PERSISTENCE + +- Send DEVFSD_NOTIFY_REGISTERED events in <devfs_mk_dir> + +- Fixed locking bug in <devfs_d_revalidate_wait> due to typo + +- Do not send CREATE, CHANGE, ASYNC_OPEN or DELETE events from devfsd + or children +=============================================================================== +Changes for patch v200 + +- Ported to kernel 2.5.1-pre2 +=============================================================================== +Changes for patch v201 + +- Fixed bug in <devfsd_read>: was dereferencing freed pointer +=============================================================================== +Changes for patch v202 + +- Fixed bug in <devfsd_close>: was dereferencing freed pointer + +- Added process group check for devfsd privileges +=============================================================================== +Changes for patch v203 + +- Use SLAB_ATOMIC in <devfsd_notify_de> from <devfs_d_delete> +=============================================================================== +Changes for patch v204 + +- Removed long obsolete rc.devfs + +- Return old entry in <devfs_mk_dir> for 2.4.x kernels + +- Updated README from master HTML file + +- Increment refcount on module in <check_disc_changed> + +- Created <devfs_get_handle> and exported <devfs_put> + +- Increment refcount on module in <devfs_get_ops> + +- Created <devfs_put_ops> and used where needed to fix races + +- Added clarifying comments in response to preliminary EMC code review + +- Added poisoning to <devfs_put> + +- Improved debugging messages + +- Fixed unregister bugs in drivers/md/lvm-fs.c +=============================================================================== +Changes for patch v205 + +- Corrected (made useful) debugging message in <unregister> + +- Moved <kmem_cache_create> in <mount_devfs_fs> to <init_devfs_fs> + +- Fixed drivers/md/lvm-fs.c to create "lvm" entry + +- Added magic number to guard against scribbling drivers + +- Only return old entry in <devfs_mk_dir> if a directory + +- Defined macros for error and debug messages + +- Updated README from master HTML file +=============================================================================== +Changes for patch v206 + +- Added support for multiple Compaq cpqarray controllers + +- Fixed (rare, old) race in <devfs_lookup> +=============================================================================== +Changes for patch v207 + +- Fixed deadlock bug in <devfs_d_revalidate_wait> + +- Tag VFS deletable in <devfs_mk_symlink> if handle ignored + +- Updated README from master HTML file +=============================================================================== +Changes for patch v208 + +- Added KERN_* to remaining messages + +- Cleaned up declaration of <stat_read> + +- Updated README from master HTML file +=============================================================================== +Changes for patch v209 + +- Updated README from master HTML file + +- Removed silently introduced calls to lock_kernel() and + unlock_kernel() due to recent VFS locking changes. BKL isn't + required in devfs + +- Changed <devfs_rmdir> to allow later additions if not yet empty + +- Added calls to <devfs_register_partitions> in drivers/block/blkpc.c + <add_partition> and <del_partition> + +- Fixed bug in <devfs_alloc_unique_number>: was clearing beyond + bitfield + +- Fixed bitfield data type for <devfs_*alloc_devnum> + +- Made major bitfield type and initialiser 64 bit safe +=============================================================================== +Changes for patch v210 + +- Updated fs/devfs/util.c to fix shift warning on 64 bit machines + Thanks to Anton Blanchard <anton@samba.org> + +- Updated README from master HTML file +=============================================================================== +Changes for patch v211 + +- Do not put miscellaneous character devices in /dev/misc if they + specify their own directory (i.e. contain a '/' character) + +- Copied macro for error messages from fs/devfs/base.c to + fs/devfs/util.c and made use of this macro + +- Removed 2.4.x compatibility code from fs/devfs/base.c +=============================================================================== +Changes for patch v212 + +- Added BKL to <devfs_open> because drivers still need it +=============================================================================== +Changes for patch v213 + +- Protected <scan_dir_for_removable> and <get_removable_partition> + from changing directory contents +=============================================================================== +Changes for patch v214 + +- Switched to ISO C structure field initialisers + +- Switch to set_current_state() and move before add_wait_queue() + +- Updated README from master HTML file + +- Fixed devfs entry leak in <devfs_readdir> when *readdir fails +=============================================================================== +Changes for patch v215 + +- Created <devfs_find_and_unregister> + +- Switched many functions from <devfs_find_handle> to + <devfs_find_and_unregister> + +- Switched many functions from <devfs_find_handle> to <devfs_get_handle> +=============================================================================== +Changes for patch v216 + +- Switched arch/ia64/sn/io/hcl.c from <devfs_find_handle> to + <devfs_get_handle> + +- Removed deprecated <devfs_find_handle> +=============================================================================== +Changes for patch v217 + +- Exported <devfs_find_and_unregister> and <devfs_only> to modules + +- Updated README from master HTML file + +- Fixed module unload race in <devfs_open> +=============================================================================== +Changes for patch v218 + +- Removed DEVFS_FL_AUTO_OWNER flag + +- Switched lingering structure field initialiser to ISO C + +- Added locking when setting/clearing flags + +- Documentation fix in fs/devfs/util.c diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README new file mode 100644 index 000000000000..54366ecc241f --- /dev/null +++ b/Documentation/filesystems/devfs/README @@ -0,0 +1,1964 @@ +Devfs (Device File System) FAQ + + +Linux Devfs (Device File System) FAQ +Richard Gooch +20-AUG-2002 + + +Document languages: + + + + + + + +----------------------------------------------------------------------------- + +NOTE: the master copy of this document is available online at: + +http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html +and looks much better than the text version distributed with the +kernel sources. A mirror site is available at: + +http://www.ras.ucalgary.ca/~rgooch/linux/docs/devfs.html + +There is also an optional daemon that may be used with devfs. You can +find out more about it at: + +http://www.atnf.csiro.au/~rgooch/linux/ + +A mailing list is available which you may subscribe to. Send +email +to majordomo@oss.sgi.com with the following line in the +body of the message: +subscribe devfs +To unsubscribe, send the message body: +unsubscribe devfs +instead. The list is archived at + +http://oss.sgi.com/projects/devfs/archive/. + +----------------------------------------------------------------------------- + +Contents + + +What is it? + +Why do it? + +Who else does it? + +How it works + +Operational issues (essential reading) + +Instructions for the impatient +Permissions persistence across reboots +Dealing with drivers without devfs support +All the way with Devfs +Other Issues +Kernel Naming Scheme +Devfsd Naming Scheme +Old Compatibility Names +SCSI Host Probing Issues + + + +Device drivers currently ported + +Allocation of Device Numbers + +Questions and Answers + +Making things work +Alternatives to devfs +What I don't like about devfs +How to report bugs +Strange kernel messages +Compilation problems with devfsd + + +Other resources + +Translations of this document + + +----------------------------------------------------------------------------- + + +What is it? + +Devfs is an alternative to "real" character and block special devices +on your root filesystem. Kernel device drivers can register devices by +name rather than major and minor numbers. These devices will appear in +devfs automatically, with whatever default ownership and +protection the driver specified. A daemon (devfsd) can be used to +override these defaults. Devfs has been in the kernel since 2.3.46. + +NOTE that devfs is entirely optional. If you prefer the old +disc-based device nodes, then simply leave CONFIG_DEVFS_FS=n (the +default). In this case, nothing will change. ALSO NOTE that if you do +enable devfs, the defaults are such that full compatibility is +maintained with the old devices names. + +There are two aspects to devfs: one is the underlying device +namespace, which is a namespace just like any mounted filesystem. The +other aspect is the filesystem code which provides a view of the +device namespace. The reason I make a distinction is because devfs +can be mounted many times, with each mount showing the same device +namespace. Changes made are global to all mounted devfs filesystems. +Also, because the devfs namespace exists without any devfs mounts, you +can easily mount the root filesystem by referring to an entry in the +devfs namespace. + + +The cost of devfs is a small increase in kernel code size and memory +usage. About 7 pages of code (some of that in __init sections) and 72 +bytes for each entry in the namespace. A modest system has only a +couple of hundred device entries, so this costs a few more +pages. Compare this with the suggestion to put /dev on a <a +href="#why-faq-ramdisc">ramdisc. + +On a typical machine, the cost is under 0.2 percent. On a modest +system with 64 MBytes of RAM, the cost is under 0.1 percent. The +accusations of "bloatware" levelled at devfs are not justified. + +----------------------------------------------------------------------------- + + +Why do it? + +There are several problems that devfs addresses. Some of these +problems are more serious than others (depending on your point of +view), and some can be solved without devfs. However, the totality of +these problems really calls out for devfs. + +The choice is a patchwork of inefficient user space solutions, which +are complex and likely to be fragile, or to use a simple and efficient +devfs which is robust. + +There have been many counter-proposals to devfs, all seeking to +provide some of the benefits without actually implementing devfs. So +far there has been an absence of code and no proposed alternative has +been able to provide all the features that devfs does. Further, +alternative proposals require far more complexity in user-space (and +still deliver less functionality than devfs). Some people have the +mantra of reducing "kernel bloat", but don't consider the effects on +user-space. + +A good solution limits the total complexity of kernel-space and +user-space. + + +Major&minor allocation + +The existing scheme requires the allocation of major and minor device +numbers for each and every device. This means that a central +co-ordinating authority is required to issue these device numbers +(unless you're developing a "private" device driver), in order to +preserve uniqueness. Devfs shifts the burden to a namespace. This may +not seem like a huge benefit, but actually it is. Since driver authors +will naturally choose a device name which reflects the functionality +of the device, there is far less potential for namespace conflict. +Solving this requires a kernel change. + +/dev management + +Because you currently access devices through device nodes, these must +be created by the system administrator. For standard devices you can +usually find a MAKEDEV programme which creates all these (hundreds!) +of nodes. This means that changes in the kernel must be reflected by +changes in the MAKEDEV programme, or else the system administrator +creates device nodes by hand. + +The basic problem is that there are two separate databases of +major and minor numbers. One is in the kernel and one is in /dev (or +in a MAKEDEV programme, if you want to look at it that way). This is +duplication of information, which is not good practice. +Solving this requires a kernel change. + +/dev growth + +A typical /dev has over 1200 nodes! Most of these devices simply don't +exist because the hardware is not available. A huge /dev increases the +time to access devices (I'm just referring to the dentry lookup times +and the time taken to read inodes off disc: the next subsection shows +some more horrors). + +An example of how big /dev can grow is if we consider SCSI devices: + +host 6 bits (say up to 64 hosts on a really big machine) +channel 4 bits (say up to 16 SCSI buses per host) +id 4 bits +lun 3 bits +partition 6 bits +TOTAL 23 bits + + +This requires 8 Mega (1024*1024) inodes if we want to store all +possible device nodes. Even if we scrap everything but id,partition +and assume a single host adapter with a single SCSI bus and only one +logical unit per SCSI target (id), that's still 10 bits or 1024 +inodes. Each VFS inode takes around 256 bytes (kernel 2.1.78), so +that's 256 kBytes of inode storage on disc (assuming real inodes take +a similar amount of space as VFS inodes). This is actually not so bad, +because disc is cheap these days. Embedded systems would care about +256 kBytes of /dev inodes, but you could argue that embedded systems +would have hand-tuned /dev directories. I've had to do just that on my +embedded systems, but I would rather just leave it to devfs. + +Another issue is the time taken to lookup an inode when first +referenced. Not only does this take time in scanning through a list in +memory, but also the seek times to read the inodes off disc. +This could be solved in user-space using a clever programme which +scanned the kernel logs and deleted /dev entries which are not +available and created them when they were available. This programme +would need to be run every time a new module was loaded, which would +slow things down a lot. + +There is an existing programme called scsidev which will automatically +create device nodes for SCSI devices. It can do this by scanning files +in /proc/scsi. Unfortunately, to extend this idea to other device +nodes would require significant modifications to existing drivers (so +they too would provide information in /proc). This is a non-trivial +change (I should know: devfs has had to do something similar). Once +you go to this much effort, you may as well use devfs itself (which +also provides this information). Furthermore, such a system would +likely be implemented in an ad-hoc fashion, as different drivers will +provide their information in different ways. + +Devfs is much cleaner, because it (naturally) has a uniform mechanism +to provide this information: the device nodes themselves! + + +Node to driver file_operations translation + +There is an important difference between the way disc-based character +and block nodes and devfs entries make the connection between an entry +in /dev and the actual device driver. + +With the current 8 bit major and minor numbers the connection between +disc-based c&b nodes and per-major drivers is done through a +fixed-length table of 128 entries. The various filesystem types set +the inode operations for c&b nodes to {chr,blk}dev_inode_operations, +so when a device is opened a few quick levels of indirection bring us +to the driver file_operations. + +For miscellaneous character devices a second step is required: there +is a scan for the driver entry with the same minor number as the file +that was opened, and the appropriate minor open method is called. This +scanning is done *every time* you open a device node. Potentially, you +may be searching through dozens of misc. entries before you find your +open method. While not an enormous performance overhead, this does +seem pointless. + +Linux *must* move beyond the 8 bit major and minor barrier, +somehow. If we simply increase each to 16 bits, then the indexing +scheme used for major driver lookup becomes untenable, because the +major tables (one each for character and block devices) would need to +be 64 k entries long (512 kBytes on x86, 1 MByte for 64 bit +systems). So we would have to use a scheme like that used for +miscellaneous character devices, which means the search time goes up +linearly with the average number of major device drivers on your +system. Not all "devices" are hardware, some are higher-level drivers +like KGI, so you can get more "devices" without adding hardware +You can improve this by creating an ordered (balanced:-) +binary tree, in which case your search time becomes log(N). +Alternatively, you can use hashing to speed up the search. +But why do that search at all if you don't have to? Once again, it +seems pointless. + +Note that devfs doesn't use the major&minor system. For devfs +entries, the connection is done when you lookup the /dev entry. When +devfs_register() is called, an internal table is appended which has +the entry name and the file_operations. If the dentry cache doesn't +have the /dev entry already, this internal table is scanned to get the +file_operations, and an inode is created. If the dentry cache already +has the entry, there is *no lookup time* (other than the dentry scan +itself, but we can't avoid that anyway, and besides Linux dentries +cream other OS's which don't have them:-). Furthermore, the number of +node entries in a devfs is only the number of available device +entries, not the number of *conceivable* entries. Even if you remove +unnecessary entries in a disc-based /dev, the number of conceivable +entries remains the same: you just limit yourself in order to save +space. + +Devfs provides a fast connection between a VFS node and the device +driver, in a scalable way. + +/dev as a system administration tool + +Right now /dev contains a list of conceivable devices, most of which I +don't have. Devfs only shows those devices available on my +system. This means that listing /dev is a handy way of checking what +devices are available. + +Major&minor size + +Existing major and minor numbers are limited to 8 bits each. This is +now a limiting factor for some drivers, particularly the SCSI disc +driver, which consumes a single major number. Only 16 discs are +supported, and each disc may have only 15 partitions. Maybe this isn't +a problem for you, but some of us are building huge Linux systems with +disc arrays. With devfs an arbitrary pointer can be associated with +each device entry, which can be used to give an effective 32 bit +device identifier (i.e. that's like having a 32 bit minor +number). Since this is private to the kernel, there are no C library +compatibility issues which you would have with increasing major and +minor number sizes. See the section on "Allocation of Device Numbers" +for details on maintaining compatibility with userspace. + +Solving this requires a kernel change. + +Since writing this, the kernel has been modified so that the SCSI disc +driver has more major numbers allocated to it and now supports up to +128 discs. Since these major numbers are non-contiguous (a result of +unplanned expansion), the implementation is a little more cumbersome +than originally. + +Just like the changes to IPv4 to fix impending limitations in the +address space, people find ways around the limitations. In the long +run, however, solutions like IPv6 or devfs can't be put off forever. + +Read-only root filesystem + +Having your device nodes on the root filesystem means that you can't +operate properly with a read-only root filesystem. This is because you +want to change ownerships and protections of tty devices. Existing +practice prevents you using a CD-ROM as your root filesystem for a +*real* system. Sure, you can boot off a CD-ROM, but you can't change +tty ownerships, so it's only good for installing. + +Also, you can't use a shared NFS root filesystem for a cluster of +discless Linux machines (having tty ownerships changed on a common +/dev is not good). Nor can you embed your root filesystem in a +ROM-FS. + +You can get around this by creating a RAMDISC at boot time, making +an ext2 filesystem in it, mounting it somewhere and copying the +contents of /dev into it, then unmounting it and mounting it over +/dev. + +A devfs is a cleaner way of solving this. + +Non-Unix root filesystem + +Non-Unix filesystems (such as NTFS) can't be used for a root +filesystem because they variously don't support character and block +special files or symbolic links. You can't have a separate disc-based +or RAMDISC-based filesystem mounted on /dev because you need device +nodes before you can mount these. Devfs can be mounted without any +device nodes. Devlinks won't work because symlinks aren't supported. +An alternative solution is to use initrd to mount a RAMDISC initial +root filesystem (which is populated with a minimal set of device +nodes), and then construct a new /dev in another RAMDISC, and finally +switch to your non-Unix root filesystem. This requires clever boot +scripts and a fragile and conceptually complex boot procedure. + +Devfs solves this in a robust and conceptually simple way. + +PTY security + +Current pseudo-tty (pty) devices are owned by root and read-writable +by everyone. The user of a pty-pair cannot change +ownership/protections without being suid-root. + +This could be solved with a secure user-space daemon which runs as +root and does the actual creation of pty-pairs. Such a daemon would +require modification to *every* programme that wants to use this new +mechanism. It also slows down creation of pty-pairs. + +An alternative is to create a new open_pty() syscall which does much +the same thing as the user-space daemon. Once again, this requires +modifications to pty-handling programmes. + +The devfs solution allows a device driver to "tag" certain device +files so that when an unopened device is opened, the ownerships are +changed to the current euid and egid of the opening process, and the +protections are changed to the default registered by the driver. When +the device is closed ownership is set back to root and protections are +set back to read-write for everybody. No programme need be changed. +The devpts filesystem provides this auto-ownership feature for Unix98 +ptys. It doesn't support old-style pty devices, nor does it have all +the other features of devfs. + +Intelligent device management + +Devfs implements a simple yet powerful protocol for communication with +a device management daemon (devfsd) which runs in user space. It is +possible to send a message (either synchronously or asynchronously) to +devfsd on any event, such as registration/unregistration of device +entries, opening and closing devices, looking up inodes, scanning +directories and more. This has many possibilities. Some of these are +already implemented. See: + + +http://www.atnf.csiro.au/~rgooch/linux/ + +Device entry registration events can be used by devfsd to change +permissions of newly-created device nodes. This is one mechanism to +control device permissions. + +Device entry registration/unregistration events can be used to run +programmes or scripts. This can be used to provide automatic mounting +of filesystems when a new block device media is inserted into the +drive. + +Asynchronous device open and close events can be used to implement +clever permissions management. For example, the default permissions on +/dev/dsp do not allow everybody to read from the device. This is +sensible, as you don't want some remote user recording what you say at +your console. However, the console user is also prevented from +recording. This behaviour is not desirable. With asynchronous device +open and close events, you can have devfsd run a programme or script +when console devices are opened to change the ownerships for *other* +device nodes (such as /dev/dsp). On closure, you can run a different +script to restore permissions. An advantage of this scheme over +modifying the C library tty handling is that this works even if your +programme crashes (how many times have you seen the utmp database with +lingering entries for non-existent logins?). + +Synchronous device open events can be used to perform intelligent +device access protections. Before the device driver open() method is +called, the daemon must first validate the open attempt, by running an +external programme or script. This is far more flexible than access +control lists, as access can be determined on the basis of other +system conditions instead of just the UID and GID. + +Inode lookup events can be used to authenticate module autoload +requests. Instead of using kmod directly, the event is sent to +devfsd which can implement an arbitrary authentication before loading +the module itself. + +Inode lookup events can also be used to construct arbitrary +namespaces, without having to resort to populating devfs with symlinks +to devices that don't exist. + +Speculative Device Scanning + +Consider an application (like cdparanoia) that wants to find all +CD-ROM devices on the system (SCSI, IDE and other types), whether or +not their respective modules are loaded. The application must +speculatively open certain device nodes (such as /dev/sr0 for the SCSI +CD-ROMs) in order to make sure the module is loaded. This requires +that all Linux distributions follow the standard device naming scheme +(last time I looked RedHat did things differently). Devfs solves the +naming problem. + +The same application also wants to see which devices are actually +available on the system. With the existing system it needs to read the +/dev directory and speculatively open each /dev/sr* device to +determine if the device exists or not. With a large /dev this is an +inefficient operation, especially if there are many /dev/sr* nodes. A +solution like scsidev could reduce the number of /dev/sr* entries (but +of course that also requires all that inefficient directory scanning). + +With devfs, the application can open the /dev/sr directory +(which triggers the module autoloading if required), and proceed to +read /dev/sr. Since only the available devices will have +entries, there are no inefficencies in directory scanning or device +openings. + +----------------------------------------------------------------------------- + +Who else does it? + +FreeBSD has a devfs implementation. Solaris and AIX each have a +pseudo-devfs (something akin to scsidev but for all devices, with some +unspecified kernel support). BeOS, Plan9 and QNX also have it. SGI's +IRIX 6.4 and above also have a device filesystem. + +While we shouldn't just automatically do something because others do +it, we should not ignore the work of others either. FreeBSD has a lot +of competent people working on it, so their opinion should not be +blithely ignored. + +----------------------------------------------------------------------------- + + +How it works + +Registering device entries + +For every entry (device node) in a devfs-based /dev a driver must call +devfs_register(). This adds the name of the device entry, the +file_operations structure pointer and a few other things to an +internal table. Device entries may be added and removed at any +time. When a device entry is registered, it automagically appears in +any mounted devfs'. + +Inode lookup + +When a lookup operation on an entry is performed and if there is no +driver information for that entry devfs will attempt to call +devfsd. If still no driver information can be found then a negative +dentry is yielded and the next stage operation will be called by the +VFS (such as create() or mknod() inode methods). If driver information +can be found, an inode is created (if one does not exist already) and +all is well. + +Manually creating device nodes + +The mknod() method allows you to create an ordinary named pipe in the +devfs, or you can create a character or block special inode if one +does not already exist. You may wish to create a character or block +special inode so that you can set permissions and ownership. Later, if +a device driver registers an entry with the same name, the +permissions, ownership and times are retained. This is how you can set +the protections on a device even before the driver is loaded. Once you +create an inode it appears in the directory listing. + +Unregistering device entries + +A device driver calls devfs_unregister() to unregister an entry. + +Chroot() gaols + +2.2.x kernels + +The semantics of inode creation are different when devfs is mounted +with the "explicit" option. Now, when a device entry is registered, it +will not appear until you use mknod() to create the device. It doesn't +matter if you mknod() before or after the device is registered with +devfs_register(). The purpose of this behaviour is to support +chroot(2) gaols, where you want to mount a minimal devfs inside the +gaol. Only the devices you specifically want to be available (through +your mknod() setup) will be accessible. + +2.4.x kernels + +As of kernel 2.3.99, the VFS has had the ability to rebind parts of +the global filesystem namespace into another part of the namespace. +This now works even at the leaf-node level, which means that +individual files and device nodes may be bound into other parts of the +namespace. This is like making links, but better, because it works +across filesystems (unlike hard links) and works through chroot() +gaols (unlike symbolic links). + +Because of these improvements to the VFS, the multi-mount capability +in devfs is no longer needed. The administrator may create a minimal +device tree inside a chroot(2) gaol by using VFS bindings. As this +provides most of the features of the devfs multi-mount capability, I +removed the multi-mount support code (after issuing an RFC). This +yielded code size reductions and simplifications. + +If you want to construct a minimal chroot() gaol, the following +command should suffice: + +mount --bind /dev/null /gaol/dev/null + + +Repeat for other device nodes you want to expose. Simple! + +----------------------------------------------------------------------------- + + +Operational issues + + +Instructions for the impatient + +Nobody likes reading documentation. People just want to get in there +and play. So this section tells you quickly the steps you need to take +to run with devfs mounted over /dev. Skip these steps and you will end +up with a nearly unbootable system. Subsequent sections describe the +issues in more detail, and discuss non-essential configuration +options. + +Devfsd +OK, if you're reading this, I assume you want to play with +devfs. First you should ensure that /usr/src/linux contains a +recent kernel source tree. Then you need to compile devfsd, the device +management daemon, available at + +http://www.atnf.csiro.au/~rgooch/linux/. +Because the kernel has a naming scheme +which is quite different from the old naming scheme, you need to +install devfsd so that software and configuration files that use the +old naming scheme will not break. + +Compile and install devfsd. You will be provided with a default +configuration file /etc/devfsd.conf which will provide +compatibility symlinks for the old naming scheme. Don't change this +config file unless you know what you're doing. Even if you think you +do know what you're doing, don't change it until you've followed all +the steps below and booted a devfs-enabled system and verified that it +works. + +Now edit your main system boot script so that devfsd is started at the +very beginning (before any filesystem +checks). /etc/rc.d/rc.sysinit is often the main boot script +on systems with SysV-style boot scripts. On systems with BSD-style +boot scripts it is often /etc/rc. Also check +/sbin/rc. + +NOTE that the line you put into the boot +script should be exactly: + +/sbin/devfsd /dev + +DO NOT use some special daemon-launching +programme, otherwise the boot script may not wait for devfsd to finish +initialising. + +System Libraries +There may still be some problems because of broken software making +assumptions about device names. In particular, some software does not +handle devices which are symbolic links. If you are running a libc 5 +based system, install libc 5.4.44 (if you have libc 5.4.46, go back to +libc 5.4.44, which is actually correct). If you are running a glibc +based system, make sure you have glibc 2.1.3 or later. + +/etc/securetty +PAM (Pluggable Authentication Modules) is supposed to be a flexible +mechanism for providing better user authentication and access to +services. Unfortunately, it's also fragile, complex and undocumented +(check out RedHat 6.1, and probably other distributions as well). PAM +has problems with symbolic links. Append the following lines to your +/etc/securetty file: + +vc/1 +vc/2 +vc/3 +vc/4 +vc/5 +vc/6 +vc/7 +vc/8 + +This will not weaken security. If you have a version of util-linux +earlier than 2.10.h, please upgrade to 2.10.h or later. If you +absolutely cannot upgrade, then also append the following lines to +your /etc/securetty file: + +1 +2 +3 +4 +5 +6 +7 +8 + +This may potentially weaken security by allowing root logins over the +network (a password is still required, though). However, since there +are problems with dealing with symlinks, I'm suspicious of the level +of security offered in any case. + +XFree86 +While not essential, it's probably a good idea to upgrade to XFree86 +4.0, as patches went in to make it more devfs-friendly. If you don't, +you'll probably need to apply the following patch to +/etc/security/console.perms so that ordinary users can run +startx. Note that not all distributions have this file (e.g. Debian), +so if it's not present, don't worry about it. + +--- /etc/security/console.perms.orig Sat Apr 17 16:26:47 1999 ++++ /etc/security/console.perms Fri Feb 25 23:53:55 2000 +@@ -14,7 +14,7 @@ + # man 5 console.perms + + # file classes -- these are regular expressions +-<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9] ++<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9] + + # device classes -- these are shell-style globs + <floppy>=/dev/fd[0-1]* + +If the patch does not apply, then change the line: + +<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9] + +with: + +<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9] + + +Disable devpts +I've had a report of devpts mounted on /dev/pts not working +correctly. Since devfs will also manage /dev/pts, there is no +need to mount devpts as well. You should either edit your +/etc/fstab so devpts is not mounted, or disable devpts from +your kernel configuration. + +Unsupported drivers +Not all drivers have devfs support. If you depend on one of these +drivers, you will need to create a script or tarfile that you can use +at boot time to create device nodes as appropriate. There is a +section which describes this. Another +section lists the drivers which have +devfs support. + +/dev/mouse + +Many disributions configure /dev/mouse to be the mouse device +for XFree86 and GPM. I actually think this is a bad idea, because it +adds another level of indirection. When looking at a config file, if +you see /dev/mouse you're left wondering which mouse +is being referred to. Hence I recommend putting the actual mouse +device (for example /dev/psaux) into your +/etc/X11/XF86Config file (and similarly for the GPM +configuration file). + +Alternatively, use the same technique used for unsupported drivers +described above. + +The Kernel +Finally, you need to make sure devfs is compiled into your kernel. Set +CONFIG_EXPERIMENTAL=y, CONFIG_DEVFS_FS=y and CONFIG_DEVFS_MOUNT=y by +using favourite configuration tool (i.e. make config or +make xconfig) and then make clean and then recompile your kernel and +modules. At boot, devfs will be mounted onto /dev. + +If you encounter problems booting (for example if you forgot a +configuration step), you can pass devfs=nomount at the kernel +boot command line. This will prevent the kernel from mounting devfs at +boot time onto /dev. + +In general, a kernel built with CONFIG_DEVFS_FS=y but without mounting +devfs onto /dev is completely safe, and requires no +configuration changes. One exception to take note of is when +LABEL= directives are used in /etc/fstab. In this +case you will be unable to boot properly. This is because the +mount(8) programme uses /proc/partitions as part of +the volume label search process, and the device names it finds are not +available, because setting CONFIG_DEVFS_FS=y changes the names in +/proc/partitions, irrespective of whether devfs is mounted. + +Now you've finished all the steps required. You're now ready to boot +your shiny new kernel. Enjoy. + +Changing the configuration + +OK, you've now booted a devfs-enabled system, and everything works. +Now you may feel like changing the configuration (common targets are +/etc/fstab and /etc/devfsd.conf). Since you have a +system that works, if you make any changes and it doesn't work, you +now know that you only have to restore your configuration files to the +default and it will work again. + + +Permissions persistence across reboots + +If you don't use mknod(2) to create a device file, nor use chmod(2) or +chown(2) to change the ownerships/permissions, the inode ctime will +remain at 0 (the epoch, 12 am, 1-JAN-1970, GMT). Anything with a ctime +later than this has had it's ownership/permissions changed. Hence, a +simple script or programme may be used to tar up all changed inodes, +prior to shutdown. Although effective, many consider this approach a +kludge. + +A much better approach is to use devfsd to save and restore +permissions. It may be configured to record changes in permissions and +will save them in a database (in fact a directory tree), and restore +these upon boot. This is an efficient method and results in immediate +saving of current permissions (unlike the tar approach, which saves +permissions at some unspecified future time). + +The default configuration file supplied with devfsd has config entries +which you may uncomment to enable persistence management. + +If you decide to use the tar approach anyway, be aware that tar will +first unlink(2) an inode before creating a new device node. The +unlink(2) has the effect of breaking the connection between a devfs +entry and the device driver. If you use the "devfs=only" boot option, +you lose access to the device driver, requiring you to reload the +module. I consider this a bug in tar (there is no real need to +unlink(2) the inode first). + +Alternatively, you can use devfsd to provide more sophisticated +management of device permissions. You can use devfsd to store +permissions for whole groups of devices with a single configuration +entry, rather than the conventional single entry per device entry. + +Permissions database stored in mounted-over /dev + +If you wish to save and restore your device permissions into the +disc-based /dev while still mounting devfs onto /dev +you may do so. This requires a 2.4.x kernel (in fact, 2.3.99 or +later), which has the VFS binding facility. You need to do the +following to set this up: + + + +make sure the kernel does not mount devfs at boot time + + +make sure you have a correct /dev/console entry in your +root file-system (where your disc-based /dev lives) + +create the /dev-state directory + + +add the following lines near the very beginning of your boot +scripts: + +mount --bind /dev /dev-state +mount -t devfs none /dev +devfsd /dev + + + + +add the following lines to your /etc/devfsd.conf file: + +REGISTER ^pt[sy] IGNORE +CREATE ^pt[sy] IGNORE +CHANGE ^pt[sy] IGNORE +DELETE ^pt[sy] IGNORE +REGISTER .* COPY /dev-state/$devname $devpath +CREATE .* COPY $devpath /dev-state/$devname +CHANGE .* COPY $devpath /dev-state/$devname +DELETE .* CFUNCTION GLOBAL unlink /dev-state/$devname +RESTORE /dev-state + +Note that the sample devfsd.conf file contains these lines, +as well as other sample configurations you may find useful. See the +devfsd distribution + + +reboot. + + + + +Permissions database stored in normal directory + +If you are using an older kernel which doesn't support VFS binding, +then you won't be able to have the permissions database in a +mounted-over /dev. However, you can still use a regular +directory to store the database. The sample /etc/devfsd.conf +file above may still be used. You will need to create the +/dev-state directory prior to installing devfsd. If you have +old permissions in /dev, then just copy (or move) the device +nodes over to the new directory. + +Which method is better? + +The best method is to have the permissions database stored in the +mounted-over /dev. This is because you will not need to copy +device nodes over to /dev-state, and because it allows you to +switch between devfs and non-devfs kernels, without requiring you to +copy permissions between /dev-state (for devfs) and +/dev (for non-devfs). + + +Dealing with drivers without devfs support + +Currently, not all device drivers in the kernel have been modified to +use devfs. Device drivers which do not yet have devfs support will not +automagically appear in devfs. The simplest way to create device nodes +for these drivers is to unpack a tarfile containing the required +device nodes. You can do this in your boot scripts. All your drivers +will now work as before. + +Hopefully for most people devfs will have enough support so that they +can mount devfs directly over /dev without losing most functionality +(i.e. losing access to various devices). As of 22-JAN-1998 (devfs +patch version 10) I am now running this way. All the devices I have +are available in devfs, so I don't lose anything. + +WARNING: if your configuration requires the old-style device names +(i.e. /dev/hda1 or /dev/sda1), you must install devfsd and configure +it to maintain compatibility entries. It is almost certain that you +will require this. Note that the kernel creates a compatibility entry +for the root device, so you don't need initrd. + +Note that you no longer need to mount devpts if you use Unix98 PTYs, +as devfs can manage /dev/pts itself. This saves you some RAM, as you +don't need to compile and install devpts. Note that some versions of +glibc have a bug with Unix98 pty handling on devfs systems. Contact +the glibc maintainers for a fix. Glibc 2.1.3 has the fix. + +Note also that apart from editing /etc/fstab, other things will need +to be changed if you *don't* install devfsd. Some software (like the X +server) hard-wire device names in their source. It really is much +easier to install devfsd so that compatibility entries are created. +You can then slowly migrate your system to using the new device names +(for example, by starting with /etc/fstab), and then limiting the +compatibility entries that devfsd creates. + +IF YOU CONFIGURE TO MOUNT DEVFS AT BOOT, MAKE SURE YOU INSTALL DEVFSD +BEFORE YOU BOOT A DEVFS-ENABLED KERNEL! + +Now that devfs has gone into the 2.3.46 kernel, I'm getting a lot of +reports back. Many of these are because people are trying to run +without devfsd, and hence some things break. Please just run devfsd if +things break. I want to concentrate on real bugs rather than +misconfiguration problems at the moment. If people are willing to fix +bugs/false assumptions in other code (i.e. glibc, X server) and submit +that to the respective maintainers, that would be great. + + +All the way with Devfs + +The devfs kernel patch creates a rationalised device tree. As stated +above, if you want to keep using the old /dev naming scheme, +you just need to configure devfsd appopriately (see the man +page). People who prefer the old names can ignore this section. For +those of us who like the rationalised names and an uncluttered +/dev, read on. + +If you don't run devfsd, or don't enable compatibility entry +management, then you will have to configure your system to use the new +names. For example, you will then need to edit your +/etc/fstab to use the new disc naming scheme. If you want to +be able to boot non-devfs kernels, you will need compatibility +symlinks in the underlying disc-based /dev pointing back to +the old-style names for when you boot a kernel without devfs. + +You can selectively decide which devices you want compatibility +entries for. For example, you may only want compatibility entries for +BSD pseudo-terminal devices (otherwise you'll have to patch you C +library or use Unix98 ptys instead). It's just a matter of putting in +the correct regular expression into /dev/devfsd.conf. + +There are other choices of naming schemes that you may prefer. For +example, I don't use the kernel-supplied +names, because they are too verbose. A common misconception is +that the kernel-supplied names are meant to be used directly in +configuration files. This is not the case. They are designed to +reflect the layout of the devices attached and to provide easy +classification. + +If you like the kernel-supplied names, that's fine. If you don't then +you should be using devfsd to construct a namespace more to your +liking. Devfsd has built-in code to construct a +namespace that is both logical and easy to +manage. In essence, it creates a convenient abbreviation of the +kernel-supplied namespace. + +You are of course free to build your own namespace. Devfsd has all the +infrastructure required to make this easy for you. All you need do is +write a script. You can even write some C code and devfsd can load the +shared object as a callable extension. + + +Other Issues + +The init programme +Another thing to take note of is whether your init programme +creates a Unix socket /dev/telinit. Some versions of init +create /dev/telinit so that the telinit programme can +communicate with the init process. If you have such a system you need +to make sure that devfs is mounted over /dev *before* init +starts. In other words, you can't leave the mounting of devfs to +/etc/rc, since this is executed after init. Other +versions of init require a named pipe /dev/initctl +which must exist *before* init starts. Once again, you need to +mount devfs and then create the named pipe *before* init +starts. + +The default behaviour now is not to mount devfs onto /dev at +boot time for 2.3.x and later kernels. You can correct this with the +"devfs=mount" boot option. This solves any problems with init, +and also prevents the dreaded: + +Cannot open initial console + +message. For 2.2.x kernels where you need to apply the devfs patch, +the default is to mount. + +If you have automatic mounting of devfs onto /dev then you +may need to create /dev/initctl in your boot scripts. The +following lines should suffice: + +mknod /dev/initctl p +kill -SIGUSR1 1 # tell init that /dev/initctl now exists + +Alternatively, if you don't want the kernel to mount devfs onto +/dev then you could use the following procedure is a +guideline for how to get around /dev/initctl problems: + +# cd /sbin +# mv init init.real +# cat > init +#! /bin/sh +mount -n -t devfs none /dev +mknod /dev/initctl p +exec /sbin/init.real $* +[control-D] +# chmod a+x init + +Note that newer versions of init create /dev/initctl +automatically, so you don't have to worry about this. + +Module autoloading +You will need to configure devfsd to enable module +autoloading. The following lines should be placed in your +/etc/devfsd.conf file: + +LOOKUP .* MODLOAD + + +As of devfsd-v1.3.10, a generic /etc/modules.devfs +configuration file is installed, which is used by the MODLOAD +action. This should be sufficient for most configurations. If you +require further configuration, edit your /etc/modules.conf +file. The way module autoloading work with devfs is: + + +a process attempts to lookup a device node (e.g. /dev/fred) + + +if that device node does not exist, the full pathname is passed to +devfsd as a string + + +devfsd will pass the string to the modprobe programme (provided the +configuration line shown above is present), and specifies that +/etc/modules.devfs is the configuration file + + +/etc/modules.devfs includes /etc/modules.conf to +access local configurations + +modprobe will search it's configuration files, looking for an alias +that translates the pathname into a module name + + +the translated pathname is then used to load the module. + + +If you wanted a lookup of /dev/fred to load the +mymod module, you would require the following configuration +line in /etc/modules.conf: + +alias /dev/fred mymod + +The /etc/modules.devfs configuration file provides many such +aliases for standard device names. If you look closely at this file, +you will note that some modules require multiple alias configuration +lines. This is required to support module autoloading for old and new +device names. + +Mounting root off a devfs device +If you wish to mount root off a devfs device when you pass the +"devfs=only" boot option, then you need to pass in the +"root=<device>" option to the kernel when booting. If you use +LILO, then you must have this in lilo.conf: + +append = "root=<device>" + +Surprised? Yep, so was I. It turns out if you have (as most people +do): + +root = <device> + + +then LILO will determine the device number of <device> and will +write that device number into a special place in the kernel image +before starting the kernel, and the kernel will use that device number +to mount the root filesystem. So, using the "append" variety ensures +that LILO passes the root filesystem device as a string, which devfs +can then use. + +Note that this isn't an issue if you don't pass "devfs=only". + +TTY issues +The ttyname(3) function in some versions of the C library makes +false assumptions about device entries which are symbolic links. The +tty(1) programme is one that depends on this function. I've +written a patch to libc 5.4.43 which fixes this. This has been +included in libc 5.4.44 and a similar fix is in glibc 2.1.3. + + +Kernel Naming Scheme + +The kernel provides a default naming scheme. This scheme is designed +to make it easy to search for specific devices or device types, and to +view the available devices. Some device types (such as hard discs), +have a directory of entries, making it easy to see what devices of +that class are available. Often, the entries are symbolic links into a +directory tree that reflects the topology of available devices. The +topological tree is useful for finding how your devices are arranged. + +Below is a list of the naming schemes for the most common drivers. A +list of reserved device names is +available for reference. Please send email to +rgooch@atnf.csiro.au to obtain an allocation. Please be +patient (the maintainer is busy). An alternative name may be allocated +instead of the requested name, at the discretion of the maintainer. + +Disc Devices + +All discs, whether SCSI, IDE or whatever, are placed under the +/dev/discs hierarchy: + + /dev/discs/disc0 first disc + /dev/discs/disc1 second disc + + +Each of these entries is a symbolic link to the directory for that +device. The device directory contains: + + disc for the whole disc + part* for individual partitions + + +CD-ROM Devices + +All CD-ROMs, whether SCSI, IDE or whatever, are placed under the +/dev/cdroms hierarchy: + + /dev/cdroms/cdrom0 first CD-ROM + /dev/cdroms/cdrom1 second CD-ROM + + +Each of these entries is a symbolic link to the real device entry for +that device. + +Tape Devices + +All tapes, whether SCSI, IDE or whatever, are placed under the +/dev/tapes hierarchy: + + /dev/tapes/tape0 first tape + /dev/tapes/tape1 second tape + + +Each of these entries is a symbolic link to the directory for that +device. The device directory contains: + + mt for mode 0 + mtl for mode 1 + mtm for mode 2 + mta for mode 3 + mtn for mode 0, no rewind + mtln for mode 1, no rewind + mtmn for mode 2, no rewind + mtan for mode 3, no rewind + + +SCSI Devices + +To uniquely identify any SCSI device requires the following +information: + + controller (host adapter) + bus (SCSI channel) + target (SCSI ID) + unit (Logical Unit Number) + + +All SCSI devices are placed under /dev/scsi (assuming devfs +is mounted on /dev). Hence, a SCSI device with the following +parameters: c=1,b=2,t=3,u=4 would appear as: + + /dev/scsi/host1/bus2/target3/lun4 device directory + + +Inside this directory, a number of device entries may be created, +depending on which SCSI device-type drivers were installed. + +See the section on the disc naming scheme to see what entries the SCSI +disc driver creates. + +See the section on the tape naming scheme to see what entries the SCSI +tape driver creates. + +The SCSI CD-ROM driver creates: + + cd + + +The SCSI generic driver creates: + + generic + + +IDE Devices + +To uniquely identify any IDE device requires the following +information: + + controller + bus (aka. primary/secondary) + target (aka. master/slave) + unit + + +All IDE devices are placed under /dev/ide, and uses a similar +naming scheme to the SCSI subsystem. + +XT Hard Discs + +All XT discs are placed under /dev/xd. The first XT disc has +the directory /dev/xd/disc0. + +TTY devices + +The tty devices now appear as: + + New name Old-name Device Type + -------- -------- ----------- + /dev/tts/{0,1,...} /dev/ttyS{0,1,...} Serial ports + /dev/cua/{0,1,...} /dev/cua{0,1,...} Call out devices + /dev/vc/0 /dev/tty Current virtual console + /dev/vc/{1,2,...} /dev/tty{1...63} Virtual consoles + /dev/vcc/{0,1,...} /dev/vcs{1...63} Virtual consoles + /dev/pty/m{0,1,...} /dev/ptyp?? PTY masters + /dev/pty/s{0,1,...} /dev/ttyp?? PTY slaves + + +RAMDISCS + +The RAMDISCS are placed in their own directory, and are named thus: + + /dev/rd/{0,1,2,...} + + +Meta Devices + +The meta devices are placed in their own directory, and are named +thus: + + /dev/md/{0,1,2,...} + + +Floppy discs + +Floppy discs are placed in the /dev/floppy directory. + +Loop devices + +Loop devices are placed in the /dev/loop directory. + +Sound devices + +Sound devices are placed in the /dev/sound directory +(audio, sequencer, ...). + + +Devfsd Naming Scheme + +Devfsd provides a naming scheme which is a convenient abbreviation of +the kernel-supplied namespace. In some +cases, the kernel-supplied naming scheme is quite convenient, so +devfsd does not provide another naming scheme. The convenience names +that devfsd creates are in fact the same names as the original devfs +kernel patch created (before Linus mandated the Big Name +Change). These are referred to as "new compatibility entries". + +In order to configure devfsd to create these convenience names, the +following lines should be placed in your /etc/devfsd.conf: + +REGISTER .* MKNEWCOMPAT +UNREGISTER .* RMNEWCOMPAT + +This will cause devfsd to create (and destroy) symbolic links which +point to the kernel-supplied names. + +SCSI Hard Discs + +All SCSI discs are placed under /dev/sd (assuming devfs is +mounted on /dev). Hence, a SCSI disc with the following +parameters: c=1,b=2,t=3,u=4 would appear as: + + /dev/sd/c1b2t3u4 for the whole disc + /dev/sd/c1b2t3u4p5 for the 5th partition + /dev/sd/c1b2t3u4p5s6 for the 6th slice in the 5th partition + + +SCSI Tapes + +All SCSI tapes are placed under /dev/st. A similar naming +scheme is used as for SCSI discs. A SCSI tape with the +parameters:c=1,b=2,t=3,u=4 would appear as: + + /dev/st/c1b2t3u4m0 for mode 0 + /dev/st/c1b2t3u4m1 for mode 1 + /dev/st/c1b2t3u4m2 for mode 2 + /dev/st/c1b2t3u4m3 for mode 3 + /dev/st/c1b2t3u4m0n for mode 0, no rewind + /dev/st/c1b2t3u4m1n for mode 1, no rewind + /dev/st/c1b2t3u4m2n for mode 2, no rewind + /dev/st/c1b2t3u4m3n for mode 3, no rewind + + +SCSI CD-ROMs + +All SCSI CD-ROMs are placed under /dev/sr. A similar naming +scheme is used as for SCSI discs. A SCSI CD-ROM with the +parameters:c=1,b=2,t=3,u=4 would appear as: + + /dev/sr/c1b2t3u4 + + +SCSI Generic Devices + +The generic (aka. raw) interface for all SCSI devices are placed under +/dev/sg. A similar naming scheme is used as for SCSI discs. A +SCSI generic device with the parameters:c=1,b=2,t=3,u=4 would appear +as: + + /dev/sg/c1b2t3u4 + + +IDE Hard Discs + +All IDE discs are placed under /dev/ide/hd, using a similar +convention to SCSI discs. The following mappings exist between the new +and the old names: + + /dev/hda /dev/ide/hd/c0b0t0u0 + /dev/hdb /dev/ide/hd/c0b0t1u0 + /dev/hdc /dev/ide/hd/c0b1t0u0 + /dev/hdd /dev/ide/hd/c0b1t1u0 + + +IDE Tapes + +A similar naming scheme is used as for IDE discs. The entries will +appear in the /dev/ide/mt directory. + +IDE CD-ROM + +A similar naming scheme is used as for IDE discs. The entries will +appear in the /dev/ide/cd directory. + +IDE Floppies + +A similar naming scheme is used as for IDE discs. The entries will +appear in the /dev/ide/fd directory. + +XT Hard Discs + +All XT discs are placed under /dev/xd. The first XT disc +would appear as /dev/xd/c0t0. + + +Old Compatibility Names + +The old compatibility names are the legacy device names, such as +/dev/hda, /dev/sda, /dev/rtc and so on. +Devfsd can be configured to create compatibility symlinks so that you +may continue to use the old names in your configuration files and so +that old applications will continue to function correctly. + +In order to configure devfsd to create these legacy names, the +following lines should be placed in your /etc/devfsd.conf: + +REGISTER .* MKOLDCOMPAT +UNREGISTER .* RMOLDCOMPAT + +This will cause devfsd to create (and destroy) symbolic links which +point to the kernel-supplied names. + + +----------------------------------------------------------------------------- + + +Device drivers currently ported + +- All miscellaneous character devices support devfs (this is done + transparently through misc_register()) + +- SCSI discs and generic hard discs + +- Character memory devices (null, zero, full and so on) + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- Loop devices (/dev/loop?) + +- TTY devices (console, serial ports, terminals and pseudo-terminals) + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- SCSI tapes (/dev/scsi and /dev/tapes) + +- SCSI CD-ROMs (/dev/scsi and /dev/cdroms) + +- SCSI generic devices (/dev/scsi) + +- RAMDISCS (/dev/ram?) + +- Meta Devices (/dev/md*) + +- Floppy discs (/dev/floppy) + +- Parallel port printers (/dev/printers) + +- Sound devices (/dev/sound) + Thanks to Eric Dumas <dumas@linux.eu.org> and + C. Scott Ananian <cananian@alumni.princeton.edu> + +- Joysticks (/dev/joysticks) + +- Sparc keyboard (/dev/kbd) + +- DSP56001 digital signal processor (/dev/dsp56k) + +- Apple Desktop Bus (/dev/adb) + +- Coda network file system (/dev/cfs*) + +- Virtual console capture devices (/dev/vcc) + Thanks to Dennis Hou <smilax@mindmeld.yi.org> + +- Frame buffer devices (/dev/fb) + +- Video capture devices (/dev/v4l) + + +----------------------------------------------------------------------------- + + +Allocation of Device Numbers + +Devfs allows you to write a driver which doesn't need to allocate a +device number (major&minor numbers) for the internal operation of the +kernel. However, there are a number of userspace programmes that use +the device number as a unique handle for a device. An example is the +find programme, which uses device numbers to determine whether +an inode is on a different filesystem than another inode. The device +number used is the one for the block device which a filesystem is +using. To preserve compatibility with userspace programmes, block +devices using devfs need to have unique device numbers allocated to +them. Furthermore, POSIX specifies device numbers, so some kind of +device number needs to be presented to userspace. + +The simplest option (especially when porting drivers to devfs) is to +keep using the old major and minor numbers. Devfs will take whatever +values are given for major&minor and pass them onto userspace. + +This device number is a 16 bit number, so this leaves plenty of space +for large numbers of discs and partitions. This scheme can also be +used for character devices, in particular the tty devices, which are +currently limited to 256 pseudo-ttys (this limits the total number of +simultaneous xterms and remote logins). Note that the device number +is limited to the range 36864-61439 (majors 144-239), in order to +avoid any possible conflicts with existing official allocations. + +Please note that using dynamically allocated block device numbers may +break the NFS daemons (both user and kernel mode), which expect dev_t +for a given device to be constant over the lifetime of remote mounts. + +A final note on this scheme: since it doesn't increase the size of +device numbers, there are no compatibility issues with userspace. + +----------------------------------------------------------------------------- + + +Questions and Answers + + +Making things work +Alternatives to devfs +What I don't like about devfs +How to report bugs +Strange kernel messages +Compilation problems with devfsd + + + +Making things work + +Here are some common questions and answers. + + + +Devfsd doesn't start + +Make sure you have compiled and installed devfsd +Make sure devfsd is being started from your boot +scripts +Make sure you have configured your kernel to enable devfs (see +below) +Make sure devfs is mounted (see below) + + +Devfsd is not managing all my permissions + +Make sure you are capturing the appropriate events. For example, +device entries created by the kernel generate REGISTER events, +but those created by devfsd generate CREATE events. + + +Devfsd is not capturing all REGISTER events + +See the previous entry: you may need to capture CREATE events. + + +X will not start + +Make sure you followed the steps +outlined above. + + +Why don't my network devices appear in devfs? + +This is not a bug. Network devices have their own, completely separate +namespace. They are accessed via socket(2) and +setsockopt(2) calls, and thus require no device nodes. I have +raised the possibilty of moving network devices into the device +namespace, but have had no response. + + +How can I test if I have devfs compiled into my kernel? + +All filesystems built-in or currently loaded are listed in +/proc/filesystems. If you see a devfs entry, then +you know that devfs was compiled into your kernel. If you have +correctly configured and rebuilt your kernel, then devfs will be +built-in. If you think you've configured it in, but +/proc/filesystems doesn't show it, you've made a mistake. +Common mistakes include: + +Using a 2.2.x kernel without applying the devfs patch (if you +don't know how to patch your kernel, use 2.4.x instead, don't bother +asking me how to patch) +Forgetting to set CONFIG_EXPERIMENTAL=y +Forgetting to set CONFIG_DEVFS_FS=y +Forgetting to set CONFIG_DEVFS_MOUNT=y (if you want devfs +to be automatically mounted at boot) +Editing your .config manually, instead of using make +config or make xconfig +Forgetting to run make dep; make clean after changing the +configuration and before compiling +Forgetting to compile your kernel and modules +Forgetting to install your kernel +Forgetting to install your modules + +Please check twice that you've done all these steps before sending in +a bug report. + + + +How can I test if devfs is mounted on /dev? + +The device filesystem will always create an entry called +".devfsd", which is used to communicate with the daemon. Even +if the daemon is not running, this entry will exist. Testing for the +existence of this entry is the approved method of determining if devfs +is mounted or not. Note that the type of entry (i.e. regular file, +character device, named pipe, etc.) may change without notice. Only +the existence of the entry should be relied upon. + + +When I start devfsd, I see the error: +Error opening file: ".devfsd" No such file or directory? + +This means that devfs is not mounted. Make sure you have devfs mounted. + + +How do I mount devfs? + +First make sure you have devfs compiled into your kernel (see +above). Then you will either need to: + +set CONFIG_DEVFS_MOUNT=y in your kernel config +pass devfs=mount to your boot loader +mount devfs manually in your boot scripts with: +mount -t none devfs /dev + + + +Mount by volume LABEL=<label> doesn't work with +devfs + +Most probably you are not mounting devfs onto /dev. What +happens is that if your kernel config has CONFIG_DEVFS_FS=y +then the contents of /proc/partitions will have the devfs +names (such as scsi/host0/bus0/target0/lun0/part1). The +contents of /proc/partitions are used by mount(8) when +mounting by volume label. If devfs is not mounted on /dev, +then mount(8) will fail to find devices. The solution is to +make sure that devfs is mounted on /dev. See above for how to +do that. + + +I have extra or incorrect entries in /dev + +You may have stale entries in your dev-state area. Check for a +RESTORE configuration line in your devfsd configuration +(typically /etc/devfsd.conf). If you have this line, check +the contents of the specified directory for stale entries. Remove +any entries which are incorrect, then reboot. + + +I get "Unable to open initial console" messages at boot + +This usually happens when you don't have devfs automounted onto +/dev at boot time, and there is no valid +/dev/console entry on your root file-system. Create a valid +/dev/console device node. + + + + + +Alternatives to devfs + +I've attempted to collate all the anti-devfs proposals and explain +their limitations. Under construction. + + +Why not just pass device create/remove events to a daemon? + +Here the suggestion is to develop an API in the kernel so that devices +can register create and remove events, and a daemon listens for those +events. The daemon would then populate/depopulate /dev (which +resides on disc). + +This has several limitations: + + +it only works for modules loaded and unloaded (or devices inserted +and removed) after the kernel has finished booting. Without a database +of events, there is no way the daemon could fully populate +/dev + + +if you add a database to this scheme, the question is then how to +present that database to user-space. If you make it a list of strings +with embedded event codes which are passed through a pipe to the +daemon, then this is only of use to the daemon. I would argue that the +natural way to present this data is via a filesystem (since many of +the events will be of a hierarchical nature), such as devfs. +Presenting the data as a filesystem makes it easy for the user to see +what is available and also makes it easy to write scripts to scan the +"database" + + +the tight binding between device nodes and drivers is no longer +possible (requiring the otherwise perfectly avoidable +table lookups) + + +you cannot catch inode lookup events on /dev which means +that module autoloading requires device nodes to be created. This is a +problem, particularly for drivers where only a few inodes are created +from a potentially large set + + +this technique can't be used when the root FS is mounted +read-only + + + + +Just implement a better scsidev + +This suggestion involves taking the scsidev programme and +extending it to scan for all devices, not just SCSI devices. The +scsidev programme works by scanning /proc/scsi + +Problems: + + +the kernel does not currently provide a list of all devices +available. Not all drivers register entries in /proc or +generate kernel messages + + +there is no uniform mechanism to register devices other than the +devfs API + + +implementing such an API is then the same as the +proposal above + + + + +Put /dev on a ramdisc + +This suggestion involves creating a ramdisc and populating it with +device nodes and then mounting it over /dev. + +Problems: + + + +this doesn't help when mounting the root filesystem, since you +still need a device node to do that + + +if you want to use this technique for the root device node as +well, you need to use initrd. This complicates the booting sequence +and makes it significantly harder to administer and configure. The +initrd is essentially opaque, robbing the system administrator of easy +configuration + + +insufficient information is available to correctly populate the +ramdisc. So we come back to the +proposal above to "solve" this + + +a ramdisc-based solution would take more kernel memory, since the +backing store would be (at best) normal VFS inodes and dentries, which +take 284 bytes and 112 bytes, respectively, for each entry. Compare +that to 72 bytes for devfs + + + + +Do nothing: there's no problem + +Sometimes people can be heard to claim that the existing scheme is +fine. This is what they're ignoring: + + +device number size (8 bits each for major and minor) is a real +limitation, and must be fixed somehow. Systems with large numbers of +SCSI devices, for example, will continue to consume the remaining +unallocated major numbers. USB will also need to push beyond the 8 bit +minor limitation + + +simply increasing the device number size is insufficient. Apart +from causing a lot of pain, it doesn't solve the management issues +of a /dev with thousands or more device nodes + + +ignoring the problem of a huge /dev will not make it go +away, and dismisses the legitimacy of a large number of people who +want a dynamic /dev + + +the standard response then becomes: "write a device management +daemon", which brings us back to the +proposal above + + + + +What I don't like about devfs + +Here are some common complaints about devfs, and some suggestions and +solutions that may make it more palatable for you. I can't please +everybody, but I do try :-) + +I hate the naming scheme + +First, remember that no naming scheme will please everybody. You hate +the scheme, others love it. Who's to say who's right and who's wrong? +Ultimately, the person who writes the code gets to choose, and what +exists now is a combination of the choices made by the +devfs author and the +kernel maintainer (Linus). + +However, not all is lost. If you want to create your own naming +scheme, it is a simple matter to write a standalone script, hack +devfsd, or write a script called by devfsd. You can create whatever +naming scheme you like. + +Further, if you want to remove all traces of the devfs naming scheme +from /dev, you can mount devfs elsewhere (say +/devfs) and populate /dev with links into +/devfs. This population can be automated using devfsd if you +wish. + +You can even use the VFS binding facility to make the links, rather +than using symbolic links. This way, you don't even have to see the +"destination" of these symbolic links. + +Devfs puts policy into the kernel + +There's already policy in the kernel. Device numbers are in fact +policy (why should the kernel dictate what device numbers I use?). +Face it, some policy has to be in the kernel. The real difference +between device names as policy and device numbers as policy is that +no one will use device numbers directly, because device +numbers are devoid of meaning to humans and are ugly. At least with +the devfs device names, (even though you can add your own naming +scheme) some people will use the devfs-supplied names directly. This +offends some people :-) + +Devfs is bloatware + +This is not even remotely true. As shown above, +both code and data size are quite modest. + + +How to report bugs + +If you have (or think you have) a bug with devfs, please follow the +steps below: + + + +make sure you have enabled debugging output when configuring your +kernel. You will need to set (at least) the following config options: + +CONFIG_DEVFS_DEBUG=y +CONFIG_DEBUG_KERNEL=y +CONFIG_DEBUG_SLAB=y + + + +please make sure you have the latest devfs patches applied. The +latest kernel version might not have the latest devfs patches applied +yet (Linus is very busy) + + +save a copy of your complete kernel logs (preferably by +using the dmesg programme) for later inclusion in your bug +report. You may need to use the -s switch to increase the +internal buffer size so you can capture all the boot messages. +Don't edit or trim the dmesg output + + + + +try booting with devfs=dall passed to the kernel boot +command line (read the documentation on your bootloader on how to do +this), and save the result to a file. This may be quite verbose, and +it may overflow the messages buffer, but try to get as much of it as +you can + + +if you get an Oops, run ksymoops to decode it so that the +names of the offending functions are provided. A non-decoded Oops is +pretty useless + + +send a copy of your devfsd configuration file(s) + +send the bug report to me first. +Don't expect that I will see it if you post it to the linux-kernel +mailing list. Include all the information listed above, plus +anything else that you think might be relevant. Put the string +devfs somewhere in the subject line, so my mail filters mark +it as urgent + + + + +Here is a general guide on how to ask questions in a way that greatly +improves your chances of getting a reply: + +http://www.tuxedo.org/~esr/faqs/smart-questions.html. If you have +a bug to report, you should also read + +http://www.chiark.greenend.org.uk/~sgtatham/bugs.html. + + +Strange kernel messages + +You may see devfs-related messages in your kernel logs. Below are some +messages and what they mean (and what you should do about them, if +anything). + + + +devfs_register(fred): could not append to parent, err: -17 + +You need to check what the error code means, but usually 17 means +EEXIST. This means that a driver attempted to create an entry +fred in a directory, but there already was an entry with that +name. This is often caused by flawed boot scripts which untar a bunch +of inodes into /dev, as a way to restore permissions. This +message is harmless, as the device nodes will still +provide access to the driver (unless you use the devfs=only +boot option, which is only for dedicated souls:-). If you want to get +rid of these annoying messages, upgrade to devfsd-v1.3.20 and use the +recommended RESTORE directive to restore permissions. + + +devfs_mk_dir(bill): using old entry in dir: c1808724 "" + +This is similar to the message above, except that a driver attempted +to create a directory named bill, and the parent directory +has an entry with the same name. In this case, to ensure that drivers +continue to work properly, the old entry is re-used and given to the +driver. In 2.5 kernels, the driver is given a NULL entry, and thus, +under rare circumstances, may not create the require device nodes. +The solution is the same as above. + + + + + +Compilation problems with devfsd + +Usually, you can compile devfsd just by typing in +make in the source directory, followed by a make +install (as root). Sometimes, you may have problems, particularly +on broken configurations. + + + +error messages relating to DEVFSD_NOTIFY_DELETE + +This happened because you have an ancient set of kernel headers +installed in /usr/include/linux or /usr/src/linux. +Install kernel 2.4.10 or later. You may need to pass the +KERNEL_DIR variable to make (if you did not install +the new kernel sources as /usr/src/linux), or you may copy +the devfs_fs.h file in the kernel source tree into +/usr/include/linux. + + + + +----------------------------------------------------------------------------- + + +Other resources + + + +Douglas Gilbert has written a useful document at + +http://www.torque.net/sg/devfs_scsi.html which +explores the SCSI subsystem and how it interacts with devfs + + +Douglas Gilbert has written another useful document at + +http://www.torque.net/scsi/SCSI-2.4-HOWTO/ which +discusses the Linux SCSI subsystem in 2.4. + + +Johannes Erdfelt has started a discussion paper on Linux and +hot-swap devices, describing what the requirements are for a scalable +solution and how and why he's used devfs+devfsd. Note that this is an +early draft only, available in plain text form at: + +http://johannes.erdfelt.com/hotswap.txt. +Johannes has promised a HTML version will follow. + + +I presented an invited +paper +at the + +2nd Annual Storage Management Workshop held in Miamia, Florida, +U.S.A. in October 2000. + + + + +----------------------------------------------------------------------------- + + +Translations of this document + +This document has been translated into other languages. + + + + +The document master (in English) by rgooch@atnf.csiro.au is +available at + +http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html + + + +A Korean translation by viatoris@nownuri.net is available at + +http://your.destiny.pe.kr/devfs/devfs.html + + + + +----------------------------------------------------------------------------- +Most flags courtesy of ITA's +Flags of All Countries +used with permission. diff --git a/Documentation/filesystems/devfs/ToDo b/Documentation/filesystems/devfs/ToDo new file mode 100644 index 000000000000..afd5a8f2c19b --- /dev/null +++ b/Documentation/filesystems/devfs/ToDo @@ -0,0 +1,40 @@ + Device File System (devfs) ToDo List + + Richard Gooch <rgooch@atnf.csiro.au> + + 3-JUL-2000 + +This is a list of things to be done for better devfs support in the +Linux kernel. If you'd like to contribute to the devfs, please have a +look at this list for anything that is unallocated. Also, if there are +items missing (surely), please contact me so I can add them to the +list (preferably with your name attached to them:-). + + +- >256 ptys + Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> + +- Amiga floppy driver (drivers/block/amiflop.c) + +- Atari floppy driver (drivers/block/ataflop.c) + +- SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c) + +- Amiga ZorroII ramdisc driver (drivers/block/z2ram.c) + +- Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c) + +- Parallel port ATAPI floppy (drivers/block/paride/pf.c) + +- AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c) + +- Archimedes floppy (drivers/acorn/block/fd1772.c) + +- MFM hard drive (drivers/acorn/block/mfmhd.c) + +- I2O block device (drivers/message/i2o/i2o_block.c) + +- ST-RAM device (arch/m68k/atari/stram.c) + +- Raw devices + diff --git a/Documentation/filesystems/devfs/boot-options b/Documentation/filesystems/devfs/boot-options new file mode 100644 index 000000000000..df3d33b03e0a --- /dev/null +++ b/Documentation/filesystems/devfs/boot-options @@ -0,0 +1,65 @@ +/* -*- auto-fill -*- */ + + Device File System (devfs) Boot Options + + Richard Gooch <rgooch@atnf.csiro.au> + + 18-AUG-2001 + + +When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options +to the kernel to debug devfs. The boot options are prefixed by +"devfs=", and are separated by commas. Spaces are not allowed. The +syntax looks like this: + +devfs=<option1>,<option2>,<option3> + +and so on. For example, if you wanted to turn on debugging for module +load requests and device registration, you would do: + +devfs=dmod,dreg + +You may prefix "no" to any option. This will invert the option. + + +Debugging Options +================= + +These requires CONFIG_DEVFS_DEBUG to be enabled. +Note that all debugging options have 'd' as the first character. By +default all options are off. All debugging output is sent to the +kernel logs. The debugging options do not take effect until the devfs +version message appears (just prior to the root filesystem being +mounted). + +These are the options: + +dmod print module load requests to <request_module> + +dreg print device register requests to <devfs_register> + +dunreg print device unregister requests to <devfs_unregister> + +dchange print device change requests to <devfs_set_flags> + +dilookup print inode lookup requests + +diget print VFS inode allocations + +diunlink print inode unlinks + +dichange print inode changes + +dimknod print calls to mknod(2) + +dall some debugging turned on + + +Other Options +============= + +These control the default behaviour of devfs. The options are: + +mount mount devfs onto /dev at boot time + +only disable non-devfs device nodes for devfs-capable drivers diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking new file mode 100644 index 000000000000..34380d4fbce3 --- /dev/null +++ b/Documentation/filesystems/directory-locking @@ -0,0 +1,113 @@ + Locking scheme used for directory operations is based on two +kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem). + + For our purposes all operations fall in 5 classes: + +1) read access. Locking rules: caller locks directory we are accessing. + +2) object creation. Locking rules: same as above. + +3) object removal. Locking rules: caller locks parent, finds victim, +locks victim and calls the method. + +4) rename() that is _not_ cross-directory. Locking rules: caller locks +the parent, finds source and target, if target already exists - locks it +and then calls the method. + +5) link creation. Locking rules: + * lock parent + * check that source is not a directory + * lock source + * call the method. + +6) cross-directory rename. The trickiest in the whole bunch. Locking +rules: + * lock the filesystem + * lock parents in "ancestors first" order. + * find source and target. + * if old parent is equal to or is a descendent of target + fail with -ENOTEMPTY + * if new parent is equal to or is a descendent of source + fail with -ELOOP + * if target exists - lock it. + * call the method. + + +The rules above obviously guarantee that all directories that are going to be +read, modified or removed by method will be locked by caller. + + +If no directory is its own ancestor, the scheme above is deadlock-free. +Proof: + + First of all, at any moment we have a partial ordering of the +objects - A < B iff A is an ancestor of B. + + That ordering can change. However, the following is true: + +(1) if object removal or non-cross-directory rename holds lock on A and + attempts to acquire lock on B, A will remain the parent of B until we + acquire the lock on B. (Proof: only cross-directory rename can change + the parent of object and it would have to lock the parent). + +(2) if cross-directory rename holds the lock on filesystem, order will not + change until rename acquires all locks. (Proof: other cross-directory + renames will be blocked on filesystem lock and we don't start changing + the order until we had acquired all locks). + +(3) any operation holds at most one lock on non-directory object and + that lock is acquired after all other locks. (Proof: see descriptions + of operations). + + Now consider the minimal deadlock. Each process is blocked on +attempt to acquire some lock and already holds at least one lock. Let's +consider the set of contended locks. First of all, filesystem lock is +not contended, since any process blocked on it is not holding any locks. +Thus all processes are blocked on ->i_sem. + + Non-directory objects are not contended due to (3). Thus link +creation can't be a part of deadlock - it can't be blocked on source +and it means that it doesn't hold any locks. + + Any contended object is either held by cross-directory rename or +has a child that is also contended. Indeed, suppose that it is held by +operation other than cross-directory rename. Then the lock this operation +is blocked on belongs to child of that object due to (1). + + It means that one of the operations is cross-directory rename. +Otherwise the set of contended objects would be infinite - each of them +would have a contended child and we had assumed that no object is its +own descendent. Moreover, there is exactly one cross-directory rename +(see above). + + Consider the object blocking the cross-directory rename. One +of its descendents is locked by cross-directory rename (otherwise we +would again have an infinite set of of contended objects). But that +means that cross-directory rename is taking locks out of order. Due +to (2) the order hadn't changed since we had acquired filesystem lock. +But locking rules for cross-directory rename guarantee that we do not +try to acquire lock on descendent before the lock on ancestor. +Contradiction. I.e. deadlock is impossible. Q.E.D. + + + These operations are guaranteed to avoid loop creation. Indeed, +the only operation that could introduce loops is cross-directory rename. +Since the only new (parent, child) pair added by rename() is (new parent, +source), such loop would have to contain these objects and the rest of it +would have to exist before rename(). I.e. at the moment of loop creation +rename() responsible for that would be holding filesystem lock and new parent +would have to be equal to or a descendent of source. But that means that +new parent had been equal to or a descendent of source since the moment when +we had acquired filesystem lock and rename() would fail with -ELOOP in that +case. + + While this locking scheme works for arbitrary DAGs, it relies on +ability to check that directory is a descendent of another object. Current +implementation assumes that directory graph is a tree. This assumption is +also preserved by all operations (cross-directory rename on a tree that would +not introduce a cycle will leave it a tree and link() fails for directories). + + Notice that "directory" in the above == "anything that might have +children", so if we are going to introduce hybrid objects we will need +either to make sure that link(2) doesn't work for them or to make changes +in is_subdir() that would make it work even in presence of such beasts. diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt new file mode 100644 index 000000000000..b5cb9110cc6b --- /dev/null +++ b/Documentation/filesystems/ext2.txt @@ -0,0 +1,383 @@ + +The Second Extended Filesystem +============================== + +ext2 was originally released in January 1993. Written by R\'emy Card, +Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the +Extended Filesystem. It is currently still (April 2001) the predominant +filesystem in use by Linux. There are also implementations available +for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. + +Options +======= + +Most defaults are determined by the filesystem superblock, and can be +set using tune2fs(8). Kernel-determined defaults are indicated by (*). + +bsddf (*) Makes `df' act like BSD. +minixdf Makes `df' act like Minix. + +check Check block and inode bitmaps at mount time + (requires CONFIG_EXT2_CHECK). +check=none, nocheck (*) Don't do extra checking of bitmaps on mount + (check=normal and check=strict options removed) + +debug Extra debugging information is sent to the + kernel syslog. Useful for developers. + +errors=continue Keep going on a filesystem error. +errors=remount-ro Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. + +grpid, bsdgroups Give objects the same group ID as their parent. +nogrpid, sysvgroups New objects have the group ID of their creator. + +nouid32 Use 16-bit UIDs and GIDs. + +oldalloc Enable the old block allocator. Orlov should + have better performance, we'd like to get some + feedback if it's the contrary for you. +orlov (*) Use the Orlov block allocator. + (See http://lwn.net/Articles/14633/ and + http://lwn.net/Articles/14446/.) + +resuid=n The user ID which may use the reserved blocks. +resgid=n The group ID which may use the reserved blocks. + +sb=n Use alternate superblock at this location. + +user_xattr Enable "user." POSIX Extended Attributes + (requires CONFIG_EXT2_FS_XATTR). + See also http://acl.bestbits.at +nouser_xattr Don't support "user." extended attributes. + +acl Enable POSIX Access Control Lists support + (requires CONFIG_EXT2_FS_POSIX_ACL). + See also http://acl.bestbits.at +noacl Don't support POSIX ACLs. + +nobh Do not attach buffer_heads to file pagecache. + +grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. + + +Specification +============= + +ext2 shares many properties with traditional Unix filesystems. It has +the concepts of blocks, inodes and directories. It has space in the +specification for Access Control Lists (ACLs), fragments, undeletion and +compression though these are not yet implemented (some are available as +separate patches). There is also a versioning mechanism to allow new +features (such as journalling) to be added in a maximally compatible +manner. + +Blocks +------ + +The space in the device or file is split up into blocks. These are +a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), +which is decided when the filesystem is created. Smaller blocks mean +less wasted space per file, but require slightly more accounting overhead, +and also impose other limits on the size of files and the filesystem. + +Block Groups +------------ + +Blocks are clustered into block groups in order to reduce fragmentation +and minimise the amount of head seeking when reading a large amount +of consecutive data. Information about each block group is kept in a +descriptor table stored in the block(s) immediately after the superblock. +Two blocks near the start of each group are reserved for the block usage +bitmap and the inode usage bitmap which show which blocks and inodes +are in use. Since each bitmap is limited to a single block, this means +that the maximum size of a block group is 8 times the size of a block. + +The block(s) following the bitmaps in each block group are designated +as the inode table for that block group and the remainder are the data +blocks. The block allocation algorithm attempts to allocate data blocks +in the same block group as the inode which contains them. + +The Superblock +-------------- + +The superblock contains all the information about the configuration of +the filing system. The primary copy of the superblock is stored at an +offset of 1024 bytes from the start of the device, and it is essential +to mounting the filesystem. Since it is so important, backup copies of +the superblock are stored in block groups throughout the filesystem. +The first version of ext2 (revision 0) stores a copy at the start of +every block group, along with backups of the group descriptor block(s). +Because this can consume a considerable amount of space for large +filesystems, later revisions can optionally reduce the number of backup +copies by only putting backups in specific groups (this is the sparse +superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. + +The information in the superblock contains fields such as the total +number of inodes and blocks in the filesystem and how many are free, +how many inodes and blocks are in each block group, when the filesystem +was mounted (and if it was cleanly unmounted), when it was modified, +what version of the filesystem it is (see the Revisions section below) +and which OS created it. + +If the filesystem is revision 1 or higher, then there are extra fields, +such as a volume name, a unique identification number, the inode size, +and space for optional filesystem features to store configuration info. + +All fields in the superblock (as in all other ext2 structures) are stored +on the disc in little endian format, so a filesystem is portable between +machines without having to know what machine it was created on. + +Inodes +------ + +The inode (index node) is a fundamental concept in the ext2 filesystem. +Each object in the filesystem is represented by an inode. The inode +structure contains pointers to the filesystem blocks which contain the +data held in the object and all of the metadata about an object except +its name. The metadata about an object includes the permissions, owner, +group, flags, size, number of blocks used, access time, change time, +modification time, deletion time, number of links, fragments, version +(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). + +There are some reserved fields which are currently unused in the inode +structure and several which are overloaded. One field is reserved for the +directory ACL if the inode is a directory and alternately for the top 32 +bits of the file size if the inode is a regular file (allowing file sizes +larger than 2GB). The translator field is unused under Linux, but is used +by the HURD to reference the inode of a program which will be used to +interpret this object. Most of the remaining reserved fields have been +used up for both Linux and the HURD for larger owner and group fields, +The HURD also has a larger mode field so it uses another of the remaining +fields to store the extra more bits. + +There are pointers to the first 12 blocks which contain the file's data +in the inode. There is a pointer to an indirect block (which contains +pointers to the next set of blocks), a pointer to a doubly-indirect +block (which contains pointers to indirect blocks) and a pointer to a +trebly-indirect block (which contains pointers to doubly-indirect blocks). + +The flags field contains some ext2-specific flags which aren't catered +for by the standard chmod flags. These flags can be listed with lsattr +and changed with the chattr command, and allow specific filesystem +behaviour on a per-file basis. There are flags for secure deletion, +undeletable, compression, synchronous updates, immutability, append-only, +dumpable, no-atime, indexed directories, and data-journaling. Not all +of these are supported yet. + +Directories +----------- + +A directory is a filesystem object and has an inode just like a file. +It is a specially formatted file containing records which associate +each name with an inode number. Later revisions of the filesystem also +encode the type of the object (file, directory, symlink, device, fifo, +socket) to avoid the need to check the inode itself for this information +(support for taking advantage of this feature does not yet exist in +Glibc 2.2). + +The inode allocation code tries to assign inodes which are in the same +block group as the directory in which they are first created. + +The current implementation of ext2 uses a singly-linked list to store +the filenames in the directory; a pending enhancement uses hashing of the +filenames to allow lookup without the need to scan the entire directory. + +The current implementation never removes empty directory blocks once they +have been allocated to hold more files. + +Special files +------------- + +Symbolic links are also filesystem objects with inodes. They deserve +special mention because the data for them is stored within the inode +itself if the symlink is less than 60 bytes long. It uses the fields +which would normally be used to store the pointers to data blocks. +This is a worthwhile optimisation as it we avoid allocating a full +block for the symlink, and most symlinks are less than 60 characters long. + +Character and block special devices never have data blocks assigned to +them. Instead, their device number is stored in the inode, again reusing +the fields which would be used to point to the data blocks. + +Reserved Space +-------------- + +In ext2, there is a mechanism for reserving a certain number of blocks +for a particular user (normally the super-user). This is intended to +allow for the system to continue functioning even if non-priveleged users +fill up all the space available to them (this is independent of filesystem +quotas). It also keeps the filesystem from filling up entirely which +helps combat fragmentation. + +Filesystem check +---------------- + +At boot time, most systems run a consistency check (e2fsck) on their +filesystems. The superblock of the ext2 filesystem contains several +fields which indicate whether fsck should actually run (since checking +the filesystem at boot can take a long time if it is large). fsck will +run if the filesystem was not cleanly unmounted, if the maximum mount +count has been exceeded or if the maximum time between checks has been +exceeded. + +Feature Compatibility +--------------------- + +The compatibility feature mechanism used in ext2 is sophisticated. +It safely allows features to be added to the filesystem, without +unnecessarily sacrificing compatibility with older versions of the +filesystem code. The feature compatibility mechanism is not supported by +the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in +revision 1. There are three 32-bit fields, one for compatible features +(COMPAT), one for read-only compatible (RO_COMPAT) features and one for +incompatible (INCOMPAT) features. + +These feature flags have specific meanings for the kernel as follows: + +A COMPAT flag indicates that a feature is present in the filesystem, +but the on-disk format is 100% compatible with older on-disk formats, so +a kernel which didn't know anything about this feature could read/write +the filesystem without any chance of corrupting the filesystem (or even +making it inconsistent). This is essentially just a flag which says +"this filesystem has a (hidden) feature" that the kernel or e2fsck may +want to be aware of (more on e2fsck and feature flags later). The ext3 +HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply +a regular file with data blocks in it so the kernel does not need to +take any special notice of it if it doesn't understand ext3 journaling. + +An RO_COMPAT flag indicates that the on-disk format is 100% compatible +with older on-disk formats for reading (i.e. the feature does not change +the visible on-disk format). However, an old kernel writing to such a +filesystem would/could corrupt the filesystem, so this is prevented. The +most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because +sparse groups allow file data blocks where superblock/group descriptor +backups used to live, and ext2_free_blocks() refuses to free these blocks, +which would leading to inconsistent bitmaps. An old kernel would also +get an error if it tried to free a series of blocks which crossed a group +boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. + +An INCOMPAT flag indicates the on-disk format has changed in some +way that makes it unreadable by older kernels, or would otherwise +cause a problem if an old kernel tried to mount it. FILETYPE is an +INCOMPAT flag because older kernels would think a filename was longer +than 256 characters, which would lead to corrupt directory listings. +The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel +doesn't understand compression, you would just get garbage back from +read() instead of it automatically decompressing your data. The ext3 +RECOVER flag is needed to prevent a kernel which does not understand the +ext3 journal from mounting the filesystem without replaying the journal. + +For e2fsck, it needs to be more strict with the handling of these +flags than the kernel. If it doesn't understand ANY of the COMPAT, +RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, +because it has no way of verifying whether a given feature is valid +or not. Allowing e2fsck to succeed on a filesystem with an unknown +feature is a false sense of security for the user. Refusing to check +a filesystem with unknown features is a good incentive for the user to +update to the latest e2fsck. This also means that anyone adding feature +flags to ext2 also needs to update e2fsck to verify these features. + +Metadata +-------- + +It is frequently claimed that the ext2 implementation of writing +asynchronous metadata is faster than the ffs synchronous metadata +scheme but less reliable. Both methods are equally resolvable by their +respective fsck programs. + +If you're exceptionally paranoid, there are 3 ways of making metadata +writes synchronous on ext2: + +per-file if you have the program source: use the O_SYNC flag to open() +per-file if you don't have the source: use "chattr +S" on the file +per-filesystem: add the "sync" option to mount (or in /etc/fstab) + +the first and last are not ext2 specific but do force the metadata to +be written synchronously. See also Journaling below. + +Limitations +----------- + +There are various limits imposed by the on-disk layout of ext2. Other +limits are imposed by the current implementation of the kernel code. +Many of the limits are determined at the time the filesystem is first +created, and depend upon the block size chosen. The ratio of inodes to +data blocks is fixed at filesystem creation time, so the only way to +increase the number of inodes is to increase the size of the filesystem. +No tools currently exist which can change the ratio of inodes to blocks. + +Most of these limits could be overcome with slight changes in the on-disk +format and using a compatibility flag to signal the format change (at +the expense of some compatibility). + +Filesystem block size: 1kB 2kB 4kB 8kB + +File size limit: 16GB 256GB 2048GB 2048GB +Filesystem size limit: 2047GB 8192GB 16384GB 32768GB + +There is a 2.4 kernel limit of 2048GB for a single block device, so no +filesystem larger than that can be created at this time. There is also +an upper limit on the block size imposed by the page size of the kernel, +so 8kB blocks are only allowed on Alpha systems (and other architectures +which support larger pages). + +There is an upper limit of 32768 subdirectories in a single directory. + +There is a "soft" upper limit of about 10-15k files in a single directory +with the current linear linked-list directory implementation. This limit +stems from performance problems when creating and deleting (and also +finding) files in such large directories. Using a hashed directory index +(under development) allows 100k-1M+ files in a single directory without +performance problems (although RAM size becomes an issue at this point). + +The (meaningless) absolute upper limit of files in a single directory +(imposed by the file size, the realistic limit is obviously much less) +is over 130 trillion files. It would be higher except there are not +enough 4-character names to make up unique directory entries, so they +have to be 8 character filenames, even then we are fairly close to +running out of unique filenames. + +Journaling +---------- + +A journaling extension to the ext2 code has been developed by Stephen +Tweedie. It avoids the risks of metadata corruption and the need to +wait for e2fsck to complete after a crash, without requiring a change +to the on-disk ext2 layout. In a nutshell, the journal is a regular +file which stores whole metadata (and optionally data) blocks that have +been modified, prior to writing them into the filesystem. This means +it is possible to add a journal to an existing ext2 filesystem without +the need for data conversion. + +When changes to the filesystem (e.g. a file is renamed) they are stored in +a transaction in the journal and can either be complete or incomplete at +the time of a crash. If a transaction is complete at the time of a crash +(or in the normal case where the system does not crash), then any blocks +in that transaction are guaranteed to represent a valid filesystem state, +and are copied into the filesystem. If a transaction is incomplete at +the time of the crash, then there is no guarantee of consistency for +the blocks in that transaction so they are discarded (which means any +filesystem changes they represent are also lost). +Check Documentation/filesystems/ext3.txt if you want to read more about +ext3 and journaling. + +References +========== + +The kernel source file:/usr/src/linux/fs/ext2/ +e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ +Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html +Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ +Hashed Directories http://kernelnewbies.org/~phillips/htree/ +Filesystem Resizing http://ext2resize.sourceforge.net/ +Compression (*) http://www.netspace.net.au/~reiter/e2compr/ + +Implementations for: +Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm +Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2 +DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/ +RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/ + +(*) no longer actively developed/supported (as of Apr 2001) diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt new file mode 100644 index 000000000000..9ab7f446f7ad --- /dev/null +++ b/Documentation/filesystems/ext3.txt @@ -0,0 +1,183 @@ + +Ext3 Filesystem +=============== + +ext3 was originally released in September 1999. Written by Stephen Tweedie +for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, +Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. + +ext3 is ext2 filesystem enhanced with journalling capabilities. + +Options +======= + +When mounting an ext3 filesystem, the following option are accepted: +(*) == default + +jounal=update Update the ext3 file system's journal to the + current format. + +journal=inum When a journal already exists, this option is + ignored. Otherwise, it specifies the number of + the inode which will represent the ext3 file + system's journal file. + +noload Don't load the journal on mounting. + +data=journal All data are committed into the journal prior + to being written into the main file system. + +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to + the journal. + +data=writeback Data ordering is not preserved, data may be + written into the main file system after its + metadata has been committed to the journal. + +commit=nrsec (*) Ext3 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose, + as much, the latest 5 seconds of work (your filesystem + will not be damaged though, thanks to journaling). This + default value (or any low value) will hurt performance, + but it's good for data-safety. Setting it to 0 will + have the same effect than leaving the default 5 sec. + Setting it to very large values will improve + performance. + +barrier=1 This enables/disables barriers. barrier=0 disables it, + barrier=1 enables it. + +orlov (*) This enables the new Orlov block allocator. It's enabled + by default. + +oldalloc This disables the Orlov block allocator and enables the + old block allocator. Orlov should have better performance, + we'd like to get some feedback if it's the contrary for + you. + +user_xattr (*) Enables POSIX Extended Attributes. It's enabled by + default, however you need to confifure its support + (CONFIG_EXT3_FS_XATTR). This is neccesary if you want + to use POSIX Acces Control Lists support. You can visit + http://acl.bestbits.at to know more about POSIX Extended + attributes. + +nouser_xattr Disables POSIX Extended Attributes. + +acl (*) Enables POSIX Access Control Lists support. This is + enabled by default, however you need to configure + its support (CONFIG_EXT3_FS_POSIX_ACL). If you want + to know more about ACLs visit http://acl.bestbits.at + +noacl This option disables POSIX Access Control List support. + +reservation + +noreservation + +resize= + +bsddf (*) Make 'df' act like BSD. +minixdf Make 'df' act like Minix. + +check=none Don't do extra checking of bitmaps on mount. +nocheck + +debug Extra debugging information is sent to syslog. + +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=continue Keep going on a filesystem error. +errors=panic Panic and halt the machine if an error occurs. + +grpid Give objects the same group ID as their creator. +bsdgroups + +nogrpid (*) New objects have the group ID of their creator. +sysvgroups + +resgid=n The group ID which may use the reserved blocks. + +resuid=n The user ID which may use the reserved blocks. + +sb=n Use alternate superblock at this location. + +quota Quota options are currently silently ignored. +noquota (see fs/ext3/super.c, line 594) +grpquota +usrquota + + +Specification +============= +ext3 shares all disk implementation with ext2 filesystem, and add +transactions capabilities to ext2. Journaling is done by the +Journaling block device layer. + +Journaling Block Device layer +----------------------------- +The Journaling Block Device layer (JBD) isn't ext3 specific. It was +design to add journaling capabilities on a block device. The ext3 +filesystem code will inform the JBD of modifications it is performing +(Call a transaction). the journal support the transactions start and +stop, and in case of crash, the journal can replayed the transactions +to put the partition on a consistent state fastly. + +handles represent a single atomic update to a filesystem. JBD can +handle external journal on a block device. + +Data Mode +--------- +There's 3 different data modes: + +* writeback mode +In data=writeback mode, ext3 does not journal data at all. This mode +provides a similar level of journaling as XFS, JFS, and ReiserFS in its +default mode - metadata journaling. A crash+recovery can cause +incorrect data to appear in files which were written shortly before the +crash. This mode will typically provide the best ext3 performance. + +* ordered mode +In data=ordered mode, ext3 only officially journals metadata, but it +logically groups metadata and data blocks into a single unit called a +transaction. When it's time to write the new metadata out to disk, the +associated data blocks are written first. In general, this mode +perform slightly slower than writeback but significantly faster than +journal mode. + +* journal mode +data=journal mode provides full data and metadata journaling. All new +data is written to the journal first, and then to its final location. +In the event of a crash, the journal can be replayed, bringing both +data and metadata into a consistent state. This mode is the slowest +except when data needs to be read from and written to disk at the same +time where it outperform all others mode. + +Compatibility +------------- + +Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`. +Ext3 is fully compatible with Ext2. Ext3 partitions can easily be +mounted as Ext2. + +External Tools +============== +see manual pages to know more. + +tune2fs: create a ext3 journal on a ext2 partition with the -j flags +mke2fs: create a ext3 partition with the -j flags +debugfs: ext2 and ext3 file system debugger + +References +========== + +kernel source: file:/usr/src/linux/fs/ext3 + file:/usr/src/linux/fs/jbd + +programs: http://e2fsprogs.sourceforge.net + +useful link: + http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html + http://www-106.ibm.com/developerworks/linux/library/l-fs7/ + http://www-106.ibm.com/developerworks/linux/library/l-fs8/ diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt new file mode 100644 index 000000000000..bd0fa7704035 --- /dev/null +++ b/Documentation/filesystems/hfs.txt @@ -0,0 +1,83 @@ + +Macintosh HFS Filesystem for Linux +================================== + +HFS stands for ``Hierarchical File System'' and is the filesystem used +by the Mac Plus and all later Macintosh models. Earlier Macintosh +models used MFS (``Macintosh File System''), which is not supported, +MacOS 8.1 and newer support a filesystem called HFS+ that's similar to +HFS but is extended in various areas. Use the hfsplus filesystem driver +to access such filesystems from Linux. + + +Mount options +============= + +When mounting an HFS filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystems. + Default: user/group id of the mounting process. + + dir_umask=n, file_umask=n, umask=n + Specifies the umask used for all files , all directories or all + files and directories. Defaults to the umask of the mounting process. + + session=n + Select the CDROM session to mount as HFS filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. Does only makes + sense for CDROMS because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + quiet + Ignore invalid mount options instead of complaining. + + +Writing to HFS Filesystems +========================== + +HFS is not a UNIX filesystem, thus it does not have the usual features you'd +expect: + + o You can't modify the set-uid, set-gid, sticky or executable bits or the uid + and gid of files. + o You can't create hard- or symlinks, device files, sockets or FIFOs. + +HFS does on the other have the concepts of multiple forks per file. These +non-standard forks are represented as hidden additional files in the normal +filesystems namespace which is kind of a cludge and makes the semantics for +the a little strange: + + o You can't create, delete or rename resource forks of files or the + Finder's metadata. + o They are however created (with default values), deleted and renamed + along with the corresponding data fork or directory. + o Copying files to a different filesystem will loose those attributes + that are essential for MacOS to work. + + +Creating HFS filesystems +=================================== + +The hfsutils package from Robert Leslie contains a program called +hformat that can be used to create HFS filesystem. See +<http://www.mars.org/home/rob/proj/hfs/> for details. + + +Credits +======= + +The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU) +and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis +Technologies. +Roman rewrote large parts of the code and brought in btree routines derived +from Brad Boyer's hfsplus driver (also maintained by Roman now). diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt new file mode 100644 index 000000000000..33dc360c8e89 --- /dev/null +++ b/Documentation/filesystems/hpfs.txt @@ -0,0 +1,296 @@ +Read/Write HPFS 2.09 +1998-2004, Mikulas Patocka + +email: mikulas@artax.karlin.mff.cuni.cz +homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi + +CREDITS: +Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file + is taken from it +Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) +Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion + +Mount options + +uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask) + Set owner/group/mode for files that do not have it specified in extended + attributes. Mode is inverted umask - for example umask 027 gives owner + all permission, group read permission and anybody else no access. Note + that for files mode is anded with 0666. If you want files to have 'x' + rights, you must use extended attributes. +case=lower,asis (default asis) + File name lowercasing in readdir. +conv=binary,text,auto (default binary) + CR/LF -> LF conversion, if auto, decision is made according to extension + - there is a list of text extensions (I thing it's better to not convert + text file than to damage binary file). If you want to change that list, + change it in the source. Original readonly HPFS contained some strange + heuristic algorithm that I removed. I thing it's danger to let the + computer decide whether file is text or binary. For example, DJGPP + binaries contain small text message at the beginning and they could be + misidentified and damaged under some circumstances. +check=none,normal,strict (default normal) + Check level. Selecting none will cause only little speedup and big + danger. I tried to write it so that it won't crash if check=normal on + corrupted filesystems. check=strict means many superfluous checks - + used for debugging (for example it checks if file is allocated in + bitmaps when accessing it). +errors=continue,remount-ro,panic (default remount-ro) + Behaviour when filesystem errors found. +chkdsk=no,errors,always (default errors) + When to mark filesystem dirty so that OS/2 checks it. +eas=no,ro,rw (default rw) + What to do with extended attributes. 'no' - ignore them and use always + values specified in uid/gid/mode options. 'ro' - read extended + attributes but do not create them. 'rw' - create extended attributes + when you use chmod/chown/chgrp/mknod/ln -s on the filesystem. +timeshift=(-)nnn (default 0) + Shifts the time by nnn seconds. For example, if you see under linux + one hour more, than under os/2, use timeshift=-3600. + + +File names + +As in OS/2, filenames are case insensitive. However, shell thinks that names +are case sensitive, so for example when you create a file FOO, you can use +'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you +also won't be able to compile linux kernel (and maybe other things) on HPFS +because kernel creates different files with names like bootsect.S and +bootsect.s. When searching for file thats name has characters >= 128, codepages +are used - see below. +OS/2 ignores dots and spaces at the end of file name, so this driver does as +well. If you create 'a. ...', the file 'a' will be created, but you can still +access it under names 'a.', 'a..', 'a . . . ' etc. + + +Extended attributes + +On HPFS partitions, OS/2 can associate to each file a special information called +extended attributes. Extended attributes are pairs of (key,value) where key is +an ascii string identifying that attribute and value is any string of bytes of +variable length. OS/2 stores window and icon positions and file types there. So +why not use it for unix-specific info like file owner or access rights? This +driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended +attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only +that extended attributes those value differs from defaults specified in mount +options are created. Once created, the extended attributes are never deleted, +they're just changed. It means that when your default uid=0 and you type +something like 'chown luser file; chown root file' the file will contain +extended attribute UID=0. And when you umount the fs and mount it again with +uid=luser_uid, the file will be still owned by root! If you chmod file to 444, +extended attribute "MODE" will not be set, this special case is done by setting +read-only flag. When you mknod a block or char device, besides "MODE", the +special 4-byte extended attribute "DEV" will be created containing the device +number. Currently this driver cannot resize extended attributes - it means +that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV" +attributes with different sizes, they won't be rewritten and changing these +values doesn't work. + + +Symlinks + +You can do symlinks on HPFS partition, symlinks are achieved by setting extended +attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and +chgrp symlinks but I don't know what is it good for. chmoding symlink results +in chmoding file where symlink points. These symlinks are just for Linux use and +incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are +stored in very crazy way. They tried to do it so that link changes when file is +moved ... sometimes it works. But the link is partly stored in directory +extended attributes and partly in OS2SYS.INI. I don't want (and don't know how) +to analyze or change OS2SYS.INI. + + +Codepages + +HPFS can contain several uppercasing tables for several codepages and each +file has a pointer to codepage it's name is in. However OS/2 was created in +America where people don't care much about codepages and so multiple codepages +support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. +Once I booted English OS/2 working in cp 850 and I created a file on my 852 +partition. It marked file name codepage as 850 - good. But when I again booted +Czech OS/2, the file was completely inaccessible under any name. It seems that +OS/2 uppercases the search pattern with its system code page (852) and file +name it's comparing to with its code page (850). These could never match. Is it +really what IBM developers wanted? But problems continued. When I created in +Czech OS/2 another file in that directory, that file was inaccessible too. OS/2 +probably uses different uppercasing method when searching where to place a file +(note, that files in HPFS directory must be sorted) and when searching for +a file. Finally when I opened this directory in PmShell, PmShell crashed (the +funny thing was that, when rebooted, PmShell tried to reopen this directory +again :-). chkdsk happily ignores these errors and only low-level disk +modification saved me. Never mix different language versions of OS/2 on one +system although HPFS was designed to allow that. +OK, I could implement complex codepage support to this driver but I think it +would cause more problems than benefit with such buggy implementation in OS/2. +So this driver simply uses first codepage it finds for uppercasing and +lowercasing no matter what's file codepage index. Usually all file names are in +this codepage - if you don't try to do what I described above :-) + + +Known bugs + +HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client +should work. If you have OS/2 server, use only read-only mode. I don't know how +to handle some HPFS386 structures like access control list or extended perm +list, I don't know how to delete them when file is deleted and how to not +overwrite them with extended attributes. Send me some info on these structures +and I'll make it. However, this driver should detect presence of HPFS386 +structures, remount read-only and not destroy them (I hope). + +When there's not enough space for extended attributes, they will be truncated +and no error is returned. + +OS/2 can't access files if the path is longer than about 256 chars but this +driver allows you to do it. chkdsk ignores such errors. + +Sometimes you won't be able to delete some files on a very full filesystem +(returning error ENOSPC). That's because file in non-leaf node in directory tree +(one directory, if it's large, has dirents in tree on HPFS) must be replaced +with another node when deleted. And that new file might have larger name than +the old one so the new name doesn't fit in directory node (dnode). And that +would result in directory tree splitting, that takes disk space. Workaround is +to delete other files that are leaf (probability that the file is non-leaf is +about 1/50) or to truncate file first to make some space. +You encounter this problem only if you have many directories so that +preallocated directory band is full i.e. + number_of_directories / size_of_filesystem_in_mb > 4. + +You can't delete open directories. + +You can't rename over directories (what is it good for?). + +Renaming files so that only case changes doesn't work. This driver supports it +but vfs doesn't. Something like 'mv file FILE' won't work. + +All atimes and directory mtimes are not updated. That's because of performance +reasons. If you extremely wish to update them, let me know, I'll write it (but +it will be slow). + +When the system is out of memory and swap, it may slightly corrupt filesystem +(lost files, unbalanced directories). (I guess all filesystem may do it). + +When compiled, you get warning: function declaration isn't a prototype. Does +anybody know what does it mean? + + +What does "unbalanced tree" message mean? + +Old versions of this driver created sometimes unbalanced dnode trees. OS/2 +chkdsk doesn't scream if the tree is unbalanced (and sometimes creates +unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely +crashes when the tree is not balanced. This driver handles unbalanced trees +correctly and writes warning if it finds them. If you see this message, this is +probably because of directories created with old version of this driver. +Workaround is to move all files from that directory to another and then back +again. Do it in Linux, not OS/2! If you see this message in directory that is +whole created by this driver, it is BUG - let me know about it. + + +Bugs in OS/2 + +When you have two (or more) lost directories pointing each to other, chkdsk +locks up when repairing filesystem. + +Sometimes (I think it's random) when you create a file with one-char name under +OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs +error corrected". + +File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and +marks them as short (and writes "minor fs error corrected"). This bug is not in +HPFS386. + +Codepage bugs described above. + +If you don't install fixpacks, there are many, many more... + + +History + +0.90 First public release +0.91 Fixed bug that caused shooting to memory when write_inode was called on + open inode (rarely happened) +0.92 Fixed a little memory leak in freeing directory inodes +0.93 Fixed bug that locked up the machine when there were too many filenames + with first 15 characters same + Fixed write_file to zero file when writing behind file end +0.94 Fixed a little memory leak when trying to delete busy file or directory +0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files +1.90 First version for 2.1.1xx kernels +1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk + Fixed a race-condition when write_inode is called while deleting file + Fixed a bug that could possibly happen (with very low probability) when + using 0xff in filenames + Rewritten locking to avoid race-conditions + Mount option 'eas' now works + Fsync no longer returns error + Files beginning with '.' are marked hidden + Remount support added + Alloc is not so slow when filesystem becomes full + Atimes are no more updated because it slows down operation + Code cleanup (removed all commented debug prints) +1.92 Corrected a bug when sync was called just before closing file +1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it + works with previous versions + Fixed a possible problem with disks > 64G (but I don't have one, so I can't + test it) + Fixed a file overflow at 2G + Added new option 'timeshift' + Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in + read-only mode + Fixed a bug that slowed down alloc and prevented allocating 100% space + (this bug was not destructive) +1.94 Added workaround for one bug in Linux + Fixed one buffer leak + Fixed some incompatibilities with large extended attributes (but it's still + not 100% ok, I have no info on it and OS/2 doesn't want to create them) + Rewritten allocation + Fixed a bug with i_blocks (du sometimes didn't display correct values) + Directories have no longer archive attribute set (some programs don't like + it) + Fixed a bug that it set badly one flag in large anode tree (it was not + destructive) +1.95 Fixed one buffer leak, that could happen on corrupted filesystem + Fixed one bug in allocation in 1.94 +1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported + error sometimes when opening directories in PMSHELL) + Fixed a possible bitmap race + Fixed possible problem on large disks + You can now delete open files + Fixed a nondestructive race in rename +1.97 Support for HPFS v3 (on large partitions) + Fixed a bug that it didn't allow creation of files > 128M (it should be 2G) +1.97.1 Changed names of global symbols + Fixed a bug when chmoding or chowning root directory +1.98 Fixed a deadlock when using old_readdir + Better directory handling; workaround for "unbalanced tree" bug in OS/2 +1.99 Corrected a possible problem when there's not enough space while deleting + file + Now it tries to truncate the file if there's not enough space when deleting + Removed a lot of redundant code +2.00 Fixed a bug in rename (it was there since 1.96) + Better anti-fragmentation strategy +2.01 Fixed problem with directory listing over NFS + Directory lseek now checks for proper parameters + Fixed race-condition in buffer code - it is in all filesystems in Linux; + when reading device (cat /dev/hda) while creating files on it, files + could be damaged +2.02 Woraround for bug in breada in Linux. breada could cause accesses beyond + end of partition +2.03 Char, block devices and pipes are correctly created + Fixed non-crashing race in unlink (Alexander Viro) + Now it works with Japanese version of OS/2 +2.04 Fixed error when ftruncate used to extend file +2.05 Fixed crash when got mount parameters without = + Fixed crash when allocation of anode failed due to full disk + Fixed some crashes when block io or inode allocation failed +2.06 Fixed some crash on corrupted disk structures + Better allocation strategy + Reschedule points added so that it doesn't lock CPU long time + It should work in read-only mode on Warp Server +2.07 More fixes for Warp Server. Now it really works +2.08 Creating new files is not so slow on large disks + An attempt to sync deleted file does not generate filesystem error +2.09 Fixed error on extremly fragmented files + + + vim: set textwidth=80: diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt new file mode 100644 index 000000000000..f64a10506689 --- /dev/null +++ b/Documentation/filesystems/isofs.txt @@ -0,0 +1,38 @@ +Mount options that are the same as for msdos and vfat partitions. + + gid=nnn All files in the partition will be in group nnn. + uid=nnn All files in the partition will be owned by user id nnn. + umask=nnn The permission mask (see umask(1)) for the partition. + +Mount options that are the same as vfat partitions. These are only useful +when using discs encoded using Microsoft's Joliet extensions. + iocharset=name Character set to use for converting from Unicode to + ASCII. Joliet filenames are stored in Unicode format, but + Unix for the most part doesn't know how to deal with Unicode. + There is also an option of doing UTF8 translations with the + utf8 option. + utf8 Encode Unicode names in UTF8 format. Default is no. + +Mount options unique to the isofs filesystem. + block=512 Set the block size for the disk to 512 bytes + block=1024 Set the block size for the disk to 1024 bytes + block=2048 Set the block size for the disk to 2048 bytes + check=relaxed Matches filenames with different cases + check=strict Matches only filenames with the exact same case + cruft Try to handle badly formatted CDs. + map=off Do not map non-Rock Ridge filenames to lower case + map=normal Map non-Rock Ridge filenames to lower case + map=acorn As map=normal but also apply Acorn extensions if present + mode=xxx Sets the permissions on files to xxx + nojoliet Ignore Joliet extensions if they are present. + norock Ignore Rock Ridge extensions if they are present. + unhide Show hidden files. + session=x Select number of session on multisession CD + sbsector=xxx Session begins from sector xxx + +Recommended documents about ISO 9660 standard are located at: +http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm +ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf +Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically +identical with ISO 9660.", so it is a valid and gratis substitute of the +official ISO specification. diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt new file mode 100644 index 000000000000..3e992daf99ad --- /dev/null +++ b/Documentation/filesystems/jfs.txt @@ -0,0 +1,35 @@ +IBM's Journaled File System (JFS) for Linux + +JFS Homepage: http://jfs.sourceforge.net/ + +The following mount options are supported: + +iocharset=name Character set to use for converting from Unicode to + ASCII. The default is to do no conversion. Use + iocharset=utf8 for UTF8 translations. This requires + CONFIG_NLS_UTF8 to be set in the kernel .config file. + iocharset=none specifies the default behavior explicitly. + +resize=value Resize the volume to <value> blocks. JFS only supports + growing a volume, not shrinking it. This option is only + valid during a remount, when the volume is mounted + read-write. The resize keyword with no value will grow + the volume to the full size of the partition. + +nointegrity Do not write to the journal. The primary use of this option + is to allow for higher performance when restoring a volume + from backup media. The integrity of the volume is not + guaranteed if the system abnormally abends. + +integrity Default. Commit metadata changes to the journal. Use this + option to remount a volume where the nointegrity option was + previously specified in order to restore normal behavior. + +errors=continue Keep going on a filesystem error. +errors=remount-ro Default. Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. + +Please send bugs, comments, cards and letters to shaggy@austin.ibm.com. + +The JFS mailing list can be subscribed to by using the link labeled +"Mail list Subscribe" at our web page http://jfs.sourceforge.net/ diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt new file mode 100644 index 000000000000..f12c30c93f2f --- /dev/null +++ b/Documentation/filesystems/ncpfs.txt @@ -0,0 +1,12 @@ +The ncpfs filesystem understands the NCP protocol, designed by the +Novell Corporation for their NetWare(tm) product. NCP is functionally +similar to the NFS used in the TCP/IP community. +To mount a NetWare filesystem, you need a special mount program, which +can be found in the ncpfs package. The home site for ncpfs is +ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors +will have it as well. + +Related products are linware and mars_nwe, which will give Linux partial +NetWare server functionality. Linware's home site is +klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on +ftp.gwdg.de/pub/linux/misc/ncpfs. diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt new file mode 100644 index 000000000000..f89b440fad1d --- /dev/null +++ b/Documentation/filesystems/ntfs.txt @@ -0,0 +1,630 @@ +The Linux NTFS filesystem driver +================================ + + +Table of contents +================= + +- Overview +- Web site +- Features +- Supported mount options +- Known bugs and (mis-)features +- Using NTFS volume and stripe sets + - The Device-Mapper driver + - The Software RAID / MD driver + - Limitiations when using the MD driver +- ChangeLog + + +Overview +======== + +Linux-NTFS comes with a number of user-space programs known as ntfsprogs. +These include mkntfs, a full-featured ntfs file system format utility, +ntfsundelete used for recovering files that were unintentionally deleted +from an NTFS volume and ntfsresize which is used to resize an NTFS partition. +See the web site for more information. + +To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file +system type 'ntfs'. The driver currently supports read-only mode (with no +fault-tolerance, encryption or journalling) and very limited, but safe, write +support. + +For fault tolerance and raid support (i.e. volume and stripe sets), you can +use the kernel's Software RAID / MD driver. See section "Using Software RAID +with NTFS" for details. + + +Web site +======== + +There is plenty of additional information on the linux-ntfs web site +at http://linux-ntfs.sourceforge.net/ + +The web site has a lot of additional information, such as a comprehensive +FAQ, documentation on the NTFS on-disk format, informaiton on the Linux-NTFS +userspace utilities, etc. + + +Features +======== + +- This is a complete rewrite of the NTFS driver that used to be in the kernel. + This new driver implements NTFS read support and is functionally equivalent + to the old ntfs driver. +- The new driver has full support for sparse files on NTFS 3.x volumes which + the old driver isn't happy with. +- The new driver supports execution of binaries due to mmap() now being + supported. +- The new driver supports loopback mounting of files on NTFS which is used by + some Linux distributions to enable the user to run Linux from an NTFS + partition by creating a large file while in Windows and then loopback + mounting the file while in Linux and creating a Linux filesystem on it that + is used to install Linux on it. +- A comparison of the two drivers using: + time find . -type f -exec md5sum "{}" \; + run three times in sequence with each driver (after a reboot) on a 1.4GiB + NTFS partition, showed the new driver to be 20% faster in total time elapsed + (from 9:43 minutes on average down to 7:53). The time spent in user space + was unchanged but the time spent in the kernel was decreased by a factor of + 2.5 (from 85 CPU seconds down to 33). +- The driver does not support short file names in general. For backwards + compatibility, we implement access to files using their short file names if + they exist. The driver will not create short file names however, and a + rename will discard any existing short file name. +- The new driver supports exporting of mounted NTFS volumes via NFS. +- The new driver supports async io (aio). +- The new driver supports fsync(2), fdatasync(2), and msync(2). +- The new driver supports readv(2) and writev(2). +- The new driver supports access time updates (including mtime and ctime). + + +Supported mount options +======================= + +In addition to the generic mount options described by the manual page for the +mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the +following mount options: + +iocharset=name Deprecated option. Still supported but please use + nls=name in the future. See description for nls=name. + +nls=name Character set to use when returning file names. + Unlike VFAT, NTFS suppresses names that contain + unconvertible characters. Note that most character + sets contain insufficient characters to represent all + possible Unicode characters that can exist on NTFS. + To be sure you are not missing any files, you are + advised to use nls=utf8 which is capable of + representing all Unicode characters. + +utf8=<bool> Option no longer supported. Currently mapped to + nls=utf8 but please use nls=utf8 in the future and + make sure utf8 is compiled either as module or into + the kernel. See description for nls=name. + +uid= +gid= +umask= Provide default owner, group, and access mode mask. + These options work as documented in mount(8). By + default, the files/directories are owned by root and + he/she has read and write permissions, as well as + browse permission for directories. No one else has any + access permissions. I.e. the mode on all files is by + default rw------- and for directories rwx------, a + consequence of the default fmask=0177 and dmask=0077. + Using a umask of zero will grant all permissions to + everyone, i.e. all files and directories will have mode + rwxrwxrwx. + +fmask= +dmask= Instead of specifying umask which applies both to + files and directories, fmask applies only to files and + dmask only to directories. + +sloppy=<BOOL> If sloppy is specified, ignore unknown mount options. + Otherwise the default behaviour is to abort mount if + any unknown options are found. + +show_sys_files=<BOOL> If show_sys_files is specified, show the system files + in directory listings. Otherwise the default behaviour + is to hide the system files. + Note that even when show_sys_files is specified, "$MFT" + will not be visible due to bugs/mis-features in glibc. + Further, note that irrespective of show_sys_files, all + files are accessible by name, i.e. you can always do + "ls -l \$UpCase" for example to specifically show the + system file containing the Unicode upcase table. + +case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as + case sensitive and create file names in the POSIX + namespace. Otherwise the default behaviour is to treat + file names as case insensitive and to create file names + in the WIN32/LONG name space. Note, the Linux NTFS + driver will never create short file names and will + remove them on rename/delete of the corresponding long + file name. + Note that files remain accessible via their short file + name, if it exists. If case_sensitive, you will need + to provide the correct case of the short file name. + +errors=opt What to do when critical file system errors are found. + Following values can be used for "opt": + continue: DEFAULT, try to clean-up as much as + possible, e.g. marking a corrupt inode as + bad so it is no longer accessed, and then + continue. + recover: At present only supported is recovery of + the boot sector from the backup copy. + If read-only mount, the recovery is done + in memory only and not written to disk. + Note that the options are additive, i.e. specifying: + errors=continue,errors=recover + means the driver will attempt to recover and if that + fails it will clean-up as much as possible and + continue. + +mft_zone_multiplier= Set the MFT zone multiplier for the volume (this + setting is not persistent across mounts and can be + changed from mount to mount but cannot be changed on + remount). Values of 1 to 4 are allowed, 1 being the + default. The MFT zone multiplier determines how much + space is reserved for the MFT on the volume. If all + other space is used up, then the MFT zone will be + shrunk dynamically, so this has no impact on the + amount of free space. However, it can have an impact + on performance by affecting fragmentation of the MFT. + In general use the default. If you have a lot of small + files then use a higher value. The values have the + following meaning: + Value MFT zone size (% of volume size) + 1 12.5% + 2 25% + 3 37.5% + 4 50% + Note this option is irrelevant for read-only mounts. + + +Known bugs and (mis-)features +============================= + +- The link count on each directory inode entry is set to 1, due to Linux not + supporting directory hard links. This may well confuse some user space + applications, since the directory names will have the same inode numbers. + This also speeds up ntfs_read_inode() immensely. And we haven't found any + problems with this approach so far. If you find a problem with this, please + let us know. + + +Please send bug reports/comments/feedback/abuse to the Linux-NTFS development +list at sourceforge: linux-ntfs-dev@lists.sourceforge.net + + +Using NTFS volume and stripe sets +================================= + +For support of volume and stripe sets, you can either use the kernel's +Device-Mapper driver or the kernel's Software RAID / MD driver. The former is +the recommended one to use for linear raid. But the latter is required for +raid level 5. For striping and mirroring, either driver should work fine. + + +The Device-Mapper driver +------------------------ + +You will need to create a table of the components of the volume/stripe set and +how they fit together and load this into the kernel using the dmsetup utility +(see man 8 dmsetup). + +Linear volume sets, i.e. linear raid, has been tested and works fine. Even +though untested, there is no reason why stripe sets, i.e. raid level 0, and +mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e. +raid level 5, unfortunately cannot work yet because the current version of the +Device-Mapper driver does not support raid level 5. You may be able to use the +Software RAID / MD driver for raid level 5, see the next section for details. + +To create the table describing your volume you will need to know each of its +components and their sizes in sectors, i.e. multiples of 512-byte blocks. + +For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for +example if one of your partitions is /dev/hda2 you would do: + +$ fdisk -ul /dev/hda + +Disk /dev/hda: 81.9 GB, 81964302336 bytes +255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors +Units = sectors of 1 * 512 = 512 bytes + + Device Boot Start End Blocks Id System + /dev/hda1 * 63 4209029 2104483+ 83 Linux + /dev/hda2 4209030 37768814 16779892+ 86 NTFS + /dev/hda3 37768815 46170809 4200997+ 83 Linux + +And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = +33559785 sectors. + +For Win2k and later dynamic disks, you can for example use the ldminfo utility +which is part of the Linux LDM tools (the latest version at the time of +writing is linux-ldm-0.0.8.tar.bz2). You can download it from: + http://linux-ntfs.sourceforge.net/downloads.html +Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go +into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You +will find the precompiled (i386) ldminfo utility there. NOTE: You will not be +able to compile this yourself easily so use the binary version! + +Then you would use ldminfo in dump mode to obtain the necessary information: + +$ ./ldminfo --dump /dev/hda + +This would dump the LDM database found on /dev/hda which describes all of your +dynamic disks and all the volumes on them. At the bottom you will see the +VOLUME DEFINITIONS section which is all you really need. You may need to look +further above to determine which of the disks in the volume definitions is +which device in Linux. Hint: Run ldminfo on each of your dynamic disks and +look at the Disk Id close to the top of the output for each (the PRIVATE HEADER +section). You can then find these Disk Ids in the VBLK DATABASE section in the +<Disk> components where you will get the LDM Name for the disk that is found in +the VOLUME DEFINITIONS section. + +Note you will also need to enable the LDM driver in the Linux kernel. If your +distribution did not enable it, you will need to recompile the kernel with it +enabled. This will create the LDM partitions on each device at boot time. You +would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc) +in the Device-Mapper table. + +You can also bypass using the LDM driver by using the main device (e.g. +/dev/hda) and then using the offsets of the LDM partitions into this device as +the "Start sector of device" when creating the table. Once again ldminfo would +give you the correct information to do this. + +Assuming you know all your devices and their sizes things are easy. + +For a linear raid the table would look like this (note all values are in +512-byte sectors): + +--- cut here --- +# Offset into Size of this Raid type Device Start sector +# volume device of device +0 1028161 linear /dev/hda1 0 +1028161 3903762 linear /dev/hdb2 0 +4931923 2103211 linear /dev/hdc1 0 +--- cut here --- + +For a striped volume, i.e. raid level 0, you will need to know the chunk size +you used when creating the volume. Windows uses 64kiB as the default, so it +will probably be this unless you changes the defaults when creating the array. + +For a raid level 0 the table would look like this (note all values are in +512-byte sectors): + +--- cut here --- +# Offset Size Raid Number Chunk 1st Start 2nd Start +# into of the type of size Device in Device in +# volume volume stripes device device +0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 +--- cut here --- + +If there are more than two devices, just add each of them to the end of the +line. + +Finally, for a mirrored volume, i.e. raid level 1, the table would look like +this (note all values are in 512-byte sectors): + +--- cut here --- +# Ofs Size Raid Log Number Region Should Number Source Start Taget Start +# in of the type type of log size sync? of Device in Device in +# vol volume params mirrors Device Device +0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 +--- cut here --- + +If you are mirroring to multiple devices you can specify further targets at the +end of the line. + +Note the "Should sync?" parameter "nosync" means that the two mirrors are +already in sync which will be the case on a clean shutdown of Windows. If the +mirrors are not clean, you can specify the "sync" option instead of "nosync" +and the Device-Mapper driver will then copy the entirey of the "Source Device" +to the "Target Device" or if you specified multipled target devices to all of +them. + +Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), +and hand it over to dmsetup to work with, like so: + +$ dmsetup create myvolume1 /etc/ntfsvolume1 + +You can obviously replace "myvolume1" with whatever name you like. + +If it all worked, you will now have the device /dev/device-mapper/myvolume1 +which you can then just use as an argument to the mount command as usual to +mount the ntfs volume. For example: + +$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 + +(You need to create the directory /mnt/myvol1 first and of course you can use +anything you like instead of /mnt/myvol1 as long as it is an existing +directory.) + +It is advisable to do the mount read-only to see if the volume has been setup +correctly to avoid the possibility of causing damage to the data on the ntfs +volume. + + +The Software RAID / MD driver +----------------------------- + +An alternative to using the Device-Mapper driver is to use the kernel's +Software RAID / MD driver. For which you need to set up your /etc/raidtab +appropriately (see man 5 raidtab). + +Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level +0, have been tested and work fine (though see section "Limitiations when using +the MD driver with NTFS volumes" especially if you want to use linear raid). +Even though untested, there is no reason why mirrors, i.e. raid level 1, and +stripes with parity, i.e. raid level 5, should not work, too. + +You have to use the "persistent-superblock 0" option for each raid-disk in the +NTFS volume/stripe you are configuring in /etc/raidtab as the persistent +superblock used by the MD driver would damange the NTFS volume. + +Windows by default uses a stripe chunk size of 64k, so you probably want the +"chunk-size 64k" option for each raid-disk, too. + +For example, if you have a stripe set consisting of two partitions /dev/hda5 +and /dev/hdb1 your /etc/raidtab would look like this: + +raiddev /dev/md0 + raid-level 0 + nr-raid-disks 2 + nr-spare-disks 0 + persistent-superblock 0 + chunk-size 64k + device /dev/hda5 + raid-disk 0 + device /dev/hdb1 + raid-disl 1 + +For linear raid, just change the raid-level above to "raid-level linear", for +mirrors, change it to "raid-level 1", and for stripe sets with parity, change +it to "raid-level 5". + +Note for stripe sets with parity you will also need to tell the MD driver +which parity algorithm to use by specifying the option "parity-algorithm +which", where you need to replace "which" with the name of the algorithm to +use (see man 5 raidtab for available algorithms) and you will have to try the +different available algorithms until you find one that works. Make sure you +are working read-only when playing with this as you may damage your data +otherwise. If you find which algorithm works please let us know (email the +linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on +IRC in channel #ntfs on the irc.freenode.net network) so we can update this +documentation. + +Once the raidtab is setup, run for example raid0run -a to start all devices or +raid0run /dev/md0 to start a particular md device, in this case /dev/md0. + +Then just use the mount command as usual to mount the ntfs volume using for +example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume + +It is advisable to do the mount read-only to see if the md volume has been +setup correctly to avoid the possibility of causing damage to the data on the +ntfs volume. + + +Limitiations when using the Software RAID / MD driver +----------------------------------------------------- + +Using the md driver will not work properly if any of your NTFS partitions have +an odd number of sectors. This is especially important for linear raid as all +data after the first partition with an odd number of sectors will be offset by +one or more sectors so if you mount such a partition with write support you +will cause massive damage to the data on the volume which will only become +apparent when you try to use the volume again under Windows. + +So when using linear raid, make sure that all your partitions have an even +number of sectors BEFORE attempting to use it. You have been warned! + +Even better is to simply use the Device-Mapper for linear raid and then you do +not have this problem with odd numbers of sectors. + + +ChangeLog +========= + +Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. + +2.1.22: + - Improve handling of ntfs volumes with errors. + - Fix various bugs and race conditions. +2.1.21: + - Fix several race conditions and various other bugs. + - Many internal cleanups, code reorganization, optimizations, and mft + and index record writing code rewritten to fit in with the changes. + - Update Documentation/filesystems/ntfs.txt with instructions on how to + use the Device-Mapper driver with NTFS ftdisk/LDM raid. +2.1.20: + - Fix two stupid bugs introduced in 2.1.18 release. +2.1.19: + - Minor bugfix in handling of the default upcase table. + - Many internal cleanups and improvements. Many thanks to Linus + Torvalds and Al Viro for the help and advice with the sparse + annotations and cleanups. +2.1.18: + - Fix scheduling latencies at mount time. (Ingo Molnar) + - Fix endianness bug in a little traversed portion of the attribute + lookup code. +2.1.17: + - Fix bugs in mount time error code paths. +2.1.16: + - Implement access time updates (including mtime and ctime). + - Implement fsync(2), fdatasync(2), and msync(2) system calls. + - Enable the readv(2) and writev(2) system calls. + - Enable access via the asynchronous io (aio) API by adding support for + the aio_read(3) and aio_write(3) functions. +2.1.15: + - Invalidate quotas when (re)mounting read-write. + NOTE: This now only leave user space journalling on the side. (See + note for version 2.1.13, below.) +2.1.14: + - Fix an NFSd caused deadlock reported by several users. +2.1.13: + - Implement writing of inodes (access time updates are not implemented + yet so mounting with -o noatime,nodiratime is enforced). + - Enable writing out of resident files so you can now overwrite any + uncompressed, unencrypted, nonsparse file as long as you do not + change the file size. + - Add housekeeping of ntfs system files so that ntfsfix no longer needs + to be run after writing to an NTFS volume. + NOTE: This still leaves quota tracking and user space journalling on + the side but they should not cause data corruption. In the worst + case the charged quotas will be out of date ($Quota) and some + userspace applications might get confused due to the out of date + userspace journal ($UsnJrnl). +2.1.12: + - Fix the second fix to the decompression engine from the 2.1.9 release + and some further internals cleanups. +2.1.11: + - Driver internal cleanups. +2.1.10: + - Force read-only (re)mounting of volumes with unsupported volume + flags and various cleanups. +2.1.9: + - Fix two bugs in handling of corner cases in the decompression engine. +2.1.8: + - Read the $MFT mirror and compare it to the $MFT and if the two do not + match, force a read-only mount and do not allow read-write remounts. + - Read and parse the $LogFile journal and if it indicates that the + volume was not shutdown cleanly, force a read-only mount and do not + allow read-write remounts. If the $LogFile indicates a clean + shutdown and a read-write (re)mount is requested, empty $LogFile to + ensure that Windows cannot cause data corruption by replaying a stale + journal after Linux has written to the volume. + - Improve time handling so that the NTFS time is fully preserved when + converted to kernel time and only up to 99 nano-seconds are lost when + kernel time is converted to NTFS time. +2.1.7: + - Enable NFS exporting of mounted NTFS volumes. +2.1.6: + - Fix minor bug in handling of compressed directories that fixes the + erroneous "du" and "stat" output people reported. +2.1.5: + - Minor bug fix in attribute list attribute handling that fixes the + I/O errors on "ls" of certain fragmented files found by at least two + people running Windows XP. +2.1.4: + - Minor update allowing compilation with all gcc versions (well, the + ones the kernel can be compiled with anyway). +2.1.3: + - Major bug fixes for reading files and volumes in corner cases which + were being hit by Windows 2k/XP users. +2.1.2: + - Major bug fixes aleviating the hangs in statfs experienced by some + users. +2.1.1: + - Update handling of compressed files so people no longer get the + frequently reported warning messages about initialized_size != + data_size. +2.1.0: + - Add configuration option for developmental write support. + - Initial implementation of file overwriting. (Writes to resident files + are not written out to disk yet, so avoid writing to files smaller + than about 1kiB.) + - Intercept/abort changes in file size as they are not implemented yet. +2.0.25: + - Minor bugfixes in error code paths and small cleanups. +2.0.24: + - Small internal cleanups. + - Support for sendfile system call. (Christoph Hellwig) +2.0.23: + - Massive internal locking changes to mft record locking. Fixes + various race conditions and deadlocks. + - Fix ntfs over loopback for compressed files by adding an + optimization barrier. (gcc was screwing up otherwise ?) + Thanks go to Christoph Hellwig for pointing these two out: + - Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs(). + - Fix ntfs_free() for ia64 and parisc. +2.0.22: + - Small internal cleanups. +2.0.21: + These only affect 32-bit architectures: + - Check for, and refuse to mount too large volumes (maximum is 2TiB). + - Check for, and refuse to open too large files and directories + (maximum is 16TiB). +2.0.20: + - Support non-resident directory index bitmaps. This means we now cope + with huge directories without problems. + - Fix a page leak that manifested itself in some cases when reading + directory contents. + - Internal cleanups. +2.0.19: + - Fix race condition and improvements in block i/o interface. + - Optimization when reading compressed files. +2.0.18: + - Fix race condition in reading of compressed files. +2.0.17: + - Cleanups and optimizations. +2.0.16: + - Fix stupid bug introduced in 2.0.15 in new attribute inode API. + - Big internal cleanup replacing the mftbmp access hacks by using the + new attribute inode API instead. +2.0.15: + - Bug fix in parsing of remount options. + - Internal changes implementing attribute (fake) inodes allowing all + attribute i/o to go via the page cache and to use all the normal + vfs/mm functionality. +2.0.14: + - Internal changes improving run list merging code and minor locking + change to not rely on BKL in ntfs_statfs(). +2.0.13: + - Internal changes towards using iget5_locked() in preparation for + fake inodes and small cleanups to ntfs_volume structure. +2.0.12: + - Internal cleanups in address space operations made possible by the + changes introduced in the previous release. +2.0.11: + - Internal updates and cleanups introducing the first step towards + fake inode based attribute i/o. +2.0.10: + - Microsoft says that the maximum number of inodes is 2^32 - 1. Update + the driver accordingly to only use 32-bits to store inode numbers on + 32-bit architectures. This improves the speed of the driver a little. +2.0.9: + - Change decompression engine to use a single buffer. This should not + affect performance except perhaps on the most heavy i/o on SMP + systems when accessing multiple compressed files from multiple + devices simultaneously. + - Minor updates and cleanups. +2.0.8: + - Remove now obsolete show_inodes and posix mount option(s). + - Restore show_sys_files mount option. + - Add new mount option case_sensitive, to determine if the driver + treats file names as case sensitive or not. + - Mostly drop support for short file names (for backwards compatibility + we only support accessing files via their short file name if one + exists). + - Fix dcache aliasing issues wrt short/long file names. + - Cleanups and minor fixes. +2.0.7: + - Just cleanups. +2.0.6: + - Major bugfix to make compatible with other kernel changes. This fixes + the hangs/oopses on umount. + - Locking cleanup in directory operations (remove BKL usage). +2.0.5: + - Major buffer overflow bug fix. + - Minor cleanups and updates for kernel 2.5.12. +2.0.4: + - Cleanups and updates for kernel 2.5.11. +2.0.3: + - Small bug fixes, cleanups, and performance improvements. +2.0.2: + - Use default fmask of 0177 so that files are no executable by default. + If you want owner executable files, just use fmask=0077. + - Update for kernel 2.5.9 but preserve backwards compatibility with + kernel 2.5.7. + - Minor bug fixes, cleanups, and updates. +2.0.1: + - Minor updates, primarily set the executable bit by default on files + so they can be executed. +2.0.0: + - Started ChangeLog. + diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting new file mode 100644 index 000000000000..2f388460cbe7 --- /dev/null +++ b/Documentation/filesystems/porting @@ -0,0 +1,266 @@ +Changes since 2.5.0: + +--- +[recommended] + +New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(), + sb_set_blocksize() and sb_min_blocksize(). + +Use them. + +(sb_find_get_block() replaces 2.4's get_hash_table()) + +--- +[recommended] + +New methods: ->alloc_inode() and ->destroy_inode(). + +Remove inode->u.foo_inode_i +Declare + struct foo_inode_info { + /* fs-private stuff */ + struct inode vfs_inode; + }; + static inline struct foo_inode_info *FOO_I(struct inode *inode) + { + return list_entry(inode, struct foo_inode_info, vfs_inode); + } + +Use FOO_I(inode) instead of &inode->u.foo_inode_i; + +Add foo_alloc_inode() and foo_destory_inode() - the former should allocate +foo_inode_info and return the address of ->vfs_inode, the latter should free +FOO_I(inode) (see in-tree filesystems for examples). + +Make them ->alloc_inode and ->destroy_inode in your super_operations. + +Keep in mind that now you need explicit initialization of private data - +typically in ->read_inode() and after getting an inode from new_inode(). + +At some point that will become mandatory. + +--- +[mandatory] + +Change of file_system_type method (->read_super to ->get_sb) + +->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV. + +Turn your foo_read_super() into a function that would return 0 in case of +success and negative number in case of error (-EINVAL unless you have more +informative error value to report). Call it foo_fill_super(). Now declare + +struct super_block foo_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super); +} + +(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of +filesystem). + +Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as +foo_get_sb. + +--- +[mandatory] + +Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames. +Most likely there is no need to change anything, but if you relied on +global exclusion between renames for some internal purpose - you need to +change your internal locking. Otherwise exclusion warranties remain the +same (i.e. parents and victim are locked, etc.). + +--- +[informational] + +Now we have the exclusion between ->lookup() and directory removal (by +->rmdir() and ->rename()). If you used to need that exclusion and do +it by internal locking (most of filesystems couldn't care less) - you +can relax your locking. + +--- +[mandatory] + +->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(), +->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename() +and ->readdir() are called without BKL now. Grab it on entry, drop upon return +- that will guarantee the same locking you used to have. If your method or its +parts do not need BKL - better yet, now you can shift lock_kernel() and +unlock_kernel() so that they would protect exactly what needs to be +protected. + +--- +[mandatory] + +BKL is also moved from around sb operations. ->write_super() Is now called +without BKL held. BKL should have been shifted into individual fs sb_op +functions. If you don't need it, remove it. + +--- +[informational] + +check for ->link() target not being a directory is done by callers. Feel +free to drop it... + +--- +[informational] + +->link() callers hold ->i_sem on the object we are linking to. Some of your +problems might be over... + +--- +[mandatory] + +new file_system_type method - kill_sb(superblock). If you are converting +an existing filesystem, set it according to ->fs_flags: + FS_REQUIRES_DEV - kill_block_super + FS_LITTER - kill_litter_super + neither - kill_anon_super +FS_LITTER is gone - just remove it from fs_flags. + +--- +[mandatory] + + FS_SINGLE is gone (actually, that had happened back when ->get_sb() +went in - and hadn't been documented ;-/). Just remove it from fs_flags +(and see ->get_sb() entry for other actions). + +--- +[mandatory] + +->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so +watch for ->i_sem-grabbing code that might be used by your ->setattr(). +Callers of notify_change() need ->i_sem now. + +--- +[recommended] + +New super_block field "struct export_operations *s_export_op" for +explicit support for exporting, e.g. via NFS. The structure is fully +documented at its declaration in include/linux/fs.h, and in +Documentation/filesystems/Exporting. + +Briefly it allows for the definition of decode_fh and encode_fh operations +to encode and decode filehandles, and allows the filesystem to use +a standard helper function for decode_fh, and provide file-system specific +support for this helper, particularly get_parent. + +It is planned that this will be required for exporting once the code +settles down a bit. + +[mandatory] + +s_export_op is now required for exporting a filesystem. +isofs, ext2, ext3, resierfs, fat +can be used as examples of very different filesystems. + +--- +[mandatory] + +iget4() and the read_inode2 callback have been superseded by iget5_locked() +which has the following prototype, + + struct inode *iget5_locked(struct super_block *sb, unsigned long ino, + int (*test)(struct inode *, void *), + int (*set)(struct inode *, void *), + void *data); + +'test' is an additional function that can be used when the inode +number is not sufficient to identify the actual file object. 'set' +should be a non-blocking function that initializes those parts of a +newly created inode to allow the test function to succeed. 'data' is +passed as an opaque value to both test and set functions. + +When the inode has been created by iget5_locked(), it will be returned with +the I_NEW flag set and will still be locked. read_inode has not been +called so the file system still has to finalize the initialization. Once +the inode is initialized it must be unlocked by calling unlock_new_inode(). + +The filesystem is responsible for setting (and possibly testing) i_ino +when appropriate. There is also a simpler iget_locked function that +just takes the superblock and inode number as arguments and does the +test and set for you. + +e.g. + inode = iget_locked(sb, ino); + if (inode->i_state & I_NEW) { + read_inode_from_disk(inode); + unlock_new_inode(inode); + } + +--- +[recommended] + +->getattr() finally getting used. See instances in nfs, minix, etc. + +--- +[mandatory] + +->revalidate() is gone. If your filesystem had it - provide ->getattr() +and let it call whatever you had as ->revlidate() + (for symlinks that +had ->revalidate()) add calls in ->follow_link()/->readlink(). + +--- +[mandatory] + +->d_parent changes are not protected by BKL anymore. Read access is safe +if at least one of the following is true: + * filesystem has no cross-directory rename() + * dcache_lock is held + * we know that parent had been locked (e.g. we are looking at +->d_parent of ->lookup() argument). + * we are called from ->rename(). + * the child's ->d_lock is held +Audit your code and add locking if needed. Notice that any place that is +not protected by the conditions above is risky even in the old tree - you +had been relying on BKL and that's prone to screwups. Old tree had quite +a few holes of that kind - unprotected access to ->d_parent leading to +anything from oops to silent memory corruption. + +--- +[mandatory] + + FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags +(see rootfs for one kind of solution and bdev/socket/pipe for another). + +--- +[recommended] + + Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter +is still alive, but only because of the mess in drivers/s390/block/dasd.c. +As soon as it gets fixed is_read_only() will die. + +--- +[mandatory] + +->permission() is called without BKL now. Grab it on entry, drop upon +return - that will guarantee the same locking you used to have. If +your method or its parts do not need BKL - better yet, now you can +shift lock_kernel() and unlock_kernel() so that they would protect +exactly what needs to be protected. + +--- +[mandatory] + +->statfs() is now called without BKL held. BKL should have been +shifted into individual fs sb_op functions where it's not clear that +it's safe to remove it. If you don't need it, remove it. + +--- +[mandatory] + + is_read_only() is gone; use bdev_read_only() instead. + +--- +[mandatory] + + destroy_buffers() is gone; use invalidate_bdev(). + +--- +[mandatory] + + fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is +deliberate; as soon as struct block_device * is propagated in a reasonable +way by that code fixing will become trivial; until then nothing can be +done. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt new file mode 100644 index 000000000000..cbe85c17176b --- /dev/null +++ b/Documentation/filesystems/proc.txt @@ -0,0 +1,1940 @@ +------------------------------------------------------------------------------ + T H E /proc F I L E S Y S T E M +------------------------------------------------------------------------------ +/proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999 + Bodo Bauer <bb@ricochet.net> + +2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 +------------------------------------------------------------------------------ +Version 1.3 Kernel version 2.2.12 + Kernel version 2.4.0-test11-pre4 +------------------------------------------------------------------------------ + +Table of Contents +----------------- + + 0 Preface + 0.1 Introduction/Credits + 0.2 Legal Stuff + + 1 Collecting System Information + 1.1 Process-Specific Subdirectories + 1.2 Kernel data + 1.3 IDE devices in /proc/ide + 1.4 Networking info in /proc/net + 1.5 SCSI info + 1.6 Parallel port info in /proc/parport + 1.7 TTY info in /proc/tty + 1.8 Miscellaneous kernel statistics in /proc/stat + + 2 Modifying System Parameters + 2.1 /proc/sys/fs - File system data + 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats + 2.3 /proc/sys/kernel - general kernel parameters + 2.4 /proc/sys/vm - The virtual memory subsystem + 2.5 /proc/sys/dev - Device specific parameters + 2.6 /proc/sys/sunrpc - Remote procedure calls + 2.7 /proc/sys/net - Networking stuff + 2.8 /proc/sys/net/ipv4 - IPV4 settings + 2.9 Appletalk + 2.10 IPX + 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem + +------------------------------------------------------------------------------ +Preface +------------------------------------------------------------------------------ + +0.1 Introduction/Credits +------------------------ + +This documentation is part of a soon (or so we hope) to be released book on +the SuSE Linux distribution. As there is no complete documentation for the +/proc file system and we've used many freely available sources to write these +chapters, it seems only fair to give the work back to the Linux community. +This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm +afraid it's still far from complete, but we hope it will be useful. As far as +we know, it is the first 'all-in-one' document about the /proc file system. It +is focused on the Intel x86 hardware, so if you are looking for PPC, ARM, +SPARC, AXP, etc., features, you probably won't find what you are looking for. +It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But +additions and patches are welcome and will be added to this document if you +mail them to Bodo. + +We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of +other people for help compiling this documentation. We'd also like to extend a +special thank you to Andi Kleen for documentation, which we relied on heavily +to create this document, as well as the additional information he provided. +Thanks to everybody else who contributed source or docs to the Linux kernel +and helped create a great piece of software... :) + +If you have any comments, corrections or additions, please don't hesitate to +contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this +document. + +The latest version of this document is available online at +http://skaro.nightcrawler.com/~bb/Docs/Proc as HTML version. + +If the above direction does not works for you, ypu could try the kernel +mailing list at linux-kernel@vger.kernel.org and/or try to reach me at +comandante@zaralinux.com. + +0.2 Legal Stuff +--------------- + +We don't guarantee the correctness of this document, and if you come to us +complaining about how you screwed up your system because of incorrect +documentation, we won't feel responsible... + +------------------------------------------------------------------------------ +CHAPTER 1: COLLECTING SYSTEM INFORMATION +------------------------------------------------------------------------------ + +------------------------------------------------------------------------------ +In This Chapter +------------------------------------------------------------------------------ +* Investigating the properties of the pseudo file system /proc and its + ability to provide information on the running Linux system +* Examining /proc's structure +* Uncovering various information about the kernel and the processes running + on the system +------------------------------------------------------------------------------ + + +The proc file system acts as an interface to internal data structures in the +kernel. It can be used to obtain information about the system and to change +certain kernel parameters at runtime (sysctl). + +First, we'll take a look at the read-only parts of /proc. In Chapter 2, we +show you how you can use /proc/sys to change settings. + +1.1 Process-Specific Subdirectories +----------------------------------- + +The directory /proc contains (among other things) one subdirectory for each +process running on the system, which is named after the process ID (PID). + +The link self points to the process reading the file system. Each process +subdirectory has the entries listed in Table 1-1. + + +Table 1-1: Process specific entries in /proc +.............................................................................. + File Content + cmdline Command line arguments + cpu Current and last cpu in wich it was executed (2.4)(smp) + cwd Link to the current working directory + environ Values of environment variables + exe Link to the executable of this process + fd Directory, which contains all file descriptors + maps Memory maps to executables and library files (2.4) + mem Memory held by this process + root Link to the root directory of this process + stat Process status + statm Process memory status information + status Process status in human readable form + wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan +.............................................................................. + +For example, to get the status information of a process, all you have to do is +read the file /proc/PID/status: + + >cat /proc/self/status + Name: cat + State: R (running) + Pid: 5452 + PPid: 743 + TracerPid: 0 (2.4) + Uid: 501 501 501 501 + Gid: 100 100 100 100 + Groups: 100 14 16 + VmSize: 1112 kB + VmLck: 0 kB + VmRSS: 348 kB + VmData: 24 kB + VmStk: 12 kB + VmExe: 8 kB + VmLib: 1044 kB + SigPnd: 0000000000000000 + SigBlk: 0000000000000000 + SigIgn: 0000000000000000 + SigCgt: 0000000000000000 + CapInh: 00000000fffffeff + CapPrm: 0000000000000000 + CapEff: 0000000000000000 + + +This shows you nearly the same information you would get if you viewed it with +the ps command. In fact, ps uses the proc file system to obtain its +information. The statm file contains more detailed information about the +process memory usage. Its seven fields are explained in Table 1-2. + + +Table 1-2: Contents of the statm files (as of 2.6.8-rc3) +.............................................................................. + Field Content + size total program size (pages) (same as VmSize in status) + resident size of memory portions (pages) (same as VmRSS in status) + shared number of pages that are shared (i.e. backed by a file) + trs number of pages that are 'code' (not including libs; broken, + includes data segment) + lrs number of pages of library (always 0 on 2.6) + drs number of pages of data/stack (including libs; broken, + includes library text) + dt number of dirty pages (always 0 on 2.6) +.............................................................................. + +1.2 Kernel data +--------------- + +Similar to the process entries, the kernel data files give information about +the running kernel. The files used to obtain this information are contained in +/proc and are listed in Table 1-3. Not all of these will be present in your +system. It depends on the kernel configuration and the loaded modules, which +files are there, and which are missing. + +Table 1-3: Kernel info in /proc +.............................................................................. + File Content + apm Advanced power management info + buddyinfo Kernel memory allocator information (see text) (2.5) + bus Directory containing bus specific information + cmdline Kernel command line + cpuinfo Info about the CPU + devices Available devices (block and character) + dma Used DMS channels + filesystems Supported filesystems + driver Various drivers grouped here, currently rtc (2.4) + execdomains Execdomains, related to security (2.4) + fb Frame Buffer devices (2.4) + fs File system parameters, currently nfs/exports (2.4) + ide Directory containing info about the IDE subsystem + interrupts Interrupt usage + iomem Memory map (2.4) + ioports I/O port usage + irq Masks for irq to cpu affinity (2.4)(smp?) + isapnp ISA PnP (Plug&Play) Info (2.4) + kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) + kmsg Kernel messages + ksyms Kernel symbol table + loadavg Load average of last 1, 5 & 15 minutes + locks Kernel locks + meminfo Memory info + misc Miscellaneous + modules List of loaded modules + mounts Mounted filesystems + net Networking info (see text) + partitions Table of partitions known to the system + pci Depreciated info of PCI bus (new way -> /proc/bus/pci/, + decoupled by lspci (2.4) + rtc Real time clock + scsi SCSI info (see text) + slabinfo Slab pool info + stat Overall statistics + swaps Swap space utilization + sys See chapter 2 + sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) + tty Info of tty drivers + uptime System uptime + version Kernel version + video bttv info of video resources (2.4) +.............................................................................. + +You can, for example, check which interrupts are currently in use and what +they are used for by looking in the file /proc/interrupts: + + > cat /proc/interrupts + CPU0 + 0: 8728810 XT-PIC timer + 1: 895 XT-PIC keyboard + 2: 0 XT-PIC cascade + 3: 531695 XT-PIC aha152x + 4: 2014133 XT-PIC serial + 5: 44401 XT-PIC pcnet_cs + 8: 2 XT-PIC rtc + 11: 8 XT-PIC i82365 + 12: 182918 XT-PIC PS/2 Mouse + 13: 1 XT-PIC fpu + 14: 1232265 XT-PIC ide0 + 15: 7 XT-PIC ide1 + NMI: 0 + +In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the +output of a SMP machine): + + > cat /proc/interrupts + + CPU0 CPU1 + 0: 1243498 1214548 IO-APIC-edge timer + 1: 8949 8958 IO-APIC-edge keyboard + 2: 0 0 XT-PIC cascade + 5: 11286 10161 IO-APIC-edge soundblaster + 8: 1 0 IO-APIC-edge rtc + 9: 27422 27407 IO-APIC-edge 3c503 + 12: 113645 113873 IO-APIC-edge PS/2 Mouse + 13: 0 0 XT-PIC fpu + 14: 22491 24012 IO-APIC-edge ide0 + 15: 2183 2415 IO-APIC-edge ide1 + 17: 30564 30414 IO-APIC-level eth0 + 18: 177 164 IO-APIC-level bttv + NMI: 2457961 2457959 + LOC: 2457882 2457881 + ERR: 2155 + +NMI is incremented in this case because every timer interrupt generates a NMI +(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups. + +LOC is the local interrupt counter of the internal APIC of every CPU. + +ERR is incremented in the case of errors in the IO-APIC bus (the bus that +connects the CPUs in a SMP system. This means that an error has been detected, +the IO-APIC automatically retry the transmission, so it should not be a big +problem, but you should read the SMP-FAQ. + +In this context it could be interesting to note the new irq directory in 2.4. +It could be used to set IRQ to CPU affinity, this means that you can "hook" an +IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the +irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask + +For example + > ls /proc/irq/ + 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask + 1 11 13 15 17 19 3 5 7 9 + > ls /proc/irq/0/ + smp_affinity + +The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ +is the same by default: + + > cat /proc/irq/0/smp_affinity + ffffffff + +It's a bitmask, in wich you can specify wich CPUs can handle the IRQ, you can +set it by doing: + + > echo 1 > /proc/irq/prof_cpu_mask + +This means that only the first CPU will handle the IRQ, but you can also echo 5 +wich means that only the first and fourth CPU can handle the IRQ. + +The way IRQs are routed is handled by the IO-APIC, and it's Round Robin +between all the CPUs which are allowed to handle it. As usual the kernel has +more info than you and does a better job than you, so the defaults are the +best choice for almost everyone. + +There are three more important subdirectories in /proc: net, scsi, and sys. +The general rule is that the contents, or even the existence of these +directories, depend on your kernel configuration. If SCSI is not enabled, the +directory scsi may not exist. The same is true with the net, which is there +only when networking support is present in the running kernel. + +The slabinfo file gives information about memory usage at the slab level. +Linux uses slab pools for memory management above page level in version 2.2. +Commonly used objects have their own slab pool (such as network buffers, +directory cache, and so on). + +.............................................................................. + +> cat /proc/buddyinfo + +Node 0, zone DMA 0 4 5 4 4 3 ... +Node 0, zone Normal 1 0 0 1 101 8 ... +Node 0, zone HighMem 2 0 0 1 1 0 ... + +Memory fragmentation is a problem under some workloads, and buddyinfo is a +useful tool for helping diagnose these problems. Buddyinfo will give you a +clue as to how big an area you can safely allocate, or why a previous +allocation failed. + +Each column represents the number of pages of a certain order which are +available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in +ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE +available in ZONE_NORMAL, etc... + +.............................................................................. + +meminfo: + +Provides information about distribution and utilization of memory. This +varies by architecture and compile options. The following is from a +16GB PIII, which has highmem enabled. You may not have all of these fields. + +> cat /proc/meminfo + + +MemTotal: 16344972 kB +MemFree: 13634064 kB +Buffers: 3656 kB +Cached: 1195708 kB +SwapCached: 0 kB +Active: 891636 kB +Inactive: 1077224 kB +HighTotal: 15597528 kB +HighFree: 13629632 kB +LowTotal: 747444 kB +LowFree: 4432 kB +SwapTotal: 0 kB +SwapFree: 0 kB +Dirty: 968 kB +Writeback: 0 kB +Mapped: 280372 kB +Slab: 684068 kB +CommitLimit: 7669796 kB +Committed_AS: 100056 kB +PageTables: 24448 kB +VmallocTotal: 112216 kB +VmallocUsed: 428 kB +VmallocChunk: 111088 kB + + MemTotal: Total usable ram (i.e. physical ram minus a few reserved + bits and the kernel binary code) + MemFree: The sum of LowFree+HighFree + Buffers: Relatively temporary storage for raw disk blocks + shouldn't get tremendously large (20MB or so) + Cached: in-memory cache for files read from the disk (the + pagecache). Doesn't include SwapCached + SwapCached: Memory that once was swapped out, is swapped back in but + still also is in the swapfile (if memory is needed it + doesn't need to be swapped out AGAIN because it is already + in the swapfile. This saves I/O) + Active: Memory that has been used more recently and usually not + reclaimed unless absolutely necessary. + Inactive: Memory which has been less recently used. It is more + eligible to be reclaimed for other purposes + HighTotal: + HighFree: Highmem is all memory above ~860MB of physical memory + Highmem areas are for use by userspace programs, or + for the pagecache. The kernel must use tricks to access + this memory, making it slower to access than lowmem. + LowTotal: + LowFree: Lowmem is memory which can be used for everything that + highmem can be used for, but it is also availble for the + kernel's use for its own data structures. Among many + other things, it is where everything from the Slab is + allocated. Bad things happen when you're out of lowmem. + SwapTotal: total amount of swap space available + SwapFree: Memory which has been evicted from RAM, and is temporarily + on the disk + Dirty: Memory which is waiting to get written back to the disk + Writeback: Memory which is actively being written back to the disk + Mapped: files which have been mmaped, such as libraries + Slab: in-kernel data structures cache + CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), + this is the total amount of memory currently available to + be allocated on the system. This limit is only adhered to + if strict overcommit accounting is enabled (mode 2 in + 'vm.overcommit_memory'). + The CommitLimit is calculated with the following formula: + CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap + For example, on a system with 1G of physical RAM and 7G + of swap with a `vm.overcommit_ratio` of 30 it would + yield a CommitLimit of 7.3G. + For more details, see the memory overcommit documentation + in vm/overcommit-accounting. +Committed_AS: The amount of memory presently allocated on the system. + The committed memory is a sum of all of the memory which + has been allocated by processes, even if it has not been + "used" by them as of yet. A process which malloc()'s 1G + of memory, but only touches 300M of it will only show up + as using 300M of memory even if it has the address space + allocated for the entire 1G. This 1G is memory which has + been "committed" to by the VM and can be used at any time + by the allocating application. With strict overcommit + enabled on the system (mode 2 in 'vm.overcommit_memory'), + allocations which would exceed the CommitLimit (detailed + above) will not be permitted. This is useful if one needs + to guarantee that processes will not fail due to lack of + memory once that memory has been successfully allocated. + PageTables: amount of memory dedicated to the lowest level of page + tables. +VmallocTotal: total size of vmalloc memory area + VmallocUsed: amount of vmalloc area which is used +VmallocChunk: largest contigious block of vmalloc area which is free + + +1.3 IDE devices in /proc/ide +---------------------------- + +The subdirectory /proc/ide contains information about all IDE devices of which +the kernel is aware. There is one subdirectory for each IDE controller, the +file drivers and a link for each IDE device, pointing to the device directory +in the controller specific subtree. + +The file drivers contains general information about the drivers used for the +IDE devices: + + > cat /proc/ide/drivers + ide-cdrom version 4.53 + ide-disk version 1.08 + +More detailed information can be found in the controller specific +subdirectories. These are named ide0, ide1 and so on. Each of these +directories contains the files shown in table 1-4. + + +Table 1-4: IDE controller info in /proc/ide/ide? +.............................................................................. + File Content + channel IDE channel (0 or 1) + config Configuration (only for PCI/IDE bridge) + mate Mate name + model Type/Chipset of IDE controller +.............................................................................. + +Each device connected to a controller has a separate subdirectory in the +controllers directory. The files listed in table 1-5 are contained in these +directories. + + +Table 1-5: IDE device information +.............................................................................. + File Content + cache The cache + capacity Capacity of the medium (in 512Byte blocks) + driver driver and version + geometry physical and logical geometry + identify device identify block + media media type + model device identifier + settings device setup + smart_thresholds IDE disk management thresholds + smart_values IDE disk management values +.............................................................................. + +The most interesting file is settings. This file contains a nice overview of +the drive parameters: + + # cat /proc/ide/ide0/hda/settings + name value min max mode + ---- ----- --- --- ---- + bios_cyl 526 0 65535 rw + bios_head 255 0 255 rw + bios_sect 63 0 63 rw + breada_readahead 4 0 127 rw + bswap 0 0 1 r + file_readahead 72 0 2097151 rw + io_32bit 0 0 3 rw + keepsettings 0 0 1 rw + max_kb_per_request 122 1 127 rw + multcount 0 0 8 rw + nice1 1 0 1 rw + nowerr 0 0 1 rw + pio_mode write-only 0 255 w + slow 0 0 1 rw + unmaskirq 0 0 1 rw + using_dma 0 0 1 rw + + +1.4 Networking info in /proc/net +-------------------------------- + +The subdirectory /proc/net follows the usual pattern. Table 1-6 shows the +additional values you get for IP version 6 if you configure the kernel to +support this. Table 1-7 lists the files and their meaning. + + +Table 1-6: IPv6 info in /proc/net +.............................................................................. + File Content + udp6 UDP sockets (IPv6) + tcp6 TCP sockets (IPv6) + raw6 Raw device statistics (IPv6) + igmp6 IP multicast addresses, which this host joined (IPv6) + if_inet6 List of IPv6 interface addresses + ipv6_route Kernel routing table for IPv6 + rt6_stats Global IPv6 routing tables statistics + sockstat6 Socket statistics (IPv6) + snmp6 Snmp data (IPv6) +.............................................................................. + + +Table 1-7: Network info in /proc/net +.............................................................................. + File Content + arp Kernel ARP table + dev network devices with statistics + dev_mcast the Layer2 multicast groups a device is listening too + (interface index, label, number of references, number of bound + addresses). + dev_stat network device status + ip_fwchains Firewall chain linkage + ip_fwnames Firewall chain names + ip_masq Directory containing the masquerading tables + ip_masquerade Major masquerading table + netstat Network statistics + raw raw device statistics + route Kernel routing table + rpc Directory containing rpc info + rt_cache Routing cache + snmp SNMP data + sockstat Socket statistics + tcp TCP sockets + tr_rif Token ring RIF routing table + udp UDP sockets + unix UNIX domain sockets + wireless Wireless interface data (Wavelan etc) + igmp IP multicast addresses, which this host joined + psched Global packet scheduler parameters. + netlink List of PF_NETLINK sockets + ip_mr_vifs List of multicast virtual interfaces + ip_mr_cache List of multicast routing cache +.............................................................................. + +You can use this information to see which network devices are available in +your system and how much traffic was routed over those devices: + + > cat /proc/net/dev + Inter-|Receive |[... + face |bytes packets errs drop fifo frame compressed multicast|[... + lo: 908188 5596 0 0 0 0 0 0 [... + ppp0:15475140 20721 410 0 0 410 0 0 [... + eth0: 614530 7085 0 0 0 0 0 1 [... + + ...] Transmit + ...] bytes packets errs drop fifo colls carrier compressed + ...] 908188 5596 0 0 0 0 0 0 + ...] 1375103 17405 0 0 0 0 0 0 + ...] 1703981 5535 0 0 0 3 0 0 + +In addition, each Channel Bond interface has it's own directory. For +example, the bond0 device will have a directory called /proc/net/bond0/. +It will contain information that is specific to that bond, such as the +current slaves of the bond, the link status of the slaves, and how +many times the slaves link has failed. + +1.5 SCSI info +------------- + +If you have a SCSI host adapter in your system, you'll find a subdirectory +named after the driver for this adapter in /proc/scsi. You'll also see a list +of all recognized SCSI devices in /proc/scsi: + + >cat /proc/scsi/scsi + Attached devices: + Host: scsi0 Channel: 00 Id: 00 Lun: 00 + Vendor: IBM Model: DGHS09U Rev: 03E0 + Type: Direct-Access ANSI SCSI revision: 03 + Host: scsi0 Channel: 00 Id: 06 Lun: 00 + Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 + Type: CD-ROM ANSI SCSI revision: 02 + + +The directory named after the driver has one file for each adapter found in +the system. These files contain information about the controller, including +the used IRQ and the IO address range. The amount of information shown is +dependent on the adapter you use. The example shows the output for an Adaptec +AHA-2940 SCSI adapter: + + > cat /proc/scsi/aic7xxx/0 + + Adaptec AIC7xxx driver version: 5.1.19/3.2.4 + Compile Options: + TCQ Enabled By Default : Disabled + AIC7XXX_PROC_STATS : Disabled + AIC7XXX_RESET_DELAY : 5 + Adapter Configuration: + SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter + Ultra Wide Controller + PCI MMAPed I/O Base: 0xeb001000 + Adapter SEEPROM Config: SEEPROM found and used. + Adaptec SCSI BIOS: Enabled + IRQ: 10 + SCBs: Active 0, Max Active 2, + Allocated 15, HW 16, Page 255 + Interrupts: 160328 + BIOS Control Word: 0x18b6 + Adapter Control Word: 0x005b + Extended Translation: Enabled + Disconnect Enable Flags: 0xffff + Ultra Enable Flags: 0x0001 + Tag Queue Enable Flags: 0x0000 + Ordered Queue Tag Flags: 0x0000 + Default Tag Queue Depth: 8 + Tagged Queue By Device array for aic7xxx host instance 0: + {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} + Actual queue depth per device for aic7xxx host instance 0: + {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} + Statistics: + (scsi0:0:0:0) + Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 + Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) + Total transfers 160151 (74577 reads and 85574 writes) + (scsi0:0:6:0) + Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 + Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) + Total transfers 0 (0 reads and 0 writes) + + +1.6 Parallel port info in /proc/parport +--------------------------------------- + +The directory /proc/parport contains information about the parallel ports of +your system. It has one subdirectory for each port, named after the port +number (0,1,2,...). + +These directories contain the four files shown in Table 1-8. + + +Table 1-8: Files in /proc/parport +.............................................................................. + File Content + autoprobe Any IEEE-1284 device ID information that has been acquired. + devices list of the device drivers using that port. A + will appear by the + name of the device currently using the port (it might not appear + against any). + hardware Parallel port's base address, IRQ line and DMA channel. + irq IRQ that parport is using for that port. This is in a separate + file to allow you to alter it by writing a new value in (IRQ + number or none). +.............................................................................. + +1.7 TTY info in /proc/tty +------------------------- + +Information about the available and actually used tty's can be found in the +directory /proc/tty.You'll find entries for drivers and line disciplines in +this directory, as shown in Table 1-9. + + +Table 1-9: Files in /proc/tty +.............................................................................. + File Content + drivers list of drivers and their usage + ldiscs registered line disciplines + driver/serial usage statistic and status of single tty lines +.............................................................................. + +To see which tty's are currently in use, you can simply look into the file +/proc/tty/drivers: + + > cat /proc/tty/drivers + pty_slave /dev/pts 136 0-255 pty:slave + pty_master /dev/ptm 128 0-255 pty:master + pty_slave /dev/ttyp 3 0-255 pty:slave + pty_master /dev/pty 2 0-255 pty:master + serial /dev/cua 5 64-67 serial:callout + serial /dev/ttyS 4 64-67 serial + /dev/tty0 /dev/tty0 4 0 system:vtmaster + /dev/ptmx /dev/ptmx 5 2 system + /dev/console /dev/console 5 1 system:console + /dev/tty /dev/tty 5 0 system:/dev/tty + unknown /dev/tty 4 1-63 console + + +1.8 Miscellaneous kernel statistics in /proc/stat +------------------------------------------------- + +Various pieces of information about kernel activity are available in the +/proc/stat file. All of the numbers reported in this file are aggregates +since the system first booted. For a quick look, simply cat the file: + + > cat /proc/stat + cpu 2255 34 2290 22625563 6290 127 456 + cpu0 1132 34 1441 11311718 3675 127 438 + cpu1 1123 0 849 11313845 2614 0 18 + intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] + ctxt 1990473 + btime 1062191376 + processes 2915 + procs_running 1 + procs_blocked 0 + +The very first "cpu" line aggregates the numbers in all of the other "cpuN" +lines. These numbers identify the amount of time the CPU has spent performing +different kinds of work. Time units are in USER_HZ (typically hundredths of a +second). The meanings of the columns are as follows, from left to right: + +- user: normal processes executing in user mode +- nice: niced processes executing in user mode +- system: processes executing in kernel mode +- idle: twiddling thumbs +- iowait: waiting for I/O to complete +- irq: servicing interrupts +- softirq: servicing softirqs + +The "intr" line gives counts of interrupts serviced since boot time, for each +of the possible system interrupts. The first column is the total of all +interrupts serviced; each subsequent column is the total for that particular +interrupt. + +The "ctxt" line gives the total number of context switches across all CPUs. + +The "btime" line gives the time at which the system booted, in seconds since +the Unix epoch. + +The "processes" line gives the number of processes and threads created, which +includes (but is not limited to) those created by calls to the fork() and +clone() system calls. + +The "procs_running" line gives the number of processes currently running on +CPUs. + +The "procs_blocked" line gives the number of processes currently blocked, +waiting for I/O to complete. + + +------------------------------------------------------------------------------ +Summary +------------------------------------------------------------------------------ +The /proc file system serves information about the running system. It not only +allows access to process data but also allows you to request the kernel status +by reading files in the hierarchy. + +The directory structure of /proc reflects the types of information and makes +it easy, if not obvious, where to look for specific data. +------------------------------------------------------------------------------ + +------------------------------------------------------------------------------ +CHAPTER 2: MODIFYING SYSTEM PARAMETERS +------------------------------------------------------------------------------ + +------------------------------------------------------------------------------ +In This Chapter +------------------------------------------------------------------------------ +* Modifying kernel parameters by writing into files found in /proc/sys +* Exploring the files which modify certain parameters +* Review of the /proc/sys file tree +------------------------------------------------------------------------------ + + +A very interesting part of /proc is the directory /proc/sys. This is not only +a source of information, it also allows you to change parameters within the +kernel. Be very careful when attempting this. You can optimize your system, +but you can also cause it to crash. Never alter kernel parameters on a +production system. Set up a development machine and test to make sure that +everything works the way you want it to. You may have no alternative but to +reboot the machine once an error has been made. + +To change a value, simply echo the new value into the file. An example is +given below in the section on the file system data. You need to be root to do +this. You can create your own boot script to perform this every time your +system boots. + +The files in /proc/sys can be used to fine tune and monitor miscellaneous and +general things in the operation of the Linux kernel. Since some of the files +can inadvertently disrupt your system, it is advisable to read both +documentation and source before actually making adjustments. In any case, be +very careful when writing to any of these files. The entries in /proc may +change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt +review the kernel documentation in the directory /usr/src/linux/Documentation. +This chapter is heavily based on the documentation included in the pre 2.2 +kernels, and became part of it in version 2.2.1 of the Linux kernel. + +2.1 /proc/sys/fs - File system data +----------------------------------- + +This subdirectory contains specific file system, file handle, inode, dentry +and quota information. + +Currently, these files are in /proc/sys/fs: + +dentry-state +------------ + +Status of the directory cache. Since directory entries are dynamically +allocated and deallocated, this file indicates the current status. It holds +six values, in which the last two are not used and are always zero. The others +are listed in table 2-1. + + +Table 2-1: Status files of the directory cache +.............................................................................. + File Content + nr_dentry Almost always zero + nr_unused Number of unused cache entries + age_limit + in seconds after the entry may be reclaimed, when memory is short + want_pages internally +.............................................................................. + +dquot-nr and dquot-max +---------------------- + +The file dquot-max shows the maximum number of cached disk quota entries. + +The file dquot-nr shows the number of allocated disk quota entries and the +number of free disk quota entries. + +If the number of available cached disk quotas is very low and you have a large +number of simultaneous system users, you might want to raise the limit. + +file-nr and file-max +-------------------- + +The kernel allocates file handles dynamically, but doesn't free them again at +this time. + +The value in file-max denotes the maximum number of file handles that the +Linux kernel will allocate. When you get a lot of error messages about running +out of file handles, you might want to raise this limit. The default value is +10% of RAM in kilobytes. To change it, just write the new number into the +file: + + # cat /proc/sys/fs/file-max + 4096 + # echo 8192 > /proc/sys/fs/file-max + # cat /proc/sys/fs/file-max + 8192 + + +This method of revision is useful for all customizable parameters of the +kernel - simply echo the new value to the corresponding file. + +Historically, the three values in file-nr denoted the number of allocated file +handles, the number of allocated but unused file handles, and the maximum +number of file handles. Linux 2.6 always reports 0 as the number of free file +handles -- this is not an error, it just means that the number of allocated +file handles exactly matches the number of used file handles. + +Attempts to allocate more file descriptors than file-max are reported with +printk, look for "VFS: file-max limit <number> reached". + +inode-state and inode-nr +------------------------ + +The file inode-nr contains the first two items from inode-state, so we'll skip +to that file... + +inode-state contains two actual numbers and five dummy values. The numbers +are nr_inodes and nr_free_inodes (in order of appearance). + +nr_inodes +~~~~~~~~~ + +Denotes the number of inodes the system has allocated. This number will +grow and shrink dynamically. + +nr_free_inodes +-------------- + +Represents the number of free inodes. Ie. The number of inuse inodes is +(nr_inodes - nr_free_inodes). + +super-nr and super-max +---------------------- + +Again, super block structures are allocated by the kernel, but not freed. The +file super-max contains the maximum number of super block handlers, where +super-nr shows the number of currently allocated ones. + +Every mounted file system needs a super block, so if you plan to mount lots of +file systems, you may want to increase these numbers. + +aio-nr and aio-max-nr +--------------------- + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. + +2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats +----------------------------------------------------------- + +Besides these files, there is the subdirectory /proc/sys/fs/binfmt_misc. This +handles the kernel support for miscellaneous binary formats. + +Binfmt_misc provides the ability to register additional binary formats to the +Kernel without compiling an additional module/kernel. Therefore, binfmt_misc +needs to know magic numbers at the beginning or the filename extension of the +binary. + +It works by maintaining a linked list of structs that contain a description of +a binary format, including a magic with size (or the filename extension), +offset and mask, and the interpreter name. On request it invokes the given +interpreter with the original program as argument, as binfmt_java and +binfmt_em86 and binfmt_mz do. Since binfmt_misc does not define any default +binary-formats, you have to register an additional binary-format. + +There are two general files in binfmt_misc and one file per registered format. +The two general files are register and status. + +Registering a new binary format +------------------------------- + +To register a new binary format you have to issue the command + + echo :name:type:offset:magic:mask:interpreter: > /proc/sys/fs/binfmt_misc/register + + + +with appropriate name (the name for the /proc-dir entry), offset (defaults to +0, if omitted), magic, mask (which can be omitted, defaults to all 0xff) and +last but not least, the interpreter that is to be invoked (for example and +testing /bin/echo). Type can be M for usual magic matching or E for filename +extension matching (give extension in place of magic). + +Check or reset the status of the binary format handler +------------------------------------------------------ + +If you do a cat on the file /proc/sys/fs/binfmt_misc/status, you will get the +current status (enabled/disabled) of binfmt_misc. Change the status by echoing +0 (disables) or 1 (enables) or -1 (caution: this clears all previously +registered binary formats) to status. For example echo 0 > status to disable +binfmt_misc (temporarily). + +Status of a single handler +-------------------------- + +Each registered handler has an entry in /proc/sys/fs/binfmt_misc. These files +perform the same function as status, but their scope is limited to the actual +binary format. By cating this file, you also receive all related information +about the interpreter/magic of the binfmt. + +Example usage of binfmt_misc (emulate binfmt_java) +-------------------------------------------------- + + cd /proc/sys/fs/binfmt_misc + echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/java/bin/javawrapper:' > register + echo ':HTML:E::html::/usr/local/java/bin/appletviewer:' > register + echo ':Applet:M::<!--applet::/usr/local/java/bin/appletviewer:' > register + echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register + + +These four lines add support for Java executables and Java applets (like +binfmt_java, additionally recognizing the .html extension with no need to put +<!--applet> to every applet file). You have to install the JDK and the +shell-script /usr/local/java/bin/javawrapper too. It works around the +brokenness of the Java filename handling. To add a Java binary, just create a +link to the class-file somewhere in the path. + +2.3 /proc/sys/kernel - general kernel parameters +------------------------------------------------ + +This directory reflects general kernel behaviors. As I've said before, the +contents depend on your configuration. Here you'll find the most important +files, along with descriptions of what they mean and how to use them. + +acct +---- + +The file contains three values; highwater, lowwater, and frequency. + +It exists only when BSD-style process accounting is enabled. These values +control its behavior. If the free space on the file system where the log lives +goes below lowwater percentage, accounting suspends. If it goes above +highwater percentage, accounting resumes. Frequency determines how often you +check the amount of free space (value is in seconds). Default settings are: 4, +2, and 30. That is, suspend accounting if there is less than 2 percent free; +resume it if we have a value of 3 or more percent; consider information about +the amount of free space valid for 30 seconds + +ctrl-alt-del +------------ + +When the value in this file is 0, ctrl-alt-del is trapped and sent to the init +program to handle a graceful restart. However, when the value is greater that +zero, Linux's reaction to this key combination will be an immediate reboot, +without syncing its dirty buffers. + +[NOTE] + When a program (like dosemu) has the keyboard in raw mode, the + ctrl-alt-del is intercepted by the program before it ever reaches the + kernel tty layer, and it is up to the program to decide what to do with + it. + +domainname and hostname +----------------------- + +These files can be controlled to set the NIS domainname and hostname of your +box. For the classic darkstar.frop.org a simple: + + # echo "darkstar" > /proc/sys/kernel/hostname + # echo "frop.org" > /proc/sys/kernel/domainname + + +would suffice to set your hostname and NIS domainname. + +osrelease, ostype and version +----------------------------- + +The names make it pretty obvious what these fields contain: + + > cat /proc/sys/kernel/osrelease + 2.2.12 + + > cat /proc/sys/kernel/ostype + Linux + + > cat /proc/sys/kernel/version + #4 Fri Oct 1 12:41:14 PDT 1999 + + +The files osrelease and ostype should be clear enough. Version needs a little +more clarification. The #4 means that this is the 4th kernel built from this +source base and the date after it indicates the time the kernel was built. The +only way to tune these values is to rebuild the kernel. + +panic +----- + +The value in this file represents the number of seconds the kernel waits +before rebooting on a panic. When you use the software watchdog, the +recommended setting is 60. If set to 0, the auto reboot after a kernel panic +is disabled, which is the default setting. + +printk +------ + +The four values in printk denote +* console_loglevel, +* default_message_loglevel, +* minimum_console_loglevel and +* default_console_loglevel +respectively. + +These values influence printk() behavior when printing or logging error +messages, which come from inside the kernel. See syslog(2) for more +information on the different log levels. + +console_loglevel +---------------- + +Messages with a higher priority than this will be printed to the console. + +default_message_level +--------------------- + +Messages without an explicit priority will be printed with this priority. + +minimum_console_loglevel +------------------------ + +Minimum (highest) value to which the console_loglevel can be set. + +default_console_loglevel +------------------------ + +Default value for console_loglevel. + +sg-big-buff +----------- + +This file shows the size of the generic SCSI (sg) buffer. At this point, you +can't tune it yet, but you can change it at compile time by editing +include/scsi/sg.h and changing the value of SG_BIG_BUFF. + +If you use a scanner with SANE (Scanner Access Now Easy) you might want to set +this to a higher value. Refer to the SANE documentation on this issue. + +modprobe +-------- + +The location where the modprobe binary is located. The kernel uses this +program to load modules on demand. + +unknown_nmi_panic +----------------- + +The value in this file affects behavior of handling NMI. When the value is +non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel +debugging information is displayed on console. + +NMI switch that most IA32 servers have fires unknown NMI up, for example. +If a system hangs up, try pressing the NMI switch. + +[NOTE] + This function and oprofile share a NMI callback. Therefore this function + cannot be enabled when oprofile is activated. + And NMI watchdog will be disabled when the value in this file is set to + non-zero. + + +2.4 /proc/sys/vm - The virtual memory subsystem +----------------------------------------------- + +The files in this directory can be used to tune the operation of the virtual +memory (VM) subsystem of the Linux kernel. + +vfs_cache_pressure +------------------ + +Controls the tendency of the kernel to reclaim the memory which is used for +caching of directory and inode objects. + +At the default value of vfs_cache_pressure=100 the kernel will attempt to +reclaim dentries and inodes at a "fair" rate with respect to pagecache and +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer +to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 +causes the kernel to prefer to reclaim dentries and inodes. + +dirty_background_ratio +---------------------- + +Contains, as a percentage of total system memory, the number of pages at which +the pdflush background writeback daemon will start writing out dirty data. + +dirty_ratio +----------------- + +Contains, as a percentage of total system memory, the number of pages at which +a process which is generating disk writes will itself start writing out dirty +data. + +dirty_writeback_centisecs +------------------------- + +The pdflush writeback daemons will periodically wake up and write `old' data +out to disk. This tunable expresses the interval between those wakeups, in +100'ths of a second. + +Setting this to zero disables periodic writeback altogether. + +dirty_expire_centisecs +---------------------- + +This tunable is used to define when dirty data is old enough to be eligible +for writeout by the pdflush daemons. It is expressed in 100'ths of a second. +Data which has been dirty in-memory for longer than this interval will be +written out next time a pdflush daemon wakes up. + +legacy_va_layout +---------------- + +If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel +will use the legacy (2.4) layout for all processes. + +lower_zone_protection +--------------------- + +For some specialised workloads on highmem machines it is dangerous for +the kernel to allow process memory to be allocated from the "lowmem" +zone. This is because that memory could then be pinned via the mlock() +system call, or by unavailability of swapspace. + +And on large highmem machines this lack of reclaimable lowmem memory +can be fatal. + +So the Linux page allocator has a mechanism which prevents allocations +which _could_ use highmem from using too much lowmem. This means that +a certain amount of lowmem is defended from the possibility of being +captured into pinned user memory. + +(The same argument applies to the old 16 megabyte ISA DMA region. This +mechanism will also defend that region from allocations which could use +highmem or lowmem). + +The `lower_zone_protection' tunable determines how aggressive the kernel is +in defending these lower zones. The default value is zero - no +protection at all. + +If you have a machine which uses highmem or ISA DMA and your +applications are using mlock(), or if you are running with no swap then +you probably should increase the lower_zone_protection setting. + +The units of this tunable are fairly vague. It is approximately equal +to "megabytes". So setting lower_zone_protection=100 will protect around 100 +megabytes of the lowmem zone from user allocations. It will also make +those 100 megabytes unavaliable for use by applications and by +pagecache, so there is a cost. + +The effects of this tunable may be observed by monitoring +/proc/meminfo:LowFree. Write a single huge file and observe the point +at which LowFree ceases to fall. + +A reasonable value for lower_zone_protection is 100. + +page-cluster +------------ + +page-cluster controls the number of pages which are written to swap in +a single attempt. The swap I/O size. + +It is a logarithmic value - setting it to zero means "1 page", setting +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. + +The default value is three (eight pages at a time). There may be some +small benefits in tuning this to a different value if your workload is +swap-intensive. + +overcommit_memory +----------------- + +This file contains one value. The following algorithm is used to decide if +there's enough memory: if the value of overcommit_memory is positive, then +there's always enough memory. This is a useful feature, since programs often +malloc() huge amounts of memory 'just in case', while they only use a small +part of it. Leaving this value at 0 will lead to the failure of such a huge +malloc(), when in fact the system has enough memory for the program to run. + +On the other hand, enabling this feature can cause you to run out of memory +and thrash the system to death, so large and/or important servers will want to +set this value to 0. + +nr_hugepages and hugetlb_shm_group +---------------------------------- + +nr_hugepages configures number of hugetlb page reserved for the system. + +hugetlb_shm_group contains group id that is allowed to create SysV shared +memory segment using hugetlb page. + +laptop_mode +----------- + +laptop_mode is a knob that controls "laptop mode". All the things that are +controlled by this knob are discussed in Documentation/laptop-mode.txt. + +block_dump +---------- + +block_dump enables block I/O debugging when set to a nonzero value. More +information on block I/O debugging is in Documentation/laptop-mode.txt. + +swap_token_timeout +------------------ + +This file contains valid hold time of swap out protection token. The Linux +VM has token based thrashing control mechanism and uses the token to prevent +unnecessary page faults in thrashing situation. The unit of the value is +second. The value would be useful to tune thrashing behavior. + +2.5 /proc/sys/dev - Device specific parameters +---------------------------------------------- + +Currently there is only support for CDROM drives, and for those, there is only +one read-only file containing information about the CD-ROM drives attached to +the system: + + >cat /proc/sys/dev/cdrom/info + CD-ROM information, Id: cdrom.c 2.55 1999/04/25 + + drive name: sr0 hdb + drive speed: 32 40 + drive # of slots: 1 0 + Can close tray: 1 1 + Can open tray: 1 1 + Can lock tray: 1 1 + Can change speed: 1 1 + Can select disk: 0 1 + Can read multisession: 1 1 + Can read MCN: 1 1 + Reports media changed: 1 1 + Can play audio: 1 1 + + +You see two drives, sr0 and hdb, along with a list of their features. + +2.6 /proc/sys/sunrpc - Remote procedure calls +--------------------------------------------- + +This directory contains four files, which enable or disable debugging for the +RPC functions NFS, NFS-daemon, RPC and NLM. The default values are 0. They can +be set to one to turn debugging on. (The default value is 0 for each) + +2.7 /proc/sys/net - Networking stuff +------------------------------------ + +The interface to the networking parts of the kernel is located in +/proc/sys/net. Table 2-3 shows all possible subdirectories. You may see only +some of them, depending on your kernel's configuration. + + +Table 2-3: Subdirectories in /proc/sys/net +.............................................................................. + Directory Content Directory Content + core General parameter appletalk Appletalk protocol + unix Unix domain sockets netrom NET/ROM + 802 E802 protocol ax25 AX25 + ethernet Ethernet protocol rose X.25 PLP layer + ipv4 IP version 4 x25 X.25 protocol + ipx IPX token-ring IBM token ring + bridge Bridging decnet DEC net + ipv6 IP version 6 +.............................................................................. + +We will concentrate on IP networking here. Since AX15, X.25, and DEC Net are +only minor players in the Linux world, we'll skip them in this chapter. You'll +find some short info on Appletalk and IPX further on in this chapter. Review +the online documentation and the kernel source to get a detailed view of the +parameters for those protocols. In this section we'll discuss the +subdirectories printed in bold letters in the table above. As default values +are suitable for most needs, there is no need to change these values. + +/proc/sys/net/core - Network core options +----------------------------------------- + +rmem_default +------------ + +The default setting of the socket receive buffer in bytes. + +rmem_max +-------- + +The maximum receive socket buffer size in bytes. + +wmem_default +------------ + +The default setting (in bytes) of the socket send buffer. + +wmem_max +-------- + +The maximum send socket buffer size in bytes. + +message_burst and message_cost +------------------------------ + +These parameters are used to limit the warning messages written to the kernel +log from the networking code. They enforce a rate limit to make a +denial-of-service attack impossible. A higher message_cost factor, results in +fewer messages that will be written. Message_burst controls when messages will +be dropped. The default settings limit warning messages to one every five +seconds. + +netdev_max_backlog +------------------ + +Maximum number of packets, queued on the INPUT side, when the interface +receives packets faster than kernel can process them. + +optmem_max +---------- + +Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence +of struct cmsghdr structures with appended data. + +/proc/sys/net/unix - Parameters for Unix domain sockets +------------------------------------------------------- + +There are only two files in this subdirectory. They control the delays for +deleting and destroying socket descriptors. + +2.8 /proc/sys/net/ipv4 - IPV4 settings +-------------------------------------- + +IP version 4 is still the most used protocol in Unix networking. It will be +replaced by IP version 6 in the next couple of years, but for the moment it's +the de facto standard for the internet and is used in most networking +environments around the world. Because of the importance of this protocol, +we'll have a deeper look into the subtree controlling the behavior of the IPv4 +subsystem of the Linux kernel. + +Let's start with the entries in /proc/sys/net/ipv4. + +ICMP settings +------------- + +icmp_echo_ignore_all and icmp_echo_ignore_broadcasts +---------------------------------------------------- + +Turn on (1) or off (0), if the kernel should ignore all ICMP ECHO requests, or +just those to broadcast and multicast addresses. + +Please note that if you accept ICMP echo requests with a broadcast/multi\-cast +destination address your network may be used as an exploder for denial of +service packet flooding attacks to other hosts. + +icmp_destunreach_rate, icmp_echoreply_rate, icmp_paramprob_rate and icmp_timeexeed_rate +--------------------------------------------------------------------------------------- + +Sets limits for sending ICMP packets to specific targets. A value of zero +disables all limiting. Any positive value sets the maximum package rate in +hundredth of a second (on Intel systems). + +IP settings +----------- + +ip_autoconfig +------------- + +This file contains the number one if the host received its IP configuration by +RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero. + +ip_default_ttl +-------------- + +TTL (Time To Live) for IPv4 interfaces. This is simply the maximum number of +hops a packet may travel. + +ip_dynaddr +---------- + +Enable dynamic socket address rewriting on interface address change. This is +useful for dialup interface with changing IP addresses. + +ip_forward +---------- + +Enable or disable forwarding of IP packages between interfaces. Changing this +value resets all other parameters to their default values. They differ if the +kernel is configured as host or router. + +ip_local_port_range +------------------- + +Range of ports used by TCP and UDP to choose the local port. Contains two +numbers, the first number is the lowest port, the second number the highest +local port. Default is 1024-4999. Should be changed to 32768-61000 for +high-usage systems. + +ip_no_pmtu_disc +--------------- + +Global switch to turn path MTU discovery off. It can also be set on a per +socket basis by the applications or on a per route basis. + +ip_masq_debug +------------- + +Enable/disable debugging of IP masquerading. + +IP fragmentation settings +------------------------- + +ipfrag_high_trash and ipfrag_low_trash +-------------------------------------- + +Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes +of memory is allocated for this purpose, the fragment handler will toss +packets until ipfrag_low_thresh is reached. + +ipfrag_time +----------- + +Time in seconds to keep an IP fragment in memory. + +TCP settings +------------ + +tcp_ecn +------- + +This file controls the use of the ECN bit in the IPv4 headers, this is a new +feature about Explicit Congestion Notification, but some routers and firewalls +block trafic that has this bit set, so it could be necessary to echo 0 to +/proc/sys/net/ipv4/tcp_ecn, if you want to talk to this sites. For more info +you could read RFC2481. + +tcp_retrans_collapse +-------------------- + +Bug-to-bug compatibility with some broken printers. On retransmit, try to send +larger packets to work around bugs in certain TCP stacks. Can be turned off by +setting it to zero. + +tcp_keepalive_probes +-------------------- + +Number of keep alive probes TCP sends out, until it decides that the +connection is broken. + +tcp_keepalive_time +------------------ + +How often TCP sends out keep alive messages, when keep alive is enabled. The +default is 2 hours. + +tcp_syn_retries +--------------- + +Number of times initial SYNs for a TCP connection attempt will be +retransmitted. Should not be higher than 255. This is only the timeout for +outgoing connections, for incoming connections the number of retransmits is +defined by tcp_retries1. + +tcp_sack +-------- + +Enable select acknowledgments after RFC2018. + +tcp_timestamps +-------------- + +Enable timestamps as defined in RFC1323. + +tcp_stdurg +---------- + +Enable the strict RFC793 interpretation of the TCP urgent pointer field. The +default is to use the BSD compatible interpretation of the urgent pointer +pointing to the first byte after the urgent data. The RFC793 interpretation is +to have it point to the last byte of urgent data. Enabling this option may +lead to interoperatibility problems. Disabled by default. + +tcp_syncookies +-------------- + +Only valid when the kernel was compiled with CONFIG_SYNCOOKIES. Send out +syncookies when the syn backlog queue of a socket overflows. This is to ward +off the common 'syn flood attack'. Disabled by default. + +Note that the concept of a socket backlog is abandoned. This means the peer +may not receive reliable error messages from an over loaded server with +syncookies enabled. + +tcp_window_scaling +------------------ + +Enable window scaling as defined in RFC1323. + +tcp_fin_timeout +--------------- + +The length of time in seconds it takes to receive a final FIN before the +socket is always closed. This is strictly a violation of the TCP +specification, but required to prevent denial-of-service attacks. + +tcp_max_ka_probes +----------------- + +Indicates how many keep alive probes are sent per slow timer run. Should not +be set too high to prevent bursts. + +tcp_max_syn_backlog +------------------- + +Length of the per socket backlog queue. Since Linux 2.2 the backlog specified +in listen(2) only specifies the length of the backlog queue of already +established sockets. When more connection requests arrive Linux starts to drop +packets. When syncookies are enabled the packets are still answered and the +maximum queue is effectively ignored. + +tcp_retries1 +------------ + +Defines how often an answer to a TCP connection request is retransmitted +before giving up. + +tcp_retries2 +------------ + +Defines how often a TCP packet is retransmitted before giving up. + +Interface specific settings +--------------------------- + +In the directory /proc/sys/net/ipv4/conf you'll find one subdirectory for each +interface the system knows about and one directory calls all. Changes in the +all subdirectory affect all interfaces, whereas changes in the other +subdirectories affect only one interface. All directories have the same +entries: + +accept_redirects +---------------- + +This switch decides if the kernel accepts ICMP redirect messages or not. The +default is 'yes' if the kernel is configured for a regular host and 'no' for a +router configuration. + +accept_source_route +------------------- + +Should source routed packages be accepted or declined. The default is +dependent on the kernel configuration. It's 'yes' for routers and 'no' for +hosts. + +bootp_relay +~~~~~~~~~~~ + +Accept packets with source address 0.b.c.d with destinations not to this host +as local ones. It is supposed that a BOOTP relay daemon will catch and forward +such packets. + +The default is 0, since this feature is not implemented yet (kernel version +2.2.12). + +forwarding +---------- + +Enable or disable IP forwarding on this interface. + +log_martians +------------ + +Log packets with source addresses with no known route to kernel log. + +mc_forwarding +------------- + +Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a +multicast routing daemon is required. + +proxy_arp +--------- + +Does (1) or does not (0) perform proxy ARP. + +rp_filter +--------- + +Integer value determines if a source validation should be made. 1 means yes, 0 +means no. Disabled by default, but local/broadcast address spoofing is always +on. + +If you set this to 1 on a router that is the only connection for a network to +the net, it will prevent spoofing attacks against your internal networks +(external addresses can still be spoofed), without the need for additional +firewall rules. + +secure_redirects +---------------- + +Accept ICMP redirect messages only for gateways, listed in default gateway +list. Enabled by default. + +shared_media +------------ + +If it is not set the kernel does not assume that different subnets on this +device can communicate directly. Default setting is 'yes'. + +send_redirects +-------------- + +Determines whether to send ICMP redirects to other hosts. + +Routing settings +---------------- + +The directory /proc/sys/net/ipv4/route contains several file to control +routing issues. + +error_burst and error_cost +-------------------------- + +These parameters are used to limit how many ICMP destination unreachable to +send from the host in question. ICMP destination unreachable messages are +sent when we can not reach the next hop, while trying to transmit a packet. +It will also print some error messages to kernel logs if someone is ignoring +our ICMP redirects. The higher the error_cost factor is, the fewer +destination unreachable and error messages will be let through. Error_burst +controls when destination unreachable messages and error messages will be +dropped. The default settings limit warning messages to five every second. + +flush +----- + +Writing to this file results in a flush of the routing cache. + +gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh +--------------------------------------------------------------------- + +Values to control the frequency and behavior of the garbage collection +algorithm for the routing cache. gc_min_interval is deprecated and replaced +by gc_min_interval_ms. + + +max_size +-------- + +Maximum size of the routing cache. Old entries will be purged once the cache +reached has this size. + +max_delay, min_delay +-------------------- + +Delays for flushing the routing cache. + +redirect_load, redirect_number +------------------------------ + +Factors which determine if more ICPM redirects should be sent to a specific +host. No redirects will be sent once the load limit or the maximum number of +redirects has been reached. + +redirect_silence +---------------- + +Timeout for redirects. After this period redirects will be sent again, even if +this has been stopped, because the load or number limit has been reached. + +Network Neighbor handling +------------------------- + +Settings about how to handle connections with direct neighbors (nodes attached +to the same link) can be found in the directory /proc/sys/net/ipv4/neigh. + +As we saw it in the conf directory, there is a default subdirectory which +holds the default values, and one directory for each interface. The contents +of the directories are identical, with the single exception that the default +settings contain additional options to set garbage collection parameters. + +In the interface directories you'll find the following entries: + +base_reachable_time, base_reachable_time_ms +------------------------------------------- + +A base value used for computing the random reachable time value as specified +in RFC2461. + +Expression of base_reachable_time, which is deprecated, is in seconds. +Expression of base_reachable_time_ms is in milliseconds. + +retrans_time, retrans_time_ms +----------------------------- + +The time between retransmitted Neighbor Solicitation messages. +Used for address resolution and to determine if a neighbor is +unreachable. + +Expression of retrans_time, which is deprecated, is in 1/100 seconds (for +IPv4) or in jiffies (for IPv6). +Expression of retrans_time_ms is in milliseconds. + +unres_qlen +---------- + +Maximum queue length for a pending arp request - the number of packets which +are accepted from other layers while the ARP address is still resolved. + +anycast_delay +------------- + +Maximum for random delay of answers to neighbor solicitation messages in +jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support +yet). + +ucast_solicit +------------- + +Maximum number of retries for unicast solicitation. + +mcast_solicit +------------- + +Maximum number of retries for multicast solicitation. + +delay_first_probe_time +---------------------- + +Delay for the first time probe if the neighbor is reachable. (see +gc_stale_time) + +locktime +-------- + +An ARP/neighbor entry is only replaced with a new one if the old is at least +locktime old. This prevents ARP cache thrashing. + +proxy_delay +----------- + +Maximum time (real time is random [0..proxytime]) before answering to an ARP +request for which we have an proxy ARP entry. In some cases, this is used to +prevent network flooding. + +proxy_qlen +---------- + +Maximum queue length of the delayed proxy arp timer. (see proxy_delay). + +app_solcit +---------- + +Determines the number of requests to send to the user level ARP daemon. Use 0 +to turn off. + +gc_stale_time +------------- + +Determines how often to check for stale ARP entries. After an ARP entry is +stale it will be resolved again (which is useful when an IP address migrates +to another machine). When ucast_solicit is greater than 0 it first tries to +send an ARP packet directly to the known host When that fails and +mcast_solicit is greater than 0, an ARP request is broadcasted. + +2.9 Appletalk +------------- + +The /proc/sys/net/appletalk directory holds the Appletalk configuration data +when Appletalk is loaded. The configurable parameters are: + +aarp-expiry-time +---------------- + +The amount of time we keep an ARP entry before expiring it. Used to age out +old hosts. + +aarp-resolve-time +----------------- + +The amount of time we will spend trying to resolve an Appletalk address. + +aarp-retransmit-limit +--------------------- + +The number of times we will retransmit a query before giving up. + +aarp-tick-time +-------------- + +Controls the rate at which expires are checked. + +The directory /proc/net/appletalk holds the list of active Appletalk sockets +on a machine. + +The fields indicate the DDP type, the local address (in network:node format) +the remote address, the size of the transmit pending queue, the size of the +received queue (bytes waiting for applications to read) the state and the uid +owning the socket. + +/proc/net/atalk_iface lists all the interfaces configured for appletalk.It +shows the name of the interface, its Appletalk address, the network range on +that address (or network number for phase 1 networks), and the status of the +interface. + +/proc/net/atalk_route lists each known network route. It lists the target +(network) that the route leads to, the router (may be directly connected), the +route flags, and the device the route is using. + +2.10 IPX +-------- + +The IPX protocol has no tunable values in proc/sys/net. + +The IPX protocol does, however, provide proc/net/ipx. This lists each IPX +socket giving the local and remote addresses in Novell format (that is +network:node:port). In accordance with the strange Novell tradition, +everything but the port is in hex. Not_Connected is displayed for sockets that +are not tied to a specific remote address. The Tx and Rx queue sizes indicate +the number of bytes pending for transmission and reception. The state +indicates the state the socket is in and the uid is the owning uid of the +socket. + +The /proc/net/ipx_interface file lists all IPX interfaces. For each interface +it gives the network number, the node number, and indicates if the network is +the primary network. It also indicates which device it is bound to (or +Internal for internal networks) and the Frame Type if appropriate. Linux +supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for +IPX. + +The /proc/net/ipx_route table holds a list of IPX routes. For each route it +gives the destination network, the router node (or Directly) and the network +address of the router (or Connected) for internal networks. + +2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem +---------------------------------------------------------- + +The "mqueue" filesystem provides the necessary kernel features to enable the +creation of a user space library that implements the POSIX message queues +API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System +Interfaces specification.) + +The "mqueue" filesystem contains values for determining/setting the amount of +resources used by the file system. + +/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the +maximum number of message queues allowed on the system. + +/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the +maximum number of messages in a queue value. In fact it is the limiting value +for another (user) limit which is set in mq_open invocation. This attribute of +a queue must be less or equal then msg_max. + +/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the +maximum message size value (it is every message queue's attribute set during +its creation). + + +------------------------------------------------------------------------------ +Summary +------------------------------------------------------------------------------ +Certain aspects of kernel behavior can be modified at runtime, without the +need to recompile the kernel, or even to reboot the system. The files in the +/proc/sys tree can not only be read, but also modified. You can use the echo +command to write value into these files, thereby changing the default settings +of the kernel. +------------------------------------------------------------------------------ diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt new file mode 100644 index 000000000000..2d2a7b2a16b9 --- /dev/null +++ b/Documentation/filesystems/romfs.txt @@ -0,0 +1,187 @@ +ROMFS - ROM FILE SYSTEM + +This is a quite dumb, read only filesystem, mainly for initial RAM +disks of installation disks. It has grown up by the need of having +modules linked at boot time. Using this filesystem, you get a very +similar feature, and even the possibility of a small kernel, with a +file system which doesn't take up useful memory from the router +functions in the basement of your office. + +For comparison, both the older minix and xiafs (the latter is now +defunct) filesystems, compiled as module need more than 20000 bytes, +while romfs is less than a page, about 4000 bytes (assuming i586 +code). Under the same conditions, the msdos filesystem would need +about 30K (and does not support device nodes or symlinks), while the +nfs module with nfsroot is about 57K. Furthermore, as a bit unfair +comparison, an actual rescue disk used up 3202 blocks with ext2, while +with romfs, it needed 3079 blocks. + +To create such a file system, you'll need a user program named +genromfs. It is available via anonymous ftp on sunsite.unc.edu and +its mirrors, in the /pub/Linux/system/recovery/ directory. + +As the name suggests, romfs could be also used (space-efficiently) on +various read-only media, like (E)EPROM disks if someone will have the +motivation.. :) + +However, the main purpose of romfs is to have a very small kernel, +which has only this filesystem linked in, and then can load any module +later, with the current module utilities. It can also be used to run +some program to decide if you need SCSI devices, and even IDE or +floppy drives can be loaded later if you use the "initrd"--initial +RAM disk--feature of the kernel. This would not be really news +flash, but with romfs, you can even spare off your ext2 or minix or +maybe even affs filesystem until you really know that you need it. + +For example, a distribution boot disk can contain only the cd disk +drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem +module. The kernel can be small enough, since it doesn't have other +filesystems, like the quite large ext2fs module, which can then be +loaded off the CD at a later stage of the installation. Another use +would be for a recovery disk, when you are reinstalling a workstation +from the network, and you will have all the tools/modules available +from a nearby server, so you don't want to carry two disks for this +purpose, just because it won't fit into ext2. + +romfs operates on block devices as you can expect, and the underlying +structure is very simple. Every accessible structure begins on 16 +byte boundaries for fast access. The minimum space a file will take +is 32 bytes (this is an empty file, with a less than 16 character +name). The maximum overhead for any non-empty file is the header, and +the 16 byte padding for the name and the contents, also 16+14+15 = 45 +bytes. This is quite rare however, since most file names are longer +than 3 bytes, and shorter than 15 bytes. + +The layout of the filesystem is the following: + +offset content + + +---+---+---+---+ + 0 | - | r | o | m | \ + +---+---+---+---+ The ASCII representation of those bytes + 4 | 1 | f | s | - | / (i.e. "-rom1fs-") + +---+---+---+---+ + 8 | full size | The number of accessible bytes in this fs. + +---+---+---+---+ + 12 | checksum | The checksum of the FIRST 512 BYTES. + +---+---+---+---+ + 16 | volume name | The zero terminated name of the volume, + : : padded to 16 byte boundary. + +---+---+---+---+ + xx | file | + : headers : + +Every multi byte value (32 bit words, I'll use the longwords term from +now on) must be in big endian order. + +The first eight bytes identify the filesystem, even for the casual +inspector. After that, in the 3rd longword, it contains the number of +bytes accessible from the start of this filesystem. The 4th longword +is the checksum of the first 512 bytes (or the number of bytes +accessible, whichever is smaller). The applied algorithm is the same +as in the AFFS filesystem, namely a simple sum of the longwords +(assuming bigendian quantities again). For details, please consult +the source. This algorithm was chosen because although it's not quite +reliable, it does not require any tables, and it is very simple. + +The following bytes are now part of the file system; each file header +must begin on a 16 byte boundary. + +offset content + + +---+---+---+---+ + 0 | next filehdr|X| The offset of the next file header + +---+---+---+---+ (zero if no more files) + 4 | spec.info | Info for directories/hard links/devices + +---+---+---+---+ + 8 | size | The size of this file in bytes + +---+---+---+---+ + 12 | checksum | Covering the meta data, including the file + +---+---+---+---+ name, and padding + 16 | file name | The zero terminated name of the file, + : : padded to 16 byte boundary + +---+---+---+---+ + xx | file data | + : : + +Since the file headers begin always at a 16 byte boundary, the lowest +4 bits would be always zero in the next filehdr pointer. These four +bits are used for the mode information. Bits 0..2 specify the type of +the file; while bit 4 shows if the file is executable or not. The +permissions are assumed to be world readable, if this bit is not set, +and world executable if it is; except the character and block devices, +they are never accessible for other than owner. The owner of every +file is user and group 0, this should never be a problem for the +intended use. The mapping of the 8 possible values to file types is +the following: + + mapping spec.info means + 0 hard link link destination [file header] + 1 directory first file's header + 2 regular file unused, must be zero [MBZ] + 3 symbolic link unused, MBZ (file data is the link content) + 4 block device 16/16 bits major/minor number + 5 char device - " - + 6 socket unused, MBZ + 7 fifo unused, MBZ + +Note that hard links are specifically marked in this filesystem, but +they will behave as you can expect (i.e. share the inode number). +Note also that it is your responsibility to not create hard link +loops, and creating all the . and .. links for directories. This is +normally done correctly by the genromfs program. Please refrain from +using the executable bits for special purposes on the socket and fifo +special files, they may have other uses in the future. Additionally, +please remember that only regular files, and symlinks are supposed to +have a nonzero size field; they contain the number of bytes available +directly after the (padded) file name. + +Another thing to note is that romfs works on file headers and data +aligned to 16 byte boundaries, but most hardware devices and the block +device drivers are unable to cope with smaller than block-sized data. +To overcome this limitation, the whole size of the file system must be +padded to an 1024 byte boundary. + +If you have any problems or suggestions concerning this file system, +please contact me. However, think twice before wanting me to add +features and code, because the primary and most important advantage of +this file system is the small code. On the other hand, don't be +alarmed, I'm not getting that much romfs related mail. Now I can +understand why Avery wrote poems in the ARCnet docs to get some more +feedback. :) + +romfs has also a mailing list, and to date, it hasn't received any +traffic, so you are welcome to join it to discuss your ideas. :) + +It's run by ezmlm, so you can subscribe to it by sending a message +to romfs-subscribe@shadow.banki.hu, the content is irrelevant. + +Pending issues: + +- Permissions and owner information are pretty essential features of a +Un*x like system, but romfs does not provide the full possibilities. +I have never found this limiting, but others might. + +- The file system is read only, so it can be very small, but in case +one would want to write _anything_ to a file system, he still needs +a writable file system, thus negating the size advantages. Possible +solutions: implement write access as a compile-time option, or a new, +similarly small writable filesystem for RAM disks. + +- Since the files are only required to have alignment on a 16 byte +boundary, it is currently possibly suboptimal to read or execute files +from the filesystem. It might be resolved by reordering file data to +have most of it (i.e. except the start and the end) laying at "natural" +boundaries, thus it would be possible to directly map a big portion of +the file contents to the mm subsystem. + +- Compression might be an useful feature, but memory is quite a +limiting factor in my eyes. + +- Where it is used? + +- Does it work on other architectures than intel and motorola? + + +Have fun, +Janos Farkas <chexum@shadow.banki.hu> diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt new file mode 100644 index 000000000000..f673ef0de0f7 --- /dev/null +++ b/Documentation/filesystems/smbfs.txt @@ -0,0 +1,8 @@ +Smbfs is a filesystem that implements the SMB protocol, which is the +protocol used by Windows for Workgroups, Windows 95 and Windows NT. +Smbfs was inspired by Samba, the program written by Andrew Tridgell +that turns any Unix host into a file server for DOS or Windows clients. + +Smbfs is a SMB client, but uses parts of samba for it's operation. For +more info on samba, including documentation, please go to +http://www.samba.org/ and then on to your nearest mirror. diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt new file mode 100644 index 000000000000..e97d024eae77 --- /dev/null +++ b/Documentation/filesystems/sysfs-pci.txt @@ -0,0 +1,88 @@ +Accessing PCI device resources through sysfs + +sysfs, usually mounted at /sys, provides access to PCI resources on platforms +that support it. For example, a given bus might look like this: + + /sys/devices/pci0000:17 + |-- 0000:17:00.0 + | |-- class + | |-- config + | |-- detach_state + | |-- device + | |-- irq + | |-- local_cpus + | |-- resource + | |-- resource0 + | |-- resource1 + | |-- resource2 + | |-- rom + | |-- subsystem_device + | |-- subsystem_vendor + | `-- vendor + `-- detach_state + +The topmost element describes the PCI domain and bus number. In this case, +the domain number is 0000 and the bus number is 17 (both values are in hex). +This bus contains a single function device in slot 0. The domain and bus +numbers are reproduced for convenience. Under the device directory are several +files, each with their own function. + + file function + ---- -------- + class PCI class (ascii, ro) + config PCI config space (binary, rw) + detach_state connection status (bool, rw) + device PCI device (ascii, ro) + irq IRQ number (ascii, ro) + local_cpus nearby CPU mask (cpumask, ro) + resource PCI resource host addresses (ascii, ro) + resource0..N PCI resource N, if present (binary, mmap) + rom PCI ROM resource, if present (binary, ro) + subsystem_device PCI subsystem device (ascii, ro) + subsystem_vendor PCI subsystem vendor (ascii, ro) + vendor PCI vendor (ascii, ro) + + ro - read only file + rw - file is readable and writable + mmap - file is mmapable + ascii - file contains ascii text + binary - file contains binary data + cpumask - file contains a cpumask type + +The read only files are informational, writes to them will be ignored. +Writable files can be used to perform actions on the device (e.g. changing +config space, detaching a device). mmapable files are available via an +mmap of the file at offset 0 and can be used to do actual device programming +from userspace. Note that some platforms don't support mmapping of certain +resources, so be sure to check the return value from any attempted mmap. + +Accessing legacy resources through sysfs + +Legacy I/O port and ISA memory resources are also provided in sysfs if the +underlying platform supports them. They're located in the PCI class heirarchy, +e.g. + + /sys/class/pci_bus/0000:17/ + |-- bridge -> ../../../devices/pci0000:17 + |-- cpuaffinity + |-- legacy_io + `-- legacy_mem + +The legacy_io file is a read/write file that can be used by applications to +do legacy port I/O. The application should open the file, seek to the desired +port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem +file should be mmapped with an offset corresponding to the memory offset +desired, e.g. 0xa0000 for the VGA frame buffer. The application can then +simply dereference the returned pointer (after checking for errors of course) +to access legacy memory space. + +Supporting PCI access on new platforms + +In order to support PCI resource mapping as described above, Linux platform +code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function. +Platforms are free to only support subsets of the mmap functionality, but +useful return codes should be provided. + +Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms +wishing to support legacy functionality should define it and provide +pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
\ No newline at end of file diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt new file mode 100644 index 000000000000..60f6c2c4d477 --- /dev/null +++ b/Documentation/filesystems/sysfs.txt @@ -0,0 +1,341 @@ + +sysfs - _The_ filesystem for exporting kernel objects. + +Patrick Mochel <mochel@osdl.org> + +10 January 2003 + + +What it is: +~~~~~~~~~~~ + +sysfs is a ram-based filesystem initially based on ramfs. It provides +a means to export kernel data structures, their attributes, and the +linkages between them to userspace. + +sysfs is tied inherently to the kobject infrastructure. Please read +Documentation/kobject.txt for more information concerning the kobject +interface. + + +Using sysfs +~~~~~~~~~~~ + +sysfs is always compiled in. You can access it by doing: + + mount -t sysfs sysfs /sys + + +Directory Creation +~~~~~~~~~~~~~~~~~~ + +For every kobject that is registered with the system, a directory is +created for it in sysfs. That directory is created as a subdirectory +of the kobject's parent, expressing internal object hierarchies to +userspace. Top-level directories in sysfs represent the common +ancestors of object hierarchies; i.e. the subsystems the objects +belong to. + +Sysfs internally stores the kobject that owns the directory in the +->d_fsdata pointer of the directory's dentry. This allows sysfs to do +reference counting directly on the kobject when the file is opened and +closed. + + +Attributes +~~~~~~~~~~ + +Attributes can be exported for kobjects in the form of regular files in +the filesystem. Sysfs forwards file I/O operations to methods defined +for the attributes, providing a means to read and write kernel +attributes. + +Attributes should be ASCII text files, preferably with only one value +per file. It is noted that it may not be efficient to contain only +value per file, so it is socially acceptable to express an array of +values of the same type. + +Mixing types, expressing multiple lines of data, and doing fancy +formatting of data is heavily frowned upon. Doing these things may get +you publically humiliated and your code rewritten without notice. + + +An attribute definition is simply: + +struct attribute { + char * name; + mode_t mode; +}; + + +int sysfs_create_file(struct kobject * kobj, struct attribute * attr); +void sysfs_remove_file(struct kobject * kobj, struct attribute * attr); + + +A bare attribute contains no means to read or write the value of the +attribute. Subsystems are encouraged to define their own attribute +structure and wrapper functions for adding and removing attributes for +a specific object type. + +For example, the driver model defines struct device_attribute like: + +struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device * dev, char * buf); + ssize_t (*store)(struct device * dev, const char * buf); +}; + +int device_create_file(struct device *, struct device_attribute *); +void device_remove_file(struct device *, struct device_attribute *); + +It also defines this helper for defining device attributes: + +#define DEVICE_ATTR(_name,_mode,_show,_store) \ +struct device_attribute dev_attr_##_name = { \ + .attr = {.name = __stringify(_name) , .mode = _mode }, \ + .show = _show, \ + .store = _store, \ +}; + +For example, declaring + +static DEVICE_ATTR(foo,0644,show_foo,store_foo); + +is equivalent to doing: + +static struct device_attribute dev_attr_foo = { + .attr = { + .name = "foo", + .mode = 0644, + }, + .show = show_foo, + .store = store_foo, +}; + + +Subsystem-Specific Callbacks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When a subsystem defines a new attribute type, it must implement a +set of sysfs operations for forwarding read and write calls to the +show and store methods of the attribute owners. + +struct sysfs_ops { + ssize_t (*show)(struct kobject *, struct attribute *,char *); + ssize_t (*store)(struct kobject *,struct attribute *,const char *); +}; + +[ Subsystems should have already defined a struct kobj_type as a +descriptor for this type, which is where the sysfs_ops pointer is +stored. See the kobject documentation for more information. ] + +When a file is read or written, sysfs calls the appropriate method +for the type. The method then translates the generic struct kobject +and struct attribute pointers to the appropriate pointer types, and +calls the associated methods. + + +To illustrate: + +#define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr) +#define to_dev(d) container_of(d, struct device, kobj) + +static ssize_t +dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf) +{ + struct device_attribute * dev_attr = to_dev_attr(attr); + struct device * dev = to_dev(kobj); + ssize_t ret = 0; + + if (dev_attr->show) + ret = dev_attr->show(dev,buf); + return ret; +} + + + +Reading/Writing Attribute Data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To read or write attributes, show() or store() methods must be +specified when declaring the attribute. The method types should be as +simple as those defined for device attributes: + + ssize_t (*show)(struct device * dev, char * buf); + ssize_t (*store)(struct device * dev, const char * buf); + +IOW, they should take only an object and a buffer as parameters. + + +sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the +method. Sysfs will call the method exactly once for each read or +write. This forces the following behavior on the method +implementations: + +- On read(2), the show() method should fill the entire buffer. + Recall that an attribute should only be exporting one value, or an + array of similar values, so this shouldn't be that expensive. + + This allows userspace to do partial reads and seeks arbitrarily over + the entire file at will. + +- On write(2), sysfs expects the entire buffer to be passed during the + first write. Sysfs then passes the entire buffer to the store() + method. + + When writing sysfs files, userspace processes should first read the + entire file, modify the values it wishes to change, then write the + entire buffer back. + + Attribute method implementations should operate on an identical + buffer when reading and writing values. + +Other notes: + +- The buffer will always be PAGE_SIZE bytes in length. On i386, this + is 4096. + +- show() methods should return the number of bytes printed into the + buffer. This is the return value of snprintf(). + +- show() should always use snprintf(). + +- store() should return the number of bytes used from the buffer. This + can be done using strlen(). + +- show() or store() can always return errors. If a bad value comes + through, be sure to return an error. + +- The object passed to the methods will be pinned in memory via sysfs + referencing counting its embedded object. However, the physical + entity (e.g. device) the object represents may not be present. Be + sure to have a way to check this, if necessary. + + +A very simple (and naive) implementation of a device attribute is: + +static ssize_t show_name(struct device * dev, char * buf) +{ + return sprintf(buf,"%s\n",dev->name); +} + +static ssize_t store_name(struct device * dev, const char * buf) +{ + sscanf(buf,"%20s",dev->name); + return strlen(buf); +} + +static DEVICE_ATTR(name,S_IRUGO,show_name,store_name); + + +(Note that the real implementation doesn't allow userspace to set the +name for a device.) + + +Top Level Directory Layout +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sysfs directory arrangement exposes the relationship of kernel +data structures. + +The top level sysfs diretory looks like: + +block/ +bus/ +class/ +devices/ +firmware/ +net/ + +devices/ contains a filesystem representation of the device tree. It maps +directly to the internal kernel device tree, which is a hierarchy of +struct device. + +bus/ contains flat directory layout of the various bus types in the +kernel. Each bus's directory contains two subdirectories: + + devices/ + drivers/ + +devices/ contains symlinks for each device discovered in the system +that point to the device's directory under root/. + +drivers/ contains a directory for each device driver that is loaded +for devices on that particular bus (this assumes that drivers do not +span multiple bus types). + + +More information can driver-model specific features can be found in +Documentation/driver-model/. + + +TODO: Finish this section. + + +Current Interfaces +~~~~~~~~~~~~~~~~~~ + +The following interface layers currently exist in sysfs: + + +- devices (include/linux/device.h) +---------------------------------- +Structure: + +struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device * dev, char * buf); + ssize_t (*store)(struct device * dev, const char * buf); +}; + +Declaring: + +DEVICE_ATTR(_name,_str,_mode,_show,_store); + +Creation/Removal: + +int device_create_file(struct device *device, struct device_attribute * attr); +void device_remove_file(struct device * dev, struct device_attribute * attr); + + +- bus drivers (include/linux/device.h) +-------------------------------------- +Structure: + +struct bus_attribute { + struct attribute attr; + ssize_t (*show)(struct bus_type *, char * buf); + ssize_t (*store)(struct bus_type *, const char * buf); +}; + +Declaring: + +BUS_ATTR(_name,_mode,_show,_store) + +Creation/Removal: + +int bus_create_file(struct bus_type *, struct bus_attribute *); +void bus_remove_file(struct bus_type *, struct bus_attribute *); + + +- device drivers (include/linux/device.h) +----------------------------------------- + +Structure: + +struct driver_attribute { + struct attribute attr; + ssize_t (*show)(struct device_driver *, char * buf); + ssize_t (*store)(struct device_driver *, const char * buf); +}; + +Declaring: + +DRIVER_ATTR(_name,_mode,_show,_store) + +Creation/Removal: + +int driver_create_file(struct device_driver *, struct driver_attribute *); +void driver_remove_file(struct device_driver *, struct driver_attribute *); + + diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt new file mode 100644 index 000000000000..d81722418010 --- /dev/null +++ b/Documentation/filesystems/sysv-fs.txt @@ -0,0 +1,38 @@ +This is the implementation of the SystemV/Coherent filesystem for Linux. +It implements all of + - Xenix FS, + - SystemV/386 FS, + - Coherent FS. + +This is version beta 4. + +To install: +* Answer the 'System V and Coherent filesystem support' question with 'y' + when configuring the kernel. +* To mount a disk or a partition, use + mount [-r] -t sysv device mountpoint + The file system type names + -t sysv + -t xenix + -t coherent + may be used interchangeably, but the last two will eventually disappear. + +Bugs in the present implementation: +- Coherent FS: + - The "free list interleave" n:m is currently ignored. + - Only file systems with no filesystem name and no pack name are recognized. + (See Coherent "man mkfs" for a description of these features.) +- SystemV Release 2 FS: + The superblock is only searched in the blocks 9, 15, 18, which + corresponds to the beginning of track 1 on floppy disks. No support + for this FS on hard disk yet. + + +Please report any bugs and suggestions to + Bruno Haible <haible@ma2s2.mathematik.uni-karlsruhe.de> + Pascal Haible <haible@izfm.uni-stuttgart.de> + Krzysztof G. Baranowski <kgb@manjak.knm.org.pl> + +Bruno Haible +<haible@ma2s2.mathematik.uni-karlsruhe.de> + diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt new file mode 100644 index 000000000000..417e3095fe39 --- /dev/null +++ b/Documentation/filesystems/tmpfs.txt @@ -0,0 +1,100 @@ +Tmpfs is a file system which keeps all files in virtual memory. + + +Everything in tmpfs is temporary in the sense that no files will be +created on your hard drive. If you unmount a tmpfs instance, +everything stored therein is lost. + +tmpfs puts everything into the kernel internal caches and grows and +shrinks to accommodate the files it contains and is able to swap +unneeded pages out to swap space. It has maximum size limits which can +be adjusted on the fly via 'mount -o remount ...' + +If you compare it to ramfs (which was the template to create tmpfs) +you gain swapping and limit checking. Another similar thing is the RAM +disk (/dev/ram*), which simulates a fixed size hard disk in physical +RAM, where you have to create an ordinary filesystem on top. Ramdisks +cannot swap and you do not have the possibility to resize them. + +Since tmpfs lives completely in the page cache and on swap, all tmpfs +pages currently in memory will show up as cached. It will not show up +as shared or something like that. Further on you can check the actual +RAM+swap use of a tmpfs instance with df(1) and du(1). + + +tmpfs has the following uses: + +1) There is always a kernel internal mount which you will not see at + all. This is used for shared anonymous mappings and SYSV shared + memory. + + This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not + set, the user visible part of tmpfs is not build. But the internal + mechanisms are always present. + +2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for + POSIX shared memory (shm_open, shm_unlink). Adding the following + line to /etc/fstab should take care of this: + + tmpfs /dev/shm tmpfs defaults 0 0 + + Remember to create the directory that you intend to mount tmpfs on + if necessary (/dev/shm is automagically created if you use devfs). + + This mount is _not_ needed for SYSV shared memory. The internal + mount is used for that. (In the 2.3 kernel versions it was + necessary to mount the predecessor of tmpfs (shm fs) to use SYSV + shared memory) + +3) Some people (including me) find it very convenient to mount it + e.g. on /tmp and /var/tmp and have a big swap partition. And now + loop mounts of tmpfs files do work, so mkinitrd shipped by most + distributions should succeed with a tmpfs /tmp. + +4) And probably a lot more I do not know about :-) + + +tmpfs has three mount options for sizing: + +size: The limit of allocated bytes for this tmpfs instance. The + default is half of your physical RAM without swap. If you + oversize your tmpfs instances the machine will deadlock + since the OOM handler will not be able to free that memory. +nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE. +nr_inodes: The maximum number of inodes for this instance. The default + is half of the number of your physical RAM pages, or (on a + a machine with highmem) the number of lowmem RAM pages, + whichever is the lower. + +These parameters accept a suffix k, m or g for kilo, mega and giga and +can be changed on remount. The size parameter also accepts a suffix % +to limit this tmpfs instance to that percentage of your physical RAM: +the default, when neither size nor nr_blocks is specified, is size=50% + +If both nr_blocks (or size) and nr_inodes are set to 0, neither blocks +nor inodes will be limited in that instance. It is generally unwise to +mount with such options, since it allows any user with write access to +use up all the memory on the machine; but enhances the scalability of +that instance in a system with many cpus making intensive use of it. + + +To specify the initial root directory you can use the following mount +options: + +mode: The permissions as an octal number +uid: The user id +gid: The group id + +These options do not have any effect on remount. You can change these +parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. + + +So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' +will give you tmpfs instance on /mytmpfs which can allocate 10GB +RAM/SWAP in 10240 inodes and it is only accessible by root. + + +Author: + Christoph Rohland <cr@sap.com>, 1.12.01 +Updated: + Hugh Dickins <hugh@veritas.com>, 01 September 2004 diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt new file mode 100644 index 000000000000..e5213bc301f7 --- /dev/null +++ b/Documentation/filesystems/udf.txt @@ -0,0 +1,57 @@ +* +* Documentation/filesystems/udf.txt +* +UDF Filesystem version 0.9.8.1 + +If you encounter problems with reading UDF discs using this driver, +please report them to linux_udf@hpesjro.fc.hp.com, which is the +developer's list. + +Write support requires a block driver which supports writing. The current +scsi and ide cdrom drivers do not support writing. + +------------------------------------------------------------------------------- +The following mount options are supported: + + gid= Set the default group. + umask= Set the default umask. + uid= Set the default user. + bs= Set the block size. + unhide Show otherwise hidden files. + undelete Show deleted files in lists. + adinicb Embed data in the inode (default) + noadinicb Don't embed data in the inode + shortad Use short ad's + longad Use long ad's (default) + nostrict Unset strict conformance + iocharset= Set the NLS character set + +The remaining are for debugging and disaster recovery: + + novrs Skip volume sequence recognition + +The following expect a offset from 0. + + session= Set the CDROM session (default= last session) + anchor= Override standard anchor location. (default= 256) + volume= Override the VolumeDesc location. (unused) + partition= Override the PartitionDesc location. (unused) + lastblock= Set the last block of the filesystem/ + +The following expect a offset from the partition root. + + fileset= Override the fileset block location. (unused) + rootdir= Override the root directory location. (unused) + WARNING: overriding the rootdir to a non-directory may + yield highly unpredictable results. +------------------------------------------------------------------------------- + + +For the latest version and toolset see: + http://linux-udf.sourceforge.net/ + +Documentation on UDF and ECMA 167 is available FREE from: + http://www.osta.org/ + http://www.ecma-international.org/ + +Ben Fennema <bfennema@falcon.csc.calpoly.edu> diff --git a/Documentation/filesystems/ufs.txt b/Documentation/filesystems/ufs.txt new file mode 100644 index 000000000000..2b5a56a6a558 --- /dev/null +++ b/Documentation/filesystems/ufs.txt @@ -0,0 +1,61 @@ +USING UFS +========= + +mount -t ufs -o ufstype=type_of_ufs device dir + + +UFS OPTIONS +=========== + +ufstype=type_of_ufs + UFS is a file system widely used in different operating systems. + The problem are differences among implementations. Features of + some implementations are undocumented, so its hard to recognize + type of ufs automatically. That's why user must specify type of + ufs manually by mount option ufstype. Possible values are: + + old old format of ufs + default value, supported as read-only + + 44bsd used in FreeBSD, NetBSD, OpenBSD + supported as read-write + + ufs2 used in FreeBSD 5.x + supported as read-only + + 5xbsd synonym for ufs2 + + sun used in SunOS (Solaris) + supported as read-write + + sunx86 used in SunOS for Intel (Solarisx86) + supported as read-write + + hp used in HP-UX + supported as read-only + + nextstep + used in NextStep + supported as read-only + + nextstep-cd + used for NextStep CDROMs (block_size == 2048) + supported as read-only + + openstep + used in OpenStep + supported as read-only + + +POSSIBLE PROBLEMS +================= + +There is still bug in reallocation of fragment, in file fs/ufs/balloc.c, +line 364. But it seems working on current buffer cache configuration. + + +BUG REPORTS +=========== + +Any ufs bug report you can send to daniel.pirkl@email.cz (do not send +partition tables bug reports.) diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt new file mode 100644 index 000000000000..5ead20c6c744 --- /dev/null +++ b/Documentation/filesystems/vfat.txt @@ -0,0 +1,231 @@ +USING VFAT +---------------------------------------------------------------------- +To use the vfat filesystem, use the filesystem type 'vfat'. i.e. + mount -t vfat /dev/fd0 /mnt + +No special partition formatter is required. mkdosfs will work fine +if you want to format from within Linux. + +VFAT MOUNT OPTIONS +---------------------------------------------------------------------- +umask=### -- The permission mask (for files and directories, see umask(1)). + The default is the umask of current process. + +dmask=### -- The permission mask for the directory. + The default is the umask of current process. + +fmask=### -- The permission mask for files. + The default is the umask of current process. + +codepage=### -- Sets the codepage number for converting to shortname + characters on FAT filesystem. + By default, FAT_DEFAULT_CODEPAGE setting is used. + +iocharset=name -- Character set to use for converting between the + encoding is used for user visible filename and 16 bit + Unicode characters. Long filenames are stored on disk + in Unicode format, but Unix for the most part doesn't + know how to deal with Unicode. + By default, FAT_DEFAULT_IOCHARSET setting is used. + + There is also an option of doing UTF8 translations + with the utf8 option. + + NOTE: "iocharset=utf8" is not recommended. If unsure, + you should consider the following option instead. + +utf8=<bool> -- UTF8 is the filesystem safe version of Unicode that + is used by the console. It can be be enabled for the + filesystem with this option. If 'uni_xlate' gets set, + UTF8 gets disabled. + +uni_xlate=<bool> -- Translate unhandled Unicode characters to special + escaped sequences. This would let you backup and + restore filenames that are created with any Unicode + characters. Until Linux supports Unicode for real, + this gives you an alternative. Without this option, + a '?' is used when no translation is possible. The + escape character is ':' because it is otherwise + illegal on the vfat filesystem. The escape sequence + that gets used is ':' and the four digits of hexadecimal + unicode. + +nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will + end in '~1' or tilde followed by some number. If this + option is set, then if the filename is + "longfilename.txt" and "longfile.txt" does not + currently exist in the directory, 'longfile.txt' will + be the short alias instead of 'longfi~1.txt'. + +quiet -- Stops printing certain warning messages. + +check=s|r|n -- Case sensitivity checking setting. + s: strict, case sensitive + r: relaxed, case insensitive + n: normal, default setting, currently case insensitive + +shortname=lower|win95|winnt|mixed + -- Shortname display/create setting. + lower: convert to lowercase for display, + emulate the Windows 95 rule for create. + win95: emulate the Windows 95 rule for display/create. + winnt: emulate the Windows NT rule for display/create. + mixed: emulate the Windows NT rule for display, + emulate the Windows 95 rule for create. + Default setting is `lower'. + +<bool>: 0,1,yes,no,true,false + +TODO +---------------------------------------------------------------------- +* Need to get rid of the raw scanning stuff. Instead, always use + a get next directory entry approach. The only thing left that uses + raw scanning is the directory renaming code. + + +POSSIBLE PROBLEMS +---------------------------------------------------------------------- +* vfat_valid_longname does not properly checked reserved names. +* When a volume name is the same as a directory name in the root + directory of the filesystem, the directory name sometimes shows + up as an empty file. +* autoconv option does not work correctly. + +BUG REPORTS +---------------------------------------------------------------------- +If you have trouble with the VFAT filesystem, mail bug reports to +chaffee@bmrc.cs.berkeley.edu. Please specify the filename +and the operation that gave you trouble. + +TEST SUITE +---------------------------------------------------------------------- +If you plan to make any modifications to the vfat filesystem, please +get the test suite that comes with the vfat distribution at + + http://bmrc.berkeley.edu/people/chaffee/vfat.html + +This tests quite a few parts of the vfat filesystem and additional +tests for new features or untested features would be appreciated. + +NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM +---------------------------------------------------------------------- +(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu> + and lightly annotated by Gordon Chaffee). + +This document presents a very rough, technical overview of my +knowledge of the extended FAT file system used in Windows NT 3.5 and +Windows 95. I don't guarantee that any of the following is correct, +but it appears to be so. + +The extended FAT file system is almost identical to the FAT +file system used in DOS versions up to and including 6.223410239847 +:-). The significant change has been the addition of long file names. +These names support up to 255 characters including spaces and lower +case characters as opposed to the traditional 8.3 short names. + +Here is the description of the traditional FAT entry in the current +Windows 95 filesystem: + + struct directory { // Short 8.3 names + unsigned char name[8]; // file name + unsigned char ext[3]; // file extension + unsigned char attr; // attribute byte + unsigned char lcase; // Case for base and extension + unsigned char ctime_ms; // Creation time, milliseconds + unsigned char ctime[2]; // Creation time + unsigned char cdate[2]; // Creation date + unsigned char adate[2]; // Last access date + unsigned char reserved[2]; // reserved values (ignored) + unsigned char time[2]; // time stamp + unsigned char date[2]; // date stamp + unsigned char start[2]; // starting cluster number + unsigned char size[4]; // size of the file + }; + +The lcase field specifies if the base and/or the extension of an 8.3 +name should be capitalized. This field does not seem to be used by +Windows 95 but it is used by Windows NT. The case of filenames is not +completely compatible from Windows NT to Windows 95. It is not completely +compatible in the reverse direction, however. Filenames that fit in +the 8.3 namespace and are written on Windows NT to be lowercase will +show up as uppercase on Windows 95. + +Note that the "start" and "size" values are actually little +endian integer values. The descriptions of the fields in this +structure are public knowledge and can be found elsewhere. + +With the extended FAT system, Microsoft has inserted extra +directory entries for any files with extended names. (Any name which +legally fits within the old 8.3 encoding scheme does not have extra +entries.) I call these extra entries slots. Basically, a slot is a +specially formatted directory entry which holds up to 13 characters of +a file's extended name. Think of slots as additional labeling for the +directory entry of the file to which they correspond. Microsoft +prefers to refer to the 8.3 entry for a file as its alias and the +extended slot directory entries as the file name. + +The C structure for a slot directory entry follows: + + struct slot { // Up to 13 characters of a long name + unsigned char id; // sequence number for slot + unsigned char name0_4[10]; // first 5 characters in name + unsigned char attr; // attribute byte + unsigned char reserved; // always 0 + unsigned char alias_checksum; // checksum for 8.3 alias + unsigned char name5_10[12]; // 6 more characters in name + unsigned char start[2]; // starting cluster number + unsigned char name11_12[4]; // last 2 characters in name + }; + +If the layout of the slots looks a little odd, it's only +because of Microsoft's efforts to maintain compatibility with old +software. The slots must be disguised to prevent old software from +panicking. To this end, a number of measures are taken: + + 1) The attribute byte for a slot directory entry is always set + to 0x0f. This corresponds to an old directory entry with + attributes of "hidden", "system", "read-only", and "volume + label". Most old software will ignore any directory + entries with the "volume label" bit set. Real volume label + entries don't have the other three bits set. + + 2) The starting cluster is always set to 0, an impossible + value for a DOS file. + +Because the extended FAT system is backward compatible, it is +possible for old software to modify directory entries. Measures must +be taken to ensure the validity of slots. An extended FAT system can +verify that a slot does in fact belong to an 8.3 directory entry by +the following: + + 1) Positioning. Slots for a file always immediately proceed + their corresponding 8.3 directory entry. In addition, each + slot has an id which marks its order in the extended file + name. Here is a very abbreviated view of an 8.3 directory + entry and its corresponding long name slots for the file + "My Big File.Extension which is long": + + <proceeding files...> + <slot #3, id = 0x43, characters = "h is long"> + <slot #2, id = 0x02, characters = "xtension whic"> + <slot #1, id = 0x01, characters = "My Big File.E"> + <directory entry, name = "MYBIGFIL.EXT"> + + Note that the slots are stored from last to first. Slots + are numbered from 1 to N. The Nth slot is or'ed with 0x40 + to mark it as the last one. + + 2) Checksum. Each slot has an "alias_checksum" value. The + checksum is calculated from the 8.3 name using the + following algorithm: + + for (sum = i = 0; i < 11; i++) { + sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i] + } + + 3) If there is free space in the final slot, a Unicode NULL (0x0000) + is stored after the final character. After that, all unused + characters in the final slot are set to Unicode 0xFFFF. + +Finally, note that the extended name is stored in Unicode. Each Unicode +character takes two bytes. diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt new file mode 100644 index 000000000000..3f318dd44c77 --- /dev/null +++ b/Documentation/filesystems/vfs.txt @@ -0,0 +1,671 @@ +/* -*- auto-fill -*- */ + + Overview of the Virtual File System + + Richard Gooch <rgooch@atnf.csiro.au> + + 5-JUL-1999 + + +Conventions used in this document <section> +================================= + +Each section in this document will have the string "<section>" at the +right-hand side of the section title. Each subsection will have +"<subsection>" at the right-hand side. These strings are meant to make +it easier to search through the document. + +NOTE that the master copy of this document is available online at: +http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt + + +What is it? <section> +=========== + +The Virtual File System (otherwise known as the Virtual Filesystem +Switch) is the software layer in the kernel that provides the +filesystem interface to userspace programs. It also provides an +abstraction within the kernel which allows different filesystem +implementations to co-exist. + + +A Quick Look At How It Works <section> +============================ + +In this section I'll briefly describe how things work, before +launching into the details. I'll start with describing what happens +when user programs open and manipulate files, and then look from the +other view which is how a filesystem is supported and subsequently +mounted. + +Opening a File <subsection> +-------------- + +The VFS implements the open(2), stat(2), chmod(2) and similar system +calls. The pathname argument is used by the VFS to search through the +directory entry cache (dentry cache or "dcache"). This provides a very +fast look-up mechanism to translate a pathname (filename) into a +specific dentry. + +An individual dentry usually has a pointer to an inode. Inodes are the +things that live on disc drives, and can be regular files (you know: +those things that you write data into), directories, FIFOs and other +beasts. Dentries live in RAM and are never saved to disc: they exist +only for performance. Inodes live on disc and are copied into memory +when required. Later any changes are written back to disc. The inode +that lives in RAM is a VFS inode, and it is this which the dentry +points to. A single inode can be pointed to by multiple dentries +(think about hardlinks). + +The dcache is meant to be a view into your entire filespace. Unlike +Linus, most of us losers can't fit enough dentries into RAM to cover +all of our filespace, so the dcache has bits missing. In order to +resolve your pathname into a dentry, the VFS may have to resort to +creating dentries along the way, and then loading the inode. This is +done by looking up the inode. + +To look up an inode (usually read from disc) requires that the VFS +calls the lookup() method of the parent directory inode. This method +is installed by the specific filesystem implementation that the inode +lives in. There will be more on this later. + +Once the VFS has the required dentry (and hence the inode), we can do +all those boring things like open(2) the file, or stat(2) it to peek +at the inode data. The stat(2) operation is fairly simple: once the +VFS has the dentry, it peeks at the inode data and passes some of it +back to userspace. + +Opening a file requires another operation: allocation of a file +structure (this is the kernel-side implementation of file +descriptors). The freshly allocated file structure is initialised with +a pointer to the dentry and a set of file operation member functions. +These are taken from the inode data. The open() file method is then +called so the specific filesystem implementation can do it's work. You +can see that this is another switch performed by the VFS. + +The file structure is placed into the file descriptor table for the +process. + +Reading, writing and closing files (and other assorted VFS operations) +is done by using the userspace file descriptor to grab the appropriate +file structure, and then calling the required file structure method +function to do whatever is required. + +For as long as the file is open, it keeps the dentry "open" (in use), +which in turn means that the VFS inode is still in use. + +All VFS system calls (i.e. open(2), stat(2), read(2), write(2), +chmod(2) and so on) are called from a process context. You should +assume that these calls are made without any kernel locks being +held. This means that the processes may be executing the same piece of +filesystem or driver code at the same time, on different +processors. You should ensure that access to shared resources is +protected by appropriate locks. + +Registering and Mounting a Filesystem <subsection> +------------------------------------- + +If you want to support a new kind of filesystem in the kernel, all you +need to do is call register_filesystem(). You pass a structure +describing the filesystem implementation (struct file_system_type) +which is then added to an internal table of supported filesystems. You +can do: + +% cat /proc/filesystems + +to see what filesystems are currently available on your system. + +When a request is made to mount a block device onto a directory in +your filespace the VFS will call the appropriate method for the +specific filesystem. The dentry for the mount point will then be +updated to point to the root inode for the new filesystem. + +It's now time to look at things in more detail. + + +struct file_system_type <section> +======================= + +This describes the filesystem. As of kernel 2.1.99, the following +members are defined: + +struct file_system_type { + const char *name; + int fs_flags; + struct super_block *(*read_super) (struct super_block *, void *, int); + struct file_system_type * next; +}; + + name: the name of the filesystem type, such as "ext2", "iso9660", + "msdos" and so on + + fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) + + read_super: the method to call when a new instance of this + filesystem should be mounted + + next: for internal VFS use: you should initialise this to NULL + +The read_super() method has the following arguments: + + struct super_block *sb: the superblock structure. This is partially + initialised by the VFS and the rest must be initialised by the + read_super() method + + void *data: arbitrary mount options, usually comes as an ASCII + string + + int silent: whether or not to be silent on error + +The read_super() method must determine if the block device specified +in the superblock contains a filesystem of the type the method +supports. On success the method returns the superblock pointer, on +failure it returns NULL. + +The most interesting member of the superblock structure that the +read_super() method fills in is the "s_op" field. This is a pointer to +a "struct super_operations" which describes the next level of the +filesystem implementation. + + +struct super_operations <section> +======================= + +This describes how the VFS can manipulate the superblock of your +filesystem. As of kernel 2.1.99, the following members are defined: + +struct super_operations { + void (*read_inode) (struct inode *); + int (*write_inode) (struct inode *, int); + void (*put_inode) (struct inode *); + void (*drop_inode) (struct inode *); + void (*delete_inode) (struct inode *); + int (*notify_change) (struct dentry *, struct iattr *); + void (*put_super) (struct super_block *); + void (*write_super) (struct super_block *); + int (*statfs) (struct super_block *, struct statfs *, int); + int (*remount_fs) (struct super_block *, int *, char *); + void (*clear_inode) (struct inode *); +}; + +All methods are called without any locks being held, unless otherwise +noted. This means that most methods can block safely. All methods are +only called from a process context (i.e. not from an interrupt handler +or bottom half). + + read_inode: this method is called to read a specific inode from the + mounted filesystem. The "i_ino" member in the "struct inode" + will be initialised by the VFS to indicate which inode to + read. Other members are filled in by this method + + write_inode: this method is called when the VFS needs to write an + inode to disc. The second parameter indicates whether the write + should be synchronous or not, not all filesystems check this flag. + + put_inode: called when the VFS inode is removed from the inode + cache. This method is optional + + drop_inode: called when the last access to the inode is dropped, + with the inode_lock spinlock held. + + This method should be either NULL (normal unix filesystem + semantics) or "generic_delete_inode" (for filesystems that do not + want to cache inodes - causing "delete_inode" to always be + called regardless of the value of i_nlink) + + The "generic_delete_inode()" behaviour is equivalent to the + old practice of using "force_delete" in the put_inode() case, + but does not have the races that the "force_delete()" approach + had. + + delete_inode: called when the VFS wants to delete an inode + + notify_change: called when VFS inode attributes are changed. If this + is NULL the VFS falls back to the write_inode() method. This + is called with the kernel lock held + + put_super: called when the VFS wishes to free the superblock + (i.e. unmount). This is called with the superblock lock held + + write_super: called when the VFS superblock needs to be written to + disc. This method is optional + + statfs: called when the VFS needs to get filesystem statistics. This + is called with the kernel lock held + + remount_fs: called when the filesystem is remounted. This is called + with the kernel lock held + + clear_inode: called then the VFS clears the inode. Optional + +The read_inode() method is responsible for filling in the "i_op" +field. This is a pointer to a "struct inode_operations" which +describes the methods that can be performed on individual inodes. + + +struct inode_operations <section> +======================= + +This describes how the VFS can manipulate an inode in your +filesystem. As of kernel 2.1.99, the following members are defined: + +struct inode_operations { + struct file_operations * default_file_ops; + int (*create) (struct inode *,struct dentry *,int); + int (*lookup) (struct inode *,struct dentry *); + int (*link) (struct dentry *,struct inode *,struct dentry *); + int (*unlink) (struct inode *,struct dentry *); + int (*symlink) (struct inode *,struct dentry *,const char *); + int (*mkdir) (struct inode *,struct dentry *,int); + int (*rmdir) (struct inode *,struct dentry *); + int (*mknod) (struct inode *,struct dentry *,int,dev_t); + int (*rename) (struct inode *, struct dentry *, + struct inode *, struct dentry *); + int (*readlink) (struct dentry *, char *,int); + struct dentry * (*follow_link) (struct dentry *, struct dentry *); + int (*readpage) (struct file *, struct page *); + int (*writepage) (struct page *page, struct writeback_control *wbc); + int (*bmap) (struct inode *,int); + void (*truncate) (struct inode *); + int (*permission) (struct inode *, int); + int (*smap) (struct inode *,int); + int (*updatepage) (struct file *, struct page *, const char *, + unsigned long, unsigned int, int); + int (*revalidate) (struct dentry *); +}; + +Again, all methods are called without any locks being held, unless +otherwise noted. + + default_file_ops: this is a pointer to a "struct file_operations" + which describes how to open and then manipulate open files + + create: called by the open(2) and creat(2) system calls. Only + required if you want to support regular files. The dentry you + get should not have an inode (i.e. it should be a negative + dentry). Here you will probably call d_instantiate() with the + dentry and the newly created inode + + lookup: called when the VFS needs to look up an inode in a parent + directory. The name to look for is found in the dentry. This + method must call d_add() to insert the found inode into the + dentry. The "i_count" field in the inode structure should be + incremented. If the named inode does not exist a NULL inode + should be inserted into the dentry (this is called a negative + dentry). Returning an error code from this routine must only + be done on a real error, otherwise creating inodes with system + calls like create(2), mknod(2), mkdir(2) and so on will fail. + If you wish to overload the dentry methods then you should + initialise the "d_dop" field in the dentry; this is a pointer + to a struct "dentry_operations". + This method is called with the directory inode semaphore held + + link: called by the link(2) system call. Only required if you want + to support hard links. You will probably need to call + d_instantiate() just as you would in the create() method + + unlink: called by the unlink(2) system call. Only required if you + want to support deleting inodes + + symlink: called by the symlink(2) system call. Only required if you + want to support symlinks. You will probably need to call + d_instantiate() just as you would in the create() method + + mkdir: called by the mkdir(2) system call. Only required if you want + to support creating subdirectories. You will probably need to + call d_instantiate() just as you would in the create() method + + rmdir: called by the rmdir(2) system call. Only required if you want + to support deleting subdirectories + + mknod: called by the mknod(2) system call to create a device (char, + block) inode or a named pipe (FIFO) or socket. Only required + if you want to support creating these types of inodes. You + will probably need to call d_instantiate() just as you would + in the create() method + + readlink: called by the readlink(2) system call. Only required if + you want to support reading symbolic links + + follow_link: called by the VFS to follow a symbolic link to the + inode it points to. Only required if you want to support + symbolic links + + +struct file_operations <section> +====================== + +This describes how the VFS can manipulate an open file. As of kernel +2.1.99, the following members are defined: + +struct file_operations { + loff_t (*llseek) (struct file *, loff_t, int); + ssize_t (*read) (struct file *, char *, size_t, loff_t *); + ssize_t (*write) (struct file *, const char *, size_t, loff_t *); + int (*readdir) (struct file *, void *, filldir_t); + unsigned int (*poll) (struct file *, struct poll_table_struct *); + int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); + int (*mmap) (struct file *, struct vm_area_struct *); + int (*open) (struct inode *, struct file *); + int (*release) (struct inode *, struct file *); + int (*fsync) (struct file *, struct dentry *); + int (*fasync) (struct file *, int); + int (*check_media_change) (kdev_t dev); + int (*revalidate) (kdev_t dev); + int (*lock) (struct file *, int, struct file_lock *); +}; + +Again, all methods are called without any locks being held, unless +otherwise noted. + + llseek: called when the VFS needs to move the file position index + + read: called by read(2) and related system calls + + write: called by write(2) and related system calls + + readdir: called when the VFS needs to read the directory contents + + poll: called by the VFS when a process wants to check if there is + activity on this file and (optionally) go to sleep until there + is activity. Called by the select(2) and poll(2) system calls + + ioctl: called by the ioctl(2) system call + + mmap: called by the mmap(2) system call + + open: called by the VFS when an inode should be opened. When the VFS + opens a file, it creates a new "struct file" and initialises + the "f_op" file operations member with the "default_file_ops" + field in the inode structure. It then calls the open method + for the newly allocated file structure. You might think that + the open method really belongs in "struct inode_operations", + and you may be right. I think it's done the way it is because + it makes filesystems simpler to implement. The open() method + is a good place to initialise the "private_data" member in the + file structure if you want to point to a device structure + + release: called when the last reference to an open file is closed + + fsync: called by the fsync(2) system call + + fasync: called by the fcntl(2) system call when asynchronous + (non-blocking) mode is enabled for a file + +Note that the file operations are implemented by the specific +filesystem in which the inode resides. When opening a device node +(character or block special) most filesystems will call special +support routines in the VFS which will locate the required device +driver information. These support routines replace the filesystem file +operations with those for the device driver, and then proceed to call +the new open() method for the file. This is how opening a device file +in the filesystem eventually ends up calling the device driver open() +method. Note the devfs (the Device FileSystem) has a more direct path +from device node to device driver (this is an unofficial kernel +patch). + + +Directory Entry Cache (dcache) <section> +------------------------------ + +struct dentry_operations +======================== + +This describes how a filesystem can overload the standard dentry +operations. Dentries and the dcache are the domain of the VFS and the +individual filesystem implementations. Device drivers have no business +here. These methods may be set to NULL, as they are either optional or +the VFS uses a default. As of kernel 2.1.99, the following members are +defined: + +struct dentry_operations { + int (*d_revalidate)(struct dentry *); + int (*d_hash) (struct dentry *, struct qstr *); + int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); + void (*d_delete)(struct dentry *); + void (*d_release)(struct dentry *); + void (*d_iput)(struct dentry *, struct inode *); +}; + + d_revalidate: called when the VFS needs to revalidate a dentry. This + is called whenever a name look-up finds a dentry in the + dcache. Most filesystems leave this as NULL, because all their + dentries in the dcache are valid + + d_hash: called when the VFS adds a dentry to the hash table + + d_compare: called when a dentry should be compared with another + + d_delete: called when the last reference to a dentry is + deleted. This means no-one is using the dentry, however it is + still valid and in the dcache + + d_release: called when a dentry is really deallocated + + d_iput: called when a dentry loses its inode (just prior to its + being deallocated). The default when this is NULL is that the + VFS calls iput(). If you define this method, you must call + iput() yourself + +Each dentry has a pointer to its parent dentry, as well as a hash list +of child dentries. Child dentries are basically like files in a +directory. + +Directory Entry Cache APIs +-------------------------- + +There are a number of functions defined which permit a filesystem to +manipulate dentries: + + dget: open a new handle for an existing dentry (this just increments + the usage count) + + dput: close a handle for a dentry (decrements the usage count). If + the usage count drops to 0, the "d_delete" method is called + and the dentry is placed on the unused list if the dentry is + still in its parents hash list. Putting the dentry on the + unused list just means that if the system needs some RAM, it + goes through the unused list of dentries and deallocates them. + If the dentry has already been unhashed and the usage count + drops to 0, in this case the dentry is deallocated after the + "d_delete" method is called + + d_drop: this unhashes a dentry from its parents hash list. A + subsequent call to dput() will dellocate the dentry if its + usage count drops to 0 + + d_delete: delete a dentry. If there are no other open references to + the dentry then the dentry is turned into a negative dentry + (the d_iput() method is called). If there are other + references, then d_drop() is called instead + + d_add: add a dentry to its parents hash list and then calls + d_instantiate() + + d_instantiate: add a dentry to the alias hash list for the inode and + updates the "d_inode" member. The "i_count" member in the + inode structure should be set/incremented. If the inode + pointer is NULL, the dentry is called a "negative + dentry". This function is commonly called when an inode is + created for an existing negative dentry + + d_lookup: look up a dentry given its parent and path name component + It looks up the child of that given name from the dcache + hash table. If it is found, the reference count is incremented + and the dentry is returned. The caller must use d_put() + to free the dentry when it finishes using it. + + +RCU-based dcache locking model +------------------------------ + +On many workloads, the most common operation on dcache is +to look up a dentry, given a parent dentry and the name +of the child. Typically, for every open(), stat() etc., +the dentry corresponding to the pathname will be looked +up by walking the tree starting with the first component +of the pathname and using that dentry along with the next +component to look up the next level and so on. Since it +is a frequent operation for workloads like multiuser +environments and webservers, it is important to optimize +this path. + +Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus +in every component during path look-up. Since 2.5.10 onwards, +fastwalk algorithm changed this by holding the dcache_lock +at the beginning and walking as many cached path component +dentries as possible. This signficantly decreases the number +of acquisition of dcache_lock. However it also increases the +lock hold time signficantly and affects performance in large +SMP machines. Since 2.5.62 kernel, dcache has been using +a new locking model that uses RCU to make dcache look-up +lock-free. + +The current dcache locking model is not very different from the existing +dcache locking model. Prior to 2.5.62 kernel, dcache_lock +protected the hash chain, d_child, d_alias, d_lru lists as well +as d_inode and several other things like mount look-up. RCU-based +changes affect only the way the hash chain is protected. For everything +else the dcache_lock must be taken for both traversing as well as +updating. The hash chain updations too take the dcache_lock. +The significant change is the way d_lookup traverses the hash chain, +it doesn't acquire the dcache_lock for this and rely on RCU to +ensure that the dentry has not been *freed*. + + +Dcache locking details +---------------------- +For many multi-user workloads, open() and stat() on files are +very frequently occurring operations. Both involve walking +of path names to find the dentry corresponding to the +concerned file. In 2.4 kernel, dcache_lock was held +during look-up of each path component. Contention and +cacheline bouncing of this global lock caused significant +scalability problems. With the introduction of RCU +in linux kernel, this was worked around by making +the look-up of path components during path walking lock-free. + + +Safe lock-free look-up of dcache hash table +=========================================== + +Dcache is a complex data structure with the hash table entries +also linked together in other lists. In 2.4 kernel, dcache_lock +protected all the lists. We applied RCU only on hash chain +walking. The rest of the lists are still protected by dcache_lock. +Some of the important changes are : + +1. The deletion from hash chain is done using hlist_del_rcu() macro which + doesn't initialize next pointer of the deleted dentry and this + allows us to walk safely lock-free while a deletion is happening. + +2. Insertion of a dentry into the hash table is done using + hlist_add_head_rcu() which take care of ordering the writes - + the writes to the dentry must be visible before the dentry + is inserted. This works in conjuction with hlist_for_each_rcu() + while walking the hash chain. The only requirement is that + all initialization to the dentry must be done before hlist_add_head_rcu() + since we don't have dcache_lock protection while traversing + the hash chain. This isn't different from the existing code. + +3. The dentry looked up without holding dcache_lock by cannot be + returned for walking if it is unhashed. It then may have a NULL + d_inode or other bogosity since RCU doesn't protect the other + fields in the dentry. We therefore use a flag DCACHE_UNHASHED to + indicate unhashed dentries and use this in conjunction with a + per-dentry lock (d_lock). Once looked up without the dcache_lock, + we acquire the per-dentry lock (d_lock) and check if the + dentry is unhashed. If so, the look-up is failed. If not, the + reference count of the dentry is increased and the dentry is returned. + +4. Once a dentry is looked up, it must be ensured during the path + walk for that component it doesn't go away. In pre-2.5.10 code, + this was done holding a reference to the dentry. dcache_rcu does + the same. In some sense, dcache_rcu path walking looks like + the pre-2.5.10 version. + +5. All dentry hash chain updations must take the dcache_lock as well as + the per-dentry lock in that order. dput() does this to ensure + that a dentry that has just been looked up in another CPU + doesn't get deleted before dget() can be done on it. + +6. There are several ways to do reference counting of RCU protected + objects. One such example is in ipv4 route cache where + deferred freeing (using call_rcu()) is done as soon as + the reference count goes to zero. This cannot be done in + the case of dentries because tearing down of dentries + require blocking (dentry_iput()) which isn't supported from + RCU callbacks. Instead, tearing down of dentries happen + synchronously in dput(), but actual freeing happens later + when RCU grace period is over. This allows safe lock-free + walking of the hash chains, but a matched dentry may have + been partially torn down. The checking of DCACHE_UNHASHED + flag with d_lock held detects such dentries and prevents + them from being returned from look-up. + + +Maintaining POSIX rename semantics +================================== + +Since look-up of dentries is lock-free, it can race against +a concurrent rename operation. For example, during rename +of file A to B, look-up of either A or B must succeed. +So, if look-up of B happens after A has been removed from the +hash chain but not added to the new hash chain, it may fail. +Also, a comparison while the name is being written concurrently +by a rename may result in false positive matches violating +rename semantics. Issues related to race with rename are +handled as described below : + +1. Look-up can be done in two ways - d_lookup() which is safe + from simultaneous renames and __d_lookup() which is not. + If __d_lookup() fails, it must be followed up by a d_lookup() + to correctly determine whether a dentry is in the hash table + or not. d_lookup() protects look-ups using a sequence + lock (rename_lock). + +2. The name associated with a dentry (d_name) may be changed if + a rename is allowed to happen simultaneously. To avoid memcmp() + in __d_lookup() go out of bounds due to a rename and false + positive comparison, the name comparison is done while holding the + per-dentry lock. This prevents concurrent renames during this + operation. + +3. Hash table walking during look-up may move to a different bucket as + the current dentry is moved to a different bucket due to rename. + But we use hlists in dcache hash table and they are null-terminated. + So, even if a dentry moves to a different bucket, hash chain + walk will terminate. [with a list_head list, it may not since + termination is when the list_head in the original bucket is reached]. + Since we redo the d_parent check and compare name while holding + d_lock, lock-free look-up will not race against d_move(). + +4. There can be a theoritical race when a dentry keeps coming back + to original bucket due to double moves. Due to this look-up may + consider that it has never moved and can end up in a infinite loop. + But this is not any worse that theoritical livelocks we already + have in the kernel. + + +Important guidelines for filesystem developers related to dcache_rcu +==================================================================== + +1. Existing dcache interfaces (pre-2.5.62) exported to filesystem + don't change. Only dcache internal implementation changes. However + filesystems *must not* delete from the dentry hash chains directly + using the list macros like allowed earlier. They must use dcache + APIs like d_drop() or __d_drop() depending on the situation. + +2. d_flags is now protected by a per-dentry lock (d_lock). All + access to d_flags must be protected by it. + +3. For a hashed dentry, checking of d_count needs to be protected + by d_lock. + + +Papers and other documentation on dcache locking +================================================ + +1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). + +2. http://lse.sourceforge.net/locking/dcache/dcache.html diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt new file mode 100644 index 000000000000..c7d5d0c7067d --- /dev/null +++ b/Documentation/filesystems/xfs.txt @@ -0,0 +1,188 @@ + +The SGI XFS Filesystem +====================== + +XFS is a high performance journaling filesystem which originated +on the SGI IRIX platform. It is completely multi-threaded, can +support large files and large filesystems, extended attributes, +variable block sizes, is extent based, and makes extensive use of +Btrees (directories, extents, free space) to aid both performance +and scalability. + +Refer to the documentation at http://oss.sgi.com/projects/xfs/ +for further details. This implementation is on-disk compatible +with the IRIX version of XFS. + + +Mount Options +============= + +When mounting an XFS filesystem, the following options are accepted. + + biosize=size + Sets the preferred buffered I/O size (default size is 64K). + "size" must be expressed as the logarithm (base2) of the + desired I/O size. + Valid values for this option are 14 through 16, inclusive + (i.e. 16K, 32K, and 64K bytes). On machines with a 4K + pagesize, 13 (8K bytes) is also a valid size. + The preferred buffered I/O size can also be altered on an + individual file basis using the ioctl(2) system call. + + ikeep/noikeep + When inode clusters are emptied of inodes, keep them around + on the disk (ikeep) - this is the traditional XFS behaviour + and is still the default for now. Using the noikeep option, + inode clusters are returned to the free space pool. + + logbufs=value + Set the number of in-memory log buffers. Valid numbers range + from 2-8 inclusive. + The default value is 8 buffers for filesystems with a + blocksize of 64K, 4 buffers for filesystems with a blocksize + of 32K, 3 buffers for filesystems with a blocksize of 16K + and 2 buffers for all other configurations. Increasing the + number of buffers may increase performance on some workloads + at the cost of the memory used for the additional log buffers + and their associated control structures. + + logbsize=value + Set the size of each in-memory log buffer. + Size may be specified in bytes, or in kilobytes with a "k" suffix. + Valid sizes for version 1 and version 2 logs are 16384 (16k) and + 32768 (32k). Valid sizes for version 2 logs also include + 65536 (64k), 131072 (128k) and 262144 (256k). + The default value for machines with more than 32MB of memory + is 32768, machines with less memory use 16384 by default. + + logdev=device and rtdev=device + Use an external log (metadata journal) and/or real-time device. + An XFS filesystem has up to three parts: a data section, a log + section, and a real-time section. The real-time section is + optional, and the log section can be separate from the data + section or contained within it. + + noalign + Data allocations will not be aligned at stripe unit boundaries. + + noatime + Access timestamps are not updated when a file is read. + + norecovery + The filesystem will be mounted without running log recovery. + If the filesystem was not cleanly unmounted, it is likely to + be inconsistent when mounted in "norecovery" mode. + Some files or directories may not be accessible because of this. + Filesystems mounted "norecovery" must be mounted read-only or + the mount will fail. + + nouuid + Don't check for double mounted file systems using the file system uuid. + This is useful to mount LVM snapshot volumes. + + osyncisosync + Make O_SYNC writes implement true O_SYNC. WITHOUT this option, + Linux XFS behaves as if an "osyncisdsync" option is used, + which will make writes to files opened with the O_SYNC flag set + behave as if the O_DSYNC flag had been used instead. + This can result in better performance without compromising + data safety. + However if this option is not in effect, timestamp updates from + O_SYNC writes can be lost if the system crashes. + If timestamp updates are critical, use the osyncisosync option. + + quota/usrquota/uqnoenforce + User disk quota accounting enabled, and limits (optionally) + enforced. + + grpquota/gqnoenforce + Group disk quota accounting enabled and limits (optionally) + enforced. + + sunit=value and swidth=value + Used to specify the stripe unit and width for a RAID device or + a stripe volume. "value" must be specified in 512-byte block + units. + If this option is not specified and the filesystem was made on + a stripe volume or the stripe width or unit were specified for + the RAID device at mkfs time, then the mount system call will + restore the value from the superblock. For filesystems that + are made directly on RAID devices, these options can be used + to override the information in the superblock if the underlying + disk layout changes after the filesystem has been created. + The "swidth" option is required if the "sunit" option has been + specified, and must be a multiple of the "sunit" value. + +sysctls +======= + +The following sysctls are available for the XFS filesystem: + + fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) + Setting this to "1" clears accumulated XFS statistics + in /proc/fs/xfs/stat. It then immediately resets to "0". + + fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) + The interval at which the xfssyncd thread flushes metadata + out to disk. This thread will flush log activity out, and + do some processing on unlinked inodes. + + fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) + The interval at which xfsbufd scans the dirty metadata buffers list. + + fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) + The age at which xfsbufd flushes dirty metadata buffers to disk. + + fs.xfs.error_level (Min: 0 Default: 3 Max: 11) + A volume knob for error reporting when internal errors occur. + This will generate detailed messages & backtraces for filesystem + shutdowns, for example. Current threshold values are: + + XFS_ERRLEVEL_OFF: 0 + XFS_ERRLEVEL_LOW: 1 + XFS_ERRLEVEL_HIGH: 5 + + fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) + Causes certain error conditions to call BUG(). Value is a bitmask; + AND together the tags which represent errors which should cause panics: + + XFS_NO_PTAG 0 + XFS_PTAG_IFLUSH 0x00000001 + XFS_PTAG_LOGRES 0x00000002 + XFS_PTAG_AILDELETE 0x00000004 + XFS_PTAG_ERROR_REPORT 0x00000008 + XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010 + XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 + XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 + + This option is intended for debugging only. + + fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) + Controls whether symlinks are created with mode 0777 (default) + or whether their mode is affected by the umask (irix mode). + + fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) + Controls files created in SGID directories. + If the group ID of the new file does not match the effective group + ID or one of the supplementary group IDs of the parent dir, the + ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl + is set. + + fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) + Controls whether unprivileged users can use chown to "give away" + a file to another user. + + fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1) + Setting this to "1" will cause the "sync" flag set + by the chattr(1) command on a directory to be + inherited by files in that directory. + + fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1) + Setting this to "1" will cause the "nodump" flag set + by the chattr(1) command on a directory to be + inherited by files in that directory. + + fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1) + Setting this to "1" will cause the "noatime" flag set + by the chattr(1) command on a directory to be + inherited by files in that directory. |