File System Support Infrastructure
==================================

Nick Garnett
v0.2


This document describes the filesystem infrastructure provided in
eCos. This is implemented by the FILEIO package and provides POSIX
compliant file and IO operations together with the BSD socket
API. These APIs are described in the relevant standards and original
documentation and will not be described here. This document is,
instead, concerned with the interfaces presented to client
filesystems and network protocol stacks.

The FILEIO infrastructure consist mainly of a set of tables containing
pointers to the primary interface functions of a file system. This
approach avoids problems of namespace pollution (several filesystems
can have a function called read(),so long as they are static). The
system is also structured to eliminate the need for dynamic memory
allocation.

New filesystems can be written directly to the interfaces described
here. Existing filesystems can be ported very easily by the
introduction of a thin veneer porting layer that translates FILEIO
calls into native filesystem calls. 

The term filesystem should be read fairly loosely in this
document. Object accessed through these interfaces could equally be
network protocol sockets, device drivers, fifos, message queues or any
other object that can present a file-like interface.

    
File System Table
-----------------

The filesystem table is an array of entries that describe each
filesystem implementation that is part of the system image. Each
resident filesystem should export an entry to this table using the
FSTAB_ENTRY() macro.

The table entries are described by the following structure:

struct cyg_fstab_entry
{
    const char          *name;          // filesystem name
    CYG_ADDRWORD        data;           // private data value
    cyg_uint32          syncmode;       // synchronization mode
    
    int     (*mount)    ( cyg_fstab_entry *fste, cyg_mtab_entry *mte );
    int     (*umount)   ( cyg_mtab_entry *mte );
    int     (*open)     ( cyg_mtab_entry *mte, cyg_dir dir, const char *name,
                          int mode,  cyg_file *fte );
    int     (*unlink)   ( cyg_mtab_entry *mte, cyg_dir dir, const char *name );
    int     (*mkdir)    ( cyg_mtab_entry *mte, cyg_dir dir, const char *name );
    int     (*rmdir)    ( cyg_mtab_entry *mte, cyg_dir dir, const char *name );
    int     (*rename)   ( cyg_mtab_entry *mte, cyg_dir dir1, const char *name1,
                          cyg_dir dir2, const char *name2 );
    int     (*link)     ( cyg_mtab_entry *mte, cyg_dir dir1, const char *name1,
                          cyg_dir dir2, const char *name2, int type );
    int     (*opendir)  ( cyg_mtab_entry *mte, cyg_dir dir, const char *name,
                          cyg_file *fte );
    int     (*chdir)    ( cyg_mtab_entry *mte, cyg_dir dir, const char *name,
                          cyg_dir *dir_out );
    int     (*stat)     ( cyg_mtab_entry *mte, cyg_dir dir, const char *name,
                          struct stat *buf);
    int     (*getinfo)  ( cyg_mtab_entry *mte, cyg_dir dir, const char *name,
                          int key, char *buf, int len );
    int     (*setinfo)  ( cyg_mtab_entry *mte, cyg_dir dir, const char *name,
                          int key, char *buf, int len );
};

The _name_ field points to a string that identifies this filesystem
implementation. Typical values might be "romfs", "msdos", "ext2" etc.

The _data_ field contains any private data that the filesystem needs,
perhaps the root of its data structures.

The _syncmode_ field contains a description of the locking protocol to
be used when accessing this filesystem. It will be described in more
detail in the "Synchronization" section.

The remaining fields are pointers to functions that implement
filesystem operations that apply to files and directories as whole
objects. The operation implemented by each function should be obvious
from the names, with a few exceptions.

The _opendir_ function opens a directory for reading. See the section
on Directories later for details.

The _getinfo_ and _setinfo_ functions provide support for various
minor control and information functions such as pathconf() and
access().

With the exception of the _mount_ and _umount_ functions, all of these
functions take three standard arguments, a pointer to a mount table
entry (see later) a directory pointer (also see later) and a file name
relative to the directory. These should be used by the filesystem to
locate the object of interest.

Mount Table
-----------

The mount table records the filesystems that are actually active.
These can be seen as being analogous to mount points in Unix systems.

There are two sources of mount table entries. Filesystems (or other
components) may export static entries to the table using the
MTAB_ENTRY() macro. Alternatively, new entries may be installed at run
time using the mount() function. Both types of entry may be unmounted
with the umount() function.

A mount table entry has the following structure:

struct cyg_mtab_entry
{
    const char          *name;          // name of mount point
    const char          *fsname;        // name of implementing filesystem
    const char          *devname;       // name of hardware device
    CYG_ADDRWORD        data;           // private data value
    cyg_bool            valid;          // Valid entry?
    cyg_fstab_entry     *fs;            // pointer to fstab entry
    cyg_dir             root;           // root directory pointer
};

The _name_ field identifies the mount point. This is used to translate
rooted filenames (filenames that begin with "/") into the correct
filesystem. When a file name that begins with "/" is submitted, it is
matched against the _name_ fields of all valid mount table
entries. The entry that yields the longest match terminating before a
"/", or end of string, wins and the appropriate function from the
filesystem table entry is then passed the remainder of the file name
together with a pointer to the table entry and the value of the _root_
field as the directory pointer.

For example, consider a mount table that contains the following
entries:

	{ "/",    "msdos", "/dev/hd0", ... }
	{ "/fd",  "msdos", "/dev/fd0", ... }
	{ "/rom", "romfs", "", ... }
	{ "/tmp", "ramfs", "", ... }
	{ "/dev", "devfs", "", ... }

An attempt to open "/tmp/foo" would be directed to the RAM filesystem
while an open of "/bar/bundy" would be directed to the hard disc MSDOS
filesystem. Opening "/dev/tty0" would be directed to the device
management filesystem for lookup in the device table.

Unrooted file names (those that do not begin with a '/') are passed
straight to the current directory. The current directory is
represented by a pair consisting of a mount table entry and a
directory pointer.

The _fsname_ field points to a string that should match the _name_
field of the implementing filesystem. During initialization the mount
table is scanned and the _fsname_ entries looked up in the
filesystem table. For each match, the filesystem's _mount_ function
is called and if successful the mount table entry is marked as valid
and the _fs_ pointer installed.

The _devname_ field contains the name of the device that this
filesystem is to use. This may match an entry in the device table (see
later) or may be a string that is specific to the filesystem if it has
its own internal device drivers.

The _data_ field is a private data value. This may be installed either
statically when the table entry is defined, or may be installed during
the _mount_ operation.

The _valid_ field indicates whether this mount point has actually been
mounted successfully. Entries with a false _valid_ field are ignored
when searching for a name match.

The _fs_ field is installed after a successful mount operation to
point to the implementing filesystem.

The _root_ field contains a directory pointer value that the
filesystem can interpret as the root of its directory tree. This is
passed as the _dir_ argument of filesystem functions that operate on
rooted filenames. This field must be initialized by the filesystem's
_mount_ function.


File Table
----------

Once a file has been opened it is represented by an open file
object. These are allocated from an array of available file
objects. User code accesses these open file objects via a second array
of pointers which is indexed by small integer offsets. This gives the
usual Unix file descriptor functionality, complete with the various
duplication mechanisms.

A file table entry has the following structure:

struct CYG_FILE_TAG
{
    cyg_uint32	                f_flag;		/* file state                   */
    cyg_uint16                  f_ucount;       /* use count                    */
    cyg_uint16                  f_type;		/* descriptor type              */
    cyg_uint32                  f_syncmode;     /* synchronization protocol     */
    struct CYG_FILEOPS_TAG      *f_ops;         /* file operations              */
    off_t       	        f_offset;       /* current offset               */
    CYG_ADDRWORD	        f_data;		/* file or socket               */
    CYG_ADDRWORD                f_xops;         /* extra type specific ops      */
    cyg_mtab_entry              *f_mte;         /* mount table entry            */
};

The _f_flag_ field contains some FILEIO control bits and some of the
bits from the open call (defined by CYG_FILE_MODE_MASK).

The _f_ucount_ field contains a use count that controls when a file
will be closed. Each duplicate in the file descriptor array counts for
one reference here and it is also incremented around each I/O
operation.

The _f_type_ field indicates the type of the underlying file
object. Some of the possible values here are CYG_FILE_TYPE_FILE,
CYG_FILE_TYPE_SOCKET or CYG_FILE_TYPE_DEVICE.

The _f_syncmode_ field is copied from the _syncmode_ field of the
implementing filesystem. Its use is described in the "Synchronization"
section later.

The _f_offset_ field records the current file position. It is the
responsibility of the file operation functions to keep this field up
to date.

The _f_data_ field contains private data placed here by the underlying
filesystem. Normally this will be a pointer to or handle on the
filesystem object that implements this file.

The _f_xops_ field contains a pointer to any extra type specific
operation functions. For example, the socket I/O system installs a
pointer to a table of functions that implement the standard socket
operations.

The _f_mte_ field contains a pointer to the parent mount table entry
for this file. It is used mainly to implement the synchronization
protocol. This may contain a pointer to some other data structure in
file objects not derived from a filesystem.

The _f_ops_ field contains a pointer to a table of file I/O
operations. This has the following structure:

struct CYG_FILEOPS_TAG
{
        int	(*fo_read)      (struct CYG_FILE_TAG *fp, struct CYG_UIO_TAG *uio);
        int	(*fo_write)     (struct CYG_FILE_TAG *fp, struct CYG_UIO_TAG *uio);
        int     (*fo_lseek)     (struct CYG_FILE_TAG *fp, off_t *pos, int whence );
        int	(*fo_ioctl)     (struct CYG_FILE_TAG *fp, CYG_ADDRWORD com,
                                 CYG_ADDRWORD data);
        int	(*fo_select)    (struct CYG_FILE_TAG *fp, int which, CYG_ADDRWORD info);
        int     (*fo_fsync)     (struct CYG_FILE_TAG *fp, int mode );        
        int	(*fo_close)     (struct CYG_FILE_TAG *fp);
        int     (*fo_fstat)     (struct CYG_FILE_TAG *fp, struct stat *buf );
        int     (*fo_getinfo)   (struct CYG_FILE_TAG *fp, int key, char *buf, int len );
        int     (*fo_setinfo)   (struct CYG_FILE_TAG *fp, int key, char *buf, int len );
};

It should be obvious from the names of most of these functions what
their responsibilities are. The _fo_getinfo_ and _fo_setinfo_
function, like their counterparts in the filesystem structure,
implement minor control and info functions such as fpathconf().

The second argument to _fo_read_ and _fo_write_ is a pointer to a UIO
structure:

struct CYG_UIO_TAG
{
    struct CYG_IOVEC_TAG *uio_iov;	/* pointer to array of iovecs */
    int	                 uio_iovcnt;	/* number of iovecs in array */
    off_t       	 uio_offset;	/* offset into file this uio corresponds to */
    ssize_t     	 uio_resid;	/* residual i/o count */
    enum cyg_uio_seg     uio_segflg;    /* see above */
    enum cyg_uio_rw      uio_rw;        /* see above */
};

struct CYG_IOVEC_TAG
{
    void           *iov_base;           /* Base address. */
    ssize_t        iov_len;             /* Length. */
};

This structure encapsulates the parameters of any data transfer
operation. It provides support for scatter/gather operations and
records the progress of any data transfer. It is also compatible with
the I/O operations of any BSD-derived network stacks and filesystems.


When a file is opened (or a file object created by some other means,
such as socket() or accept()) it is the responsibility of the
filesystem open operation to initialize all the fields of the object
except the _f_ucount_, _f_syncmode_ and _f_mte_ fields. Since the
_f_flag_ field will already contain bits belonging to the FILEIO
infrastructure, any changes to it must be made with the appropriate
logical operations.


Directories
-----------

Filesystem operations all take a directory pointer as one of their
arguments.  A directory pointer is an opaque handle managed by the
filesystem. It should encapsulate a reference to a specific directory
within the filesystem. For example, it may be a pointer to the data
structure that represents that directory, or a pointer to a pathname
for the directory.

The _chdir_ filesystem function has two modes of use. When passed a
pointer in the _dir_out_ argument, it should locate the named
directory and place a directory pointer there. If the _dir_out_
argument is NULL then the _dir_ argument is a previously generated
directory pointer that can now be disposed of. When the infrastructure
is implementing the chdir() function it makes two calls to filesystem
_chdir_ functions. The first is to get a directory pointer for the new
current directory. If this succeeds the second is to dispose of the
old current directory pointer.

The _opendir_ function is used to open a directory for reading. This
results in an open file object that can be read to return a sequence
of _struct dirent_ objects. The only operation that are allowed on
this file are _read_, _lseek_ and _close_. Each read operation on this
file should return a single _struct dirent_ object. When the end of
the directory is reached, zero should be returned. The only seek
operation allowed is a rewind to the start of the directory, by
supplying an offset of zero and a _whence_ specifier of _SEEK_SET_.

Most of these considerations are invisible to clients of a filesystem
since they will access directories via the POSIX
opendir()/readdir()/closedir() functions.

Support for the _getcwd()_ function is provided by three mechanisms.
The first is to use the _FS_INFO_GETCWD_ getinfo key on the filesystem
to use any internal support that it has for this. If that fails it
falls back on one of the two other mechanisms. If
_CYGPKG_IO_FILEIO_TRACK_CWD_ is set then the current directory is
tracked textually in chdir() and the result of that is reported in
getcwd(). Otherwise an attempt is made to traverse the directory tree
to its root using ".." entries.

This last option is complicated and expensive, and relies on the
filesystem supporting "." and ".."  entries. This is not always the
case, particularly if the filesystem has been ported from a
non-UNIX-compatible source. Tracking the pathname textually will
usually work, but might not produce optimum results when symbolic
links are being used.


Synchronization
---------------

The FILEIO infrastructure provides a synchronization mechanism for
controlling concurrent access to filesystems. This allows existing
filesystems to be ported to eCos, even if they do not have their own
synchronization mechanisms. It also allows new filesystems to be
implemented easily without having to consider the synchronization
issues.

The infrastructure maintains a mutex for each entry in each of
the main tables: filesystem table, mount table and file table. For
each class of operation each of these mutexes may be locked before the
corresponding filesystem operation is invoked.

The synchronization protocol implemented by a filesystem is described
by the _syncmode_ field of the filesystem table entry. This is a
combination of the following flags:

CYG_SYNCMODE_FILE_FILESYSTEM Lock the filesystem table entry mutex
			     during all filesystem level operations.

CYG_SYNCMODE_FILE_MOUNTPOINT Lock the mount table entry mutex
			     during all filesystem level operations.

CYG_SYNCMODE_IO_FILE	     Lock the file table entry mutex during all
			     I/O operations.

CYG_SYNCMODE_IO_FILESYSTEM   Lock the filesystem table entry mutex
			     during all I/O operations.
			     
CYG_SYNCMODE_IO_MOUNTPOINT   Lock the mount table entry mutex during
			     all I/O operations.

CYG_SYNCMODE_SOCK_FILE       Lock the file table entry mutex during
			     all socket operations.

CYG_SYNCMODE_SOCK_NETSTACK   Lock the network stack table entry mutex
			     during all socket operations.

CYG_SYNCMODE_NONE	     Perform no locking at all during any
			     operations.


The value of the _syncmode_ in the filesystem table entry will be
copied by the infrastructure to the open file object after a
successful open() operation.


Initialization and Mounting
---------------------------

As mentioned previously, mount table entries can be sourced from two
places. Static entries may be defined by using the MTAB_ENTRY()
macro. Such entries will be automatically mounted on system startup.
For each entry in the mount table that has a non-null _name_ field the
filesystem table is searched for a match with the _fsname_ field. If a
match is found the filesystem's _mount_ entry is called and if
successful the mount table entry marked valid and the _fs_ field
initialized. The _mount_ function is responsible for initializing the
_root_ field.

The size of the mount table is defined by the configuration value
CYGNUM_FILEIO_MTAB_MAX. Any entries that have not been statically
defined are available for use by dynamic mounts.

A filesystem may be mounted dynamically by calling mount(). This
function has the following prototype:

int mount( const char *devname,
           const char *dir,
	   const char *fsname);

The _devname_ argument identifies a device that will be used by this
filesystem and will be assigned to the _devname_ field of the mount
table entry.

The _dir_ argument is the mount point name, it will be assigned to the
_name_ field of the mount table entry.

The _fsname_ argument is the name of the implementing filesystem, it
will be assigned to the _fsname_ entry of the mount table entry.

The process of mounting a filesystem dynamically is as follows. First
a search is made of the mount table for an entry with a NULL _name_
field to be used for the new mount point. The filesystem table is then
searched for an entry whose name matches _fsname_. If this is
successful then the mount table entry is initialized and the
filesystem's _mount_ operation called. If this is successful, the
mount table entry is marked valid and the _fs_ field initialized.

Unmounting a filesystem is done by the umount() function. This can
unmount filesystems whether they were mounted statically or
dynamically.

The umount() function has the following prototype:

int umount( const char *name );

The mount table is searched for a match between the _name_ argument
and the entry _name_ field. When a match is found the filesystem's
_umount_ operation is called and if successful, the mount table entry
is invalidated by setting its _valid_ field false and the _name_ field
to NULL.

Sockets
-------

If a network stack is present, then the FILEIO infrastructure also
provides access to the standard BSD socket calls.

The netstack table contains entries which describe the network
protocol stacks that are in the system image. Each resident stack
should export an entry to this table using the NSTAB_ENTRY() macro.

Each table entry has the following structure:

struct cyg_nstab_entry
{
    cyg_bool            valid;          // true if stack initialized
    cyg_uint32          syncmode;       // synchronization protocol
    char                *name;          // stack name
    char                *devname;       // hardware device name
    CYG_ADDRWORD        data;           // private data value

    int     (*init)( cyg_nstab_entry *nste );
    int     (*socket)( cyg_nstab_entry *nste, int domain, int type,
		       int protocol, cyg_file *file );
};

This table is analogous to a combination of the filesystem and mount
tables.

The _valid_ field is set true if the stack's _init_ function returned
successfully and the _syncmode_ field contains the CYG_SYNCMODE_SOCK_*
bits described above.

The _name_ field contains the name of the protocol stack.

The _devname_ field names the device that the stack is using. This may
reference a device under "/dev", or may be a name that is only
meaningful to the stack itself.

The _init_ function is called during system initialization to start
the protocol stack running. If it returns non-zero the _valid_ field
is set false and the stack will be ignored subsequently.

The _socket_ function is called to attempt to create a socket in the
stack. When the socket() API function is called the netstack table is
scanned and for each valid entry the _socket_ function is called. If
this returns non-zero then the scan continues to the next valid stack,
or terminates with an error if the end of the table is reached.

The result of a successful socket call is an initialized file object
with the _f_xops_ field pointing to the following structure:

struct cyg_sock_ops
{
    int (*bind)      ( cyg_file *fp, const sockaddr *sa, socklen_t len );
    int (*connect)   ( cyg_file *fp, const sockaddr *sa, socklen_t len );
    int (*accept)    ( cyg_file *fp, cyg_file *new_fp,
                       struct sockaddr *name, socklen_t *anamelen );
    int (*listen)    ( cyg_file *fp, int len );
    int (*getname)   ( cyg_file *fp, sockaddr *sa, socklen_t *len, int peer );
    int (*shutdown)  ( cyg_file *fp, int flags );
    int (*getsockopt)( cyg_file *fp, int level, int optname,
                       void *optval, socklen_t *optlen);
    int (*setsockopt)( cyg_file *fp, int level, int optname,
                       const void *optval, socklen_t optlen);
    int (*sendmsg)   ( cyg_file *fp, const struct msghdr *m,
                       int flags, ssize_t *retsize );
    int (*recvmsg)   ( cyg_file *fp, struct msghdr *m,
                       socklen_t *namelen, ssize_t *retsize );
};

It should be obvious from the names of these functions which API calls
they provide support for. The _getname_ function provides support for
both getsockname() and getpeername() while the _sendmsg_ and _recvmsg_
functions provide support for send(), sendto(), sendmsg(), recv(),
recvfrom() and recvmsg() as appropriate.


Select
------

The infrastructure provides support for implementing a select
mechanism. This is modeled on the mechanism in the BSD kernel, but has
been modified to make it implementation independent.

The main part of the mechanism is the select() API call. This
processes its arguments and calls the _fo_select_ function on all file
objects referenced by the file descriptor sets passed to it. If the
same descriptor appears in more than one descriptor set, the
_fo_select_ function will be called separately for each appearance.

The _which_ argument of the _fo_select_ function will either be
CYG_FREAD to test for read conditions, CYG_FWRITE to test for write
conditions or zero to test for exceptions. For each of these options
the function should test whether the condition is satisfied and if so
return true. If it is not satisfied then it should call
cyg_selrecord() with the _info_ argument that was passed to the
function and a pointer to a cyg_selinfo structure.

The cyg_selinfo structure is used to record information about current
select operations. Any object that needs to support select must
contain an instance of this structure.  Separate cyg_selinfo
structures should be kept for each of the options that the object can
select on - read, write or exception.

If none of the file objects report that the select condition is
satisfied, then the select() API function puts the calling thread to
sleep waiting either for a condition to become satisfied, or for the
optional timeout to expire.

A selectable object must have some asynchronous activity that may
cause a select condition to become true - either via interrupts or the
activities of other threads. Whenever a selectable condition is
satisfied, the object should call cyg_selwakeup() with a pointer to
the appropriate cyg_selinfo structure. If the thread is still waiting,
this will cause it to wake up and repeat its poll of the file
descriptors. This time around, the object that caused the wakeup
should indicate that the select condition is satisfied, and the
_select()_ API call will return.

Note that _select()_ does not exhibit real time behaviour: the
iterative poll of the descriptors, and the wakeup mechanism mitigate
against this. If real time response to device or socket I/O is
required then separate threads should be devoted to each device of
interest.


Devices
-------

Devices are accessed by means of a pseudo-filesystem, "devfs", that is
mounted on "/dev". Open operations are translated into calls to
cyg_io_lookup() and if successful result in a file object whose
_f_ops_ functions translate filesystem API functions into calls into
the device API.

// EOF fileio.txt