<!doctype linuxdoc system>

<article>

<title>The Linux Virtual File-system Layer
<author>Neil Brown <tt>neilb@cse.unsw.edu.au</tt> and others.

<date>29 December 1999 - v1.6
<abstract>
The Linux operating system supports multiple different file-systems,
including <tt/ext2/ (the Second Extended file-system), <tt/nfs/ (the
Network File-system), <tt/FAT/ (The MS-DOS File Allocation Table file
system), and others.

To enable the upper levels of the kernel to deal equally with all of
these and other file-systems, Linux defines an abstract layer, known as
the Virtual File-system, or <tt/vfs/.  Each lower level file-system
must present an interface which conforms to this Virtual file-system.

This document describes the <tt/vfs/ interface (as present in Linux
2.3.29).

NOTE this document is incomplete.
</abstract>

<toc>

<sect>Introduction<p>

This document describes the internals of one of the fundamental Linux kernel 
subsystems - the Virtual File-system Layer also known as the VFS switch.
This subsystem corresponds to the "vnode/vfs layer" found in commercial
UNIX flavours, such as those based on SVR4/SVR5 code base, e.g. SCO UnixWare.

All references to the C source code files are given relative to the 
<tt>/usr/src/linux</tt> directory. All header files are relative to the
<tt>/usr/src/linux/include</tt> directory.

<sect>Objects and Methods<p>

The Virtual File-system interface is structured around a number of
generic object types, and a number of methods which can be called on
these objects.

The basic objects known to the VFS layer are files, file-systems,
inodes, and names for inodes.

<sect1>Files<p>

Files are things that can be read from or written to.  They can also
be mapped into memory and sometimes a list of file names can be read
from them.  They map very closely to the <tt/file descriptor/ concept
that unix has.  Files are represented within Linux by a <tt/struct
file/ which has a number of methods stored in a <tt/struct
file_operations/.

<sect1>Inodes<p>

An inode represents a basic object within a file-system.  It can be a
regular file, a directory, a symbolic link, or a few other things.
The VFS does not make a strong distinction between different sorts of
objects, but leaves it to the actual file-system implementation to
provide appropriate behaviours, and to the higher levels of the kernel
to treat different objects differently.

Each inode is represented by a <tt/struct inode/ which has a number of
methods stored in a <tt/struct inode_operations/.

It may seem that Files and Inodes are very similar.   They are but
there are some important differences.  One thing to note is that there
are some things that have inodes but never have files.  A good
example of this is a symbolic link.  Conversely there are files which
do not have inodes, particularly pipes (though not named pipes) and
sockets (though not UNIX domain sockets). 

Also, a File has state information that an inode does not have,
particularly a <tt/pos/ition, which indicates where in the file the
next  read or write will be performed.

<sect1>File-systems<p>

A file-system is a collection of <tt/inodes/ with one distinguished
inode known as the <tt/root/.  Other inodes are accessed by starting
at the root and looking up a file name to get to another inode.

A file-system has a number of characteristics which apply uniformly to
all inodes within the file-system. Some of these are flags such as the
<tt/READ-ONLY/ flag.  Another important one is the <tt/blocksize/.   I'm
not entirely sure why this is needed globallly.

Each file-system is represented by a <tt/struct super_block/, and has a
number of methods stored in a <tt/struct super_operations/.

There is a strong correlation within Linux between super-blocks (and
hence file-systems) and device numbers.  Each file-system must
(appear to) have a unique device on which the file-system resides.
Some file-systems (such as <tt/nfs/ and <tt/proc/) are marked as not needing a
real device.  For these, an anonymous device, with a <tt/major/
number of 0, is automatically assigned.

As well as knowing about file-systems, Linux VFS knows about different
file-system types.  Each type of file-system is represented in Linux
by a <tt/struct file_system_type/.  This contains just one method,
<tt/read_super/ which instantiates a <tt/super_block/ to represent a
given file-system.

<sect1>Names<p>

All inodes within a file-system are accessed by name.  As the
name-to-inode lookup process may be expensive for some file-systems,
Linux's VFS layer maintains a cache of currently active and recently
used names.  This cache is referred to as the <tt/dcache/.

The dcache is structured in memory as a tree. Each node in the tree
corresponds to an inode in a given directory with a given name.  An
inode can be associated with more than one node in the tree.

While the dcache is not a complete copy of the file tree, it is a
proper prefix of that tree (if that is a correct usage of the term).
This means that if any node of the file tree is in the cache, then
every ancestor of that node is also in the cache.

Each node in the tree is represented by a <tt/struct dentry/ which has
a number of methods stored in a <tt/struct dentry_operations/.

The dentries act as an intermediary between Files and Inodes.  Each
file points to the dentry that it has open.  Each dentry points to the
inode that it references.  This implies that for every open file, the
dentry of that file, and of all the parents of that file are
cached in memory.  This allows a full path name of every open file to
be easily determined, as can be seen from doing:
<code>
# ls -l /proc/self/fd
total 0
lrwx------   1 root     root           64 Nov 23 07:51 0 -> /dev/pts/2
lrwx------   1 root     root           64 Nov 23 07:51 1 -> /dev/pts/2
lrwx------   1 root     root           64 Nov 23 07:51 2 -> /dev/pts/2
lr-x------   1 root     root           64 Nov 23 07:51 3 -> /proc/15588/fd/
</code>

<sect>Registering and Mounting a file-system<p>

It is probably worth starting by observing that there is possible
ambiguity in our use of the word file-system.  It can be used to mean a
particular type, or class, of file-system, such as <tt/ext2/ or
<tt/nfs/ or <tt/coda/, or it can be used to mean a particular
instance of a file-system, such as <tt>/usr</tt> or <tt>/home</tt> or
<em>The file-system on /dev/hda4</em>.

The first usage is implied when registering a file-system, the second
is implied while mounting a file-system.  I will continue to use this
ambiguous language as most people are familiar with it and nothing
better is obvious.

Linux finds out about new file-system types by calls
<tt/register_filesystem/ (and forgets about them by the calls to its
counterpart <tt/unregister_filesystem/).  The formal declarations are:
<tscreen>
<verb>
#include <linux/fs.h>

int register_filesystem(struct file_system_type * fs);
int unregister_filesystem(struct file_system_type * fs);
</verb>
</tscreen>
The function <tt/register_filesystem/ returns <tt/0/ on success and 
<tt/-EINVAL/ if <tt/fs==NULL/.  It returns <tt/-EBUSY/
if either <tt/fs->next != NULL/ or there is already a file-system 
registered under the same name.  It should be called (directly or
indirectly) from <tt/init_module/ for file-systems which are being
loaded as modules, or from <tt/filesystem_setup/ in 
<tt>fs/filesystems.c</tt>.  The function <tt/unregister_filesystem/
should only be called from the <tt/cleanup_module/ routine of a module.  
It returns <tt/0/ on success and <tt/-EINVAL/ if the argument is
not a pointer to a registered file-system.  (In particular,
<tt/unregister_filesystem(NULL)/ may Oops).

An example of file-system registration and unregistration can be
seen in <tt>fs/ext2/super.c</tt>:
<tscreen>
<verb>
static struct file_system_type ext2_fs_type = {
        "ext2",
        FS_REQUIRES_DEV /* | FS_IBASKET */,     /* ibaskets have unresolved bugs */
        ext2_read_super,
        NULL
};

int __init init_ext2_fs(void)
{
        return register_filesystem(&amp;ext2_fs_type);
}

#ifdef MODULE
EXPORT_NO_SYMBOLS;

int init_module(void)
{
        return init_ext2_fs();
}

void cleanup_module(void)
{
        unregister_filesystem(&amp;ext2_fs_type);
}

#endif
</verb>
</tscreen>

A <tt/struct file_system_type/ is defined in <tt>linux/fs.h</tt> and
has the following format:

<tscreen>
<code>
struct file_system_type {
	const char *name;
	int fs_flags;
	struct super_block *(*read_super) (struct super_block *, void *, int);
	struct file_system_type * next;
};
</code>
</tscreen>

<descrip>
<tag/name/

The name field simply gives the name of the file-system type, such as
<tt/ext2/ or <tt/iso9660/ or <tt/msdos/.  This field is used as a key,
and it is not possible to register a file-system with a name that is
already in use.  It is also used for the <tt>/proc/filesystems</tt> file
which lists all file-system types currently registered with the kernel.
When a file-system is implemented as a module, the name points to the
module's address space (mapped to a <tt/vmalloc/'d area) which means that
if you forget to <tt/unregister_filesystem/ in <tt/cleanup_module/ and
try to <tt>cat /proc/filesystems/</tt> you will get an Oops trying to
dereference name - a common mistake made by file-system writers
at the first stages of development..

<tag/fs_flags/
A number of adhoc flags which record features of the file-system.

 <descrip>
 <tag/FS_REQUIRES_DEV/
   As mentioned above, every mounted file-system is connected to some
   device, or at least some device number.  If a file-system type has
   <tt/FS_REQUIRES_DEV/, then a real device must be given when mounting
   the file-system, otherwise an anonymous device is allocated.

   <tt/nfs/ and <tt/procfs/ are examples of file-systems that don't
   require a device. <tt/ext2/ and <tt/msdos/ do.

 <tag/FS_NO_DCACHE/
   This flag is declared but not used at all. From the comment in
   <tt/fs.h/ the intent is that for file-systems marked this way, the
   dcache only keeps entries for files that are actually in use.

 <tag/FS_NO_PRELIM/
   Like <tt/FS_NO_DCACHE/, this flag is never used.  The intent appears
   to be that the dcache will have entries that are in use or have
   been used, but will not speculatively cache anything else.

 <tag/FS_IBASKET/
   Another vapour-flag.  See section on <tt/ibasket/s below, which may
   be a vapour-section.

 </descrip>

<tag/next/
<tt/next/ is simply a pointer for chaining all <tt/file_system_types/
together.  It should be initialised to <tt/NULL/ (<tt/register_filesystem/
does not set it for you and will return <tt/-EBUSY/ if you don't set 
<tt/next/ to <tt/NULL/).

<tag/read&lowbar;super/

The <tt/read_super/ method is called when a file-system (instance) is
being mounted.

The <tt/struct super_block/ is clean (all fields zero) except for the
<tt/s_dev/ and <tt/s_flags/ fields.  The <tt/void */ pointer points to
the data what has been passed down from the <tt/mount/ system
call. The trailing <tt/int/ field tells whether <tt/read_super/ should
be silent about errors.  It is set only when mounting the root
file-system.  When mounting root, every possible file-system is tried in
turn until one succeeds. Printing errors in this case would be untidy.

<tt/read_super/ must determine whether the device given in <tt/s_dev/
together with the <tt/data/ from <tt/mount/ define a valid file-system
of this type.  If they do, then it should fill out the rest of the
<tt/struct super_block/ and return the pointer.  If not, it should
return NULL.

</descrip>

<sect>The Super-Block and its operations<p>

Each mounted file-system is represented by the <tt/super_block/
structure.  The fact that it is mounted is stored in a 
<tt/struct vfsmount/, the declaration of which can be found in 
<tt>linux/mount.h</tt>:
<tscreen>
<code>
struct vfsmount
{
  kdev_t mnt_dev;                       /* Device this applies to */
  char *mnt_devname;                    /* Name of device e.g. /dev/dsk/hda1 */
  char *mnt_dirname;                    /* Name of directory mounted on */
  unsigned int mnt_flags;               /* Flags of this device */
  struct super_block *mnt_sb;           /* pointer to superblock */
  struct quota_mount_options mnt_dquot; /* Diskquota specific mount options */
  struct vfsmount *mnt_next;            /* pointer to next in linkedlist */
};
</code>
</tscreen>
These <tt/vfsmount/ structures are linked together in a simple
linked list starting from <tt/vfsmntlist/ in <tt>fs/super.c</tt>.
This list is mainly used for finding mounted file-system information
given a device, particularly be the disc quota code.

The reason why <tt/vfsmount/ is kept separate from the list of super
blocks <tt/super_blocks/ is because if the super-block already exists
then <tt>fs/super.c:read_super()</tt> is satisfied by 
<tt>fs/super.c:get_super()</tt> instead of going through the
<tt/read_super/ file-system-specific method. But the entry in 
<tt/vfsmntlist/ is unlinked as soon as the file-system is unmounted.

Each mount is also recorded in the <tt/dcache/ which will be described
later, and this is the source of mount information used when
traversing path names.

<sect1>The Super-block Struture<p>

A somewhat reduced description of the super-block structure is:
<tscreen>
<code>
struct super_block {
	struct list_head	s_list;		/* Keep this first */
	kdev_t			s_dev;
	unsigned long		s_blocksize;
	unsigned char		s_blocksize_bits;
	unsigned char		s_lock;
	unsigned char		s_dirt;
	struct file_system_type	*s_type;
	struct super_operations	*s_op;
	struct dquot_operations	*dq_op;
	unsigned long		s_flags;
	unsigned long		s_magic;
	struct dentry		*s_root;
	wait_queue_head_t	s_wait;

	struct inode		*s_ibasket;
	short int		s_ibasket_count;
	short int		s_ibasket_max;
	struct list_head	s_dirty;	/* dirty inodes */
	struct list_head	s_files;

	union {
		/* Configured-in filesystems get entries here */
		void			*generic_sbp;
	} u;
	/*
	 * The next field is for VFS *only*. No filesystems have any business
	 * even looking at it. You had been warned.
	 */
	struct semaphore s_vfs_rename_sem;	/* Kludge */
};
</code>
</tscreen>

See <tt>linux/fs.h</tt> for a complete declaration which includes
all file-system-specific components of the <tt/union u/ which were suppressed
above. The various fields in the super-block are:

<descrip>

<tag/s_list/
A doubly linked list of all mounted file-systems (see 
<tt>linux/list.h</tt>).

<tag/s_dev/
The device (possibly anonymous) that this file-system is mounted on.

<tag/s_blocksize/
The basic blocksize of the file-system.  I'm not sure exactly how this
is used yet. It must be a power of 2.

<tag/s_blocksize_bits/
The power of 2 that <tt/s_blocksize/ is (i.e. <tt/log2(s_blocksize)/).

<tag/s_lock/
This indicates whether the super-block is currently locked.  It is
managed by <tt/lock_super/ and <tt/unlock_super/.

<tt/lock_kernel/.

<tag/s_wait/
This is a queue of processes that are waiting for the <tt/s_lock/ lock
on the super-block.

<tag/s_dirt/
This is a flag which gets set when a super-block is changed, and is
cleared whenever the super-block is written to the device.  This
happens when a filesystem is unmounted, or in response to a <tt/sync/
system call.

<tag/s_type/
This is simply a pointer to the <tt/struct file_system_type/ structure
discussed above.

<tag/s_op/
This is a pointer to the <tt/struct super_operations/ which will be
described next.

<tag/dq_op/
This is a pointer to Disc Quota operations which will be described
later.

<tag/s_flags/
This is a list of flags which are logically <tt/or/ed with the flags
in each inode to determine certain behaviours.  There is one flag
which applies only to the whole file-system, and so will be described
here. The others are described under the discussion on inodes.

<descrip>
<tag/MS_RDONLY/
A file-system with the flag set has been mounted read-only.  No writing
will be permitted, and no indirect modification, such as mount times
in the super-block or access times on files, will be made.
</descrip>

<tag/s_magic/
This records an identification number that has been read from the
device to confirm that the data on the device corresponds to the
file-system in question.  It seems to be used by the Minix file-system to
distinguish between various flavours of that file-system.
It is not clear why this is in the generic part of the structure, and
not confined to the file-system specific part for those file-systems
which need it.  Maybe this is historical.

The one <em/interesting/ usage of the field is in
<tt>fs/nfsd/vfs.c:nfsd_lookup()</tt> where it is used to make sure that
a <tt/proc/ or <tt/nfs/ type file-system is never accessed via NFS.

<tag/s_root/
This is a <tt/stuct dentry/ which refers to the root of the
file-system.  It is normally created by loading the root inode from the
file-system, and passing it to <tt/d_alloc_root/.  This dentry will get
spliced into the dcache by the mount command (<tt/do_mount/ calls
<tt/d_mount/).

<tag/s_ibasket, s_ibasket_count, s_ibasket_max/
These three refer to a basket of inodes I guess, but there is no such
thing in current versions.

<tag/s_dirty/
A list of dirty inodes linked on the <tt/i_list/ field.

When an inode is marked as dirty with <tt/mark_inode_dirty/ it gets
put on this list.  When <tt/sync_inodes/ is called, any inode in this
list gets passed to the file-system's <tt/write_inode/ method.

<tag/s_files/
This is a list of files (linked on <tt/f_list/) of open files on this
file-system.  It is used, for example, to check if there are any files
open for write before remounting the file-system as read-only.

<tag/u.generic_sbp/
The <tt/u/ union contains one file-system-specific super-block
information structure for each file-system known about at compile
time. Any file-system loaded as a module must allocate a separate
structure and place a pointer in <tt/u.generic_sbp/.

<tag/s_vfs_rename_sem/
This semaphore is used as a file-system wide lock while renaming a
directory.  This appears to be to guard against possible races which
may end up renaming a directory to be a child of itself.  This
semaphore is not needed or used when renaming things that are not
directories.

</descrip>
<sect1>The Super-Block Methods (or Operations)<p>

The methods defined in the <tt/struct super_operations/ are:

<tscreen>
<code>
struct super_operations {
	void (*read_inode) (struct inode *);
	void (*write_inode) (struct inode *);
	void (*put_inode) (struct inode *);
	void (*delete_inode) (struct inode *);
	int (*notify_change) (struct dentry *, struct iattr *);
	void (*put_super) (struct super_block *);
	void (*write_super) (struct super_block *);
	int (*statfs) (struct super_block *, struct statfs *, int);
	int (*remount_fs) (struct super_block *, int *, char *);
	void (*clear_inode) (struct inode *);
	void (*umount_begin) (struct super_block *);
};
</code>
</tscreen>

All of these methods get called with only the kernel lock held.
This means that they can safely block, but are
responsible from guarding against concurrent access themselves.  All
are called from a process context, not from interrupt handlers or the
<em/bottom half/.

<descrip>
<tag/read_inode/
This method is called to read a specific inode from a mounted
file-system.  It is only called from <tt/get_new_inode/
out of <tt/iget/ in <tt>fs/inode.c</tt>.

In the <tt/struct inode */ argument passed to this method the
fields <tt/i_sb/, <tt/i_dev/ and particularly <tt/i_ino/ will be 
initialised to indicate which inode should be read from which file-system.
It must set (among other things) the <tt/i_op/ field of <tt/struct inode/
to point to the relevant <tt/struct inode_operations/ so that VFS can
call the methods on this inode as needed.

<tt/iget/ is mostly called from within particular file-systems to read
inodes for that file-system.  One notable exception is in
<tt>fs/nfsd/nfsfh.h</tt> where it is used to get an inode based
on information in the nfs file handle.

It is not clear that this method needs to be exported as (with the
exception of nfsd) it is only (indirectly) used by the file-system
which provides it.  Avoiding it would allow more flexibility than a
simple 32bit inode number to identify a particular inode.

The <tt/nfsd/ usage could better be replaced by an interface that
takes a file handle (or part there-of) and returns an inode.

<tag/write_inode/
This method gets called on inodes which have been marked dirty with
<tt/mark_inode_dirty/.  It is called when a sync request is made on
the file, or on the file-system.  It should make sure that any
information in the inode is safe on the device.

<tag/put_inode/
If defined, this method is called whenever the reference count on an
inode is decreased.  Note that this does not mean that the inode is
not in use any more, just that it has one fewer users.

<tt/put_inode/ is called <bf/before/ the <tt/i_count/ field is
decreased, so if <tt/put_inode/ wants to check if this is the last
reference, it should check if <tt/i_count/ is 1 or not.

Almost all file-systems that define this method use it to do some
special handling when the last reference to the inode is release.
i.e. when <tt/i_count/ is 1 and is about to be come zero.

<!--- As no locks
are held at this time, it is not clear that this usage is SMP safe.  It
is fairly clear that if the test succeeds (i.e. if <tt/i_count/ is
one), then the count will very shortly become zero, as there would be
no way for some other thread to find and hence attach the inode.
However it isn't so clear that if the test fails, then the count isn't
about to become zero.  One can imaging two thread on two different
processors that are both calling <tt/iput/ for the last two references
to an inode.  They both call <tt/put_inode/ where both see
<tt/i_count/ as having the value 2.  Then they both proceed to acquire
the lock and decrement <tt/i_count/.  The second one to get the lock
will reduce <tt/i_count/ to 0 without the special processing ever
happening.

While this can be imagined, I am not sure if it is actually possible.
 -->

<tag/delete_inode/
If defined, <tt/delete_inode/ is called whenever the reference count
on an inode reaches 0, and it is found that the link count
(<tt/i_nlink/) is also zero.  It is presumed that the file-system will
deal with this situation be invalidating the inode in the file-system
and freeing up any resourses used.

It could be argued that this and the previous methods should be
replaced by one method that is called whenever the <tt/i_count/ field
reaches 0, and then the file-system gets to decide if it should do
something special with <tt/i_nlink/ being 0.  The only difficulty that
this might cause with current file-systems is that <tt/ext2/ calls
<tt/ext2_discard_prealloc/ when <tt/put_inode/ is called,
independently of <tt/i_count/.  This would no longer be possible.  But
is this even desirable?  Would it not make more sense to do this only
in <tt/ext2_release_file/ (which does it as well).

<tag/notify_change/
This is called when inode attributes are changed, the argument
<tt/struct iattr */ pointing to the new set of attributes.
If the file-system does not define
this method (i.e. it is <tt/NULL/) then VFS uses the routine
<tt>fs/iattr.c:inode_change_ok</tt> which implements POSIX standard
attributes verification.  Then VFS marks the inode as dirty.
If the file-system implements its own <tt/notify_change/ then it should
call <tt/mark_inode_dirty(inode)/ after it has set the attributes.  An
example of how to implement this method can be seen in 
<tt>fs/ext2/inode.c:ext2_notify_change()</tt>.

<tag/put_super/
This is called at the last stages of <tt/umount(2)/ system call, before
removing the entry from <tt/vfsmntlist/.
This method is called with super-block lock held.
A typical implementation would free file-system-private resources specific
for this mount instance, such as inode bitmaps, block bitmaps, a buffer header
containing super-block and decrement module hold count if the file-system is
implemented as a dynamically loadable module. For example, 
<tt>fs/bfs/inode.c:bfs_put_super()</tt> looks very simple:
<tscreen>
<code>
static void bfs_put_super(struct super_block *s)
{
        brelse(s->su_sbh);
        kfree(s->su_imap);
        kfree(s->su_bmap);
        MOD_DEC_USE_COUNT;
}
</code>
</tscreen>

<tag/write_super/
Called when VFS decides that the super-block needs to be written to disk.
Called from <tt>fs/buffer.c:file_fsync</tt>, 
<tt>fs/super.c:sync_supers</tt> and <tt>fs/super.c:do_umount</tt>.
Obviously not needed for a read-only file-system.

<tag/statfs/
This method is needed to implement <tt/statfs(2)/ system call and is
called from <tt>fs/open.c:sys_statfs</tt> if implemented, otherwise
<tt/statfs(2)/ will fail with <tt/errno/ set to <tt/ENODEV/.

<tag/remount_fs/
Called when file-system is being remounted, i.e. if the <tt/MS_REMOUNT/
flag is specified with the <tt/mount(2)/ system call.
This can be used to change various mount options without unmounting
the file-system.  A common usage is to change a readonly file-system into a
writable file-system.


<tag/clear_inode/
Optional method, called when VFS clears the inode.
This is needed (at least) by any file-system which attaches
<tt/kmalloc/ed data to the inode structure, as particularly might be
the case for file-systems using the <tt/generic_ip/ field in <tt/struct
inode/.

It is currently used by <tt/ntfs/ which does attach kalloced data to
an inode, and by <tt/fat/ which does interesting things to present a
pretense of stable inode numbers on a file-system which does not
support inode numbers.

<tag/umount_begin/
This method is called early in the unmounting process if the MNT_FORCE
flag was given to umount.  The intentions is that it should cause any
incomplete transaction on the file-system to fail quickly rather than
block waiting on some external event such as a remote server
responding.

Note that calling <tt/umount_begin/ will probably not make an active
file-system become unmountable, but it should allow any processes using
that file-system to be killable, rather than being in an
uninterruptible wait.

Currently, <tt/NFS/ is the only file-system which provides <tt/umount_begin/.

</descrip>

<sect>The File and its Operations<p>

A file object is used where-ever there is a need to read from or write
to something.  This includes accessing objects within file-system,
communicating through a pipe, or over a network.  Files are accessible
to processes through their <em/file descriptors/.

<sect1>File Structure<p>

The <tt/file/ structure is defined in
<tt>linux/fs.h</tt> to be:

<tscreen>
<code>
struct fown_struct {
	int pid;		/* pid or -pgrp where SIGIO should be sent */
	uid_t uid, euid;	/* uid/euid of process setting the owner */
	int signum;		/* posix.1b rt signal to be delivered on IO */
};

struct file {
	struct list_head	f_list;
	struct dentry		*f_dentry;
	struct file_operations	*f_op;
	atomic_t		f_count;
	unsigned int 		f_flags;
	mode_t			f_mode;
	loff_t			f_pos;
	unsigned long 		f_reada, f_ramax, f_raend, f_ralen, f_rawin;
	struct fown_struct	f_owner;
	unsigned int		f_uid, f_gid;
	int			f_error;

	unsigned long		f_version;

	/* needed for tty driver, and maybe others */
	void			*private_data;
};
</code>
</tscreen>

The fields have the following meaning:

<descrip>

<tag/f_list/
This field links files together into one of a number of lists.  There
is one list for each active file-system, starting at the <tt/s_files/
pointer in the super-block.  There is one for free file structures
(<tt/free_list/ in <tt>fs/file_table.c</tt>).  And there is one
for anonymous files (<tt/anon_list/ in <tt>fs/file_table.c</tt>)
such as pipes.

<tag/f_dentry/
This field records the dcache entry that points to the inode for this
file.  If the inode refers to an object, such as a pipe, which isn't
in a regular file-system, the dentry is a root dentry created with
<tt/d_alloc_root/.

<tag/f_op/
This field points to the methods to use on this file.

<tag/f_count/
The number of references to this file.  One for each different
user-process file descriptor, plus one for each internal usage.

<tag/f_flags/
This field stores the flags for this file such as access type
(read/write), nonblocking, appendonly etc.  These are defined in the
per-architecture include file <tt>asm/fcntl.h</tt>.
Some of these flags are only relevant at the time of opening, and are
not stored in <tt/f_flags/.  These excluded flags are
O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC.  This list is from <tt/filp_open/
in <tt>fs/open.c</tt>.

<tag/f_mode/
The bottom two bits of <tt/f_flags/ encode read and write access
in  a way that it is not easy to extract the individual read and write
access information.  <tt/f_mode/ stores the read and write access as
two separate bits.

<tag/f_pos/
This records the current file position which will be the address used
for the next <tt/read/ request, and for the next <tt/write/ request if
the file does NOT have the O_APPEND flag.

<tag/f_reada, f_remax, f_raend, f_ralen, f_rawin/
These five fields are used to keeping track of sequential access
patterns on the file, and determining how much read-ahead to do.
There may be a separate section on read-ahead.

<tag/f_owner/
This structure stores a process id and a signal to send to the process
when certain events happen with the file, such as new data being
available.  Currently, keyboards, mice, serial ports and network
sockes seem to be the only files which is this feature (via
<tt/kill_fasync/).

<tag/f_uid, f_gid/
These fields get set to the owner and group of the process which
opened the file.  They don't seem to be used at all.

<tag/f_error/
This is used by the NFS client file-system code to return write
errors.  It is set in <tt>fs/nfs/write.c</tt> and checked in
<tt>fs/nfs/file.c</tt>, and used in
<tt>mm/filemap.c:generic_file_write</tt>

<tag/f_version/
This field is available to be used by the underlying file-system to
help cache state, and check for the cache being invalid.
It is changed whenever the file has its <tt/f_pos/ value changed.

For example, the <tt/ext2/ file-system uses it in conjuction with the
<tt/i_version/ field in the inode to detect when a directory may have
changed.  If neither the directory nor the file position has changed,
then <tt/ext2/  can be sure that the current file position is the start
of a valid directory entry, otherwise it much re-check from the start
of the block.

<tag/private_data/
This is used by many device drivers, and even a few file-systems, to
store extra per-open-file information (such as credentials in <tt/coda/).

</descrip>

<sect1>File Methods<p>
The list of file methods are defined in
<tt>linux/fs.h</tt>
to be:

<tscreen><code>
typedef int (*filldir_t)(void *, const char *, int, off_t, ino_t);

struct file_operations {
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
	int (*readdir) (struct file *, void *, filldir_t);
	unsigned int (*poll) (struct file *, struct poll_table_struct *);
	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
	int (*mmap) (struct file *, struct vm_area_struct *);
	int (*open) (struct inode *, struct file *);
	int (*flush) (struct file *);
	int (*release) (struct inode *, struct file *);
	int (*fsync) (struct file *, struct dentry *);
	int (*fasync) (int, struct file *, int);
	int (*check_media_change) (kdev_t dev);
	int (*revalidate) (kdev_t dev);
	int (*lock) (struct file *, int, struct file_lock *);
};
</code></tscreen>

<descrip>

<tag/llseek/
This implements the <tt/lseek/ system call. If it is left undefined,
then <tt/default_llseek/ from <tt>fs/read_write.c</tt> is used
instead.  This updates the <tt/f_pos/ field as expected, and also may
change the <tt/f_reada/ field and <tt/f_version/ field.

<tag/read/

This is used to implement the <tt/read/ system call and to support
other occasions for reading files such a loading executables and
reading the quotas file.  It is expected to update the offset value
(last argument) which is usually a pointer to the <tt/f_pos/ field in
the <tt/file/ structure, except for the <tt/pread/ and <tt/pwrite/
system calls.

For file-systems on block devices, there is a routine
<tt/generic_file_read/ in <tt>mm/filemap.c</tt> which can be
used for this method providing that the inode has a <tt/readpage/
method defined.

<tag/write/
This method allows writing to a file such as when using the <tt/write/
system call.  This method does not necessarily make sure that the data
has reached the device, but may only queue it ready for writing when
convenient, depending on the semantics of the file type.

For file-systems on block devices, <tt/generic_file_write/ may be
used in conjunction with <tt/block_write_partial_page/ from
<tt>fs/buffer.c</tt> to implement this method.

<tag/readdir/

<tt/readdir/ should read directory entries from the file, which would
presumably be a directory, and return them using the <tt/filldir_t/
callback function.  This function takes the <tt/void */ handle that
was  passed along with a pointer to a name, the length of the name,
the postion in the file where this name was found, and the inode
number associated with the name.

If the <tt/filldir/ call-back returns non-zero, then <tt/readdir/
should assume that it has had enough, and should return as well.

When <tt/readdir/ reaches the end of the directory, it should return
with the value 0.  Otherwise it may return after just some of the
entres have been given to <tt/filldir/.  In this case is should return
a non-zero value.  It should return a negative number on error.

<tag/poll/

<tt/poll/ is use to implement the <tt/select/ and <tt/poll/ system
calls.
It should add a <tt/poll_table_entry/ to the <tt/poll_table_struct/
that it is passed, and do some other stuff.... I haven't looked into
this much yet.

<tag/ioctl/

This implements <em/ad hoc/ <tt/ioctl/ functionality.  If an
<tt/ioctl/ request is not one of a set of known requests (FIBMAP,
FIGETBSZ, FIONREAD), then the request is passed on the underlying
file implementation.

<tag/mmap/
This routine implements memory mapping of files.  It can often be
implemented using <tt/generic_file_mmap/.  Its task seems to be to
validate that the mapping is allowed, and to set up the <tt/vm_ops/
field of the <tt/vm_area_struct/ to point to something appropriate.

<tag/open/
This method, if defined, is called when a new file has been opened in
an inode.  It can do any setup that may be needed on open.  This is
not used with many file-systems. One exception is <tt/coda/ which
tries to get the file cached locally at open.

<tag/flush/
<tt/flush/ is called when a file descriptor is closed.  There may be
other file descriptors open on this file, so it isn't necessarily a
final close of the file, just an interim one.  The only file-system
that currently defines this method is the <tt/NFS/ client, which
flushes out any write-behind requests that are pending.

Flush can return an error status back through the <tt/close/ system
call, and so needs to be used if errors need to be checked for.
Unfortunately, there is no way that <tt/flush/ can reliably determine
if it is the last call to flush.

<tag/release/
<tt/release/ is called when the last handle on a file is closed.  It
should do any special cleanup that is needed.

<tt/release/ cannot return any error status to anyone, and so should
really be of type <tt/void/ rather than <tt/int/.

<tag/fsync/
This method implements the <tt/fsync/  and <tt/fdatasync/ system
calls (they are currently identical).  It should not
return until all pending writes for the file have successfully reached
the device.

<tt/fsync/ may be partially implemented using
<tt/generic_buffer_fdatasync/ which will write out all dirty buffers
on all mapped pages of the inode.

<tag/fasync/
This method is called when the FIOASYNC flag of the file changes. The
<tt/int/ parameter contains the new value of this flag.  No
file-systems currently use this method.

<tag/check_media_change/
This method should check if the underlying media has changed, and
should return true if it has.  The only place out-side of disc drivers
where it is called is in <tt/read_super/ when a file-system is about to
be mounted.  If it returns true at this point, all buffers associated
with the device are invalidated.

<tag/revalidate/
<tt/Revalidate/ is called after buffers have been invalidated after a
media change, as reported by <tt/check_media_change/.  So it is only
meaningful if <tt/check_media_change/ is defined.   This shouldn't be
confused with the <tt/inode:revalidate/ method which is quite
different.


<tag/lock/
This method allows a file service to provide  extra handling of POSIX
locks.  It is not used for FLOCK style locks.
This is useful particularly for network file-systems where other locks
might be held in ways only noticeable by the file-system.

When locks are being set or removed, a lock is obtained firstly with
this method, and then also with the standard posix lock code.  If this
method succeeds in getting a lock, but the local code fails, then the
lock will never be released...

When a process is trying to find what locks are present, information
returned by this method is used, the local locks are not checked.
</descrip>

<sect>Names, or dentrys<p>
The VFS layer does all management of path names of files, and converts
them into entries in the <tt/dcache/ before passing allowing the
underlying file-system to see them.  The one exception to this is the
target of an symbolic link, which is passed untouched to the
underlying file-system.  The underlying file-system is then expected to
interpret it.  This seems like a slightly blurred module boundry.

The <tt/dcache/ is made up of lots of <tt/struct dentry/s.  Each
<tt/dentry/ corresponds to one filename component in the file-system
and the object associated with that name (if there is one).  Each
<tt/dentry/ references its parent which must exist in the
<tt/dcache/.  <tt/dentry/s also record file-system mounting
relationships.

The <tt/dcache/ is a master of the inode cache.  Whenever a
<tt/dcache/ entry exists, the inode will also exist in the inode
cache.  Conversely whenever there is an inode in the inode cache, it
will reference a dentry in the dcache.

<sect1>Dentry structure<p>

The <tt/dentry/ structure is defined in
<tt>linux/dcache.h</tt>.

<tscreen><code>
struct qstr {
	const unsigned char * name;
	unsigned int len;
	unsigned int hash;
};

#define DNAME_INLINE_LEN 16

struct dentry {
	int d_count;
	unsigned int d_flags;
	struct inode  * d_inode;	/* Where the name belongs to - NULL is negative */
	struct dentry * d_parent;	/* parent directory */
	struct dentry * d_mounts;	/* mount information */
	struct dentry * d_covers;
	struct list_head d_hash;	/* lookup hash list */
	struct list_head d_lru;		/* d_count = 0 LRU list */
	struct list_head d_child;	/* child of parent list */
	struct list_head d_subdirs;	/* our children */
	struct list_head d_alias;	/* inode alias list */
	struct qstr d_name;
	unsigned long d_time;		/* used by d_revalidate */
	struct dentry_operations  *d_op;
	struct super_block * d_sb;	/* The root of the dentry tree */
	unsigned long d_reftime;	/* last time referenced */
	void * d_fsdata;		/* fs-specific data */
	unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
};
</code></tscreen>

<descrip>
<tag/d_count/
This is a simple reference count.

The count does NOT include the reference from the parent through the
<tt/d_subdirs/ list, but does include the <tt/d_parent/ references from
children.  This implies that only leaf nodes in the cache may have a
<tt/d_count/ of 0.  These entries are linked together by the
<tt/d_lru/ list as will be seen.

<tag/d_flags/
There are currently two possible flags, both for use by specific
file-system implementations (so why are they exposed?), and so will not
be documented here.  They are DCACHE_AUTOFS_PENDING and
DCACHE_NFSFS_RENAMED.

<tag/d_inode/
Simply a pointer to the inode related to this name.  This field may be
NULL, which indicates a negative entry, implying that the name is
known not to exist.

<tag/d_parent/
This will point to the parent <tt/dentry/.  For the root of a
file-system, or for an anonymous entry like that for a file, this
points back to the containing <tt/dentry/ itself.

<tag/d_mounts/
For a directory that has had a file-system mounted on it, this points
to the root dentry of that file-system.  For other dentries, this
points back to the dentry itself.

It is not possible to mount a file-system on a mountpoint, so there
will never be a chain of <tt/d_mount/ entries longer than one.

<tag/d_covers/
This is the inverse of <tt/d_mounts/.  For the root of a mounted
file-system, this points to the <tt/dentry/ of the directory that it is
mounted on.  For other <tt/dentry/s, this points to the <tt/dentry/
itself.

<tag/d_hash/
This doubly linked list chains together the entries in one hash
bucket.

<tag/d_lru/
This provides a doubly linked list of unreferenced leaf nodes in the
cache. The head of the list is the <tt/dentry_unused/ global
variable. It is stored in Least Recently Used order.

When other parts of the kernel need to reclaim memory or inodes, which
may be locked up in unused entries in the dcache, they can call
<tt/select_dcache/ which finds removable entries in the <tt/d_lru/ and
prepares them to be removed by <tt/prune_dcache/.

<tag/d_child/
This <tt/list_head/ is used to link together all the children of the
<tt/d_parent/ of this <tt/dentry/.  One might think that
<tt/d_sibling/ might be a better name.

<tag/d_subdirs/
This is the head of the <tt/d_child/ list that links all the children
of this <tt/dentry/.  Ofcourse, elements may refer to file and not
just sub-directories, so <tt/d_child/ may be a better name, but that
is already in use:-).

<tag/d_alias/
As files (and some other file-system objects) may have multiple names
in the file-system through multiple hard links, it is possible that
multiple <tt/dentry/s refer to the same inode.  When this happens, the
<tt/dentry/s are linked on the <tt/d_alias/ field.  The inode's
<tt/i_dentry/ field is the head of this list.

<tag/d_name/
The <tt/d_name/ field contains the name of this entry, together with
its hash value.  The <tt/name/ subfield may point to the <tt/d_iname/
field of the dentry or, if that isn't long enough, it will point to a
separately allocated string.

<tag/d_time/
This field is only used by underlying file-systems, which can
presumably do whatever they want.  The intention is to use it to
record something about when this entry was last known to be valid to
get some idea about when its validity might need to be checked again.

<tag/d_op/
This points to the <tt/struct dentry_operations/ with specifics for
how to handle this <tt/dentry/.

<tag/d_sb/
This points to the super-block of the file-system on which the
object refered to by the <tt/dentry/ resides.  It is not clear why
this is needed rather than using <tt/d_inode->i_sb/.

<tag/d_reftime/
This is set to the current time in <tt/jiffies/ whenever the
<tt/d_count/ reaches zero, but it is never used.

<tag/d_fsdata/
This is available for specific file-systems to use as they wish.  This
is currently only used by <tt/nfs/ to store a file handle. (Odd that,
I would have thought that the filehandle is per-inode, not per-name,
but I gather some nfs servers don't agree).

<tag/d_iname/
This stores the first 16 characters of the name of the file for easy
reference.   If the name fits completely, then <tt/d_name.name/ points
here, otherwise it points to separately allocated memory.

</descrip>

<sect1>Dentry Methods<p>
Most handling of dentries is common across all file-systems, so most
operations that you would expect to do on dentries do not have methods
in the <tt/dentry_operations/ list.  Rather, it provides for a few
operations which may be handled in a non-obvious way by some
file-system implementations.  A file-system can choose to leave all of
the methods as NULL, in which case the default operation will apply.

The structure definition from <tt>linclude/linux/dcache.h</tt>
is:

<tscreen><code>
struct dentry_operations {
	int (*d_revalidate)(struct dentry *, int);
	int (*d_hash) (struct dentry *, struct qstr *);
	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
	void (*d_delete)(struct dentry *);
	void (*d_release)(struct dentry *);
	void (*d_iput)(struct dentry *, struct inode *);
};
</code></tscreen>

<descrip>
<tag/d_revalidate/
This method is called whenever a path lookup uses an entry in the
dcache, in order to see if the entry is still valid.  It should return
1 if it can still be trusted, else 0.  The default is to assume a
return value of 1.

The <tt/int/ argument gives the flags relevant to this lookup, and can
include any of
LOOKUP_FOLLOW, LOOKUP_DIRECTORY, LOOKUP_SLASHOK, LOOKUP_CONTINUE.
These will be described (if at all) under the section on <tt/namei/.

This method is only needed if the file-system is likely to change
without the VFS layer doing anything, as may happen with shared file
systems.

If <tt/d_revalidate/ returns 0, the VFS layer will attempt to prune
the dentry from the dcache.  This is done by <tt/d_invalidate/ which
removes any children which are not in active use and, if that was
successful, unhashes the dentry.


<tag/d_hash/
If the file-system has non-standard rules about valid names or name
equivalence, then this routine should be provided to check for
validity and return a canonical hash.

If the name is valid, a hash should be calculated (which should be the
same for all equivalent names) and stored in the <tt/qstr/ argument.
If the name is not valid, an appropriate (negative) error code should
be returned.

The <tt/dentry/ argument is the dentry of the <bf/parent/ of the name
in question (which is found in the <tt/qstr/), as the dentry of the
name will not be complete yet. 

<tag/d_compare/
This should compare the two <tt/qstr/s (again in the context of the
<tt/dentry/ being their parent) to see if they are equivalent.  It
should return 0 only if they are the same.  Ordering is not important.

<tag/d_delete/
This is called when the reference count reaches zero, <bf/before/ the
dentry is placed on the <tt/dentry_unused/ list.

<tag/d_release/
This is called just before a <tt/dentry/ is finally freed up.  It
can be used to release the <tt/d_fsdata/ if any.

<tag/d_iput/
If defined, this is called instead of <tt/iput/ to release the inode
when the <tt/dentry/ is being discarded.  It should do the equivalent
of <tt/iput/ plus anything else that it wants.

</descrip>


<sect>Inodes and Operations<p>

Linux keeps a cache of active and recently used inodes.  There are two
paths by which these inodes can be accessed.

The first is through the dcache described above.  Each dentry in the
dcache refers to an inode, and thereby keeps that inode in the cache.

The second path is through the inode hash table.  Each inode is hashed
(to an 8 bit number) based on the address of the file-system's super-block
and the inode number.  Inodes with the same hash value are then
chained together in a doubly linked list.

Access though the hash table is achieved using the <tt/iget/ function.
<tt/iget/ is only called by individual file-system implementations when
looking up an inode (which wasn't found in the dcache), and by
<tt/nfsd/.

Basing the hash on the inode number is a bit restrictive as it assumes
that every file-system can uniquely identify a file in 32 bits.  This
is a problem at least of the NFS file-system, which would prefer to use
the 256 bit file handle as the unique identifier in the hash.

The <tt/nfsd/ usage might be better served by having the file-system
provide a filehandle-to-inode mapping function which has interpret the
filehandle however is most appropriate.

<sect1>Inode Structure<p>

<tscreen>
<code>
struct inode {
	struct list_head	i_hash;
	struct list_head	i_list;
	struct list_head	i_dentry;

	unsigned long		i_ino;
	unsigned int		i_count;
	kdev_t			i_dev;
	umode_t			i_mode;
	nlink_t			i_nlink;
	uid_t			i_uid;
	gid_t			i_gid;
	kdev_t			i_rdev;
	off_t			i_size;
	time_t			i_atime;
	time_t			i_mtime;
	time_t			i_ctime;
	unsigned long		i_blksize;
	unsigned long		i_blocks;
	unsigned long		i_version;
	unsigned long		i_nrpages;
	struct semaphore	i_sem;
	struct inode_operations	*i_op;
	struct super_block	*i_sb;
	wait_queue_head_t	i_wait;
	struct file_lock	*i_flock;
	struct vm_area_struct	*i_mmap;
	struct page		*i_pages;
	spinlock_t		i_shared_lock;
	struct dquot		*i_dquot[MAXQUOTAS];
	struct pipe_inode_info	*i_pipe;

	
	unsigned long		i_state;

	unsigned int		i_flags;
	unsigned char		i_sock;

	atomic_t		i_writecount;
	unsigned int		i_attr_flags;
	__u32			i_generation;
	union {
		....
		struct ext2_inode_info		ext2_i;
		....
		struct socket			socket_i;
		void				*generic_ip;
	} u;
};
</code>
</tscreen>

Many fields in the inode structure will have an obvious meaning to
anyone familiar with Unix file-systems, so they will be skipped.
Here I will only deal with those specific to Linux or which have
interesting usage.

<descrip>

<tag/i_hash/
The <tt/i_hash/ linked list links together  all inodes which hash to
the same hash bucket.  Hash values are based on the address of the
super-block structure, and the inode number of the inode.

<tag/i_list/
The <tt/i_list/ linked list links inodes in various states.
There is the <tt/inode_in_use/ list which lists unchanged inodes that
are in active use,
<tt/inode_unused/ which lists unused inodes, and
<tt/superblock->s_dirty/ which holds all the dirty inodes on the given file
system.

<tag/i_dentry/

The <tt/i_dentry/ list is a list of all <tt/struct dentry/s that refer
to this inode.  They are linked together with the <tt/d_alias/ field
of the <tt/dentry/.

<tag/i_version/

The <tt/i_version/ field is available for file-systems to use to record
that a change has been made since some previous time.  Typically the
<tt/i_version/ is set to the current value of the <tt/event/ global
variable which is then incremented.  The file-system code will
sometimes assign the current value of <tt/i_version/ to the
<tt/f_version/ field of an associated <tt/file/ structure. On a
subsequent use of the <tt/file/ structure, it is then possible to tell
if the inode has been changed, and if necessary, data cached in the
<tt/file/ structure can be refreshed.


<tag/i_nrpages/

This field records the number of pages, linked at <tt/i_pages/ which
are currently cached for this inode.  It is incremented by
<tt/add_page_to_inode_queue/ and decremented by
<tt/remove_page_from_inode_queue/.

<tag/i_sem/
This semaphore guards changes to the inode.  Any code that wants to
make non-atomic access to the inode (i.e. two related accesses with the possibility
of sleeping inbetween) must first claim this semaphore.
This includes such things as allocating and deallocating blocks and
searching through directories.

It appears that it is not possible to claim a shared lock for
read-only operations.

<tag/i_flock/

This points to the list of <tt/struct file_lock/ structures that
impose locks in this inode.

<tag/i_mmap/

All of the <tt/vm_area_struct/ structures that describe mapping of an
inode are linked together with the <tt/vm_next_share/ and
<tt/vm_pprev_share/ pointers.  This <tt/i_mmap/ pointer points into
that list.

<tag/i_pages/

This is the list of all pages in the page cache that refer to this
inode.  They are linked together on the <tt/next/ and <tt/prev/ links
in the <tt/page/ structure.

<tag/i_shared_lock/

This spin lock guards the <tt/vm_next_share/ and <tt/vm_prev_share/
pointers in the <tt/i_mmap/ list.

<tag/i_state/
There are three possible inode state bits: I_DIRTY, I_LOCK, I_FREEING.
  <descrip>
  <tag/I_DIRTY/
  	Dirty inodes are on the per-super-block <tt/s_dirty/ list, and
  	will be written next time a sync is requested.
  <tag/I_LOCK/
        Inodes are locked while they are being created, read or written.
  <tag/I_FREEING/
        An inode is has this state when the reference count and link count
  	have both reached zero.  This seems to be only used by
  	<tt/igrab/ called from the <tt/fat/ file-system.  <tt/fat/
  	does funny things with inodes.
  </descrip>
  
<tag/i_flags/
  The <tt/i_flags/ field correspond to the <tt/s_flags/ field in the super
  block.  Many of the flags can be set system wide or per inode.  The
  per-inode flags are:

  <descrip>
  <tag/MS_NOSUID/
  	Setuid/setgid is not permitted in this file.
  <tag/MS_NODEV/
  	If this inode is a device special file, it cannot be opened.
  <tag/MS_NOEXEC/
  	This file cannot be executed.
  <tag/MS_SYNCHRONOUS/
  	All write should be synchronous.
  <tag/MS_MANDLOCK/
  	Mandatory locking is honoured.
  <tag/S_QUOTA/
  	Quotas have been initialised.
  <tag/S_APPEND/
  	The file can only be appended to.
  <tag/S_IMMUTABLE/
        The file may not be changed, even by root.
  <tag/MS_NOATIME/
  	Do not update access time on the inode when the file is
  	accessed. 
  <tag/MS_NODIRATIME/
  	Do not update access time on directories (but still do so on
  	files unless MS_NOATIME).
  <tag/MS_ODD_RENAME/
  	Wierd nfs thing.
  </descrip>

<tag/i_writecount/
If this is positive, it counts the number of clients (files or memory
maps) which have write access.  If negative, then the absolute value ofthis
number counts the number of VM_DENYWRITE mappings that are
current. Otherwise it is 0, and nobody is trying to write or trying to
stop others from writing.
<tag/i_attr_flags/
This is never used, and is only set by <tt/ext2_read_inode/ to be some
combination of ATTR_FLAG_SYNCRONOUS, ATTR_FLAG_APPEND,
ATTR_FLAG_IMMUTABLE and ATTR_FLAG_NOATIME.

<tag/i_generation/
The intent of <tt/i_generation/ is to be able to distinguish between
an inode before and after a delete/reuse cycle.  This is important for
NFS.  Currently, only <tt/ext2/ and <tt/nfsd/ maintain this field.

It is not clear that this could be exported to the VFS layer at all as
it's use is so specific.  Rather each file-system should have the
opportunity to provide a unique file handle for a given inode, and
each can then do whatever seems best to guarantee uniqueness.

</descrip>

<sect1>Inode Methods<p>

<tscreen>
<code>
struct inode_operations {
	struct file_operations * default_file_ops;
	int (*create) (struct inode *,struct dentry *,int);
	struct dentry * (*lookup) (struct inode *,struct dentry *);
	int (*link) (struct dentry *,struct inode *,struct dentry *);
	int (*unlink) (struct inode *,struct dentry *);
	int (*symlink) (struct inode *,struct dentry *,const char *);
	int (*mkdir) (struct inode *,struct dentry *,int);
	int (*rmdir) (struct inode *,struct dentry *);
	int (*mknod) (struct inode *,struct dentry *,int,int);
	int (*rename) (struct inode *, struct dentry *,
			struct inode *, struct dentry *);
	int (*readlink) (struct dentry *, char *,int);
	struct dentry * (*follow_link) (struct dentry *, struct dentry *, unsigned int);

	int (*get_block) (struct inode *, long, struct buffer_head *, int);

	int (*readpage) (struct file *, struct page *);
	int (*writepage) (struct file *, struct page *);
	int (*flushpage) (struct inode *, struct page *, unsigned long);

	void (*truncate) (struct inode *);
	int (*permission) (struct inode *, int);
	int (*smap) (struct inode *,int);
	int (*revalidate) (struct dentry *);
};
</code>
</tscreen>

<descrip>

<tag/default_file_ops/
This points to the default table of file operations for files opened
on this inode.  When a file is opened, the <tt/f_op/ field in the file
structure is initialised from this, and then the <tt/open/ method in
the <tt/file_operations/ table is called.  That method may choose to
change the <tt/f_op/ to a different (non-default) method table.  This
is done, for example, when a device special file is opened.

<tag/create/
This, and the next 8 methods are only meaningful on directory inodes.

<tt/create/ is called when the VFS wants to create a file with the
given name (in the <tt/dentry/) in the given directory.  The VFS will
have already checked that the name doesn't exist, and the <tt/dentry/
passed will be a negative <tt/dentry/ meaning that the inode pointer
will be NULL.

<tt/Create/ should, if successful, get a new empty inode from the cache
with <tt/get_empty_inode/, fill in the fields and insert it into the
hash table with  <tt/insert_inode_hash/, mark it dirty with
<tt/mark_inode_dirty/, and instantiate it into the dcache with
<tt/d_instantiate/.

The <tt/int/ argument contains the <tt/mode/ of the file which should
indicate that it is <tt/S_IFREG/ and specify the required permission bits.

<tag/lookup/
<tt/lookup/ should check if that name (given by the <tt/dentry/)
exists in the directory (given by the <tt/inode/) and should update
the dentry using <tt/d_add/ if it does.  This involves finding and
loading the inode.

If the lookup failed to find anything, this is indicated by returning
a negative dentry, with an inode pointer of NULL.

As well as returning an error or NULL, indicating that the <tt/dentry/
was correctly updated, <tt/lookup/ can return an alternate
<tt/dentry/, in which case the passed <tt/dentry/ will be released.
I don't know if this possibility is actually used.

<tag/link/

The <tt/link/ method should make a <bf/hard/ link from the name
refered to by the first <tt/dentry/ to the name referred to by the
second <tt/dentry/, which is in the directory refered to by the
<tt/inode/.

If successful, it should call <tt/d_instantiate/ to link the inode of
the linked file to the new <tt/dentry/ (which was a negative dentry).

<tag/unlink/
This should remove the name refered to by the <tt/dentry/ from the
directory referred to by the <tt/inode/.   It should <tt/d_delete/ the
dentry on success.

<tag/symlink/
This should create a symbolic link in the given directory with the
given name having the given value.  It should <tt/d_instantiate/ the
new inode into the <tt/dentry/ on success.

<tag/mkdir/
Create a directory with the given parent, name, and mode.

<tag/rmdir/
Remove the named directory (if empty) and <tt/d_delete/ the dentry.

<tag/mknod/
Create a device special file with the given parent, name, mode, and
device number.  Then <tt/d_instantiate/ the new inode into the dentry.

<tag/rename/
The first inode and entry refer to a directory and name that exist.
<tt/rename/ should rename the object to have the parent and name given
by the second inode and dentry.  All generic checks, including that
the new parent isn't a child of the old name, have already been done.

<tag/readlink/
The symbolic link referred to by the dentry is read and the value is
copied into the user buffer (with <tt/copy_to_user/) with a maximum
length given by the <tt/int/.

<tag/follow_link/
If we have a directory (the first dentry) and a name within that
directory (the second dentry) then the <em/obvious/ result of
following the name from the directory would arrive at the second
dentry.  If an inode requires some other, non-obvious, result -- as do
symbolic links -- the inode should provide a <tt/follow_link/ method to
return the appropriate new <tt/dentry/.   The <tt/int/ argument
contains a number of <tt/LOOKUP/ flags which are described in the
section on <tt/namei/ lookups.

<tag/get_block/
This method is used to find the device block that holds a given block
of a file.  The <tt/inode/ and <tt/long/ indicate the file and block
number being sought (the block number is the file offset divided by
the file-system block size).  <tt/get_block/ should initialise the
<tt/b_dev/ and <tt/b_blocknr/ fields of the <tt/buffer_head/, and
should possibly modify the <tt/b_state/ flags.

If the <tt/int/ argument is non-zero then a new block should be
allocated if one does not already exist.

<tag/readpage/

<tt/Readpage/ is only called by <tt>mm/filemap.c</tt>
It is called by:
<itemize>
<item> <tt/try_to_read_ahead/ from <tt/generic_file_readahead/
and <tt/filemap_nopage/

<item>
<tt/do_generic_file_read/

<item>
<tt/sys_sendfile/

<item>
<tt/filemap_nopage/

<item>
<tt/generic_file_mmap/ requires it to be non-null.
</itemize>

Thus it is needed for memory mapping of files (as you would expect),
for using the <tt/sendfile/ system call, or if the
<tt/generic_read_file/ is to be used for the <tt/file/:<tt/read/
method.

<tt/readpage/ is not expected to actually read in the page. It must
arrange for the read to happen.  Clients wait for the page to be
unlocked before using the data.

<tt/readpage/ can be implemented using <tt/block_read_full_page/ which
is defined in <tt>fs/buffer.c</tt>.
This routine assumes that <tt/inode:get_block/ has been defined and
sets up a buffer_heads to access the block in question.
These buffer_heads will be set to call 'end_buffer_io_async' on
completion, which will unlock the page when all buffers on the page
complete.

<tag/writepage/

<tt/Writepage/ is called from <tt>linux/mm/filemap.c</tt> too.

it is called by <tt/do_write_page/ from <tt/filemap_write_page/,
from <tt/filemap_swapout/, <tt/filemap_sync_pte/, and from
<tt/generic_file_mmap/.

<tt/Writepage/ can be implemented using <tt/block_write_full_page/
from <tt>fs/buffer.c</tt>.  It is a close twin of
<tt/block_read_fullpage/.  The important differences being:
<itemize>
<item>
<tt/block_read_fullpage/ initiates a read with <tt/ll_rw_block/, while
<tt/block_write_fullpage/ only sets up the buffers, but doesn't
initiate the write.
<item>
<tt/block_read_fullpage/ calls <tt/inode:get_block/ with the create
flags set to zero, while <tt/block_write_fullpage/ sets it to one, and
<item>
<tt/block_read_fullpage/ calls <tt/init_buffer/ to get
<tt/end_buffer_io_async/ called on completion.
</itemize>

These two routines could be cleaned up a bit so that the similarity
and differences stand out more.


<tag/flushpage/

<tt/flushpage/ is called from <tt>mm/filemap.c</tt> and
<tt>mm/swap_state.c</tt>.

In <tt>mm/filemap.c</tt> is called by <tt/truncate_inode_pages/ to
make sure no I/O is pending on a page before the page is released.
<tt>mm/swap_state.c</tt> similarly calls it when a page is being
removed from the swap cache -- all I/O must be finished.


HEREish

<tag/truncate/ TODO
<tag/permission/ TODO
<tag/smap/ TODO
<tag/revalidate/ TODO


</descrip>


<sect>Locking<p>

All file-system operations are still protected by the big kernel lock.
The moves to make file-system code SMP safe seem to be progressing from
the bottom up, with the buffer cache and page cache essentially SMP
safe, the inode cache probably SMP safe (there is spin lock called
<tt/inode_lock/ which must be held during inode operations) and the
dcache totally SMP-unsafe.

As file-system operations are mostly done at the dcache level, file
system operations are all under the kernel lock.

The main (only?) non-SMP locking issues that file-systems need to deal
with are consistancy of the hierarchical structure in the dcache, and
consistancy of any internal structure a individual files (or
file-system objects).

<sect1>Dcache consistancy<p>
Changes to the dache involve adding and deleting dentries as children
of pre-existing dentries.

Deleting entries in performed in a lazy fashion. Entires that are not
wanted any longer are unhash so that they will not be found by future
lookups.  Once the last reference to the unwanted dentry is removed,
the dentry will be pruned by <tt/dput/.

Adding entries is done by first adding a 'negative' entry which has a
NULL pointer for the <tt/d_inode/, and then instantiating that entry
by filling in the <tt/d_inode/ pointer appropriately.

Any operation which might change the dcache structure must hold a lock
while making the change. The protocol used in the VFS layer that the
<tt/i_sem/ semaphore on the parent inode must be held when adding a
dentry as a child of that inode, or when changing the <tt/d_inode/
pointer in any child of the inode.  Note that unhashing or pruning
entries do not require the semaphore to be held as these can be done
atomically under the kernel lock.

The situations which require <tt/i_sem/ to be help down include:
<itemize>
<item>
performing a <tt/lookup/ operation in the file-system which will
add a new child dentry - possibly a negative one.
<item>
creating a new file to instantiate a negative dentry.
<item>
Unlinking a file, and hence changing a dentry into a negative dentry.
</itemize>

Many operations require a two step processes.  The first step does a
lookup of some name in a directory.  The second step performs some
operation on the name that was found, such as to instantiate it or is
some other way change the <tt/d_inode/ pointer.  This requires the
<tt/i_sem/ semaphore two be taken and released twice, once of the
lookup and once for the other step.  In order the ensure that no
incompatible operations has occurred between the two holds on the
semaphore, the VFS locking protocol requires that after the second
<tt/down(&amp;inode-&gt;i_sem/, the operation must check that the
parent dentry really is still the parent of the child dentry.  This
can be done using code similar to the <tt/check_parent/ macro in
<tt>fs/namei.c</tt>.

<sect2>Rename<p>
A particularly interesting case for dcache locking involves the rename
operation, as this changes two entries in the one operation.

When renaming a file (or other non-directory object) it is sufficient
to lock both parent directories.  If order to avoid deadlocks, the
convention is to HERE

<sect1>File consistancy<p>

<sect1>Mount table locking<p>


<sect>Credits<p>

<descrip>
<tag>Richard Gooch <tt>&lt;rgooch@atnf.csiro.au&gt;</tt></tag>
This document was inspired in part by <tt>Documentation/vfs.txt</tt>
which was by by Richard.
<tag>Tigran Aivazian <tt>&lt;tigran@sco.com&gt;</tt></tag>
Tigran has provided a number of additions and corrections, and well as
valuable encouragement.
</descrip>

<sect>Scribbled notes<p>

NOTES:

<tt/brw_page/ in <tt>fs/buffer.c</tt> is called by <tt>mm/page_io.c</tt>
and is also exported to modules.
It is for swap page I/O.

NOTES:

<tt/generic_file_read/ does readahead, and reads a full page.
<tt/generic_file_write/ calls the helper routine to write each page, or
part there of.  The helper copies the user-space buffer into the page,
after reading in partial blocks and sets up the buffer-headers to
allow write.


When writing to a file, the data is copied into the page cache, and
<tt/buffer_head/s are set up, marked dirty.
Eventually, either by fsync or bdflush (possibly called by
<tt/balance_dirty/) <tt/ll_rw_block(WRITE)/ will be called.
<tt/ll_rw_block/ goes straight to the <tt/device_queue/ unless it is an
MD or LOOP device, which might intervene.

SUGGESTIONS:

allow <tt/inode:follow_link/ to be NULL, implying readlink followed by
lookup_dentry.

discard i_generation in favour of getfilehandle
</article>
