<!doctype linuxdoc system>

<article>

<title>The Linux Kernel NFSD Implementation
<author>Neil Brown <tt/neilb@cse.unsw.edu.au/
<date>13th October 1999
<abstract>
One of the modules in the Linux operating system is an NFS server,
sometimes referred to as <tt/knfsd/ (the <tt/k/ is for Kernel, to
distinguish it from the user-level NFS server that is also available).

This document describes the details of the implementation current at
version 1.4.7 which is a patch against the 2.2.7 kernel, and will
possibly be included in a late 2.2.* kernel release.
</abstract>

<toc>

<sect>Introduction<p>

NFS, the Network File System from SUN Microsystems, is a system that
allows files to be shared over a network.  It is implemented using 4
protocols each built on-top of ONC-RPC, the Open Network Computing
Remote Procedure Call protocol, also from SUN.

The four protocols are:
<descrip>
<tag/NFS/
The basic file access protocol which allows files to be created,
found, read and written.  It is designed to be stateless, meaning that
the only state that the server needs to store is the exposed contents
of the filesystem.  The server is not expected to retain any information
about the state of the clients.

As well as being stateless, the NFS protocol is <bf/idempotent/. This
means that is a particular request arrives twice, the second will
effectively be a NO-OP.  This allows it to be used over an
"unreliable" protocol such as UDP/IP.

<tag/MOUNTD/
The mountd protocol is used to gain initial access to a filesystem
which can subsequently be accessed by NFS.  The mountd protocol does
expect the server to retain some state about clients, as it contains
an unmount request to tell a server that the client is nolonger using
a filesystem.

<tag/SM/
The Status Monitor protocol is used for monitoring the state of different nodes in
a network.  The particular intent is that any node can register an
interest in another node and will be told in a timely fashion when
that node restarts.

<tag/NLM/
The Network Lock Manager protocol provides file and record locking.
It relies on SM to determine when clients have restarted, and so
released all of their locks, and when servers have restarted, and so
need to be reminded of the current locks.
</descrip>

There is another protcol, the Kernel Lock Manager, or KLM that some
NFS implementations use to communicate between the kernel and a
user-level locking daemon.  This protocol is not used by the Linux NFS
implementation.

The <tt/knfsd/ implementation in Linux supports the NFS and NLM
protocols completely within the kernel, the NFS protocol by code in
<tt>linux/fs/nfsd</tt> and the NLM protocol by code in
<tt>linux/fs/lockd</tt>.  The <tt/lockd/ code contains a client for
the SM protocol, but the server is provided by a user-level process
called <tt/statd/.  The MOUNTD protocol is served by a user-level
process called <tt/mountd/.  There is a system call interface to allow
<tt/mountd/ to communicate information about exported filesystems to
the kernel level NFS server.

There are two version of the NFS protocol that are commonly in use
today, version 2 and version 3.  The NFS server implementation in
Linux currently only supports version 2.  Unless otherwise stated, all
statments about the protocols should be taken as referring the version
2. 

<sect>Understanding File Handles<p>

The file handle is a central part of all NFS and related operations.
A file handle in an opaque string of bits which is used to uniquely
identify a file or other filesystem object.  In Version 2, the file
handle in 256 bits long (32 bytes).  In version 3, it is variable in
length, up to 512 bits long.

The fact that the file handle is opaque means that the client should
not attempt to understand anything about the file from inspecting the
contents of the file handle.  The only operations that the client
should perform on the file handle are to copy it, and to compare it
for equality with another file handle.

This leaves the server free to encode information about the file into
the file handle in what-ever way it sees fit.  As Unix allows files to
be moved around the filesystem without changing their intrinsic
identity, it is important that the NFS servers only encodes
information about the file, and not about its location in the
filesystem hierarchy, otherwise confusion can result.

The traditional contents of a file handle is:
<itemize>
<item>
some identifier for the file system such as the device number that the
file system is mounted from,
<item>
some identifier for the inode within the file, such as the inode
number, and
<item>
a field to indicate when an inode has been reused, typically
called a generation number for the inode.
</itemize>

The server is free to use this sort of information, or anything else
that will serve the same purpose.  The important thing for the server
is that it must be able to generate a unique file handle for each file
that preferably does not change across restarts, that it must be able
to reliably find a file given the file handle, and that it must be
able to determine if a given file handle is still valid (not old and
not fake).

With this context, we can now look at the file handles used by
<tt/knfsd/ in Linux.

From <tt>linux/include/linux/nfsd/nfsfh.h</tt> we find that the file
handle is built from a structure containing:

<tscreen>
<code>
struct nfs_fhbase {
	struct dentry *	fb_dentry;	/* dentry cookie */
	__u32		fb_ino;		/* our inode number */
	__u32		fb_dirino;	/* dir inode number */
	__u32		fb_dev;		/* our device */
	__u32		fb_xdev;
	__u32		fb_xino;
	__u32		fb_generation;
};
</code>
</tscreen>
and some NUL padding.
(Note to code readers, each of these are actually referenced with a
<tt/fh/ prefix rather than the <tt/fb/ shown here.  See <tt/nfsfh.h/
for the reason. <tt/fb_dentry/ doesn't only have a different prefix, it is
infact spelt <tt/fh_dcookie/.)

The fields have the following usage:
<descrip>
<tag/fb_dentry/
A previous version stored the address of a kernel structure related to
the file in this entry.  This is not stable over reboots and turns out
not to be needed.  The current version always sets this to the value
<tt/0xfeebbaca/.  This value is not checked, only set.

<tag/fb_ino/
This field stores the inode number of the file - cast to a 32 bit
number if needed.   This is stored in native byte order (as are all
values, as the client never looks at them).

<tag/fb_dirino/
This field stores the inode of the directory that contains the file.
Given that files can be in multiple directories and can move between
directories, this is neither unique nor stable.  It is used to help
locate the file within the <tt/dcache/ as will be explained later.

<tag/fb_dev/
This field contains the device that the file system was mounted from,
which within Linux is a reasonably unique identifier of the
filesystem.  A number of file systems, such as <tt/procfs/ and
<tt/smbfs/ do not have associated devices, so a unique anonymous
device is allocated at mount time.  For such file systems, this field
is not guaranteed to be stable across restarts.  However these file
systems are not normally exported (and knfds actually refuses to
provide access to some of them. Possibly it should reject all of
them). 

<tag/fb_xdev/
This field is the device number of the directory that was exported.
<tag/fb_xino/
This stores the inode number of the directory that was exported (which
may not be the directory that was mounted).  The field together with
<tt/fb_xdev/ and the IP address of the client are used to determine
whether the file handle is actually valid for that client.

<tag/fb_generation/
The generation number is used to differentiate between an inode before
and after it has been deleted and reused.  The generation number
changes in a non-predictable way whenever the inode is reused.
</descrip>


The most complicated part of dealing with a file handle, for Linux, is
in finding the file given the file handle.  This is inpart due to the
presence of the <bf/dcache/.

In order to read or write to a file within Linux, a <tt/struct file/
structure is needed.  These structures do not refer to the inode
directly, but do so through an entry in the dcache, a <tt/struct
dentry/.  Also, directory operations like lookup and create require a
<tt/dentry/.

As the dcache always contains a prefix of the filesystem directory
structure, finding a dentry requires making sure that the parent
directory (or a parent directory, as some files have multiple parents)
and all of its ancestors are also in the dcache. It is for this reason
that the file handle contains the inode number of a containing
directory.


<sect1>Interpreting a file handle - OLD<p>

The filehandle code, in <tt>linux/fs/nfsd/nfsfh.c</tt> goes though
some hoops to get hold of an appropriate dentry as will now be
described.


When a new handle arrives in a request a <tt/struct svc_fh/ is
created and passed to <tt/fh_verify/ to check that it is valid, and to
find the dentry.  <tt/fh_verify/ calls <tt/find_fh_dentry/ which does
the real work.  This goes through several stages to try to find the
dentry.

Firstly it looks in a cache of recent file handles using
<tt/find_dentry_in_fcache/.  This is currently a no-op as it depended
in the  <tt/fb_dentry/ files of the file handle, which is now
depreciated.

The next stage involves checking to see if the inode is currently in
the inode cache, using <tt/iget_in_use/.  If it is, then a valid
dentry should also be in the dcache, and can be found because each
inode points to a list of dentries that refer to it.  The code loops
though all dentries until it find one with a parent that has an inode
number that same as <tt/fb_dirino/.  Why it cares as long as it has a
dentry, I'm not sure.

If no appropriate dentry is found, it proceeds to look in the rename
cache.  Every time a file is renamed into a different directory, a
record is kept in a cache of the inode number the old and new
directory inode numbers.  Of course, this cache does not survive a
restart.   If the inode/dir is found in the rename cache, then the
list of dentries is rechecked looking for the new directory inode
number.

If it still hasn't found an appropriate dentry, it continues to stage
3, which tries to find the full path name of the file, and then create
the dcache entry from that path name.
The path name is found in much the same way that the unix <tt/pwd/
command works to find a path name.
<enum>
<item>
It fakes up a temporary dentry for the directory and reads through
looking for an entry for the given file inode. At the same time it
records the inode number for the parent of the directory from the
"<tt/../" entry.
<item>
Then it fakes up an dentry for the parent directory which it now has
an inode number for and repeats the process.
</enum>
One reason that these faked up temporary entries cannot be used more
generally, is that the <bf/vfs/ level will not look up "<tt/../" through
them properly as they don't have a valid <tt/d_parent/ entry.
(They could be used for files though).

If this fails, then the file must have been renamed to a different
directory a long time ago.

Stage 4 uses a path name cache to try to find the path name of the
parent inode.  I'm not really sure of the point of this, and wont go
into the detail yet.  The name cache is maintained by
<tt/add_to_path_cache/ and <tt/get_path_entry/.

Stage five is simply to fail.

<sect1>Interpreting a file handle - NEW<p>

<it>Note: This section assumes my patches to 1.4.7</it>
<p>

As knfsd looks up a file handle for every request, and will often
recieve a number of requests on the same file handle (e.g. several
read requests on the one file) it is well worth while doing some
caching to improve lookup speed.  Fortunately the dcache and icaches
provide adequate caching, and knfsd does not need to to any of it's
own.

REMOVE THIS?
When a file handle arrives, it is passed to <tt/fh_verify/ to check
that is is valid, and that the user has appropriate access.  After
some simple consistancy check (<tt/fb_xdev/ must equal <tt/fb_dev/,
and the <tt/fb_xdev/,<tt/fb_xino/ pair must be exported to this host),
<tt/find_fh_dentry/ is called to find an entry in the dcache for the
inode referred to in the handle.

Providing that the underlying filesystem support the <tt/read_inode/
operation, the inode with the given inode number is found using
<tt/iget_in_use/.  (If <tt/read_inode/ is not supported, then the
filesystem simply cannot be NFS-exported).

If this inode was already in the icache, then it will have a pointer
to a valid dcache entry, and the lookup is complete.

If this inode was not in the cache (either it has been flush to make
room or the server has been restarted since the inode was last used),
then a dcache entry needs to be created.

Here we need to worry about the <tt/NFSMNT_SUBTREECHECK/ export
option.
If this option is set, then we want to make sure that every file
accessed is a descendant of the export point.  When exporting whole
filesystems, this checking is un-necessary and can be avoided by
clearing this option.

If this option is not set, and inode being search for is not a
directory, then the dcache entry that is returned does not need to be
located in the dcache tree, it's parent and child pointers will never
be check.  So for non-directories with no <tt/SUBTREECHECK/
requirement, we simply create a dcache entry with <tt/d_alloc_root/
and return it.

For other objects we need to find a valid location in the dcache
tree.  This requires being able to find the (or a) parent for this
object, and then a parent for that parent and so-on until an object in
the tree is found.  For directories we can always find a parent by
reading the directory and looking for the '<tt/../' entry.  For
non-directories we need the <tt/fb_dirino/ entry  in the file handle.

Given this parent inode number, <tt/find_fh_dentry/ walks up the
directory tree, building a dcache path as it goes, until it finds an
ancestor that is already in the dcache.  It then splices the path into
that ancestor, and returns the base of the path which is the dcache
entry of the file that is wanted.

If <tt/SUBTREECHECK/ing is required, and a file is rename to a
different directory, then accessing it with the old file handle will
only work as long as the entry for the file is still in the dcache.
Once it expires, access from that filehandle will no longer work.  It
would be possible to encourage entries for renamed files to stay in
the dcache longer, but we would need some data on how long entries
tend to stay anyway, and how much moved files are accessed by their
old filehandle to see if there was any value in this.

<sect1>Using the File Handle<p>

knfsd stores active file handles in a <tt/struct svc_fh/ which looks
like:

<tscreen>
<code>
typedef struct svc_fh {
	struct knfs_fh		fh_handle;	/* FH data */
	struct dentry *		fh_dentry;	/* validated dentry */
	struct svc_export *	fh_export;	/* export pointer */
	size_t			fh_pre_size;	/* size before operation */
	time_t			fh_pre_mtime;	/* mtime before oper */
	time_t			fh_pre_ctime;	/* ctime before oper */
	unsigned long		fh_post_version;/* inode version after oper */
	unsigned char		fh_locked;	/* inode locked by us */
	unsigned char		fh_dverified;	/* dentry has been checked */
} svc_fh;
</code>
</tscreen>

The first three entries are (hopefully) fairly obvious.  The
<tt/fh_handle/ is the raw handle that came over the wire and
<tt/fh_dentry/ and <tt/fh_export/ are the dcache entry and export
entry that have been derived from it.

<tt/fh_pre_size/, <tt/fh_pre_mtime/ and <tt/fh_pre_ctime/ are intended
for encoding <bf>Weak Cache Consistency</bf> data for NFS
version 3.  The values do not seem to be set significantly at present.
<tt/fh_post_version/ is used to determine if this <bf/wcc/ data
needs to be returned or not.

<tt/fh_locked/ is set when the <tt/i_sem/ semaphore on the inode is
taken <tt/down/, to make sure that it gets put back <tt/up/ when the
file handle is release.  <tt/fh_dverified/ records that the
<tt/fh_dentry/ is valid, to make sure that it is released when the
filehandle is release.

<sect>Exporting file trees<p>

<it>Note: This section assumes my patches to 1.4.7</it>
<p>

The kernel nfs server maintains a list of file systems that are
currently mounted by some client.  As clients mount filesystems using
the MOUNTD protocol, the MOUNTD server tells the kernel about them.

As the kernel needs to know about all currently mountd filesystems, it
is necessary for this information to be safely stored across
restarts.  This is the responsibility of <tt/mountd/ which records all
current mounts in <tt>/var/lib/nfs/rmtab</tt> and of <tt/exportfs/
which reminds the kernel of all exports deduced from <tt/rmtab/.

Each client is known the the kernel by a <tt/struct svc_client/ which
is defined in <tt>linux/include/nfsd/export.h</tt> to be

<tscreen>
<code>
struct svc_client {
	struct svc_client *	cl_next;
	char			cl_ident[NFSCLNT_IDMAX];
	int			cl_idlen;
	int			cl_naddr;
	struct in_addr		cl_addr[NFSCLNT_ADDRMAX];
	struct svc_uidmap *	cl_umap;
	struct svc_export *	cl_export[NFSCLNT_EXPMAX];
};
</code>
</tscreen>

The <tt/cl_ident/ field stores a string (at most 1023 chars) which is
used as a key when accessing this client information via the
<tt/nfs_ctl/ system call.  This is typically the hostname of the
client.

The <tt/cl_addr/ array contains up to 16 internet addresses for the
client, which should all be considered to be equivalent.  This is used
as a key to the structure when an NFS request arrives over the
network.  The kernel does not check, when a client is created or
change, that the addresses are not already in use by another client,
so there is a possibility for confusion if the user-level processes
allow it.   Access checks are always done against the most recently
registered client to have a particular address.

The <tt/cl_export/ array is a hash table (size 16) which stores the
information about the different filesystems which are exported.
The <tt/struct svc_export/ structure is also defined in
<tt>linux/include/nfsd/export.h</tt> to be

<tscreen>
<code>
struct svc_export {
	struct svc_export *	ex_next;
	char			ex_path[NFS_MAXPATHLEN+1];
	struct svc_export *	ex_parent;
	struct svc_client *	ex_client;
	int			ex_flags;
	struct dentry *		ex_dentry;
	kdev_t			ex_dev;
	ino_t			ex_ino;
	uid_t			ex_anon_uid;
	gid_t			ex_anon_gid;
};
</code>
</tscreen>

The <tt/ex_next/ entry is used to chain together entries in the same
hash bucket.  <tt/ex_client/ is simply a pointer back to the client
which owns the export entry, and <tt/ex_path/, <tt/ex_dentry/,
<tt/ex_dev/, and <tt/ex_ino/ simply store different information about
the exported directory.

The <tt/ex_parent/, if non-NULL, point to another <tt/svc_export/ for
this client that is an ancestor of this directory in the filesystem.
It will always point to the closest such ancestor.  Many export
entries will not have a parent.

As far as I can tell, the <tt/ex_parent/ is maintained but never used.
It seems to be related to two checks  in <tt/exp_export/ (in
<tt>linux/fs/nfsd/export.c</tt>) which check two rules about exporting
related directories. They appear to be:

<descrip>
<tag/Rule 2/
If an ancestor directory of a given directory is exported to a given
client, then the given directory can also be exported <bf/only/ if it
is on a different filesystem/device.
<tag/Rule 3/
If any decendant directory of a given directory, which is on the same
file system (or a file system where the device has the same hash
value!) is already exported to a client, then the given directory
cannot also be exported to that client.
</descrip>

These rules are simple inverses of each other and are presumably
intended to remove any ambiguity concerning which export attributes
(flags and anon ids) should be applied to a given file.  If either of
there rules were violated then there would be two export points in the
one file system with one being a child (Descendant) of the other.  This would
create ambiguity as to export rules should apply to children of the
junior export point.

<sect1>Other export details<p>

The <tt/cl_uidmap/ field in each client is currently not used.  The
intent seems to be to allow the client and server to use different
uids and gids for the same entity, and the nfs server would do the
appropriate mapping. Presumably this would be a cache which would be
updated by a call-back to a use-space daemon on a cache-miss.

The <tt/ex_flags/ fields can have the following bit-flags set:
<descrip>
<tag/READONLY/
All write requests are denied with NFSERR_ROFS.

<tag/INSECURE_PORT/
Requests from insecure ports (1024 or above) are permitted.

<tag/ROOTSQUASH/
All accesses by uid 0 are mapped to appear to be by the uid given in
<tt/ex_anon_uid/.  Similarly accessed by gid 0 are treated as accessed
by gid <tt/ex_anon_gid/.

<tag/ALLSQUASH/
All accesses are treated as though they came from uid <tt/ex_anon_uid/
and gid <tt/ex_anon_gid/.

<tag/GATHERED_WRITES/
This enabled a hack which attempts to allow the underlying filesystem
to gather writes together for more efficient use of the lowlevel
device.  If a write is requested while it appears that another write
is pending, the first write sleeps for 10msecs before flushing the
write to give the filesystem a chance to have scheduled to two (or
more) writes together.

<tag/UIDMAP/
This flags indicates that the currently unimplemented <tt/cl_uidmap/
map should be used.

<tag/KERBEROS/
This unimplemented option indicates that Kerberos authentication
should be used for each request.

<tag/SUNSECURE/
This is another unimplemented option.  It presumable indicates that
SUN's Secure RPC is being used to authenticate requests.

<tag/CROSSMNT/
Another unimplemented option.  Presumably it is intended that
file trees exported with this options allow the client to cross mount
mounts. 
</descrip>

There is another option that affects exported filesystem behaviour
that is implemented as a compile time option, <tt/CONFIG_NFSD_SUN/
rather than a runtime option.
Normally, if a directory within an exported filesystem is mounted on,
then that directory and hence everything beneith it is not accessable
via NFS.  If <tt/CONFIG_NFSD_SUN/ has been selected at compile time,
then the server acts line SUN Microsystems servers and allows the
entire file system to be viewed, independant of mounts.

<sect1>User level assistants<p>

There are two user level programs which assist with maintaining the
client and export lists in the kernel.  They are <tt/mountd/ and
<tt/exportfs/, the first is a daemon, the second is an admin tool.

These two programs read the <tt>/etc/exports</tt> file and maintain
the files <tt/xtab/, <tt/etab/ and <tt/rmtab/ in
<tt>/var/lib/nfs</tt> as well as the in kernel client and export
lists.

<sect2>The <tt>/etc/exports</tt> file<p>

The <tt>/etc/exports</tt> file lists file trees, clients that that
they can be mounted by, and export flags (such as read-only) that are
imposed on that client for that file tree.

Client names in <tt>/etc/export</tt> can be:
<descrip>
<tag/ANONYMOUS/
An empty client name will match any client.
<tag/NETGROUP/
A client name starting with an '@' will match any client host in the
netgroup given by the rest of the client name.
<tag/WILDCARD/
If a client name contains '*', '?', or '[', then it is assumed to be a
wild carded host name.  Any host with a fully qualified domain name
which matches the client name using <bf/glob/ matching will match the
client.

<tag/SUBNETWORK/
If a client is a dotted-quod IP address followed by a slash and a
number of bits (e.g. <tt>129.94.0.0/16</tt> then any client with an IP
address in that subnet will match.
<tag/FQDN/
Otherwise clients must be host names and a looked up with standard
host name resolution procedures.
</descrip>

The <tt>/etc/exports</tt> file records the default intension of what
should be exported.  It is read by <tt/exportfs/, usually at startup
time, and the information is presented to <tt/mountd/ through the
<tt>/var/lib/nfs/etab</tt> file.  Thus this file is indicative, but
not authorative.

<sect2> The <tt/rmtab/ file<p>

<tt>/var/lib/nfs/rmtab</tt> contains a list of client hosts and file
trees that have been mounted from them. The format of the file has
one line for each entry, each line being the name of the client, a
colon, and the path to the file tree that was mounted.

The name of the clients will be fully qualified domain names are
returned by the resolver, or IP addresses.

Entries are added to this file when <tt/mountd/ replies to a
successful mount request, and are removed when <tt/mountd/ receives an
<tt/unmount/ or and <tt/unmountall/ request.  It is used to reply to
the <tt/dump/ mountd request.

<tt/rmtab/ is used by <tt/exportfs/ when exporting filesystems.  It
uses the host names to instantiate wild card exports to create specific
host exports to give to the kernel.

<sect2>The <tt/etab/ file<p>

<tt>/var/lib/nfs/etab</tt> contains a list of currently exported file
trees and their options.  It is in a somewhat different format than
the <tt>/etc/exports</tt> file in that each line lists just one path
and one client, and the client has all export options explicitly
listed.

The <tt/etab/ file will normally contain the same information as
<tt>/etc/exports</tt>, but it can be given extra information by giving
explicit export arguments to the <tt/exportfs/ program.

The <tt/etab/ file is written by the <tt/exportfs/ program and read by
<tt/mountd/.

The <tt/etab/ file is the authorative list of what should be exported
to where.


<sect2>The <tt>/proc</tt> <tt/exports/ file<p>

The file <tt>/proc/fs/nfs/exports</tt> provides a window in the
kernels table of exported file trees.  It lists paths, clients and
options in exactly the same format as <tt/etab/.  While <tt/tab/ may
well have wild card, netgroup, and other non-specific client name, the
<tt/exports/ file in <tt>/proc</tt> only has explict client host
names.

<sect2>The <tt/xtab/ file<p>

The file <tt>/var/lib/nfs/xtab</tt> serves much the same purpose as
<tt>/proc/fs/nfs/exports</tt> in that it records host-specific mounts
that have been given to the kernel.

More precisely, whenever an export request is given to the kernel that
is based on a group export, rather than a host export, in <tt/etab/,
the host export line is written to <tt/xtab/.

The <tt/xtab/ file is not necessary if the <tt>/proc</tt> filesystem
is available.


<sect2>The <tt/exportfs/ tool<p>

<tt/exportfs/ is used to communicate the intention of the system
administrator to the NFSD system.  It does that by maintaining the
<tt/etab/ file and playing some part in managing the in-kernel export
table.

<tt/exportfs/ will read file trees to be exported from
<tt>/etc/exports</tt> or from the command line, and will make
appropriate changes to <tt/etab/.  It can add and remove exports
requests from <tt/etab/.

<tt/exportfs/ will also make sure that all host exports that are known
to be required (an no others) are know to the kernel.  It does this by
telling the kernel about any FQDN export requests that it puts in
<tt/etab/, and about any mount requests listed in <tt/rmtab/ that can
be validated against group export requests that are in <tt/etab/.

<sect2>The <tt/mountd/ daemon<p>

When a client wants to mount a filetree, it asks <tt/mountd/.
<tt/mountd/ check the request against information in <tt/etab/, and
reponds with file handle information if the mount is permitted.  It
also make sure that the kernel knows that the filetree may be exported
to that client.  If the file tree is explicitly exported to the client
in <tt/etab/, then <tt/exportfs/ will have already told the kernel.

If the file tree is exported due to some group export request in
<tt/etab/, then <tt/mountd/ specialised that export request to the
given client, tells the kernel about this request, and records the
fact in <tt/xtab/.

<sect>The Path of a request<p>

This section traces the path that a single request takes as it is
processed by <tt/knfsd/.  This essentially shows the flow of control.

When the nfsd service is started by the <tt/nfsd/ user level program,
<tt/nfssvc.c::nfsd_svc/ is called. This calls <tt/svc_create_thread/
from the <tt/sunrpc/ module to create a number of threads for serving
requests that arrive for the <tt/nfs/ service.  Each thread runs
<tt/nfssvc.c::nfsd/ which handles requests in a loop.  This routine,
and most of what is called by is, runs protected by the big kernel
lock, so SMP issues are non-issues.

<tt/nfsd/ repeatedly calls <tt/svc_recv/ to receive a request. When it
receives a valid request it is finds out which client the IP address
corresponds to (<tt/exp_getclient/) and passes the request back to the
<tt/sunrpc/ module using <tt/svc_process/.  These two call, and hence
everything that <tt/nfsd/ does except for waiting for for new
requests, are performed with a readlock on the export table.

The client identity is used to a minor extent by the <tt/sunrpc/
module in that if no valid client was found, and the procedure
requested was not the NULL procedure, then the request is rejected
with a bad credentials error status.  This checking is enabled by
setting <tt/rqstp->rq_auth/ to 1 in <tt/nfsd/.

The <tt/sunrpc/ module decodes the rpc request and passes it back to
<tt/nfsd_dispatch/.  This is selected by the <tt/nfsd_program/
structure at the end of <tt/nfssvc.c/.

<tt/nfsd_dispatch/ checks to see if the request has already been seen
and the reply been cached (see separate section on the request cache).
If it has, then the remembered reply is returned.
Otherwise the arguments are decoded, and the appropriate procedure is
called.  When this procedure returns the result is appropriate encoded
into the response buffer, and cached if caching is appropriate for
that procedure.

The different procedures are declared in the <tt/nfsd_procedures2/ and
<tt/nfsd_procedures3/ sturctures which are defined in <tt/nfsproc.c/
and <tt/nfs3proc.c/.  For each procedure there is defined:
<itemize>
<item>The C function to call to implement the procedure
(e.g. <tt/nfsd_proc_lookup/).
<item>The function to call to decode the arguments
(e.g. <tt/nfssvc_decode_diropargs/).
<item>The function to call to encode the results
(e.g. <tt/nfssvc_encode_diropres/).
<item>The function to call to release any data structures that might
have been storing in the result.

The only data structures that need releasing for the nfs service are
file handles.  <tt/nfsproc.c/ use <tt/nfssvc_release_fhandle/ to
release both <tt/nfsd_diropres/ and <tt/nfsd_attrstat/ structures.
This works only because these to structures are identical to
<tt/nfsd_fhandle/.

<item>An indication of whether and how the reply should be cached.
The options are
<descrip>
<tag/RC_NOCACHE/ says that the response should not be cached.  This is
used for read-only operations like <tt/read/ and <tt/lookup/.
<tag/RC_REPLSTAT/ says to cache only the success status of the
call. This is used for all requests that only return success or
failure, and do not return data (they
are write only?) such as <tt/remove/, <tt/rename/ and <tt/symlink/.
<tag/RC_REPLBUF/ says to cache the reply which includes some real
data.  This includes operations which return the status of the file in
question, such as <tt/create/ and <tt/write/ and <tt/mkdir/.  It also
includes <tt/readdir/, presumably because rereading from a directory
is more expensive than re-reading from a file.
</descrip>
</itemize>

Each the function handling each different procedure naturally proceeds
quite differently.  There are however still some similarities that can
be commented on.

Most <tt/nfsd_proc_*/ procedures simply pass the arguments on to
<tt/nfsd_*/ in <tt/vfs.c/ which contains common code for versions 2
and 3.  Other processing that is done at this level is preparing
bufferes for return data (e.g. <tt/nfsd_proc_readlink/) and calling
<tt/fh_put/ on the file handle if nothing is needed from it.

A distinct exception to this pattern is <tt/nfsd_proc_create/, I
think because NFSv2 create is very different to NFSv3 create (FIXME
expand on this, after understanding it).

Within each <tt/nfsd_*/ call in <tt/vfs.c/, the common first step is
to call <tt/fh_verify/ on a file handle to make sure that it is valid
and that the relevant user has the required access.

<tt/fh_verify/  is defined in <tt/nfsfh.c/.   It uses
<tt/find_fh_dentry/ (which is described elsewhere) to find the dentry
for the file, is it is valid, does a number of other validity check,
and finally calls <tt/nfsd_permission/ to see if the user has the
required access.  The sequence of checks is described elsewhere (FIXME
one day...).

If <tt/fh_verify/ reports success, then the <tt/nfsd_*/ function goes
about its specific task and eventually returns.  This will cause the
results to be encoded, possibly cached, and sent back to the RPC
client.

<sect>Details of some handlers<p>

This section give a bit of detail about interesting things that are
done by some of the specific NFS request handlers in <tt/vfs.c/.


<sect1>Read and read ahead<p>

The <tt/VFS/ layer in the kernel montiors whether a file is being used
for sequential access or random access, and as a pattern of sequential
access is noticed, it does more and more read-ahead to improve
performance.   For NFS accessed to benefit from this read-ahead, the
VFS layer must be able to detect sequential reads.

However, because NFS has no "open" request, and effectively performs
open/read/close for each read request, VFS needs a bit of help to
notice continuity of accesses.

The VFS layer stores the access pattern information in the <tt/file/
structure.  knfsd helps by recording the (5) various numbers after a
read request, and restoring them before the next read request on the
same file (dev/ino pair).  To do this is keeps a cache of read ahead
values for recently accessed files.  Currently this cache is
implemented as a simple linked list which recently accessed entries
moved to the top.  This size of the list is limited to twice the
number of knfsd threads.  It would be interesting to be able to
measure the normal number of files which are concurrently being read
on a given fileserver.  This would allow the cache size to be turned.

<sect1>lookup and mountpoints<p>

<sect2>Unpatched<p>
When <tt/nfsd_lookup/ calls <tt/lookup_dentry/ to perform a filesystem
lookup, it is possible that the lookup will cross a mountpoint and the
returned dentry will be on a different filesystem.

The current implementation of <tt/nfsd_lookup/ checks for this case
and steps back to the underlying (covered) dentry, so that lookups
always stay on the same filesystem.  Note that unless
<tt/CONFIG_NFSD_SUN/ was used in compiling the code, the file handle
so obtained will be rejected on all future accesses with a permission
error.

<sect2>Patched<p>
When <tt/nfsd_lookup/ calls <tt/lookup_dentry/ to perform a filesystem
lookup, it is possible that the lookup will cross a mountpoint and the
returned dentry will be on a different filesystem.

If this happens, then <tt/knfsd/ will check to see if that filesystem
has been exported to the client and, in particular, whether it has
been exported with the <tt/NFSEXP_CROSSMNT/ option.
If it has been exported with this option, then a file handle for the
mounted directory is returned.  If it is not exported, or does not
have the option, then <tt/knfsd/ returns a file handle for the
underlying (coverred) directory.  
Note that unless <tt/CONFIG_NFSD_SUN/ was used in compiling the code,
the file handle for the underlying directory will be rejected on all
future accesses with a permission error.

This crossing of mount points with LOOKUP is not well supported by all
clients, for (at least) two reasons:
<itemize>
<item>
The files in the underlying file system may present <tt/fileid/s (also
known as inode numbers) which are the same as <tt/fileids/ in the parent
filesystem.  If the client depends on the uniqueness of these fileids
(without also taking the <tt/fsid/ into account) then it could get
confused.
<item>
STATFS will return different values for different filehandles in, what
appears to the client to by, the one filesystem.  If this does not
confuse the client, it may well confuse the users on the client
system.
</itemize>

Despite these problems, some clients do cope well with mount point
crossing, and some system administrators find it useful, so the
functionality is provided for those who want it.


<sect>Validity checks<p>

When a Remote Procedure Call arrives with a file handle (or possibly
two file handles) in it, the file handle needs to be converted to a
<tt/dentry/ (the Linux internal representation of a filesystem
object), and this dentry must be checked to see if the required access
is permitted.  This checking is performed by <tt/nfsfh.c::fh_verify/.

Then the file handle arrives, <tt/nfsxdr.c::decode_fh/ copies it in to
a <tt/struct svc_fh/ structure which has been zeroed by
<tt/nfsffh.h::fh_init/.

The <tt/svc_fh/ structure was described earlier in the section on
<em/Using the File Handle/.

The process of verification proceeds as follows:

<enum>
<item>
The file handle is checked to make sure that <tt/fh_dev/ is that same
as <tt/fh_xdev/. If it isn't a warning is printed and a ESTALE error
is returned.  This is simply a consistency check.  The code could
equally well simply ignore the value in <tt/fh_xdev/ (as it ignores
many other bytes in the file handle) and copy <tt/fh_dev/ into
<tt/fh_xdev/ for other sections of code to use.

<item>
The export point from the file handle (<tt/fh_xdev/, <tt/fh_xino/) is
looked up in the export table (with <tt/export.c::exp_get/) to find
out how, and whether, this file tree is currently exported.  If there
is no export entry, then the file handle is rejected with
<tt/ESTALE/.

There are enhancements being worked on (September 1999) to allow knfsd
to call-back to a userlevel process (such as <tt/mountd/) to ask that
and appropriate entry be inserted into the table --- possible and
entry denying access.

<item>
If the export entry requires requests to come from a <bf/secure/
port (1-1024), and the request is from an insecure port, then file
handle is rejected with <tt/EPERM/ and a warning is printed.

<item>
Next the file handle is converted to a <tt/dentry/ by
<tt/find_fh_dentry/.  If the export point has the <tt/SUBTREECHECK/
flag set, then <tt/find_fh_dentry/ must find a dentry which is
properly located in the file hierarchy .  If not, and the file handle
does not refer to a directory, then it is allowed to return a "root"
dentry that simply refers to the appropriate inode.

If an appropriate dentry cannot be found, then the file handle is
rejected, possibly with ESTALE, or ENOENT if a location could not be
found in the tree.  (Maybe it should always return ESTALE?)

<item>
Next the generation number from the inode (referred to be the dentry)
is compared with the generation number in the file handle.  If they
don't match then the file handle is rejected as stale.

Arguably the generation number should be checked in
<tt/find_fh_dentry/ as if the generation number doesn't match then it
isn't the right dentry.  This is more of aesthetic than practical
significance.

<item>
When <tt/fh_verify/ is called, the called may indicate that a
particular type of object is required, possibly a directory, or a file
or a symbolic link.

If a type was specified then the next check is to make sure that the
inode that was found has the right type.  If the inode has the wrong
type, then either ENOTDIR or EISDIR is returned depending on whether a
diretory was asked for or not.

<item>
The next check is the <bf/sub-tree check/ and can be disabled if
the export point did not have the <tt/SUBTREECHECK/ flag set.

The sub-tree check involves walking up the dcache tree from the
dentry that was found until we find the dentry for the export point.
If the root of the filesystem is found before finding the export
point, then the dentry found is clearly not in the exported tree, and
so the filehandle is rejected with ESTALE.

While the the tree is being walked another check is made.  If the
filesystem is exported ROOTSQUASH then every directory in the path
must give execute access to someone other than root/wheel ???

<item>
The last step of <tt/fh_verify/ is to call
<tt/vfs.c::nfsd_permission/.
This checks the access type that was requested in various ways as the
following points outline.

However it first checks if the dentry was mounted on. In this case (if
it is compiled with CONFIG_NFSD_SUN) the filehandle is rejected with
EPERM.

<item>
The first access type check in nfsd_permission is to guard against
writing inappropriately.  If a write access (including setattr and
truncate) is requested then:  if the export or the filesystem is readonly,
the EROFS is returned, and if the file is immutable, then EPERM is
returned.  These is another test involving <tt/nfsd_iscovered/,
however nfsd_iscovered is equivalent to <tt/false/. See Below.

<item>
If access requires truncation, but the file is append only, then EPERM
is returned.  It would seem that this test should be done in the VFS
layer.  However VFS enforces correct handling of IS_APPEND at file
open time, and there is not equivalent of open with NFS.

Interestingly, <tt/vfs.c::nfsd_open/ rejects all read/write access to
IS_APPEND files.

<item>
If all preceding tests succeed, then the owner of the file will always
get access.  This may seem a bit odd, but it is related to the fact
that Unix does permission checking at open time, while NFS has to do
it at access time.

<item>
Finally, the vfs <tt/permission/ routine is called to do normal
access checking.  As a special case, read-only requests on a regular
file are allowed to if read OR exec access is available.  This allow
executables to be loaded (NFS does not distinguish between loading a
file to read it and loading a file to execute it).

</enum>

<tt/knfsd/ has a number of other bits of permission checking code
distributed in various places which are worth mentioning.
<descrip>
<tag/nfsd_iscovered/
This function is called from a number of places in <tt/vfs.c/,
including once in <tt/nfsd_permission/ as mentioned above. In apparent
contradiction to it's name, this routine seems to check if a given
<tt/dentry/ "covers" (i.e. is mounted on) some other dentry.  However
it allows through the export point.  As linux only allows the root of
a filesystem to cover anything, this function could only return true
for the root of a filesystem, but when given a dentry which is the
root of a filesystem, the export point will be that same root, and so
<tt/nfsd_iscovered/ will still return false.

I am not sure what the intent of this routine is.

<tag/fs_off_limits/
The <tt/vfs.c:fs_off_limits/ function rejects any filesystem which is
either an NFS filesystem or a PROC filesystem, as exporting these is a
bad idea for different reasons.

It is called in <tt/nfsd_lookup/ to make sure that the parent dentry
is not on an off-limits file system.

It would seem to make more sense to perform this test in
<tt/fh_verify/ so that those file systems were equally rejected for
all accessed.  Further, a more general test would be to reject any
filesystem without the <tt/FS_REQUIRES_DEV/ flag, as this coverred the
two in question and any such filesystem does not have a reliably
stable device number, and so (current) filehandles wouldn't be
guaranteed to remain meaningful across reboots.

<tag/set-time/
Some NFS clients (apparently) try to use the <tt/setattr/ request to
update the access and modify times on a file to the current time.
This should be allowed for any client which has write access to the
file (whereas normally seting these times is restricted to the owner).
<tt/nfsd_setattr/ makes a special case of allowing such a request
through providing that the requested time is "close enough" to the
current time on the server. "Close enough" is a configurable value set
via the <tt>/proc/fs/nfs/time-diff-margin</tt> file.

This configuation should probably go somewhere in <tt>/proc/sys</tt>
to meet current (apparent) standards, though where isn't clear.

</descrip>

<sect>Tracing/sanity<p>

The knfsd code has a number of hooks for tracing and sanity checking.
Some of them are described here.

<sect1>nfsd_nr_put/nfsd_nr_verified<p>

File handle structures are used extensivly in the nfsd code.
Presumably as a check that they were being allocated and freed
properly, a count of the number that have been properly verify, and
the number that have been properly released is kept.  The difference
between these two should be the number that are currently in use which
should never be more than 3 times the number of threads.  However this
is never checked, and the number a no accessible at all.  Maybe they
can be discards.

<sect1>dprintk<p>

The <tt/sunrpc/ module provides a very nice facility for turning on
and off printk tracing of various modules.  Each <tt/sunrpc/ related
module (nfs, nfsd, nlm, rpc) has a "debug" variable which can be
read or written through <tt>/proc/sys/sunrpc/module_debug</tt>.  The
value is a bitmask of different parts of the module that can be
traced.  Each file defines which part it is.

For example, <tt/nfsfh.c/ contains:

<code>
#define NFSDDBG_FACILITY NFSDDBG_FH
</code>
<tt>include/linux/nfsd/debug.h</tt> defines
<code>
#define NFSDDBG_FH		0x0002
</code>
so that the command
<code>
 echo 2 > /proc/sys/sunrpc/nfsd_debug
</code>
will enable all the <tt/dprintk/ statements in <tt/nfsfh.c/.

<sect1>nfsd_stats<p>
knfsd keeps a few counters to measure various events, as does the
<tt/sunrpc/ modules.  These are made available throughhe file
<tt>/proc/net/rpc/nfsd</tt>
This file contains one line for each sort of statistics. The first
line is specific to knfsd, the remainder are provided by the rpc
layer.

Though 9 counters are currently defined, only 4 are still used. They
are the hits, misses, and refusals(?) for the request cache, and the
count of stale file handles that have been seen.  These are the first,
second, third, and 8th numbers on the line.
</article>
