The Linux Kernel NFSD Implementation Neil Brown neilb@cse.unsw.edu.au 13th October 1999 One of the modules in the Linux operating system is an NFS server, sometimes referred to as knfsd (the k is for Kernel, to distinguish it from the user-level NFS server that is also available). This document describes the details of the implementation current at version 1.4.7 which is a patch against the 2.2.7 kernel, and will possibly be included in a late 2.2.* kernel release. ______________________________________________________________________ Table of Contents 1. Introduction 2. Understanding File Handles 2.1 Interpreting a file handle - OLD 2.2 Interpreting a file handle - NEW 2.3 Using the File Handle 3. Exporting file trees 3.1 Other export details 3.2 User level assistants 3.2.1 The 3.2.2 The 3.2.3 The 3.2.4 The 3.2.5 The 3.2.6 The 3.2.7 The 4. The Path of a request 5. Details of some handlers 5.1 Read and read ahead 5.2 lookup and mountpoints 5.2.1 Unpatched 5.2.2 Patched 6. Validity checks 7. Tracing/sanity 7.1 nfsd_nr_put/nfsd_nr_verified 7.2 dprintk 7.3 nfsd_stats ______________________________________________________________________ 11.. IInnttrroodduuccttiioonn NFS, the Network File System from SUN Microsystems, is a system that allows files to be shared over a network. It is implemented using 4 protocols each built on-top of ONC-RPC, the Open Network Computing Remote Procedure Call protocol, also from SUN. The four protocols are: NNFFSS The basic file access protocol which allows files to be created, found, read and written. It is designed to be stateless, meaning that the only state that the server needs to store is the exposed contents of the filesystem. The server is not expected to retain any information about the state of the clients. As well as being stateless, the NFS protocol is iiddeemmppootteenntt. This means that is a particular request arrives twice, the second will effectively be a NO-OP. This allows it to be used over an "unreliable" protocol such as UDP/IP. MMOOUUNNTTDD The mountd protocol is used to gain initial access to a filesystem which can subsequently be accessed by NFS. The mountd protocol does expect the server to retain some state about clients, as it contains an unmount request to tell a server that the client is nolonger using a filesystem. SSMM The Status Monitor protocol is used for monitoring the state of different nodes in a network. The particular intent is that any node can register an interest in another node and will be told in a timely fashion when that node restarts. NNLLMM The Network Lock Manager protocol provides file and record locking. It relies on SM to determine when clients have restarted, and so released all of their locks, and when servers have restarted, and so need to be reminded of the current locks. There is another protcol, the Kernel Lock Manager, or KLM that some NFS implementations use to communicate between the kernel and a user- level locking daemon. This protocol is not used by the Linux NFS implementation. The knfsd implementation in Linux supports the NFS and NLM protocols completely within the kernel, the NFS protocol by code in linux/fs/nfsd and the NLM protocol by code in linux/fs/lockd. The lockd code contains a client for the SM protocol, but the server is provided by a user-level process called statd. The MOUNTD protocol is served by a user-level process called mountd. There is a system call interface to allow mountd to communicate information about exported filesystems to the kernel level NFS server. There are two version of the NFS protocol that are commonly in use today, version 2 and version 3. The NFS server implementation in Linux currently only supports version 2. Unless otherwise stated, all statments about the protocols should be taken as referring the version 2. 22.. UUnnddeerrssttaannddiinngg FFiillee HHaannddlleess The file handle is a central part of all NFS and related operations. A file handle in an opaque string of bits which is used to uniquely identify a file or other filesystem object. In Version 2, the file handle in 256 bits long (32 bytes). In version 3, it is variable in length, up to 512 bits long. The fact that the file handle is opaque means that the client should not attempt to understand anything about the file from inspecting the contents of the file handle. The only operations that the client should perform on the file handle are to copy it, and to compare it for equality with another file handle. This leaves the server free to encode information about the file into the file handle in what-ever way it sees fit. As Unix allows files to be moved around the filesystem without changing their intrinsic identity, it is important that the NFS servers only encodes information about the file, and not about its location in the filesystem hierarchy, otherwise confusion can result. The traditional contents of a file handle is: +o some identifier for the file system such as the device number that the file system is mounted from, +o some identifier for the inode within the file, such as the inode number, and +o a field to indicate when an inode has been reused, typically called a generation number for the inode. The server is free to use this sort of information, or anything else that will serve the same purpose. The important thing for the server is that it must be able to generate a unique file handle for each file that preferably does not change across restarts, that it must be able to reliably find a file given the file handle, and that it must be able to determine if a given file handle is still valid (not old and not fake). With this context, we can now look at the file handles used by knfsd in Linux. From linux/include/linux/nfsd/nfsfh.h we find that the file handle is built from a structure containing: ______________________________________________________________________ struct nfs_fhbase { struct dentry * fb_dentry; /* dentry cookie */ __u32 fb_ino; /* our inode number */ __u32 fb_dirino; /* dir inode number */ __u32 fb_dev; /* our device */ __u32 fb_xdev; __u32 fb_xino; __u32 fb_generation; }; ______________________________________________________________________ and some NUL padding. (Note to code readers, each of these are actu- ally referenced with a fh prefix rather than the fb shown here. See nfsfh.h for the reason. fb_dentry doesn't only have a different pre- fix, it is infact spelt fh_dcookie.) The fields have the following usage: ffbb__ddeennttrryy A previous version stored the address of a kernel structure related to the file in this entry. This is not stable over reboots and turns out not to be needed. The current version always sets this to the value 0xfeebbaca. This value is not checked, only set. ffbb__iinnoo This field stores the inode number of the file - cast to a 32 bit number if needed. This is stored in native byte order (as are all values, as the client never looks at them). ffbb__ddiirriinnoo This field stores the inode of the directory that contains the file. Given that files can be in multiple directories and can move between directories, this is neither unique nor stable. It is used to help locate the file within the dcache as will be explained later. ffbb__ddeevv This field contains the device that the file system was mounted from, which within Linux is a reasonably unique identifier of the filesystem. A number of file systems, such as procfs and smbfs do not have associated devices, so a unique anonymous device is allocated at mount time. For such file systems, this field is not guaranteed to be stable across restarts. However these file systems are not normally exported (and knfds actually refuses to provide access to some of them. Possibly it should reject all of them). ffbb__xxddeevv This field is the device number of the directory that was exported. ffbb__xxiinnoo This stores the inode number of the directory that was exported (which may not be the directory that was mounted). The field together with fb_xdev and the IP address of the client are used to determine whether the file handle is actually valid for that client. ffbb__ggeenneerraattiioonn The generation number is used to differentiate between an inode before and after it has been deleted and reused. The generation number changes in a non-predictable way whenever the inode is reused. The most complicated part of dealing with a file handle, for Linux, is in finding the file given the file handle. This is inpart due to the presence of the ddccaacchhee. In order to read or write to a file within Linux, a struct file structure is needed. These structures do not refer to the inode directly, but do so through an entry in the dcache, a struct dentry. Also, directory operations like lookup and create require a dentry. As the dcache always contains a prefix of the filesystem directory structure, finding a dentry requires making sure that the parent directory (or a parent directory, as some files have multiple parents) and all of its ancestors are also in the dcache. It is for this reason that the file handle contains the inode number of a containing directory. 22..11.. IInntteerrpprreettiinngg aa ffiillee hhaannddllee -- OOLLDD The filehandle code, in linux/fs/nfsd/nfsfh.c goes though some hoops to get hold of an appropriate dentry as will now be described. When a new handle arrives in a request a struct svc_fh is created and passed to fh_verify to check that it is valid, and to find the dentry. fh_verify calls find_fh_dentry which does the real work. This goes through several stages to try to find the dentry. Firstly it looks in a cache of recent file handles using find_dentry_in_fcache. This is currently a no-op as it depended in the fb_dentry files of the file handle, which is now depreciated. The next stage involves checking to see if the inode is currently in the inode cache, using iget_in_use. If it is, then a valid dentry should also be in the dcache, and can be found because each inode points to a list of dentries that refer to it. The code loops though all dentries until it find one with a parent that has an inode number that same as fb_dirino. Why it cares as long as it has a dentry, I'm not sure. If no appropriate dentry is found, it proceeds to look in the rename cache. Every time a file is renamed into a different directory, a record is kept in a cache of the inode number the old and new directory inode numbers. Of course, this cache does not survive a restart. If the inode/dir is found in the rename cache, then the list of dentries is rechecked looking for the new directory inode number. If it still hasn't found an appropriate dentry, it continues to stage 3, which tries to find the full path name of the file, and then create the dcache entry from that path name. The path name is found in much the same way that the unix pwd command works to find a path name. 1. It fakes up a temporary dentry for the directory and reads through looking for an entry for the given file inode. At the same time it records the inode number for the parent of the directory from the ".." entry. 2. Then it fakes up an dentry for the parent directory which it now has an inode number for and repeats the process. One reason that these faked up temporary entries cannot be used more generally, is that the vvffss level will not look up ".." through them properly as they don't have a valid d_parent entry. (They could be used for files though). If this fails, then the file must have been renamed to a different directory a long time ago. Stage 4 uses a path name cache to try to find the path name of the parent inode. I'm not really sure of the point of this, and wont go into the detail yet. The name cache is maintained by add_to_path_cache and get_path_entry. Stage five is simply to fail. 22..22.. IInntteerrpprreettiinngg aa ffiillee hhaannddllee -- NNEEWW _N_o_t_e_: _T_h_i_s _s_e_c_t_i_o_n _a_s_s_u_m_e_s _m_y _p_a_t_c_h_e_s _t_o _1_._4_._7 As knfsd looks up a file handle for every request, and will often recieve a number of requests on the same file handle (e.g. several read requests on the one file) it is well worth while doing some caching to improve lookup speed. Fortunately the dcache and icaches provide adequate caching, and knfsd does not need to to any of it's own. REMOVE THIS? When a file handle arrives, it is passed to fh_verify to check that is is valid, and that the user has appropriate access. After some simple consistancy check (fb_xdev must equal fb_dev, and the fb_xdev,fb_xino pair must be exported to this host), find_fh_dentry is called to find an entry in the dcache for the inode referred to in the handle. Providing that the underlying filesystem support the read_inode operation, the inode with the given inode number is found using iget_in_use. (If read_inode is not supported, then the filesystem simply cannot be NFS-exported). If this inode was already in the icache, then it will have a pointer to a valid dcache entry, and the lookup is complete. If this inode was not in the cache (either it has been flush to make room or the server has been restarted since the inode was last used), then a dcache entry needs to be created. Here we need to worry about the NFSMNT_SUBTREECHECK export option. If this option is set, then we want to make sure that every file accessed is a descendant of the export point. When exporting whole filesystems, this checking is un-necessary and can be avoided by clearing this option. If this option is not set, and inode being search for is not a directory, then the dcache entry that is returned does not need to be located in the dcache tree, it's parent and child pointers will never be check. So for non-directories with no SUBTREECHECK requirement, we simply create a dcache entry with d_alloc_root and return it. For other objects we need to find a valid location in the dcache tree. This requires being able to find the (or a) parent for this object, and then a parent for that parent and so-on until an object in the tree is found. For directories we can always find a parent by reading the directory and looking for the '..' entry. For non-directories we need the fb_dirino entry in the file handle. Given this parent inode number, find_fh_dentry walks up the directory tree, building a dcache path as it goes, until it finds an ancestor that is already in the dcache. It then splices the path into that ancestor, and returns the base of the path which is the dcache entry of the file that is wanted. If SUBTREECHECKing is required, and a file is rename to a different directory, then accessing it with the old file handle will only work as long as the entry for the file is still in the dcache. Once it expires, access from that filehandle will no longer work. It would be possible to encourage entries for renamed files to stay in the dcache longer, but we would need some data on how long entries tend to stay anyway, and how much moved files are accessed by their old filehandle to see if there was any value in this. 22..33.. UUssiinngg tthhee FFiillee HHaannddllee knfsd stores active file handles in a struct svc_fh which looks like: ______________________________________________________________________ typedef struct svc_fh { struct knfs_fh fh_handle; /* FH data */ struct dentry * fh_dentry; /* validated dentry */ struct svc_export * fh_export; /* export pointer */ size_t fh_pre_size; /* size before operation */ time_t fh_pre_mtime; /* mtime before oper */ time_t fh_pre_ctime; /* ctime before oper */ unsigned long fh_post_version;/* inode version after oper */ unsigned char fh_locked; /* inode locked by us */ unsigned char fh_dverified; /* dentry has been checked */ } svc_fh; ______________________________________________________________________ The first three entries are (hopefully) fairly obvious. The fh_handle is the raw handle that came over the wire and fh_dentry and fh_export are the dcache entry and export entry that have been derived from it. fh_pre_size, fh_pre_mtime and fh_pre_ctime are intended for encoding WWeeaakk CCaacchhee CCoonnssiisstteennccyy data for NFS version 3. The values do not seem to be set significantly at present. fh_post_version is used to determine if this wwcccc data needs to be returned or not. fh_locked is set when the i_sem semaphore on the inode is taken down, to make sure that it gets put back up when the file handle is release. fh_dverified records that the fh_dentry is valid, to make sure that it is released when the filehandle is release. 33.. EExxppoorrttiinngg ffiillee ttrreeeess _N_o_t_e_: _T_h_i_s _s_e_c_t_i_o_n _a_s_s_u_m_e_s _m_y _p_a_t_c_h_e_s _t_o _1_._4_._7 The kernel nfs server maintains a list of file systems that are currently mounted by some client. As clients mount filesystems using the MOUNTD protocol, the MOUNTD server tells the kernel about them. As the kernel needs to know about all currently mountd filesystems, it is necessary for this information to be safely stored across restarts. This is the responsibility of mountd which records all current mounts in /var/lib/nfs/rmtab and of exportfs which reminds the kernel of all exports deduced from rmtab. Each client is known the the kernel by a struct svc_client which is defined in linux/include/nfsd/export.h to be ______________________________________________________________________ struct svc_client { struct svc_client * cl_next; char cl_ident[NFSCLNT_IDMAX]; int cl_idlen; int cl_naddr; struct in_addr cl_addr[NFSCLNT_ADDRMAX]; struct svc_uidmap * cl_umap; struct svc_export * cl_export[NFSCLNT_EXPMAX]; }; ______________________________________________________________________ The cl_ident field stores a string (at most 1023 chars) which is used as a key when accessing this client information via the nfs_ctl system call. This is typically the hostname of the client. The cl_addr array contains up to 16 internet addresses for the client, which should all be considered to be equivalent. This is used as a key to the structure when an NFS request arrives over the network. The kernel does not check, when a client is created or change, that the addresses are not already in use by another client, so there is a possibility for confusion if the user-level processes allow it. Access checks are always done against the most recently registered client to have a particular address. The cl_export array is a hash table (size 16) which stores the information about the different filesystems which are exported. The struct svc_export structure is also defined in linux/include/nfsd/export.h to be ______________________________________________________________________ struct svc_export { struct svc_export * ex_next; char ex_path[NFS_MAXPATHLEN+1]; struct svc_export * ex_parent; struct svc_client * ex_client; int ex_flags; struct dentry * ex_dentry; kdev_t ex_dev; ino_t ex_ino; uid_t ex_anon_uid; gid_t ex_anon_gid; }; ______________________________________________________________________ The ex_next entry is used to chain together entries in the same hash bucket. ex_client is simply a pointer back to the client which owns the export entry, and ex_path, ex_dentry, ex_dev, and ex_ino simply store different information about the exported directory. The ex_parent, if non-NULL, point to another svc_export for this client that is an ancestor of this directory in the filesystem. It will always point to the closest such ancestor. Many export entries will not have a parent. As far as I can tell, the ex_parent is maintained but never used. It seems to be related to two checks in exp_export (in linux/fs/nfsd/export.c) which check two rules about exporting related directories. They appear to be: RRuullee 22 If an ancestor directory of a given directory is exported to a given client, then the given directory can also be exported oonnllyy if it is on a different filesystem/device. RRuullee 33 If any decendant directory of a given directory, which is on the same file system (or a file system where the device has the same hash value!) is already exported to a client, then the given directory cannot also be exported to that client. These rules are simple inverses of each other and are presumably intended to remove any ambiguity concerning which export attributes (flags and anon ids) should be applied to a given file. If either of there rules were violated then there would be two export points in the one file system with one being a child (Descendant) of the other. This would create ambiguity as to export rules should apply to children of the junior export point. 33..11.. OOtthheerr eexxppoorrtt ddeettaaiillss The cl_uidmap field in each client is currently not used. The intent seems to be to allow the client and server to use different uids and gids for the same entity, and the nfs server would do the appropriate mapping. Presumably this would be a cache which would be updated by a call-back to a use-space daemon on a cache-miss. The ex_flags fields can have the following bit-flags set: RREEAADDOONNLLYY All write requests are denied with NFSERR_ROFS. IINNSSEECCUURREE__PPOORRTT Requests from insecure ports (1024 or above) are permitted. RROOOOTTSSQQUUAASSHH All accesses by uid 0 are mapped to appear to be by the uid given in ex_anon_uid. Similarly accessed by gid 0 are treated as accessed by gid ex_anon_gid. AALLLLSSQQUUAASSHH All accesses are treated as though they came from uid ex_anon_uid and gid ex_anon_gid. GGAATTHHEERREEDD__WWRRIITTEESS This enabled a hack which attempts to allow the underlying filesystem to gather writes together for more efficient use of the lowlevel device. If a write is requested while it appears that another write is pending, the first write sleeps for 10msecs before flushing the write to give the filesystem a chance to have scheduled to two (or more) writes together. UUIIDDMMAAPP This flags indicates that the currently unimplemented cl_uidmap map should be used. KKEERRBBEERROOSS This unimplemented option indicates that Kerberos authentication should be used for each request. SSUUNNSSEECCUURREE This is another unimplemented option. It presumable indicates that SUN's Secure RPC is being used to authenticate requests. CCRROOSSSSMMNNTT Another unimplemented option. Presumably it is intended that file trees exported with this options allow the client to cross mount mounts. There is another option that affects exported filesystem behaviour that is implemented as a compile time option, CONFIG_NFSD_SUN rather than a runtime option. Normally, if a directory within an exported filesystem is mounted on, then that directory and hence everything beneith it is not accessable via NFS. If CONFIG_NFSD_SUN has been selected at compile time, then the server acts line SUN Microsystems servers and allows the entire file system to be viewed, independant of mounts. 33..22.. UUsseerr lleevveell aassssiissttaannttss There are two user level programs which assist with maintaining the client and export lists in the kernel. They are mountd and exportfs, the first is a daemon, the second is an admin tool. These two programs read the /etc/exports file and maintain the files xtab, etab and rmtab in /var/lib/nfs as well as the in kernel client and export lists. 33..22..11.. TThhee //eettcc//eexxppoorrttss ffiillee The /etc/exports file lists file trees, clients that that they can be mounted by, and export flags (such as read-only) that are imposed on that client for that file tree. Client names in /etc/export can be: AANNOONNYYMMOOUUSS An empty client name will match any client. NNEETTGGRROOUUPP A client name starting with an '@' will match any client host in the netgroup given by the rest of the client name. WWIILLDDCCAARRDD If a client name contains '*', '?', or '[', then it is assumed to be a wild carded host name. Any host with a fully qualified domain name which matches the client name using gglloobb matching will match the client. SSUUBBNNEETTWWOORRKK If a client is a dotted-quod IP address followed by a slash and a number of bits (e.g. 129.94.0.0/16 then any client with an IP address in that subnet will match. FFQQDDNN Otherwise clients must be host names and a looked up with standard host name resolution procedures. The /etc/exports file records the default intension of what should be exported. It is read by exportfs, usually at startup time, and the information is presented to mountd through the /var/lib/nfs/etab file. Thus this file is indicative, but not authorative. 33..22..22.. TThhee rrmmttaabb ffiillee /var/lib/nfs/rmtab contains a list of client hosts and file trees that have been mounted from them. The format of the file has one line for each entry, each line being the name of the client, a colon, and the path to the file tree that was mounted. The name of the clients will be fully qualified domain names are returned by the resolver, or IP addresses. Entries are added to this file when mountd replies to a successful mount request, and are removed when mountd receives an unmount or and unmountall request. It is used to reply to the dump mountd request. rmtab is used by exportfs when exporting filesystems. It uses the host names to instantiate wild card exports to create specific host exports to give to the kernel. 33..22..33.. TThhee eettaabb ffiillee /var/lib/nfs/etab contains a list of currently exported file trees and their options. It is in a somewhat different format than the /etc/exports file in that each line lists just one path and one client, and the client has all export options explicitly listed. The etab file will normally contain the same information as /etc/exports, but it can be given extra information by giving explicit export arguments to the exportfs program. The etab file is written by the exportfs program and read by mountd. The etab file is the authorative list of what should be exported to where. 33..22..44.. TThhee //pprroocc eexxppoorrttss ffiillee The file /proc/fs/nfs/exports provides a window in the kernels table of exported file trees. It lists paths, clients and options in exactly the same format as etab. While tab may well have wild card, netgroup, and other non-specific client name, the exports file in /proc only has explict client host names. 33..22..55.. TThhee xxttaabb ffiillee The file /var/lib/nfs/xtab serves much the same purpose as /proc/fs/nfs/exports in that it records host-specific mounts that have been given to the kernel. More precisely, whenever an export request is given to the kernel that is based on a group export, rather than a host export, in etab, the host export line is written to xtab. The xtab file is not necessary if the /proc filesystem is available. 33..22..66.. TThhee eexxppoorrttffss ttooooll exportfs is used to communicate the intention of the system administrator to the NFSD system. It does that by maintaining the etab file and playing some part in managing the in-kernel export table. exportfs will read file trees to be exported from /etc/exports or from the command line, and will make appropriate changes to etab. It can add and remove exports requests from etab. exportfs will also make sure that all host exports that are known to be required (an no others) are know to the kernel. It does this by telling the kernel about any FQDN export requests that it puts in etab, and about any mount requests listed in rmtab that can be validated against group export requests that are in etab. 33..22..77.. TThhee mmoouunnttdd ddaaeemmoonn When a client wants to mount a filetree, it asks mountd. mountd check the request against information in etab, and reponds with file handle information if the mount is permitted. It also make sure that the kernel knows that the filetree may be exported to that client. If the file tree is explicitly exported to the client in etab, then exportfs will have already told the kernel. If the file tree is exported due to some group export request in etab, then mountd specialised that export request to the given client, tells the kernel about this request, and records the fact in xtab. 44.. TThhee PPaatthh ooff aa rreeqquueesstt This section traces the path that a single request takes as it is processed by knfsd. This essentially shows the flow of control. When the nfsd service is started by the nfsd user level program, nfssvc.c::nfsd_svc is called. This calls svc_create_thread from the sunrpc module to create a number of threads for serving requests that arrive for the nfs service. Each thread runs nfssvc.c::nfsd which handles requests in a loop. This routine, and most of what is called by is, runs protected by the big kernel lock, so SMP issues are non- issues. nfsd repeatedly calls svc_recv to receive a request. When it receives a valid request it is finds out which client the IP address corresponds to (exp_getclient) and passes the request back to the sunrpc module using svc_process. These two call, and hence everything that nfsd does except for waiting for for new requests, are performed with a readlock on the export table. The client identity is used to a minor extent by the sunrpc module in that if no valid client was found, and the procedure requested was not the NULL procedure, then the request is rejected with a bad credentials error status. This checking is enabled by setting rqstp->rq_auth to 1 in nfsd. The sunrpc module decodes the rpc request and passes it back to nfsd_dispatch. This is selected by the nfsd_program structure at the end of nfssvc.c. nfsd_dispatch checks to see if the request has already been seen and the reply been cached (see separate section on the request cache). If it has, then the remembered reply is returned. Otherwise the arguments are decoded, and the appropriate procedure is called. When this procedure returns the result is appropriate encoded into the response buffer, and cached if caching is appropriate for that procedure. The different procedures are declared in the nfsd_procedures2 and nfsd_procedures3 sturctures which are defined in nfsproc.c and nfs3proc.c. For each procedure there is defined: +o The C function to call to implement the procedure (e.g. nfsd_proc_lookup). +o The function to call to decode the arguments (e.g. nfssvc_decode_diropargs). +o The function to call to encode the results (e.g. nfssvc_encode_diropres). +o The function to call to release any data structures that might have been storing in the result. The only data structures that need releasing for the nfs service are file handles. nfsproc.c use nfssvc_release_fhandle to release both nfsd_diropres and nfsd_attrstat structures. This works only because these to structures are identical to nfsd_fhandle. +o An indication of whether and how the reply should be cached. The options are RRCC__NNOOCCAACCHHEE says that the response should not be cached. This is used for read-only operations like read and lookup. RRCC__RREEPPLLSSTTAATT says to cache only the success status of the call. This is used for all requests that only return success or failure, and do not return data (they are write only?) such as remove, rename and symlink. RRCC__RREEPPLLBBUUFF says to cache the reply which includes some real data. This includes operations which return the status of the file in question, such as create and write and mkdir. It also includes readdir, presumably because rereading from a directory is more expensive than re-reading from a file. Each the function handling each different procedure naturally proceeds quite differently. There are however still some similarities that can be commented on. Most nfsd_proc_* procedures simply pass the arguments on to nfsd_* in vfs.c which contains common code for versions 2 and 3. Other processing that is done at this level is preparing bufferes for return data (e.g. nfsd_proc_readlink) and calling fh_put on the file handle if nothing is needed from it. A distinct exception to this pattern is nfsd_proc_create, I think because NFSv2 create is very different to NFSv3 create (FIXME expand on this, after understanding it). Within each nfsd_* call in vfs.c, the common first step is to call fh_verify on a file handle to make sure that it is valid and that the relevant user has the required access. fh_verify is defined in nfsfh.c. It uses find_fh_dentry (which is described elsewhere) to find the dentry for the file, is it is valid, does a number of other validity check, and finally calls nfsd_permission to see if the user has the required access. The sequence of checks is described elsewhere (FIXME one day...). If fh_verify reports success, then the nfsd_* function goes about its specific task and eventually returns. This will cause the results to be encoded, possibly cached, and sent back to the RPC client. 55.. DDeettaaiillss ooff ssoommee hhaannddlleerrss This section give a bit of detail about interesting things that are done by some of the specific NFS request handlers in vfs.c. 55..11.. RReeaadd aanndd rreeaadd aahheeaadd The VFS layer in the kernel montiors whether a file is being used for sequential access or random access, and as a pattern of sequential access is noticed, it does more and more read-ahead to improve performance. For NFS accessed to benefit from this read-ahead, the VFS layer must be able to detect sequential reads. However, because NFS has no "open" request, and effectively performs open/read/close for each read request, VFS needs a bit of help to notice continuity of accesses. The VFS layer stores the access pattern information in the file structure. knfsd helps by recording the (5) various numbers after a read request, and restoring them before the next read request on the same file (dev/ino pair). To do this is keeps a cache of read ahead values for recently accessed files. Currently this cache is implemented as a simple linked list which recently accessed entries moved to the top. This size of the list is limited to twice the number of knfsd threads. It would be interesting to be able to measure the normal number of files which are concurrently being read on a given fileserver. This would allow the cache size to be turned. 55..22.. llooookkuupp aanndd mmoouunnttppooiinnttss 55..22..11.. UUnnppaattcchheedd When nfsd_lookup calls lookup_dentry to perform a filesystem lookup, it is possible that the lookup will cross a mountpoint and the returned dentry will be on a different filesystem. The current implementation of nfsd_lookup checks for this case and steps back to the underlying (covered) dentry, so that lookups always stay on the same filesystem. Note that unless CONFIG_NFSD_SUN was used in compiling the code, the file handle so obtained will be rejected on all future accesses with a permission error. 55..22..22.. PPaattcchheedd When nfsd_lookup calls lookup_dentry to perform a filesystem lookup, it is possible that the lookup will cross a mountpoint and the returned dentry will be on a different filesystem. If this happens, then knfsd will check to see if that filesystem has been exported to the client and, in particular, whether it has been exported with the NFSEXP_CROSSMNT option. If it has been exported with this option, then a file handle for the mounted directory is returned. If it is not exported, or does not have the option, then knfsd returns a file handle for the underlying (coverred) directory. Note that unless CONFIG_NFSD_SUN was used in compiling the code, the file handle for the underlying directory will be rejected on all future accesses with a permission error. This crossing of mount points with LOOKUP is not well supported by all clients, for (at least) two reasons: +o The files in the underlying file system may present fileids (also known as inode numbers) which are the same as fileids in the parent filesystem. If the client depends on the uniqueness of these fileids (without also taking the fsid into account) then it could get confused. +o STATFS will return different values for different filehandles in, what appears to the client to by, the one filesystem. If this does not confuse the client, it may well confuse the users on the client system. Despite these problems, some clients do cope well with mount point crossing, and some system administrators find it useful, so the functionality is provided for those who want it. 66.. VVaalliiddiittyy cchheecckkss When a Remote Procedure Call arrives with a file handle (or possibly two file handles) in it, the file handle needs to be converted to a dentry (the Linux internal representation of a filesystem object), and this dentry must be checked to see if the required access is permitted. This checking is performed by nfsfh.c::fh_verify. Then the file handle arrives, nfsxdr.c::decode_fh copies it in to a struct svc_fh structure which has been zeroed by nfsffh.h::fh_init. The svc_fh structure was described earlier in the section on _U_s_i_n_g _t_h_e _F_i_l_e _H_a_n_d_l_e. The process of verification proceeds as follows: 1. The file handle is checked to make sure that fh_dev is that same as fh_xdev. If it isn't a warning is printed and a ESTALE error is returned. This is simply a consistency check. The code could equally well simply ignore the value in fh_xdev (as it ignores many other bytes in the file handle) and copy fh_dev into fh_xdev for other sections of code to use. 2. The export point from the file handle (fh_xdev, fh_xino) is looked up in the export table (with export.c::exp_get) to find out how, and whether, this file tree is currently exported. If there is no export entry, then the file handle is rejected with ESTALE. There are enhancements being worked on (September 1999) to allow knfsd to call-back to a userlevel process (such as mountd) to ask that and appropriate entry be inserted into the table --- possible and entry denying access. 3. If the export entry requires requests to come from a sseeccuurree port (1-1024), and the request is from an insecure port, then file handle is rejected with EPERM and a warning is printed. 4. Next the file handle is converted to a dentry by find_fh_dentry. If the export point has the SUBTREECHECK flag set, then find_fh_dentry must find a dentry which is properly located in the file hierarchy . If not, and the file handle does not refer to a directory, then it is allowed to return a "root" dentry that simply refers to the appropriate inode. If an appropriate dentry cannot be found, then the file handle is rejected, possibly with ESTALE, or ENOENT if a location could not be found in the tree. (Maybe it should always return ESTALE?) 5. Next the generation number from the inode (referred to be the dentry) is compared with the generation number in the file handle. If they don't match then the file handle is rejected as stale. Arguably the generation number should be checked in find_fh_dentry as if the generation number doesn't match then it isn't the right dentry. This is more of aesthetic than practical significance. 6. When fh_verify is called, the called may indicate that a particular type of object is required, possibly a directory, or a file or a symbolic link. If a type was specified then the next check is to make sure that the inode that was found has the right type. If the inode has the wrong type, then either ENOTDIR or EISDIR is returned depending on whether a diretory was asked for or not. 7. The next check is the ssuubb--ttrreeee cchheecckk and can be disabled if the export point did not have the SUBTREECHECK flag set. The sub-tree check involves walking up the dcache tree from the dentry that was found until we find the dentry for the export point. If the root of the filesystem is found before finding the export point, then the dentry found is clearly not in the exported tree, and so the filehandle is rejected with ESTALE. While the the tree is being walked another check is made. If the filesystem is exported ROOTSQUASH then every directory in the path must give execute access to someone other than root/wheel ??? 8. The last step of fh_verify is to call vfs.c::nfsd_permission. This checks the access type that was requested in various ways as the following points outline. However it first checks if the dentry was mounted on. In this case (if it is compiled with CONFIG_NFSD_SUN) the filehandle is rejected with EPERM. 9. The first access type check in nfsd_permission is to guard against writing inappropriately. If a write access (including setattr and truncate) is requested then: if the export or the filesystem is readonly, the EROFS is returned, and if the file is immutable, then EPERM is returned. These is another test involving nfsd_iscovered, however nfsd_iscovered is equivalent to false. See Below. 10. If access requires truncation, but the file is append only, then EPERM is returned. It would seem that this test should be done in the VFS layer. However VFS enforces correct handling of IS_APPEND at file open time, and there is not equivalent of open with NFS. Interestingly, vfs.c::nfsd_open rejects all read/write access to IS_APPEND files. 11. If all preceding tests succeed, then the owner of the file will always get access. This may seem a bit odd, but it is related to the fact that Unix does permission checking at open time, while NFS has to do it at access time. 12. Finally, the vfs permission routine is called to do normal access checking. As a special case, read-only requests on a regular file are allowed to if read OR exec access is available. This allow executables to be loaded (NFS does not distinguish between loading a file to read it and loading a file to execute it). knfsd has a number of other bits of permission checking code distributed in various places which are worth mentioning. nnffssdd__iissccoovveerreedd This function is called from a number of places in vfs.c, including once in nfsd_permission as mentioned above. In apparent contradiction to it's name, this routine seems to check if a given dentry "covers" (i.e. is mounted on) some other dentry. However it allows through the export point. As linux only allows the root of a filesystem to cover anything, this function could only return true for the root of a filesystem, but when given a dentry which is the root of a filesystem, the export point will be that same root, and so nfsd_iscovered will still return false. I am not sure what the intent of this routine is. ffss__ooffff__lliimmiittss The vfs.c:fs_off_limits function rejects any filesystem which is either an NFS filesystem or a PROC filesystem, as exporting these is a bad idea for different reasons. It is called in nfsd_lookup to make sure that the parent dentry is not on an off-limits file system. It would seem to make more sense to perform this test in fh_verify so that those file systems were equally rejected for all accessed. Further, a more general test would be to reject any filesystem without the FS_REQUIRES_DEV flag, as this coverred the two in question and any such filesystem does not have a reliably stable device number, and so (current) filehandles wouldn't be guaranteed to remain meaningful across reboots. sseett--ttiimmee Some NFS clients (apparently) try to use the setattr request to update the access and modify times on a file to the current time. This should be allowed for any client which has write access to the file (whereas normally seting these times is restricted to the owner). nfsd_setattr makes a special case of allowing such a request through providing that the requested time is "close enough" to the current time on the server. "Close enough" is a configurable value set via the /proc/fs/nfs/time- diff-margin file. This configuation should probably go somewhere in /proc/sys to meet current (apparent) standards, though where isn't clear. 77.. TTrraacciinngg//ssaanniittyy The knfsd code has a number of hooks for tracing and sanity checking. Some of them are described here. 77..11.. nnffssdd__nnrr__ppuutt//nnffssdd__nnrr__vveerriiffiieedd File handle structures are used extensivly in the nfsd code. Presumably as a check that they were being allocated and freed properly, a count of the number that have been properly verify, and the number that have been properly released is kept. The difference between these two should be the number that are currently in use which should never be more than 3 times the number of threads. However this is never checked, and the number a no accessible at all. Maybe they can be discards. 77..22.. ddpprriinnttkk The sunrpc module provides a very nice facility for turning on and off printk tracing of various modules. Each sunrpc related module (nfs, nfsd, nlm, rpc) has a "debug" variable which can be read or written through /proc/sys/sunrpc/module_debug. The value is a bitmask of different parts of the module that can be traced. Each file defines which part it is. For example, nfsfh.c contains: ______________________________________________________________________ #define NFSDDBG_FACILITY NFSDDBG_FH ______________________________________________________________________ include/linux/nfsd/debug.h defines ______________________________________________________________________ #define NFSDDBG_FH 0x0002 ______________________________________________________________________ so that the command ______________________________________________________________________ echo 2 > /proc/sys/sunrpc/nfsd_debug ______________________________________________________________________ will enable all the dprintk statements in nfsfh.c. 77..33.. nnffssdd__ssttaattss knfsd keeps a few counters to measure various events, as does the sunrpc modules. These are made available throughhe file /proc/net/rpc/nfsd This file contains one line for each sort of statistics. The first line is specific to knfsd, the remainder are provided by the rpc layer. Though 9 counters are currently defined, only 4 are still used. They are the hits, misses, and refusals(?) for the request cache, and the count of stale file handles that have been seen. These are the first, second, third, and 8th numbers on the line.