Next Previous Contents

7. Inodes and Operations

Linux keeps a cache of active and recently used inodes. There are two paths by which these inodes can be accessed.

The first is through the dcache described above. Each dentry in the dcache refers to an inode, and thereby keeps that inode in the cache.

The second path is through the inode hash table. Each inode is hashed (to an 8 bit number) based on the address of the file-system's super-block and the inode number. Inodes with the same hash value are then chained together in a doubly linked list.

Access though the hash table is achieved using the iget function. iget is only called by individual file-system implementations when looking up an inode (which wasn't found in the dcache), and by nfsd.

Basing the hash on the inode number is a bit restrictive as it assumes that every file-system can uniquely identify a file in 32 bits. This is a problem at least of the NFS file-system, which would prefer to use the 256 bit file handle as the unique identifier in the hash.

The nfsd usage might be better served by having the file-system provide a filehandle-to-inode mapping function which has interpret the filehandle however is most appropriate.

7.1 Inode Structure


struct inode {
        struct list_head        i_hash;
        struct list_head        i_list;
        struct list_head        i_dentry;

        unsigned long           i_ino;
        unsigned int            i_count;
        kdev_t                  i_dev;
        umode_t                 i_mode;
        nlink_t                 i_nlink;
        uid_t                   i_uid;
        gid_t                   i_gid;
        kdev_t                  i_rdev;
        off_t                   i_size;
        time_t                  i_atime;
        time_t                  i_mtime;
        time_t                  i_ctime;
        unsigned long           i_blksize;
        unsigned long           i_blocks;
        unsigned long           i_version;
        unsigned long           i_nrpages;
        struct semaphore        i_sem;
        struct inode_operations *i_op;
        struct super_block      *i_sb;
        wait_queue_head_t       i_wait;
        struct file_lock        *i_flock;
        struct vm_area_struct   *i_mmap;
        struct page             *i_pages;
        spinlock_t              i_shared_lock;
        struct dquot            *i_dquot[MAXQUOTAS];
        struct pipe_inode_info  *i_pipe;

        
        unsigned long           i_state;

        unsigned int            i_flags;
        unsigned char           i_sock;

        atomic_t                i_writecount;
        unsigned int            i_attr_flags;
        __u32                   i_generation;
        union {
                ....
                struct ext2_inode_info          ext2_i;
                ....
                struct socket                   socket_i;
                void                            *generic_ip;
        } u;
};

Many fields in the inode structure will have an obvious meaning to anyone familiar with Unix file-systems, so they will be skipped. Here I will only deal with those specific to Linux or which have interesting usage.

i_hash

The i_hash linked list links together all inodes which hash to the same hash bucket. Hash values are based on the address of the super-block structure, and the inode number of the inode.

i_list

The i_list linked list links inodes in various states. There is the inode_in_use list which lists unchanged inodes that are in active use, inode_unused which lists unused inodes, and superblock->s_dirty which holds all the dirty inodes on the given file system.

i_dentry

The i_dentry list is a list of all struct dentrys that refer to this inode. They are linked together with the d_alias field of the dentry.

i_version

The i_version field is available for file-systems to use to record that a change has been made since some previous time. Typically the i_version is set to the current value of the event global variable which is then incremented. The file-system code will sometimes assign the current value of i_version to the f_version field of an associated file structure. On a subsequent use of the file structure, it is then possible to tell if the inode has been changed, and if necessary, data cached in the file structure can be refreshed.

i_nrpages

This field records the number of pages, linked at i_pages which are currently cached for this inode. It is incremented by add_page_to_inode_queue and decremented by remove_page_from_inode_queue.

i_sem

This semaphore guards changes to the inode. Any code that wants to make non-atomic access to the inode (i.e. two related accesses with the possibility of sleeping inbetween) must first claim this semaphore. This includes such things as allocating and deallocating blocks and searching through directories.

It appears that it is not possible to claim a shared lock for read-only operations.

i_flock

This points to the list of struct file_lock structures that impose locks in this inode.

i_mmap

All of the vm_area_struct structures that describe mapping of an inode are linked together with the vm_next_share and vm_pprev_share pointers. This i_mmap pointer points into that list.

i_pages

This is the list of all pages in the page cache that refer to this inode. They are linked together on the next and prev links in the page structure.

i_shared_lock

This spin lock guards the vm_next_share and vm_prev_share pointers in the i_mmap list.

i_state

There are three possible inode state bits: I_DIRTY, I_LOCK, I_FREEING.

I_DIRTY

Dirty inodes are on the per-super-block s_dirty list, and will be written next time a sync is requested.

I_LOCK

Inodes are locked while they are being created, read or written.

I_FREEING

An inode is has this state when the reference count and link count have both reached zero. This seems to be only used by igrab called from the fat file-system. fat does funny things with inodes.

i_flags

The i_flags field correspond to the s_flags field in the super block. Many of the flags can be set system wide or per inode. The per-inode flags are:

MS_NOSUID

Setuid/setgid is not permitted in this file.

MS_NODEV

If this inode is a device special file, it cannot be opened.

MS_NOEXEC

This file cannot be executed.

MS_SYNCHRONOUS

All write should be synchronous.

MS_MANDLOCK

Mandatory locking is honoured.

S_QUOTA

Quotas have been initialised.

S_APPEND

The file can only be appended to.

S_IMMUTABLE

The file may not be changed, even by root.

MS_NOATIME

Do not update access time on the inode when the file is accessed.

MS_NODIRATIME

Do not update access time on directories (but still do so on files unless MS_NOATIME).

MS_ODD_RENAME

Wierd nfs thing.

i_writecount

If this is positive, it counts the number of clients (files or memory maps) which have write access. If negative, then the absolute value ofthis number counts the number of VM_DENYWRITE mappings that are current. Otherwise it is 0, and nobody is trying to write or trying to stop others from writing.

i_attr_flags

This is never used, and is only set by ext2_read_inode to be some combination of ATTR_FLAG_SYNCRONOUS, ATTR_FLAG_APPEND, ATTR_FLAG_IMMUTABLE and ATTR_FLAG_NOATIME.

i_generation

The intent of i_generation is to be able to distinguish between an inode before and after a delete/reuse cycle. This is important for NFS. Currently, only ext2 and nfsd maintain this field.

It is not clear that this could be exported to the VFS layer at all as it's use is so specific. Rather each file-system should have the opportunity to provide a unique file handle for a given inode, and each can then do whatever seems best to guarantee uniqueness.

7.2 Inode Methods


struct inode_operations {
        struct file_operations * default_file_ops;
        int (*create) (struct inode *,struct dentry *,int);
        struct dentry * (*lookup) (struct inode *,struct dentry *);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct inode *,struct dentry *,const char *);
        int (*mkdir) (struct inode *,struct dentry *,int);
        int (*rmdir) (struct inode *,struct dentry *);
        int (*mknod) (struct inode *,struct dentry *,int,int);
        int (*rename) (struct inode *, struct dentry *,
                        struct inode *, struct dentry *);
        int (*readlink) (struct dentry *, char *,int);
        struct dentry * (*follow_link) (struct dentry *, struct dentry *, unsigned int);

        int (*get_block) (struct inode *, long, struct buffer_head *, int);

        int (*readpage) (struct file *, struct page *);
        int (*writepage) (struct file *, struct page *);
        int (*flushpage) (struct inode *, struct page *, unsigned long);

        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int);
        int (*smap) (struct inode *,int);
        int (*revalidate) (struct dentry *);
};

default_file_ops

This points to the default table of file operations for files opened on this inode. When a file is opened, the f_op field in the file structure is initialised from this, and then the open method in the file_operations table is called. That method may choose to change the f_op to a different (non-default) method table. This is done, for example, when a device special file is opened.

create

This, and the next 8 methods are only meaningful on directory inodes.

create is called when the VFS wants to create a file with the given name (in the dentry) in the given directory. The VFS will have already checked that the name doesn't exist, and the dentry passed will be a negative dentry meaning that the inode pointer will be NULL.

Create should, if successful, get a new empty inode from the cache with get_empty_inode, fill in the fields and insert it into the hash table with insert_inode_hash, mark it dirty with mark_inode_dirty, and instantiate it into the dcache with d_instantiate.

The int argument contains the mode of the file which should indicate that it is S_IFREG and specify the required permission bits.

lookup

lookup should check if that name (given by the dentry) exists in the directory (given by the inode) and should update the dentry using d_add if it does. This involves finding and loading the inode.

If the lookup failed to find anything, this is indicated by returning a negative dentry, with an inode pointer of NULL.

As well as returning an error or NULL, indicating that the dentry was correctly updated, lookup can return an alternate dentry, in which case the passed dentry will be released. I don't know if this possibility is actually used.

link

The link method should make a hard link from the name refered to by the first dentry to the name referred to by the second dentry, which is in the directory refered to by the inode.

If successful, it should call d_instantiate to link the inode of the linked file to the new dentry (which was a negative dentry).

unlink

This should remove the name refered to by the dentry from the directory referred to by the inode. It should d_delete the dentry on success.

symlink

This should create a symbolic link in the given directory with the given name having the given value. It should d_instantiate the new inode into the dentry on success.

mkdir

Create a directory with the given parent, name, and mode.

rmdir

Remove the named directory (if empty) and d_delete the dentry.

mknod

Create a device special file with the given parent, name, mode, and device number. Then d_instantiate the new inode into the dentry.

rename

The first inode and entry refer to a directory and name that exist. rename should rename the object to have the parent and name given by the second inode and dentry. All generic checks, including that the new parent isn't a child of the old name, have already been done.

readlink

The symbolic link referred to by the dentry is read and the value is copied into the user buffer (with copy_to_user) with a maximum length given by the int.

follow_link

If we have a directory (the first dentry) and a name within that directory (the second dentry) then the obvious result of following the name from the directory would arrive at the second dentry. If an inode requires some other, non-obvious, result -- as do symbolic links -- the inode should provide a follow_link method to return the appropriate new dentry. The int argument contains a number of LOOKUP flags which are described in the section on namei lookups.

get_block

This method is used to find the device block that holds a given block of a file. The inode and long indicate the file and block number being sought (the block number is the file offset divided by the file-system block size). get_block should initialise the b_dev and b_blocknr fields of the buffer_head, and should possibly modify the b_state flags.

If the int argument is non-zero then a new block should be allocated if one does not already exist.

readpage

Readpage is only called by mm/filemap.c It is called by:

Thus it is needed for memory mapping of files (as you would expect), for using the sendfile system call, or if the generic_read_file is to be used for the file:read method.

readpage is not expected to actually read in the page. It must arrange for the read to happen. Clients wait for the page to be unlocked before using the data.

readpage can be implemented using block_read_full_page which is defined in fs/buffer.c. This routine assumes that inode:get_block has been defined and sets up a buffer_heads to access the block in question. These buffer_heads will be set to call 'end_buffer_io_async' on completion, which will unlock the page when all buffers on the page complete.

writepage

Writepage is called from linux/mm/filemap.c too.

it is called by do_write_page from filemap_write_page, from filemap_swapout, filemap_sync_pte, and from generic_file_mmap.

Writepage can be implemented using block_write_full_page from fs/buffer.c. It is a close twin of block_read_fullpage. The important differences being:

These two routines could be cleaned up a bit so that the similarity and differences stand out more.

flushpage

flushpage is called from mm/filemap.c and mm/swap_state.c.

In mm/filemap.c is called by truncate_inode_pages to make sure no I/O is pending on a page before the page is released. mm/swap_state.c similarly calls it when a page is being removed from the swap cache -- all I/O must be finished.

HEREish

truncate

TODO

permission

TODO

smap

TODO

revalidate

TODO


Next Previous Contents