2.11BSD/man/cat5/fs.0
FS(5) UNIX Programmer's Manual FS(5)
NAME
fs, inode - format of file system volume (2BSD)
SYNOPSIS
#include <sys/types.h>
#include <sys/fs.h>
#include <sys/inode.h>
DESCRIPTION
Every file system storage volume (e.g. disk) has a common
format for certain vital information. Every such volume is
divided into a certain number of blocks. The block size is
DEV_BSIZE bytes; specified in <_s_y_s/_p_a_r_a_m._h> - currently
1024.
Each disk drive contains some number of file systems each
laid out on a contiguous partition of the disk. A file sys-
tem consists of a _b_o_o_t _b_l_o_c_k, followed by a _s_u_p_e_r _b_l_o_c_k,
followed by an _i_n_o_d_e _a_r_e_a, followed by a _d_a_t_a _b_l_o_c_k _a_r_e_a
which takes up the remainder of the disk partition. The
layout of the super block as defined in <_s_y_s/_f_s._h> is:
#define MAXMNTLEN 12
/*
* Structure of the super-block
*/
struct fs
{
u_short fs_isize; /* first block after i-list */
daddr_t fs_fsize; /* size in blocks of entire volume */
short fs_nfree; /* number of addresses in fs_free */
daddr_t fs_free[NICFREE]; /* free block list */
short fs_ninode; /* number of inodes in fs_inode */
ino_t fs_inode[NICINOD]; /* free inode list */
char fs_flock; /* lock during free list manipulation */
char fs_ilock; /* lock during i-list manipulation */
char fs_fmod; /* super block modified flag */
char fs_ronly; /* mounted read-only flag */
time_t fs_time; /* last super block update */
daddr_t fs_tfree; /* total free blocks */
ino_t fs_tinode; /* total free inodes */
short fs_step; /* optimal step in free list pattern */
short fs_cyl; /* number of blocks per pattern */
char fs_fsmnt[MAXMNTLEN]; /* ordinary file mounted on */
ino_t fs_lasti; /* start place for circular search */
ino_t fs_nbehind; /* est # free inodes before s_lasti */
u_short fs_flags; /* mount time flags */
};
_F_i_l_e _s_y_s_t_e_m: A file system is described by its super-block.
Block 0 of each file system partition is unused and is
Printed 11/26/99 January 27, 1996 1
FS(5) UNIX Programmer's Manual FS(5)
available to contain a bootstrap program, pack label, or
other information. Block 1 (SUPERB) is the super block.
The inode area starts immediately after the super-block, in
block 2. _F_s__i_s_i_z_e is the address of the first block after
the inode area. Thus the inode area is _f_s__i_s_i_z_e-2 blocks
long. _F_s__f_s_i_z_e is the address of the first block not poten-
tially available for allocation to a file. Thus the data
block area is _f_s__f_s_i_z_e - _f_s__i_s_i_z_e blocks long.
_S_u_p_e_r _b_l_o_c_k: The path name on which the file system is
mounted is maintained in _f_s__f_s_m_n_t. _F_s__f_l_o_c_k, _f_s__i_l_o_c_k,
_f_s__f_m_o_d, _f_s__r_o_n_l_y and _f_s__f_l_a_g_s are flags maintained in the
in core copy of the super block while its file system is
mounted and their values on disk are immaterial. _F_s__f_m_o_d is
used as a flag to indicate that the super-block has changed
and should be copied to the disk during the next periodic
update of file system information. _F_s__r_o_n_l_y is a write-
protection indicator. It is a copy of the mount flags
_f_s__f_l_a_g_s anded with MNT_RDONLY(see/_s_y_s/_h/_m_o_u_n_t._h).
_F_s__t_i_m_e is the last time the super-block of the file system
was changed. During a reboot, the _f_s__t_i_m_e of the super-
block for the root file system is used to set the system's
idea of the time.
_I_n_o_d_e: The inode is the focus of all file activity in the
UNIX file system. There is a unique inode allocated for
each active file, each current directory, each mounted-on
file, text file, and the root. An inode is `named' by its
device/i-number pair.
Inodes are 64 bytes long, so 16 of them fit into a block if
DEV_BSIZE is 1024. The root inode is the root of the file
system. Inode 0 can't be used for normal purposes and his-
torically bad blocks were linked to inode 1, thus the root
inode is 2 (inode 1 is no longer used for this purpose, how-
ever numerous dump tapes make this assumption, so we are
stuck with it). No other i-number has a built-in meaning.
The format of an inode as given in <_s_y_s/_i_n_o_d_e._h> is:
/*
* Inode structure as it appears on
* a disk block.
*/
struct dinode {
u_short di_mode; /* mode and type of file */
short di_nlink; /* number of links to file */
uid_t di_uid; /* owner's user id */
gid_t di_gid; /* owner's group id */
off_t di_size; /* number of bytes in file */
daddr_t di_addr[7]; /* 7 block addresses 4 bytes each */
Printed 11/26/99 January 27, 1996 2
FS(5) UNIX Programmer's Manual FS(5)
u_short di_reserved[5];/* pad of 10 to make total size 64 */
u_short di_flags;
time_t di_atime; /* time last accessed */
time_t di_mtime; /* time last modified */
time_t di_ctime; /* time created */
};
/*
* 28 of the di_addr address bytes are used; 7 addresses of 4
* bytes each: 4 direct (4Kb directly accessible) and 3 indirect.
*/
#define NADDR 7
/* modes */
#define IFMT 0170000 /* type of file */
#define IFCHR 0020000 /* character special */
#define IFDIR 0040000 /* directory */
#define IFBLK 0060000 /* block special */
#define IFREG 0100000 /* regular */
#define IFLNK 0120000 /* symbolic link */
#define IFSOCK 0140000 /* socket */
#define ISUID 04000 /* set user id on execution */
#define ISGID 02000 /* set group id on execution */
#define ISVTX 01000 /* save swapped text even after use */
#define IREAD 0400 /* read, write, execute permissions */
#define IWRITE 0200
#define IEXEC 0100
_D_i__m_o_d_e identifies the type of file the inode represents; it
is encoded identically to the _s_t__m_o_d_e field of _s_t_a_t(2).
_D_i__n_l_i_n_k is the number of directory entries (links) that
refer to this inode. _D_i__u_i_d and _d_i__g_i_d are the owner's user
and group IDs. _D_i__s_i_z_e is the number of bytes in the file.
_D_i__a_t_i_m_e and _d_i__m_t_i_m_e are the times of last access and
modification of the file contents (read, write or create);
_D_i__c_t_i_m_e records the time of last modification to the inode
or to the file, and is used to determine whether it should
be dumped by _d_u_m_p(8).
Special files are recognized by their modes. A block-type
special file is one which can potentially be mounted as a
file system; a character-type special file cannot, though it
is not necessarily character-oriented. For special files,
the first two bytes of the _d_i__a_d_d_r field are occupied by the
device code (see _t_y_p_e_s(5)). The device codes of block and
character special files overlap.
Disk addresses of plain files and directories are kept in
the array _d_i__a_d_d_r. For a DEV_BSIZE of 1K bytes, 7 addresses
are kept in _d_i__a_d_d_r using 28 of the 40 bytes. The first 4
addresses specify device blocks directly. The last 3
Printed 11/26/99 January 27, 1996 3
FS(5) UNIX Programmer's Manual FS(5)
addresses are singly, doubly and triply indirect and point
to blocks containing 256 further block pointers. There are
3 block addresses reserved as a pad to bring the total size
of an inode to 64 bytes. All block addresses are of type
_d_a_d_d_r__t (see _t_y_p_e_s(5)).
For block _b in a file to exist, it is not necessary that all
blocks less than _b exist. A zero block number indicates
that the corresponding block has never been allocated. Such
a missing block reads as if it contained all zero bytes.
_F_r_e_e _b_l_o_c_k _l_i_s_t: The free data block list for each volume is
maintained as follows. _F_s__f_r_e_e[_1], ... ,
_f_s__f_r_e_e[_f_s__n_f_r_e_e-_1], contain up to NICFREE free block
numbers (NICFREE is a configuration constant defined in
<_s_y_s/_p_a_r_a_m._h>). _F_s__f_r_e_e[_0] is the block address of the head
of a chain of blocks constituting the free list. The layout
of each block of the free chain as defined in <_s_y_s/_f_s._h> is:
struct fblk
{
short df_nfree; /* number of addresses in df_free */
daddr_t df_free[NICFREE]; /* free block list */
};
The fields _d_f__n_f_r_e_e and _d_f__f_r_e_e in a free block are used
exactly like _f_s__n_f_r_e_e and _f_s__f_r_e_e in the super block.
The algorithm used to allocate a block is: decrement
_f_s__n_f_r_e_e, and the new block number is _f_s__f_r_e_e[_f_s__n_f_r_e_e]. If
the new block address is 0, there are no blocks left, so
give an error. If _f_s__n_f_r_e_e became 0, read the new block
into _f_s__n_f_r_e_e and _f_s__f_r_e_e.
To free a block: check if _f_s__n_f_r_e_e is NICFREE; if so, copy
_f_s__n_f_r_e_e and the _f_s__f_r_e_e array into the newly freed block,
write it out, and set _f_s__n_f_r_e_e to 0. In any event set
_f_s__f_r_e_e[_f_s__n_f_r_e_e] to the freed block's address and increment
_f_s__n_f_r_e_e.
_F_s__i_s_i_z_e and _f_s__f_s_i_z_e are used by the system to check for
bad block addresses; if an `impossible' block address is
allocated from or returned to the free list, a diagnostic is
written on the console. Moreover, the free array is
cleared, to prevent further allocation from a presumably
corrupted free list.
_F_s__s_t_e_p and _f_s__c_y_l determine the block interleaving of files
for fastest access; traditionally these were referred to as
_s__m and _s__n respectively. _F_s__s_t_e_p is the distance between
successive blocks and _f_s__c_y_l is the number of blocks before
the pattern repeats. A file system's interleaving factors
Printed 11/26/99 January 27, 1996 4
FS(5) UNIX Programmer's Manual FS(5)
are determined when it is first created by _m_k_f_s(8). _M_k_f_s
lays out the initial free list with these parameters and
_f_s_c_k(8) can be used to rebuild the free list optimally (and
assign new interleaving factors if necessary).
_F_r_e_e _i_n_o_d_e _l_i_s_t: _F_s__n_i_n_o_d_e is the number of free inode
numbers in the _f_s__i_n_o_d_e array.
To allocate an inode: if _f_s__n_i_n_o_d_e is greater than 0, decre-
ment it and return _f_s__i_n_o_d_e[_f_s__n_i_n_o_d_e]. If it was 0, read
through the inode area and place the numbers of all free
inodes (up to NICINOD) into the _f_s__i_n_o_d_e array, then try
again. If a search for free inodes is necessary, the search
will start at the beginning of the inode area if _f_s__n_b_e_h_i_n_d
>= 4 x NICINOD, otherwise starting at _f_s__l_a_s_t_i and continu-
ing at the beginning of the inode area if NICINOD free
inodes aren't found when the end of the inode area is
reached. When a search completes the i-number of the first
inode of the last block scanned in the search is left in
_f_s__l_a_s_t_i.
To free an inode, provided _f_s__n_i_n_o_d_e is less than NICINODE,
place its number into _f_s__i_n_o_d_e[_f_s__n_i_n_o_d_e] and increment
_f_s__n_i_n_o_d_e. If _f_s__n_i_n_o_d_e is already NICINODE, don't bother to
enter the freed inode into any table (_f_s__i_n_o_d_e is only to
speed up the allocation process; the information as to
whether the inode is really free or not is maintained in the
inode itself). If the i-number of the freed inode is less
than _f_s__l_a_s_t_i increment _f_s__n_b_e_h_i_n_d.
SEE ALSO
stat(2), dir(5), types(5), dcheck(8), fsck(8), icheck(8),
mkfs(8), mount(8)
BUGS
It isn't the _4_B_S_D _f_a_s_t _f_i_l_e _s_y_s_t_e_m. The 2BSD file system is
a direct descendent of the V7 file system and exists little
changed from that ancestor. There are many performance
holes in the file system.
Some changes from the original V7 file system have resulted
in better performance: The larger block size (1Kb as opposed
to the 512 byte block size of V7) cuts the average number of
system calls necessary to access a file by a factor of two;
the smaller (in core) inodes allowed by the smaller number
of direct links kept in inodes saves valuable kernel data
space allowing the kernel buffer cache to be made larger
while sacrificing only 1Kb of direct file accessing; and
starting free inode searches at the position the last search
ended cuts the time to gather free inodes significantly.
Printed 11/26/99 January 27, 1996 5
FS(5) UNIX Programmer's Manual FS(5)
However, the separation of inodes and data blocks into com-
pletely different areas of the disk, the handling of the
free list, the lack of any file allocation layout policy
encouraging locality such as that found in the 4BSD file
system and the still too small block size often leads to
extremely poor performance.
The separation of inodes and data blocks in the file system
means that to access a file a seek will have to be made to
the beginning of the disk partition containing the file sys-
tem followed another to the the actual data blocks of the
file (often quite distant from the inode area).
The free list which is laid out at file system creation for
optimal file block allocation, becomes scrambled over time
on an active file system. This process is slowed down by
the kernel which always frees blocks from unlink'ed or trun-
cated files in reverse order thereby maintaining strings of
optimally laid out free blocks in the free list. Eventu-
ally, however, since both freed and allocated blocks use the
head of the free list, it's possible (and quite probable) to
have most of the free list laid out optimally with the first
portion totally scrambled. As a trade off, a file system's
free list may be rebuilt fairly frequently via _i_c_h_e_c_k -_s or
_f_s_c_k -_s and most blocks allocated will be localized as close
to the the inode area as possible. Because of this problem,
files are sometimes scattered across a file system generat-
ing an unpleasant amount of disk arm movement. A nasty
oscillation also occurs in the free block list when _f_s__n_f_r_e_e
hovers around NICFREE and 0 causing the free array to be
constantly written out and read back in as blocks are freed
and allocated.
For a more in depth analysis of the 2BSD file system, its
shortcomings, and a description of the changes made for the
4BSD file system see "A Fast File System for UNIX" by _M.
_M_c_K_u_s_i_c_k; _W. _J_o_y; _S. _L_e_f_f_l_e_r; and _R. _F_a_b_r_y.
Printed 11/26/99 January 27, 1996 6