.fp 1 PA .fp 2 PI .fp 3 PB .fp 4 PX .EQ delim $$ .EN .ND January 2, 1981 .TM 81-11272-1 11173 11173-11 .TL .UX B-trees .AU "MH 2C522" 7214 P.J. Weinberger .AI .MH .AB B-trees are a good data structure for storing, retrieving, and updating records in files containing many records. Ordinary .UX tools such as $awk$ and $grep$ become too slow when used on files with more than a few hundred lines, and $ed$ can no longer be used for updating after a thousand or so lines. This memo contains descriptions of subroutines and commands for creating and maintaining B-trees, and a complete example of their use in a data base with two files. .AE .CS 5 2 7 0 0 1 .NH Introduction .PP This memo contains a description of subroutines and commands which implement B-trees. (See [Comer] for a bibliography on B-trees.) The subroutines make the B-trees look to the user like a list of .I "key\(emvalue" pairs, sorted by .I key . The subroutines provide for positioning within the list, reading the next pair, deleting pairs, and writing pairs. The way the subroutines implement the list is by using a prefix-compressed B-tree, in which prefixes common to consecutive .I keys are factored out. The details of the data structures are in sections 2 and 5. .PP There are several advantages to using these B-trees as the file organization for data bases. A given key can be searched for with a small number of file system accesses, typically two or three. Since the file looks to the user as if it is sorted on the keys, it is easy to retrieve the lexically next key, and so to scan the file in lexical order. Because the keys are sorted, adjacent keys tend to have a common prefix, which can be compressed out. In the case of the dictionary .I /usr/dict/words , where the .I values are all null, the original file consists of 24473 words and 201032 bytes, while the B-tree contains only 134144 bytes. The B-tree is significantly shorter even though it contains headers and other overhead. .PP Sections 3 and 4 contain descriptions of subroutines and commands. .PP There are 9 subroutines. 4 of these correspond to .UX system calls $read$, $write$, $open$, $close$, with the difference that the data stored in the B-trees is not just a vector of bytes, but is a sequence of $key\(emvalue$ pairs. There is one routine for deleting, which has no corresponding system call. The remaining 4 correspond to various aspects of $lseek$: one positions, one reports on the position, one positions to the beginning, and the last gives the length of the $value$ at the current position. .PP The commands are utilities for building B-trees, dealing with them at the command level, and gathering statistics about them. .PP Section 6 contains a complete example constructing and using a small but interesting data base. .NH B-trees .PP The B-trees used by the subroutines are slightly different from standard B-trees. The resemblance would be closer if the routines did not support the .I value half of the pair. Then the .UX B-trees would be B-trees in which all the keys are in leaves, and in which the keys can be of variable length, and are stored with some compression. Adding the .I value half of a pair allows very large records to be stored. Further each record can be stored contiguously, rather than broken across tree nodes. .PP The standard definition of B-tree requires some bounds on the number of keys in a node, as in 2-3 trees, where each node has 2 or 3 keys. While this restriction is convenient for proving theorems it is silly in practice. Intuitively, trees which branch a lot at each node have more leaves for a given height, so that searching from the root to a leaf requires fewer file system accesses the bushier the tree. Therefore one wants as many keys as possible in each node, all other things being equal. The node size has been chosen to be the disk block size. Given that, various algorithms for data compression might be used to squeeze as many keys as possible into a node. .PP The method I chose maintains the keys in lexical order in the node. If a key begins with the same $n$ bytes as its immediate predecessor, it is stored with the $n$ bytes replaced by the count $n$. A trivial analysis shows how much space might be saved: Suppose that the keys are chosen at random from strings of length $m$ in which each of $a$ characters appears with equal probability. Since there are no more than $a sup k$ prefixes of length $k$, one expects that the length of the common prefix of two of $N$ keys is about $log ( N) / log (a)$. Since one byte is required to store the length of the common prefix, the space saving should be around $N( log N / log a - 1) $ bytes. .PP Using this compression technique does not make handling the keys much more complicated. The reader will have no trouble seeing that searching to find a key is easy, that the size of a list of keys decreases when a key is deleted, and that the size of a list of keys increases by no more than the uncompressed size of the new key when a key is inserted. .PP The neighborhood of an interior node in a B-tree looks like .KS .nf .so picture.out .fi .KE Each node which is not a leaf contains one more pointer than key: .EQ p sub 0 , ~ k sub 0 , ~ p sub 1 , ~ k sub 1 , ~ ... , ~ k sub n , ~ p sub n+1 . .EN Of the keys in the subtree whose root is the given node, $p sub j$ points to the subtree containing keys strictly greater than $k sub j-1$ and no greater than $k sub j$. $p sub 0$ points to the subtree containing keys no greater than $k sub 0$, and $p sub n+1$ points to the subtree containing keys strictly greater than $k sub n$. If $n$ is 0, then there is only one subtree, and $p sub 0$ points to it. This last case can occur when the tree is being built by the $build$ command, which may leave a single node nearly empty at each level of the tree due to its zeal in making the rest of the nodes at that level as full as possible. More detail on file structure is in section 5. .NH Subroutines .PP There are 9 subroutines that C programs can use to access B-trees. The normal order of events would be for a program to open one or more B-trees using $bopen$. To retrieve information the program would use $bseek$ to position itself in the B-tree. It could use $breclen$ to see how much it was about to read, and then $bread$ to actually retrieve the data. If it wanted to delete information it would use $bdelete$, and to write new or changed information it would use $bwrite$. .SH /usr/include/cbt.h .LP contains definitions of some manifest constants and types used by the subroutines. Manifest constants are represented as CONSTANT. A pointer of type .I *bfile is returned by the open routine and passed to each of the other routines. Data is passed in an .I mbuf : .DS .ft 8 typedef struct { char *mdata; unsigned short mlen; } mbuf; .ft P .DE .I mdata points to .I mlen bytes of data. .PP An open B-tree is either positioned at its end or at some pair. When the file is opened, it is positioned at the first pair. Reading the B-tree gets the pair at the current position. Each subroutine has a deterministic effect on the current position. .PP B-trees can have any one of the 4 types INDEX, READONLY, both, or neither. The type is determined when the B-tree is created. INDEX says that the .I value part of a pair will always be null; i.e., .I mlen will be zero. A B-tree should be READONLY if updates are infrequent and are never done while someone else is reading in the B-tree. .SH bfile *bopen(b-tree-name, flag) char *b-tree-name; .LP returns a pointer to an internal structure if the open is successful, and NULL otherwise. The file is positioned to its first pair. .I flag should be 0 if the file is only to be read, and 2 otherwise. It is passed directly to the system .I open routine. .SH bclose(bf) bfile *bf; .LP closes the B-tree, flushing buffers and cleaning up in general. If a process updates a B-tree which is not READONLY, then it must call this routine before any other user can perceive its updates. .SH bfirst(bf) bfile *bf; .LP resets the current position in the file to the beginning. It returns EOF if the B-tree is empty. .SH bseek(bf, key) bfile *bf; mbuf key; .LP sets the current position in the file to the first key in the file which is lexically greater than or equal to .I key . The routine returns FOUND if .I key was in the file, EOF if .I key is greater than any key in the file, and NOTFOUND otherwise. Lexical ordering is that induced by the C type .I char . .SH breclen(bf) bfile *bf; .LP returns the length of the .I value at the current position in the file, or EOF if you are at the end of the file. (On a machine where $short$ and .I int are 16 bits long, EOF will be a legal value for $mlen$, name 65535. If the user is on such a machine and writes objects of this length, he must not rely on this subroutine to detect the end of the file.) .SH mbuf bkey(bf) bfile *bf; .LP returns the .I key at the current position of the file. A key cannot have .I mlen of zero, so if the returned value does, you are at the end of the file. Do not change the data pointed to by .I mdata in the returned value, as it is used by the subroutines. .SH bread(bf, key, rec) bfile *bf; mbuf *key, *rec; .LP returns 0 if successful and EOF otherwise. If .I key is not NULL, $ key->mlen $ is set to the length of the key at the current position of the file, and the buffer pointed to by $ key->mdata $ is filled with the key. Likewise, if .I rec is not NULL, $ rec->mlen $ is set to the length of the value at the current position of the file, and the buffer pointed to by $rec->mdata $ is filled with the value. Then the current position in the file is moved to the next pair. It is the user's responsibility to make the buffers large enough to hold what is put in them. $breclen$ gives the necessary size for values, and no key can be as large as the defined constant NDSZ from .I /usr/include/cbt.h . .SH bdelete(bf, key) bfile *bf; mbuf key; .LP deletes the pair with the given .I key from the file if it is present. The routine returns EOF on error, FOUND if a pair was deleted, and NOTFOUND otherwise. Afterwards, the current position in the file is as if .I "bseek(bf, key)" had been called. .SH bwrite(bf, key, value) bfile *bf; mbuf key, value; .LP writes the pair .I "key value" into the file. It returns EOF on error, FOUND if a pair was overwritten, and NOTFOUND if a new pair was written. When the routine returns, the position of the file is as if .I "bseek(bf, key)" had been called. Keys should not be more than about 120 bytes long. .SH Error Returns .PP If an error causes a routine to return EOF, the external integer .I errno is set to one of the codes defined in the header file. BNOWRITE means you tried to write on a B-tree not opened for writing, BIOWRT means that the system wrote fewer bytes than requested, BNOMEM means that a subroutine needed more memory and couldn't get it, and BTALL means that the tree has grown too tall. .I errno can also be set because one of the system calls failed. Then it will have one of the values from .I /usr/include/errno.h . .NH Commands .PP The commands are all of the form .I "cbt command args" . The file .I cbt is a shell script which invokes the actual commands. The commands are utilites. $cat$ just uses the subroutines of the last section, but $creat$, $build$, and $report$ require inside knowledge of data structures not available through the B-tree subroutines. This is all in the spirit of data abstraction. The user, through the subroutines, gets certain services, which are best provided through private data structures not available to the user. $build$ has another and more compelling justification; it creates trees that require less space than if they were constructed by a sequence of $bwrite$s. .SH cbt creat [flags] file .LP creates an empty B-tree. In .I flags , .I -i makes the file have type INDEX, and .I -r makes the file have type READONLY. The .UX file created is .I file.T . If the B-tree is not of type INDEX, .I file.F is created to hold the values. .SH cbt build [-R] file .LP builds the B-tree .I file from the standard input. The input data must be sorted by key. The new file has each node as full as possible, given that the program decides what to do with each key before looking at the next. .PP If the optional argument is not present each line of input contains a key and value separated by a tab. All the bytes to the first tab on a line make up the key (which must not be empty), the tab is removed, and the remaining bytes, not including the trailing newline, make up the value. Thus there is no way, without the optional argument to the command, to get a tab or newline in a key, or a newline in a value. .PP If the optional argument is present, the command expects a sequence of data of the form .DS .ft 8 struct { struct unsigned short klen; char kdata[klen]; } key; struct { unsigned short vlen; char vdata[vlen]; } value; }; .ft P .DE The dictionary mentioned in the introduction was constructed by .DS .ft 8 cbt creat -i foo; sort /usr/dict/words | cbt build foo .ft P .DE .SH cbt cat [flags] files .LP writes the contents of the files on its standard output. If the .I R flag is present the output is suitable for .I R input to .I build, otherwise the output is a sequence of `lines' of the form key, tab, value, newline. Thus the command .DS .ft 8 cbt cat -R file | cbt build -R newfile .ft P .DE will compress the old file into the new file. .SH cbt report file .LP scans the file and reports on the number of pairs, the total size, the amount of the tree in use, and the amount of free space. .NH File Structure .PP The .I .F file contains the values concatenated together in the order they were written. Each time a pair is written, the new value is put on the end of the .I .F file. Deleting a pair has no effect on the .I .F file. Therefore as a B-tree is updated, the .I .F file grows. .PP The .I .T file contains the B-tree proper, and pointers to the .I .F file if the B-tree is not an INDEX. The .I .T contains only tree nodes, each one NDSZ bytes long. (NDSZ is defined in the header file.) Each node has the form .DS header, keys, space, addresses, trailer. .DE The trailer is a $short$ containing the number of unused bytes in the node, which is the size of $space$. .PP The form of an $address$ depends on the type of the file and the height of the node in the tree. Leaves have height 0. Each other node has height one greater than its children. B-trees have the property that each node has a well-defined height. If the node is not a leaf, each $address$ is a $short$ which is the location of a child of the node. The node pointed to by $address$ is the node starting at $address * NDSZ$ in the file. Leaves in files of type INDEX don't have addresses. Otherwise an $address$ is a structure containing the length of the value and its starting position in the .I .F file. .PP Each $key$ structure contains its total length in a one-byte field, then the number of bytes of prefix in common with the preceding key in the node, and then the rest of the key. .PP The $header$ contains an identifier so that a program can tell if it rewrote the node, the number of keys in the node, the type of the node, and the height of the node. The first node in a file is always the root. The identifier is made from the process id and the time. If a file is not READONLY the routines will not overwrite a node unless the node was originally written by the process, or unless the file is being closed and the node is the root. In this way a file which is not READONLY supports one writer concurrently with programs reading. .NH An example .PP The example is a data base with two files. With it the user can translate from old department numbers to new department numbers, and get the title of a department from either department number. The example illustrates how to construct the data base and suggests one way of contstructing an index. There is also a complete program for retrieving from the data base whcih illustrates using all the subroutines which don't change B-trees. .PP A file named $deptnos$ contains old department numbers, new department numbers, and department titles. It is 1124 lines long and contains 56816 bytes. Here are the first 4 and last 4 lines: .DS 0001 00001 Chairman of the Board 0002 20000 Executive Vice President - Corporate Studies 0003 30000 President/Project Planning 0004 40000 Executive Vice President - Customer Systems 9543 59543 Demand Economics Department 9544 59544 Computer Systems Department 9545 59545 Advanced Software Department 9800 59800 Transfer of Non-Incremental Cost Between BIS and AT&T .DE For the example the file will go into 2 B-trees. One named $dept$ will have the new department numbers as keys and the department titles as values. The other, $olddept$, will be an index which will give new department numbers as a function of old. By combining the two, a user will be able to determine the department title from the old department number. In both, leading zeros will be stripped off department numbers. The commands .I "cbt creat dept" and .I "cbt creat -i olddept" create empty B-trees. The bold (but correct) command .EQ delim off .EN .DS .ft 8 awk < deptnos '{$1 = ""; print}' | sed 's/^ // s/^0*// s/ /\et/' | sort | cbt build dept .ft P .DE .EQ delim $$ .EN builds $dept$. The $awk$ command deletes the old department number and replaces multiple blanks by a single blank. The $sed$ command removes the leading blank (which is the separator $awk$ put after the null first field) and replaces the single blank separating the new department number from the department title with a tab for .I "cbt build" . $sort$ sees 48089 bytes. $dept.T$ contains 13284 bytes, and $dept.F$ contains 39955 bytes. Thus the additional space required for the B-tree is(39955+13284)/48089, which is 1.118, about 12%. The command .EQ delim off .EN .DS .ft 8 awk < deptnos '{print $1, $2}' | sed 's/^0*// s/ 0*/ /' | sort | cbt build olddept .ft P .DE .EQ delim $$ .EN builds $olddept$. The $awk $ command prints the old and new department numbers, separated by a space. Each of the strings is the key of a pair in olddept. The output of $sort$ is 12875 bytes long, and $olddept.T$ is 11776 bytes long. .PP The program in the appendix provides a simple facility for using $dept$ and $olddept$. If the user types a number, all the new department numbers which have that number as a prefix are printed, together with their department names. If the user types a number preceded with the letter O, then the program computes the set of old department numbers which begin with the typed number, and prints a set of lines consisting of the old and new department numbers and the department title. .NH Conclusion .PP The subroutines and commands described in this memo can be used to build and maintain data bases of moderate size. The subroutines give a clean interface to C programs, and the files themselves are efficient to use and reasonably economical of space. .SG .SH Bibliography .LP D. Comer, .I The Ubiquitous B-Tree, .R Computing Surveys .B (11) June 1979, pp 121-137.