.EQ
delim $$
.EN
.TL
An Algorithm for Differential File Comparison
.AU
J. W. Hunt
.AI
Department of Electrical Engineering, Stanford University, Stanford, California
.AU
M. D. McIlroy
.AI
Bell Laboratories, Murray Hill, New Jersey 07974
.AB
The program
.I diff
reports differences between two files, expressed as a minimal list of line changes to bring either file into agreement with the other.
$Diff$ has been engineered to make efficient use of time and space on typical inputs that arise in vetting version-to-version changes in computer-maintained or computer-generated documents. Time and space usage are observed to vary about as the sum of the file lengths on real data, although they are known to vary as the product of the file lengths in the worst case.
.PP
The central algorithm of
.I diff
solves the `longest common subsequence problem' to find the lines that do not change between files. Practical efficiency is gained by attending only to certain critical `candidate' matches between the files, the breaking of which would shorten the longest subsequence common to some pair of initial segments of the two files. Various techniques of hashing, presorting into equivalence classes, merging by binary search, and dynamic storage allocation are used to obtain good performance.
.PP
[This document was scanned from Bell Laboratories Computing Science Technical Report #41, dated
July 1976. Text was converted by OCR and hand-corrected (last changed June, 2012). Figures were reconstructed.
Some OCR errors may remain, especially in tables and equations.
Please report them to
doug@cs.dartmouth.edu.]
.AE
.PP
The program
.I diff
creates a list of what lines of one file have to be changed to bring it into agreement with a second file or vice versa. It is based on ideas from several sources[1,2,7,8]. As an example of its work, consider the two files, listed horizontally for brevity:
.TS
center;
ccccccc.
a b c d e f g
w a b x y z e
.TE
It is easy to see that the first file can be made into the second by the following prescription, in which an imaginary line 0 is understood at the beginning of each:
.TS
center;
lccc.
append after line 0: w,
change lines 3 through 4, which were: c d
into: x y z,
delete lines 6 through 7, which were: f g.
.TE
Going the other way, the first file can be made from the second this way:
.TS
center;
lccc.
delete line 1, which was: w,
change lines 4 through 6, which were: x y z
into: c d,
append after line 7: f g.
.TE
Delete, change and append are the only operations available to $diff$. It indicates them by
1-letter abbreviations reminiscent of the qed text editor[3] in a form from which both directions of change can be read off. By exchanging 'a' for 'd' and line numbers of the first file with
those of a second, we get a recipe for going the other way. In these recipes lines of the original file are flagged with `<', lines of the derived file are flagged with `>':
.TS
center;
ll.
0 a 1,1 1,1 d 0
> w < w
3,4 c 4,6 4,6 c 3,4
< c < x
< d < y
--- < z
> x ---
> y > c
> z > d
6,7 d 7 7 a 6,7
< f > f
< g > g
.TE
In mathematical terms, the goal of
.I diff
is to report the minimum number of line changes necessary to convert one file into the other. Equivalently, the goal is to maximize the number of lines left unchanged, or to find the longest common subsequence of lines that occurs in both files.
.NH
Solving the longest common subsequence problem
.PP
No uniformly good way of solving the longest common subsequence problem is known.
The simplest idea\(emgo through both files line by line until they disagree, then search forward somehow in both until a matching pair of lines is encountered, and continue similarly\(emreduces the problem to implementing the `somehow', which doesn't help much. However, in practical terms, the first step of stripping matching lines from the beginning (and end) is helpful, for when changes are not too pervasive stripping can make inroads into the (nonlinear) running time of the hard part of the problem.
.PP
An extremely simple heuristic for the `somehow', which works well when there are relatively few differences between files and relatively few duplications of lines within one file, has been used by Johnson and others[1, 11]: Upon encountering a difference, compare the $k$th line ahead in each file with the $k$ lines following the mismatch in the other for $k$ = 1,2,... until a match is found. On more difficult problems, the method can missynchronize badly. To keep a lid on time and space, $k$ is customarily limited, with the result that longer changed passages defeat resynchronization.
.PP
There is a simple dynamic programming scheme for the longest common subsequence problem[4,5]. Call the lines of the first file $A sub i,~i~=~1,...,m$ and the lines of the second $B sub j,~j~=~1,...,n$. Let $P sub ij$ be the length of the longest subsequence common to the first $i$ lines of the first file and the first $j$ lines of the second. Evidently $P sub ij$ satisfies
.EQ
P sub i0 = 0~~~i=0,...,m,
.EN
.EQ
P sub 0j = 0~~~j=0,...,n,
.EN
.EQ
P sub ij = left "{" matrix {
lcol{1~+~P sub {i-1,j-1} above roman max (P sub {i-1,j} , P sub {i,j-1})}
lcol{ if~A sub i~=~B sub j above if~A sub i != B sub j}}~~~1 <=i<=m,~1<=j<=n.
.EN
Then $P sub mn$ is the length of the desired longest common subsequence.
From the whole $P sub ij$ array that was generated in calculating $P sub mn$ it is easy to recover the indices or the elements of a longest common subsequence.
.PP
Unfortunately the dynamic program is $O(mn)$ in time, and\(emeven worse\(em$O(mn)$ in space. Noting that each row $P sub i$ of the difference equation is simply determined from
$P sub {i-1}$,
D. S. Hirschberg invented a clever scheme that first calculates $P sub m$ in $O(n)$ space and then recovers the sequence using no more space and about as much time again as is needed to find $P sub m$[6].
.PP
The
.I diff
algorithm improves on the simple dynamic program by attending only to essential matches, the breaking of which would change $P$. The essential matches, dubbed `$k$-candidates' by Hirschberg[7], occur where $A sub i~=~B sub j$ and $P sub ij~>~roman max (P sub {i-1,j} , P sub {i,j-1})$. A $k$-candidate is a pair of indices $(i,j)$ such that (1)
$A sub i~=~B sub j$, (2) a longest common subsequence of length $k$ exists between the first $i$ elements of the first file and the first $j$ elements of the second, and (3) no common subsequence of length $k$ exists when either $i$ or $j$ is reduced. A candidate is a pair of indices that is a $k$-candidate for some $k$. Evidently a longest common subsequence can be found among a complete list of candidates.
.PP
If $(i sub 1 , j sub 1 )$ and $(i sub 2 , j sub 2 )$ with $i sub 1 < i sub 2$ are both $k$-candidates, then $j sub 1 > j sub 2$. For if $j sub 1 = j sub 2$, $(i sub 2 , j sub 2 )$ would violate condition (3) of the definition; and if $j sub 1 < j sub 2$ then the common subsequence of length $k$ ending with $(i sub 1 , j sub 1 )$ could be extended to a common subsequence of length $k+ 1$ ending with $(i sub 2 , j sub 2 )$.
.PP
The candidate methods have a simple graphical interpretation. In Figure 1 dots mark grid points $(i,j)$ for which $A sub i~=~B sub j$. Because the dots portray an equivalence relation, any two horizontal lines or any two vertical lines on the figure have either no dots in common or carry exactly the same dots. A common subsequence is a set of dots that can be threaded by a strictly monotone increasing curve. Four such curves have been drawn in the figure. These particular curves have been chosen to thread only (and all) dots that are candidates. The values of $k$ for these candidates are indicated by transecting curves of constant $k$. These latter curves, shown dashed, must all decrease monotonically. The number of candidates is obviously less than $mn$, except in trivial cases, and in practical file comparison turns out to be very much less, so the list of candidates usually can be stored quite comfortably.
.KS
.PS 4i
line from 1,1 to 7,1
line from 1,2 to 7,2
line from 1,3 to 7,3
line from 1,4 to 7,4
line from 1,5 to 7,5
line from 1,6 to 7,6
line from 1,1 to 1,6
line from 2,1 to 2,6
line from 3,1 to 3,6
line from 4,1 to 4,6
line from 5,1 to 5,6
line from 6,1 to 6,6
line from 7,1 to 7,6
.ft H
.ps 24
line from 1,3 to 2,4 to 3,6
line from 2,4 to 4,5
line from 2,2 to 4,3 to 5,4 to 7,5
line from 3,1 to 5,2 to 7,3
line dashed from 1,3 to 3,1
line dashed from 2,4 to 4,3 to 5,2
line dashed from 3,6 to 5,4 to 7,3
line dashed from 6.2,5.8 to 7.2,4.8
"1" at 0.5,1
"2" at 0.5,2
"3" at 0.5,3
"4" at 0.5,4
"5" at 0.5,5
"6" at 0.5,6
"1" at 1,0.5
"2" at 2,0.5
"3" at 3,0.5
"4" at 4,0.5
"5" at 5,0.5
"6" at 6,0.5
"7" at 7,0.5
"a" at 1, 6.5
"b" at 2, 6.5
"c" at 3, 6.5
"a" at 4, 6.5
"b" at 5, 6.5
"b" at 6, 6.5
"a" at 7, 6.5
"c" at 7.5,1
"b" at 7.5,2
"a" at 7.5,3
"b" at 7.5,4
"a" at 7.5,5
"c" at 7.5,6
"k=1" at 1.5,2.5
"k=2" at 3,3.5
"k=3" at 4.5,4.5
"k=4" at 6.5,5.5
circle radius .1 fill 1 at 1,3
circle radius .1 fill 1 at 1,5
circle radius .1 fill 1 at 2,2
circle radius .1 fill 1 at 2,4
circle radius .1 fill 1 at 3,1
circle radius .1 fill 1 at 3,6
circle radius .1 fill 1 at 4,3
circle radius .1 fill 1 at 4,5
circle radius .1 fill 1 at 5,2
circle radius .1 fill 1 at 5,4
circle radius .1 fill 1 at 6,2
circle radius .1 fill 1 at 6,4
circle radius .1 fill 1 at 7,3
circle radius .1 fill 1 at 7,5
.ps
.PE
.ce 3
Figure 1. Common subsequences and candidates in comparing
.CW
abcabba
cbabac
.KE
.NH
The method of diff
.PP
The dots of Figure 1 are stored in linear space as follows:
.IP (1)
Construct lists of the equivalence classes of elements in the second file. These lists occupy $O(n)$ space. They can be made by sorting the lines of the second file.
.IP (2)
Associate the appropriate equivalence class with each element of the first file. This association can be stored in $O(m)$ space. In effect now we have a list of the dots for each vertical.
.PP
Having this setup, we proceed to generate the candidates left-to-right. Let $K$ be a vector designating the rightmost $k$-candidate yet seen for each $k$. To simplify what follows, pad the vector out to include a dummy 0-candidate $(0,0)$ and, for all $k$ that do not yet have a candidate, a dummy `fence' candidate $(m+ 1,n+ 1)$, whose components will compare high against the components of any other candidate. $K$ begins empty, except for padding, and gets updated as we move right. Thus after processing the 4th vertical, marked `\f(CWa\fP' in Figure 1, the list of rightmost candidates is
.DS
(0,0) (3,1) (4,3) (4,5) (8,7) (8,7) ...
.DE
Now a new $k$-candidate on the next vertical is the lowest dot that falls properly between the ordinates of the previous $(k -1)$- and $k$-candidates. Two such dots are on the 5th vertical in Figure 1. They displace the 2-candidate and 3-candidate entries to give the new vector $K$:
.DS
(0,0) (3,1) (5,2) (5,4) (8,7) (8,7) ...
.DE
The two dots on the 6th vertical fall on, rather than between, ordinates in this list and so are not candidates. Each new $k$-candidate is chained to the previous $(k -1 )$-candidate to facilitate later recovery of the longest common subsequence. For more detail see the Appendix.
.PP
The determination of candidates on a given vertical is thus a specialized merge of the list of dots on that vertical into the current list of rightmost candidates. When the number of dots is $O(1 )$, binary search in the list of at most $roman min (m,n)$ candidates will do the merge in time $O( log m )$. Since this case of very few dots per vertical is typical in practice, we are led to merge each dot separately by binary search, even though the worst case time to process a vertical becomes $O(n log m )$, as against $O(m+n)$ for ordinary merging.
.NH
Hashing
.PP
To make comparison of reasonably large files (thousands of lines) possible in random access memory,
.I diff
hashes each line into one computer word. This may cause some unequal lines to compare equal. Assuming the hash function is truly random, the probability of a spurious equality on a given comparison that should have turned out unequal is $1/ M$, where the hash values range from 1 to $M$. A longest common subsequence of length $k$ determined from hash values can thus be expected to contain about $k/ M$ spurious matches when $k << M$, so a sequence of length $k$ will be a spurious `jackpot' sequence with probability about $k/ M$. On our 16-bit machine jackpots on 5000-line files should happen less than 10% of the time and on 500-line files less than 1% of the time.
.PP
.I Diff
guards against jackpots by checking the purported longest common subsequence in the original files. What remains after spurious equalities are edited out is accepted as an answer even though there is a small possibility that it is not actually a longest common subsequence. $Diff$ announces jackpots, so these cases tend to get scrutinized fairly hard. In two years we have had brought to our attention only one jackpot where an edited longest subsequence was actually short\(emin that instance short by one.
.SH
Complexity
.PP
In the worst case, the
.I diff
algorithm doesn't perform substantially better than the trivial dynamic program. From Section 2 it follows that the worst case time complexity is dominated by the merging and is in fact $O(mn log m)$ (although $O(m (m+ n) )$ could be achieved). Worst case space complexity is dominated by the space required for the candidate list, which is $O(mn)$ as can be seen by counting the candidates that arise in comparing the two files
.DS
.CW
a b c a b c a b c ...
a c b a c b a c b ...
.DE
This problem is illustrated in Figure 2. When $m =n$ the kite-shaped area in which the candidates lie is 1/2 the total area of the diagram, and (asymptotically) 1/3 of the grid points in the kite are candidates, so the number of candidates approaches $n sup 2 /6$ asymptotically.*
.FS
* Direct counting shows that there are $"\(lf" (4mn-m sup 2 -n sup 2 +2m+2n+6)/12 "\(rf"$ candidates when $m-1$ and $n-1$ differ by at most a factor of 2. The floor is exact whenever $n -1$ and $m -1$ are multiples of 6.
.FE
.PP
In practice,
.I diff
works much better than the worst case bounds would indicate. Only rarely are more than $roman min (m,n)$ candidates found. In fact an early version with a naive storage allocation algorithm that provided space for just $n$ candidates first overflowed only after two months of use, during which time it was probably run more than a hundred times. Thus we have good evidence that in a very large percentage of practical cases
.I diff
requires only linear space.
.ne 2i
.KS
.PS 4i
line from 0,0 to 12,0
line from 0,1 to 12,1
line from 0,2 to 12,2
line from 0,3 to 12,3
line from 0,4 to 12,4
line from 0,5 to 12,5
line from 0,6 to 12,6
line from 0,7 to 12,7
line from 0,8 to 12,8
line from 0,9 to 12,9
line from 0,10 to 12,10
line from 0,11 to 12,11
line from 0,12 to 12,12
line from 0,0 to 0,12
line from 1,0 to 1,12
line from 2,0 to 2,12
line from 3,0 to 3,12
line from 4,0 to 4,12
line from 5,0 to 5,12
line from 6,0 to 6,12
line from 7,0 to 7,12
line from 8,0 to 8,12
line from 9,0 to 9,12
line from 10,0 to 10,12
line from 11,0 to 11,12
line from 12,0 to 12,12
.ps 24
line from 0,0 to 6,12
line from 0,0 to 12,6
line from 2,1 to 7,11
line from 4,2 to 9,12
line from 6,3 to 10,11
line from 8,4 to 12,12
line from 10,5 to 12,9
circle radius .1 fill 1 at 0,0
circle radius .1 fill 1 at 0,3
circle radius .1 fill 1 at 0,6
circle radius .1 fill 1 at 0,9
circle radius .1 fill 1 at 0,12
circle radius .1 fill 1 at 3,0
circle radius .1 fill 1 at 3,3
circle radius .1 fill 1 at 3,6
circle radius .1 fill 1 at 3,9
circle radius .1 fill 1 at 0,12
circle radius .1 fill 1 at 6,0
circle radius .1 fill 1 at 6,3
circle radius .1 fill 1 at 6,6
circle radius .1 fill 1 at 6,9
circle radius .1 fill 1 at 6,12
circle radius .1 fill 1 at 9,0
circle radius .1 fill 1 at 9,3
circle radius .1 fill 1 at 9,6
circle radius .1 fill 1 at 9,9
circle radius .1 fill 1 at 9,12
circle radius .1 fill 1 at 12,0
circle radius .1 fill 1 at 12,3
circle radius .1 fill 1 at 12,6
circle radius .1 fill 1 at 12,9
circle radius .1 fill 1 at 12,12
circle radius .1 fill 1 at 1,2
circle radius .1 fill 1 at 1,5
circle radius .1 fill 1 at 1,8
circle radius .1 fill 1 at 1,11
circle radius .1 fill 1 at 4,2
circle radius .1 fill 1 at 4,5
circle radius .1 fill 1 at 4,8
circle radius .1 fill 1 at 4,11
circle radius .1 fill 1 at 7,2
circle radius .1 fill 1 at 7,5
circle radius .1 fill 1 at 7,8
circle radius .1 fill 1 at 7,11
circle radius .1 fill 1 at 10,2
circle radius .1 fill 1 at 10,5
circle radius .1 fill 1 at 10,8
circle radius .1 fill 1 at 10,11
circle radius .1 fill 1 at 2,1
circle radius .1 fill 1 at 2,4
circle radius .1 fill 1 at 2,7
circle radius .1 fill 1 at 2,10
circle radius .1 fill 1 at 5,1
circle radius .1 fill 1 at 5,4
circle radius .1 fill 1 at 5,7
circle radius .1 fill 1 at 5,10
circle radius .1 fill 1 at 8,1
circle radius .1 fill 1 at 8,4
circle radius .1 fill 1 at 8,7
circle radius .1 fill 1 at 8,10
circle radius .1 fill 1 at 11,1
circle radius .1 fill 1 at 11,4
circle radius .1 fill 1 at 11,7
circle radius .1 fill 1 at 11,10
.ps
.PE
.ce 3
Figure 2. Common subsequences and candidates in comparing
.CW
a b c a b c a b c ...
a c b a c b a c b ...
.KE
.PP
As for practical time complexity, the central algorithm of
.I diff
is so fast that even in the biggest cases our implementation can handle (about 3500 lines) almost half the run time is still absorbed by simple character handling for hashing, jackpot checking, etc., that is linear in the total number of characters in the two files. Typical times for comparing 3500-line files range from 1/4 to 3/4 cpu minutes on a PDP11/45. By contrast, a speeded-up variant of Hirschberg's
dynamic programming algorithm[6] took about 5 cpu minutes on 3500-line files. The heuristic algorithm sketched at the beginning of Section 1 typically runs about 2 or 3 times as fast as
.I diff
on long but trivially different files, but loses much of that advantage on more difficult cases
that are within the competence of both methods. Since the failure modes of the two programs
are quite different, it is useful to have both on hand.
.bp
.SH
References
.LP
[1] S. C. Johnson, `ALTER \- A Comdeck Comparing Program,' Bell Laboratories internal memorandum 1971.
.LP
[2] Generalizing from a special case solved by T. G Szymanski[8], H. S. Stone proposed and J.
W. Hunt refined and implemented the first version of the candidate-listing algorithm used by $diff$
and embedded it in an older framework due to M. D. Mcllroy. A variant of this algorithm was also elaborated by Szymanski[10]. We have had many useful discussions with A. V. Aho and J. D. Ullman. M. E. Lesk moved the program from UNIX to OS/360.
.LP
[3] `Tutorial Introduction to QED Text Editor,' Murray Hill Computing Center MHCC-002.
.LP
[4] S. B. Needleman and C. D. Wunsch, `A General Method Applicable to the Search for Similarities in the Amino Acid Sequence,'
.I "J Mol Biol"
.B 48
(1970) 443-53.
.LP
[5] D. Sankoff, `Matching Sequences Under Deletion/Insertion Constraints',
.I "Proc Nat Acad Sci USA
.B 69
(1972) 4-6.
.LP
[6] D. S. Hirschberg, `A Linear Space Algorithm for Computing Maximal Common Subsequences,'
.I CACM
.B 18
(1975) 341-3.
.LP
[7] D. S. Hirschberg, `The Longest Common Subsequence Problem,' Doctoral Thesis, Princeton 1975.
.LP
[8] T. G Szymanski, `A Special Case of the Maximal Common Subsequence Problem,' Computer Science Lab TR-170, Princeton University 1975
.LP
[9] Michael L. Fredman, `On Computing the Length of Longest Increasing Subsequences,'
.I "Discrete Math
.B 11
(1975) 29-35.
.LP
[10] T. G. Szymanski, `A Note on the Maximal Common Subsequence Problem,' submitted for publication. [The paper finally appeared as H. W. Hunt III and T. G. Szymanski,
`A fast algorithm for computing longest common subsequences',
.I CACM
.B 20
(1977) 350-353.]
.LP
[11] The programs called $proof$, written by E. N. Pinson and M. E. Lesk for UNIX and GECOS use the heuristic algorithm for differential file comparison.
.bp
.SH
.ce
Appendix
.SH
A.l Summary of the diff algorithm
.LP
Algorithm to find the longest subsequence of lines common to file 1, whose length is $m$ lines, and file 2, $n$ lines.
.LP
Steps 1 through 4 determine equivalence classes in file 2 and associate them with lines in file 1 in preparation for the central algorithm. (The
.I diff
program that is in actual use does the work of these steps somewhat differently.)
.IP 1.
Let $V$ be a vector of elements structured $(serial, hash)$, where $serial$ is a line number and $hash$ is an integer. Set
.EQ
V[j]~<-~(j,H(j))~~~j=1,...,n.
.EN
where $H(j)$ is the hash value of line $j$ in file 2.
.IP 2.
Sort $V$ into ascending order on $hash$ as primary key and $serial$ as secondary key.
.IP 3.
Let $E$ be a vector of elements structured $(serial,last )$. Then set
.EQ
E[j]~<-~( V[j].serial,f(j) )~~~j=1,...,n,
.EN
.EQ
E[0] <- (0, bold true ),
.EN
where
.EQ
f(j) = left "{" matrix {
lcol{bold true above bold false}
lcol{ roman if~j=n~roman or~V[j].hash != V[j+1].hash above roman otherwise}}
.EN
$E$ lists all the equivalence classes of lines in file 2, with $last = bold true$ on the last element of each class. The elements are ordered by $serial$ within classes.
.IP 4.
Let $P$ be a vector of integers. For $i = 1 ,...,m$ set
.EQ
P[i] <- left "{" matrix { lcol{j~roman "such that"~E[j-1].last= bold true~roman and~H(i)=V[j].hash above 0~roman "if no such"~j~ roman exists}}
.EN
where $H(i)$ is the hash value of line $i$ of file 1. The $j$ values can be found by binary search in $V$.
.br
$P[i]$, if nonzero, now points in $E$ to the beginning of the class of lines in file 2 equivalent to line $i$ in file 1.
.LP
Steps 5 and 6 are the longest common subsequence algorithm proper.
.IP 5.
Let $candidate(a,b,previous )$ be a reference-valued constructor, where $a$ and $b$ are line numbers in file 1 and file 2 respectively and $previous$ is $nil$ or a reference to a candidate.
.br
Let $K [0: roman min (m,n ) + 1]$ be a vector of references to candidates. Let $k$ be the index of the last usefully filled element of $K$. Initialize
.EQ
K[0] <- candidate(0,0,nil),
.EN
.EQ
K[1] <- candidate(m+ 1,n+ 1,nil ),
.EN
.EQ
k <- 0.
.EN
$K [1]$ is a fence beyond the last usefully filled element.
.IP 6.
For $i = 1,...,m$, if $P[i] != 0$ do $merge(K,k,i,E,P[i] )$ to update $K$ and $k$ (see below).
.LP
Steps 7 and 8 get a more convenient representation for the longest common subsequence.
.IP 7.
Let $J$ be a vector of integers. Initialize
.EQ
J[i] <- 0~~~i =0,...,m.
.EN
.IP 8.
For each element $c$ of the chain of candidates referred to by $K[k]$ and linked by previous references set
.EQ
J[c.a] <- c.b.
.EN
The nonzero elements of $J$ now pick out a longest common subsequence, possibly including spurious `jackpot' coincidences. The pairings between the two files are given by
.EQ
"{" (i,J[i] )~|~J[i] != 0 "}".
.EN
The next step weeds out jackpots.
.IP 9.
For $i = 1.....m$, if $J[i] != 0$ and line $i$ in file 1 is not equal to line $J[i]$ in file 2, set
.EQ
J[i] <- 0.
.EN
This step requires one synchronized pass through both files.
.SH
A.2 Storage management
.LP
To maximize capacity, storage is managed in $diff$ per the following notes, which are keyed to the steps in the preceding summary. After each step appear the number of words then in use, except for a small additive constant, assuming that an integer or a pointer occupy one word.
.IP 1.
Storage for $V$ can be grown as file 2 is read and hashed. The value of $n$ need not be known in advance. [$2n$ words]
.IP 3.
Though $E$ contains information already in $V$; it is more compact because the $last$ field only takes one bit, and can be packed into the same word with the $serial$ field. $E$ can be overlaid on $V.serial$. [$2n$ words]
.IP 4.
$P$ can be grown as was $V$ in step 1. [$2n+m$ words]
.br
$V$ is dead after this step. Storage can be compacted to contain only the live information, $E$ and $P$. [$n + m$ words]
.IP 5.
Candidates can be allocated as needed from the free storage obtained in the previous compaction, and from space grown beyond that if necessary. Because they are chained, candidates do not have to be contiguous.
.IP 6
During the $i$th invocation of merge, the first $i$ elements of $P$ are dead, and at most the first $i+ 2$ elements of $K$ are in use, so with suitable equivalencing $K$ can be overlaid on $P$.
[$n+ m+ 3 \(mu$(number of candidates)]
.IP 7.
$P$ and $K$ are dead, so $J$ can be overlaid on them. $E$ is dead also. [$m+ 3 \(mu$(number of candidates) ]
.SH
A.3 Summary of merge step
.LP
.B procedure
$merge(K,k,i,E,p)$
.br
$K$ is as defined in step 5 above, by reference
.br
$k$ is index of last filled element of $K$, by reference
.br
$i$ is current index in file 1, by value
.br
$E$ is as defined in Step 3 above, by reference
.br
$p$ is index in $E$ of first element of class of lines in file 2 equivalent to line
$i$ of file 1, by value
.IP 1.
Let $r$ be an integer and $c$ be a reference to a candidate. $c$ will always refer to the last candidate found, which will always be an $r$-candidate. $K [r ]$ will be updated with this reference once the previous value of $K[r]$ is no longer needed. Initialize
.EQ
r <- 0.
.EN
.EQ
c <- K[0].
.EN
(By handling the equivalence class in reverse order, Szymanski[10] circumvents the need to delay updating $K [r ]$, but generates extra `candidates' that waste space.)
.IP 2.
Do steps 3 through 6 repeatedly.
.RS
.IP 3.
Let $j = E[p ].serial$.
.br
Search $K[r:k]$ for an element $K[s]$ such that $K[s] -> b b>j$. (Note that $K$ is ordered on $K [. ] -> b$, so binary search will work.)
.br
If such an element is found do steps 4 and 5.
.RS
.IP 4.
If $K [s + 1 ] -> b >j$, simultaneously set
.br
.ll -1i
.EQ
K[r] <- c.
.EN
.EQ
r <- s+ 1.
.EN
.EQ
c <- candidate (i,j,K[s] ).
.EN
.ll +1i
.IP 5.
If $s = k$ do:
Simultaneously set
.br
.ll -1i
.EQ
K[k+2] <- K[k+1]~~~roman "(move fence)" ,
.EN
.EQ
k <- k+1.
.EN
.ll +1i
Break out of step 2's loop.
.RE
.IP 6.
If $E[p].last= bold true$, break out of step 2's loop.
.br
Otherwise set $p <- p+ 1$.
.RE
.IP 7.
Set $K[r] <- c$.