V10/man/man1/ocr.1
.TH OCR 1 cetus,hydra,coma
.CT 1 graphics
.SH NAME
ocr \- optical character recognition
.SH SYNOPSIS
.B ocr
[
.I option ...
]
[
.I file
]
.SH DESCRIPTION
.I Ocr
reads a black-and-white image of a page from
.IR file ,
and writes ASCII to the standard output.
If no
.I file
is specified, it reads from the standard input.
.PP
The input is a
.IR picfile (5)
image of one column of machine-printed text, normally
scanned in by
.IR cscan (1).
Fonts, sizes, and line-spacings may vary within the column,
but each line should have a constant text size and baseline.
Lines should be parallel and roughly horizontal.
.PP
In the output, white space approximates the original page layout.
Words that
.IR spell (1)
are preferred, and hyphenations across lines are recombined.
.PP
The options are:
.nr xx \w'\fL-pn,m\ \ '
.TP \n(xxu
.BI -a s
The alphabet is the union of symbol sets selected by characters in string
.IR s ,
from among:
.RS
.PD
.nr yy \w'\fLA\ \ '
.TP \n(yyu
.B A
ABCDEFGHIJKLMNOPQRSTUVWXYZ
.PD0
.TP
.B a
abcdefghijklmnopqrstuvwxyz
.PD0
.TP
.B 0
0123456789
.PD0
.TP
.B .
.ie t \&.\^,\|-\^:\^;\|*\^'\|\^"\|?\^!\|/\|&\|$\^(\^)\^[\|\^]\|#\|@\|% \0\0\0\0\0\0\0\0\0\0\0 \kz(basic punctuation)
.el \&.\^,\|-\^:\^;\|*\^'\|\^"\|?\^!\|/\|&\|$\^(\^)\^[\|\^]\|#\|@\|% \0\0\0\0\0\0\0 \kz(basic punctuation)
.ig
should include ` /(em + ???
shouldn't include []#@% ???
..
.PD0
.TP
.B ^
^\|\f(CW~\fR\^`\|\^\\\||\|\^{\|}\|_ \h'|\nzu'(extended punct'n)
.ig
should include []#@% ???
shouldn't include ` ???
..
.PD0
.TP
.B +
+\^\-\^*\|/\|<\^>\^=\^.\^E\|e\|[\|] \h'|\nzu'(numerical punct'n)
.PD0
.TP
.B s
.ie t \(sc\^\(dg\^\(dd\^\(ct\|\(bu\|\(co\|\(rg\|\(de\^\(fm\^\(en\|\^\(mi\|\(em \h'|\nzu'(selected non-ASCII)
.el \\(sc\\(dg\\(dd\\(ct\\(bu\\(co ... \h'|\nzu'(selected non-ASCII)
.PD0
.TP
.B l
.ie t \(fi\|\(fl\|f\h'-.1m'f\|f\h'-.1m'\(fi\|f\h'-.1m'\(fl\|\N'114'\|\N'115'\|\N'105'\|\N'106' \h'|\nzu'(ligatures and digraphs)
.el fi fl ff ffi ffl ae oe ... \h'|\nzu'(ligatures, digraphs)
.PD0
.TP
.B g
.ie t \(*a\(*b\(*g\(*d\(*e\(*z\(*y\(*h\(*i\(*k\(*l\(*m\(*n\(*c\(*o\(*p\(*r\(*s\(*t\(*u\(*f\(*x\(*q\(*w \h'|\nzu'(Greek lower case)
.el \\(*a\\(*b\\(*g\\(*d\\(*e\\(*z ... \h'|\nzu'(Greek lower case)
.PD0
.TP
.B G
.ie t AB\(*G\(*DEZH\(*HIK\(*LMN\(*CO\(*PP\(*STY\(*FX\(*Q\(*W \h'|\nzu'(Greek upper case)
.el AB\\(*G\\(*DEZ ... \h'|\nzu'(Greek upper case)
.PD
.PP
The default is
.BR -aAa0.+^ ,
the full printable-ASCII set, which may be abbreviated as
.BR -ap .
Thus,
.B -apslgG
selects all of the above.
.RE
.PD
.TP \n(xxu
.B -c
Find columns in complex nested layouts using greedy white covers algorithm.
.TP
.BI -m l[,r]
Trim the left and right margins of the image by
.I l
and
.I r
inches, respectively, before looking for columns.
If
.I r
is omitted, it is assumed to equal
.IR l.
.TP
.BI -n n
Find the
.I n
largest columns by analysis of a single vertical projection.
Each column should be compactly-printed
and separated from the others by at least 2 ems of horizontal white space.
.TP
.BI -p n,m
Point sizes lie in the range [
.I n, m
]; other sizes are discarded.
The default is
.BR -p6,24 .
.TP
.B -s
Defeat spelling check (but continue to favor numeric strings and good punctuation).
.TP
.B -t
Write
.IR troff (1)
format.
Each column is shown on a separate page, lines at their original height,
words at their original horizontal location, and
characters roughly original size in Times roman.
Hyphenated words are not recombined.
.TP
.B -u
Unspellable words are prefixed with `?' or, if
.B -t
is specified, printed boldface.
.TP
.BI -w w
Find the largest column of width
.I w
inches, within a single vertical projection.
.SS Fonts
Trained on over 100 Latin-alphabet book fonts in various italic, bold, etc styles.
Only one font of Greek, without diacriticals.
Also Swedish and Tibetan, on request.
.SH SEE ALSO
.IR bcp (1),
.IR cscan (1),
.IR font (6),
.IR picfile (5),
.IR spell (1),
.IR troff (1)
.SH BUGS
For best results, use images of high-contrast, cleanly-printed original
documents digitized at a resolution of 400 pixels/inch or higher.
It may help to restrict the alphabet and sizes to what's there.
.ig
8.7 CPU minutes on pipe to read this page, September 1989.
..