[COFF] converting lousy scans of pdfs into something more, useable

Will Senn will.senn at gmail.com
Sat Feb 4 02:21:31 AEST 2023

> From: Dennis Boone <drb at msu.edu>
> * Don't use JPEG 2000 and similar compression algorithms that try to
>    re-use blocks of pixels from elsewhere in the document -- too many
>    errors, and they're errors of the sort that can be critical.  Even if
>    the replacements use the correct code point, they're distracting as
>    hell in a different font, size, etc.
I wondered about why certain images were the way they were, this 
probably explains a lot.

> * OCR-under is good.  I use `ocrmypdf`, which uses the Tesseract engine.
Thanks for the tips.
> * Bookmarks for pages / table of contents entries / etc are mandatory.
>    Very few things make a scanned-doc PDF less useful than not being able
>    to skip directly to a document indicated page.
I wish. This is a tough one. I generally sacrifice ditching the 
bookmarks to make a better pdf. I need to look into extracting bookmarks 
and if they can be re-added without getting all wonky.

> * I like to see at least 300 dpi.
Yes, me too, but I've found that this often results in too big (when 
fixing existing), if I'm creating, they're fine.

> * Don't scan in color mode if the source material isn't color.  Grey
>    scale or even "line art" works fine in most cases.  Using one pixel
>    means you can use G4 compression for colorless pages.

Amen :).
> * Do reduce the color depth of pages that do contain color if you can.
>    The resulting PDF can contain a mix of image types.  I've worked with
>    documents that did use color where four or eight colors were enough,
>    and the whole document could be mapped to them.  With care, you _can_
>    force the scans down to two or three bits per pixel.
> * Do insert sensible metadata.
> * Do try to square up the inevitably crooked scans, clean up major
>    floobydust and whatever crud around the edges isn't part of the paper,
>    etc.  Besides making the result more readable, it'll help the OCR.  I
>    never have any luck with automated page orientation tooling for some
>    reason, so end up just doing this with Gimp.
Great points. Thanks.


More information about the COFF mailing list