[COFF] converting lousy scans of pdfs into something more, useable
Will Senn
will.senn at gmail.com
Sat Feb 4 02:21:31 AEST 2023
> From: Dennis Boone <drb at msu.edu>
>
> * Don't use JPEG 2000 and similar compression algorithms that try to
> re-use blocks of pixels from elsewhere in the document -- too many
> errors, and they're errors of the sort that can be critical. Even if
> the replacements use the correct code point, they're distracting as
> hell in a different font, size, etc.
I wondered about why certain images were the way they were, this
probably explains a lot.
> * OCR-under is good. I use `ocrmypdf`, which uses the Tesseract engine.
Thanks for the tips.
> * Bookmarks for pages / table of contents entries / etc are mandatory.
> Very few things make a scanned-doc PDF less useful than not being able
> to skip directly to a document indicated page.
I wish. This is a tough one. I generally sacrifice ditching the
bookmarks to make a better pdf. I need to look into extracting bookmarks
and if they can be re-added without getting all wonky.
> * I like to see at least 300 dpi.
Yes, me too, but I've found that this often results in too big (when
fixing existing), if I'm creating, they're fine.
> * Don't scan in color mode if the source material isn't color. Grey
> scale or even "line art" works fine in most cases. Using one pixel
> means you can use G4 compression for colorless pages.
Amen :).
>
> * Do reduce the color depth of pages that do contain color if you can.
> The resulting PDF can contain a mix of image types. I've worked with
> documents that did use color where four or eight colors were enough,
> and the whole document could be mapped to them. With care, you _can_
> force the scans down to two or three bits per pixel.
> * Do insert sensible metadata.
>
> * Do try to square up the inevitably crooked scans, clean up major
> floobydust and whatever crud around the edges isn't part of the paper,
> etc. Besides making the result more readable, it'll help the OCR. I
> never have any luck with automated page orientation tooling for some
> reason, so end up just doing this with Gimp.
Great points. Thanks.
-will
More information about the COFF
mailing list