HP3000-L Archives

June 2000, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Wirt Atmar <[log in to unmask]>
Reply To:
Date:
Tue, 20 Jun 2000 12:47:19 EDT
Content-Type:
text/plain
Parts/Attachments:
text/plain (39 lines)
Fred,

> You are right, the print quality is much better. I am just reluctant to
>  print the 778 page COBOL manual and not have the keyword search.
>  What is strange to me is that you can actually cut and paste the text from
>  the manual, it just looks "fuzzy" in the PDF file.
>  Could that just be a feature of Acrobat Reader 4.0 to select text from an
>  image?

We just bought Adobe Acrobat Capture ourselves and I've only had a few
minutes to play with it, so I'm no expert on its use, but of the several
settings in Capture are the scan frequency (dpi) and whether you would like
to keep a TIFF file along with the converted text. I strongly suspect that
someone answered yes to the "keep TIFF file?" question, but then didn't scan
the document in at a sufficient dpi to make it particularly legible on the
screen.

To do it right (with "right" taken with more than a grain of salt at this
juncture), the pages should have been scanned in at a relatively high dpi,
and converted to true PDF text, including being tagged with its appropriate
font characteristics, with only the figures being converted to TIFF files,
not the entire page.

This takes a great deal more time. After each page is scanned, you have to go
back and manually convert each of the words that the OCR routine didn't
recognize into its proper text word (and possibly correct its font
representation). Instead, what I believe the person did who scanned this
material was to allow it be done completely automatically, just sheet-feeding
the manual into the scanner, and letting Capture recognize any words that it
could. This is certainly the fastest way to convert a document to PDF,
although it has its obvious drawbacks. It also produces documents of maximum
byte count size.

If everything that I've said is true, only about 85 to 90% of the text has
actually been recognized and is sitting behind the TIFF files you're seeing,
and is thus truly capable of being searched.

Wirt Atmar

ATOM RSS1 RSS2