Folks:
I've been having trouble with poor (read: glacial) indexing performance
when working with some PDF files. I noticed a couple of inconclusive
posts on the subject here, and thought I'd share the little bit that
I've learned. Turns out that pdftotext -- and by implication,
SWISH::Filter, which uses it -- can't cope with some Type 3 fonts. Lack
of coping takes the form of very, very, very, very, very, very, very
slow conversion resulting in gibberish.
For me, the best solution was simply to use pdffonts to detect the
presence of Type 3 fonts and skip indexing likely troublesome files
altogether (they account for the 10% of the output of one of the smaller
of 13 collections I'm indexing). While (according to wiser PDF-heads
than mine) it is not necessarily true that the presence of a Type 3
font inevitably bollixes conversion, spot-checking the corpus I am
working with indicates that it's near-enough certain to justify triage.
Hope this helps somebody. For that matter, if anyone has a pdf-to-ascii
or -html converter that'll cope with Type 3 fonts, I'd love to know
about it.
Best,
Tb.
--
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
Thomas R. Bruce (trb2@cornell.edu)
Director, Legal Information Institute
Cornell Law School
http://www.law.cornell.edu/
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
Received on Thu Feb 3 17:48:04 2005