Skip to main content.
home | support | download

Back to List Archive

PDF indexing: one minor mystery revealed

From: Thomas R. Bruce <trb2(at)not-real.cornell.edu>
Date: Fri Feb 04 2005 - 01:47:57 GMT
Folks:

I've been having trouble with poor (read: glacial) indexing performance  
when working with some PDF files.  I noticed a couple of inconclusive 
posts on the subject here, and thought I'd share the little bit that 
I've learned.    Turns out that pdftotext -- and by implication, 
SWISH::Filter, which uses it -- can't cope with some Type 3 fonts.  Lack 
of coping takes the form of very, very, very, very, very, very, very 
slow conversion resulting in gibberish.

For me, the best solution was simply to use pdffonts to detect the 
presence of Type 3 fonts and skip indexing likely troublesome files 
altogether (they account for the 10% of the output of one of the smaller 
of 13 collections I'm indexing).  While (according to wiser PDF-heads 
than mine)  it is not necessarily true that the presence of a Type 3 
font inevitably bollixes conversion, spot-checking the corpus I am 
working with indicates that it's near-enough certain to justify triage.  

Hope this helps somebody.  For that matter, if anyone has a pdf-to-ascii 
or -html converter that'll cope with Type 3 fonts, I'd love to know 
about it.

Best,
Tb.

-- 
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
Thomas R. Bruce (trb2@cornell.edu)
Director, Legal Information Institute
Cornell Law School
http://www.law.cornell.edu/
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+= 
Received on Thu Feb 3 17:48:04 2005