Skip to main content.
home | support | download

Back to List Archive

Problem Indexing PDFs with spider.pl

From: <Jeffrey.Grunstein(at)not-real.ny.frb.org>
Date: Wed Dec 04 2002 - 15:09:16 GMT
I'm running Swish-E 2.2.1 on a Solaris 9 box.  I got a filesystem index
working flawlessly, with PDFs being parsed as TXT using pdftotext.

Now, I'm trying to get it working using the prog method and spider.pl.  The
crawl seems to works fine and HTML files get indexed using the HTML2
parser.  I cannot get PDF files to index correctly.  When I tried the pdf
function internal to spider.pl, the PDF files were parsed as HTML2s and
only
between 5 and 8 words per file were indexed.  I know this is wrong because
the same PDF files with the filesystem index yield many more indexed
words.

I also tried using pdftotext and that doesn't index any words.  Here's a
snippet from my swish-e config file.

IndexContents HTML2 .html .htm
StoreDescription HTML2 100000

FilterDir /opt/sfw/bin
FileFilter .pdf pdftotext "'%p' -"
IndexContents TXT .pdf
StoreDescription TXT 250000

Note that the same directives work perfectly when we do a filesystem index.
For some reason, they don't work with a prog / spider.pl crawl.
Received on Wed Dec 4 15:09:28 2002