Skip to main content.
home | support | download

Back to List Archive

Re: Problem Indexing PDFs with spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Dec 04 2002 - 15:41:10 GMT
On Wed, 4 Dec 2002 Jeffrey.Grunstein@ny.frb.org wrote:

> I'm running Swish-E 2.2.1 on a Solaris 9 box.  I got a filesystem index
> working flawlessly, with PDFs being parsed as TXT using pdftotext.
> 
> Now, I'm trying to get it working using the prog method and spider.pl.  The
> crawl seems to works fine and HTML files get indexed using the HTML2
> parser.  I cannot get PDF files to index correctly.  When I tried the pdf
> function internal to spider.pl, the PDF files were parsed as HTML2s and
> only
> between 5 and 8 words per file were indexed.  I know this is wrong because
> the same PDF files with the filesystem index yield many more indexed
> words.
> 
> FilterDir /opt/sfw/bin
> FileFilter .pdf pdftotext "'%p' -"

http://www.swish-e.org/current/docs/CHANGES.html#Version_2_2_2_November_14_2002

Or you can filter in the spider.pl program.



-- 
Bill Moseley moseley@hank.org
Received on Wed Dec 4 15:41:23 2002