On Wed, 4 Dec 2002 Jeffrey.Grunstein@ny.frb.org wrote:
> I'm running Swish-E 2.2.1 on a Solaris 9 box. I got a filesystem index
> working flawlessly, with PDFs being parsed as TXT using pdftotext.
> Now, I'm trying to get it working using the prog method and spider.pl. The
> crawl seems to works fine and HTML files get indexed using the HTML2
> parser. I cannot get PDF files to index correctly. When I tried the pdf
> function internal to spider.pl, the PDF files were parsed as HTML2s and
> between 5 and 8 words per file were indexed. I know this is wrong because
> the same PDF files with the filesystem index yield many more indexed
> FilterDir /opt/sfw/bin
> FileFilter .pdf pdftotext "'%p' -"
Or you can filter in the spider.pl program.
Bill Moseley email@example.com
Received on Wed Dec 4 15:41:23 2002