Skip to main content.
home | support | download

Back to List Archive

Re: indexing only PDF files using swish-e-2.1 dev

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Apr 26 2001 - 19:51:34 GMT
At 12:40 PM 04/26/01 -0700, Chris Blackstone wrote:
>I'm trying to set up my config file so it will spider and only index PDF
>files, not any other files.
>In my config file I have the following
>
>IndexOnly .pdf
>NoContents .html .htm
>
>which only indexes the pdf files and the title of each html page.
>However, if I take out the NoContents line, all html files get indexed.

You are still using the spider.pl program, correct?

IndexOnly doesn't work with the "prog" document source method because it
thinks that all the docs you are feeding swish you want indexed.

So what you want to do is spider your entire web site but only index pdf
files, correct?

Then use this in your spider.pl config:

  test_response   => sub { $_[2]->content_type eq 'application/pdf' }, 

When that returns false that document will be skipped, if true (when pdf file)
it will be passed onto swish for indexing.

I think it's better to look at content types when spidering.



Bill Moseley
mailto:moseley@hank.org
Received on Thu Apr 26 19:53:24 2001