Skip to main content.
home | support | download

Back to List Archive

Re: error indexing pdf files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Apr 15 2003 - 13:27:30 GMT
On Tue, 15 Apr 2003, Jody Cleveland wrote:

> My question is, how do I get the spider to only look at a specific folder,
> and nothing else? I looked through the swish-e message archive, and came
> across this, which I added to my SwishSpiderConfig.pl:
> 
> 
> But, that still indexes all of www.oshkoshpubliclibrary.org. All I want is
> the citydirs directory.

You can try setting   

     debug => DEBUG_SKIPPED|DEBUG_INFO,

And if that's not enough simply add some print statements to your test_url
function.

    test_url => sub {
        my ($uri, $server) = @_;
        print STDERR "checking path: ", $uri->path, \n" 
            if $server->{debug}&DEBUG_INFO
        return if $uri->path =~ /\.(gif|jpeg)$/;
        return $uri->path =~ m[^/citydirs/];
    },

Another way to do all this is index the entire site in one go and use
Swish-e's ExtractPath to set a metaname.  Then when searching you can
limit to areas of the index.  See the "select_by_meta" example in the
swish.cgi file.

BTW -- are you using keep_alive => 1 when spidering?

-- 
Bill Moseley moseley@hank.org
Received on Tue Apr 15 13:31:15 2003