Skip to main content.
home | support | download

Back to List Archive

Re: SWISH-E index limits

From: Gerald Klaas <gklaas(at)not-real.arb.ca.gov>
Date: Mon Apr 22 2002 - 17:50:51 GMT
Bill Moseley wrote:
> 
> At 10:03 AM 04/22/02 -0700, Linda DeBoer wrote:
> >       Whenever I run swish-e against a site which has a url pointing back
> >to the home page, it loops.
> 
> You don't mean "loop" in that it indexes the same URL more than once, right?
> 

It might if there is an equivalent URL not configured with the
EquivalentServer directive.  I.e.  http://www.sacto.com/ and http://sacto.com/
are two URL's for the same page. So wouldn't you need () in your config file ?
EquivalentServer http://sacto.com http://www.sacto.com

Or if the links back to the homepage, are not consistent, you might
also wind up with things like () being indexed separately.
http://sacto/
http://sacto.com/index.htm
http://www.sacto.com/index.htm
And then possibilities of case insensitivity if the host is MS-based
http://www.sacto.com/Index.htm
http://www.sacto.com/INDEX.htm
http://www.sacto.com/INDEX.HTM


> But, if you are using 2.1-dev, and the -S prog method with spider.pl then
> it's rather easy to do this.
> 
> In the config you can say:
> 
>   test_url => sub {
>       my $uri = shift;
>       return $uri->path =~ m!^/some/path!;
>   }
> 

I do this. Just like Bill says, it works like a charm.   :-)
If you want to see how I use this, you can check the 
"spider configuration template" link from
http://www.arb.ca.gov/db/search/swishe/swishe.htm

> Another option, which would be fast, would be to run another web
> server/virtual host on a different port, and change the document root.
> 
Interesting.  Then you'd use the ReplaceRules directive to
rewrite the URL as it goes into the index? 

Gerald
Received on Mon Apr 22 17:50:56 2002