Skip to main content.
home | support | download

Back to List Archive

Re: Double Slashes When Spidering

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Jan 26 2003 - 18:26:10 GMT
On Thu, 23 Jan 2003, Michael Tsai wrote:

> On Wednesday, January 22, 2003, at 02:32  PM, Michael Tsai wrote:
> 
> > The problem is that the spider goes into an infinite loop. After going
> > through all the pages on the site, it starts printing out entries like:
> >
> >     Processing http://www.atpm.com//2.07/index.shtml...
> >     Processing http://www.atpm.com//2.06/index.shtml...
> > where it adds a second forward slash after the domain name. If I leave
> > it running long enough, it makes another pass over the pages with three
> > slashes.

You might look at the thread starting at:

http://www.rosat.mpe-garching.mpg.de/mailing-lists/libwww-perl/2002-01/msg00005.html

> I was able to stop this from happening by putting:
> 
> 	return if $uri->as_string =~ m[atpm\.com//];

Seems like a reasonably easy solution.  Might be interesting to see what
linky you have that is causing it to generate the double slash.  Turn on
URL or link debugging in the spider and you should be able to find where
it's happening.

-- 
Bill Moseley moseley@hank.org
Received on Sun Jan 26 18:26:55 2003