Skip to main content.
home | support | download

Back to List Archive

duplicate documents

From: Jon Sorensen <jon(at)not-real.starkmedia.com>
Date: Fri Oct 01 2004 - 15:50:59 GMT
I'm trying to spider a number of sites but spider.pl keeps getting in a loop
at:

https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2310&RegID= -
Using HTML2 parser -  (582 words)
https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2311&RegID= -
Using HTML2 parser -  (591 words)
https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2312&RegID= -
Using HTML2 parser -  (582 words)
https://secure.meriter.com/classreg/desc.cfm?CatID=35&ClassID=2313&RegID= -
Using HTML2 parser -  (582 words)

it was getting stuck on ReviewClasses.cfm so I'm using test_url to stop that
but I want to index the desc.cfm pages, I'm planning on trying use_md5 but
not sure if that
will make any difference

my %serverD = (
        base_url    => 'https://secure.meriter.com/classreg/',
        email       => 'jon@starkmedia.com',
      keep_alive  => 1,
     test_url    => sub {
         my $uri = shift;
            return 0 if $uri->path =~ /ReviewClasses\.cfm/;
            return 1;
         }
  #use_md5  => 1,
);
@servers = ( \%serverD, );

I'm not sure why this is getting stuck or how to debug for this issue
I checked the -T trace flag options for indexing but nothing seems to
pertain to this

any suggestions?

thanks for all your help

Jon Sorensen
Received on Fri Oct 1 08:51:10 2004