Zhou Xiang wrote on 3/27/09 3:12 PM: > Thank you for your help! > Now it is still strange that when I tried to index the following page: > http://digital.lib.lehigh.edu/beyondsteel_test/admin/templist.htm > Although I set max_depth to be 1, it still cannot dig deeper into each link. > That means it can only index the text appears on the above page, but none of > the contents in each link, . > Can you figure it out? > > My spider.config file: > @servers = ( > { > base_url => ' > http://digital.lib.lehigh.edu/beyondsteel_test/admin/templist.htm', > email => 'abc@gmail.com', > > # other spider settings described below > max_depth => 1, > }, > ); Did you read what I said before? >> >> Read the docs: >> >> http://swish-e.org/docs/spider.html#configuration_options >> >> the default behaviour is to remain only on the same host. All the links on the url you supply point at: rust.cc.lehigh.edu which is not the same as digital.lib.lehigh.edu so the spider stops because it will not leave the host you point it at. That's a feature. Why not pass in a list of all the urls you want spidered directly? base_url => [qw( http://rust.cc.lehigh.edu/beyondsteel/display.php?args=start-x_id-Sholes143line2 http://rust.cc.lehigh.edu/beyondsteel/display.php?args=start-x_id-Sholes143line3 http://rust.cc.lehigh.edu/beyondsteel/display.php?args=start-x_id-Sholes143line4 )] etc. -- Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com _______________________________________________ Users mailing list Users@lists.swish-e.org http://lists.swish-e.org/listinfo/usersReceived on Sat Mar 28 10:22:26 2009