Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Regarding scalibilty and multithreading in Swish-e

From: <kumar.nitin(at)not-real.wipro.com>
Date: Tue Feb 19 2008 - 13:28:57 GMT
Hi,

 

Thanks for your valuable inputs.

 

I have little issue:

 

1>>>>>>>>>

 

I have following test.config file.

 

Command used on terminal:

 

./spider.pl test.config > out.txt

 

 

my %main_site = (

            base_url   => 'http://learning.wipro.com',

            same_hosts => 'http://www.wipro.com',

            email      => 'kumar.nitin@wipro.com',

      #     test_url => sub { $_[0]->path !~ /\.(gif|jpeg|png)$/ },

            max_depth => 1,

            delay_sec => 0,

            link_tags   => [qw/ a frame option area/],

            debug => DEBUG_URL | DEBUG_LINKS

    );

 

my %news_site = (

            base_url   => 'http://tedweb.wipro.com',

            same_hosts => 'http://www.wipro.com',

            email      => 'kumar.nitin@wipro.com',

      #     test_url => sub { $_[0]->path !~ /\.(gif|jpeg|png)$/ },

            max_depth => 1,

            delay_sec => 0,

            link_tags   => [qw/ a frame option area/],

            debug => DEBUG_URL | DEBUG_LINKS

    );

    @servers = ( \%main_site, \%news_site);

      1;

 

In the above scenario, while crawling 'http://learning.wipro.com', it
gives me all links at page in case of option used max_depth=0.

 

But in case of max_depth=1, when it is trying to connect to different
host like http://channelw.wipro.com <http://channelw.wipro.com/> , it is
failing. 

 

Please let us know how we can resolve this problem so that at depth-1,
we can achieve our functionality. 

  

2>>>>>>>>>>>>> 

 

If I am crawling multiple URLs at a time, how can it balance the load?
Like multithreading.

 

  

 

With Regards,

Nitin Kumar

+91-9999499757

 

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Peter Karman
Sent: Tuesday, February 19, 2008 5:59 PM
To: Swish-e Users Discussion List
Subject: Re: [swish-e] Regarding scalibilty and multithreading in
Swish-e

 

 

 

kumar.nitin@wipro.com wrote on 2/19/08 4:03 AM:

> Hi,

> 

>  

> 

> Swish-e has helper program spider.pl which spider the single host. Can

> we give multiple hosts to spider at a time? 

> 

 

yes. read the docs:

 

http://swish-e.org/docs/spider.html#configuration_file

 

-- 

Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com

_______________________________________________

Users mailing list

Users@lists.swish-e.org

http://lists.swish-e.org/listinfo/users


The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 19 08:29:10 2008