Skip to main content.
home | support | download

Back to List Archive

Re: HTTP method and swishspider

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Sep 25 2000 - 17:40:08 GMT
At 08:21 AM 09/25/00 -0700, jmruiz@boe.es wrote:
>I have noticed that this option is slow. I am wondering why. As you 
>know, an external perl program is called for getting each page from 
>the server.

Yes, that's a silly, silly way to spider.  I do hope nobody is using swish
to spider sites under their control instead of using the file system.

My guess is that the fork is not the problem.  Most modern operating
systems are probably smart enough to do that fast.  Well, probably fast
compared to getting the remote http document.

It would be smarter if the GET was done within http.c instead of calling a
perl program, for sure.  

It would be smarter if httpd.c did a pipe open of swishspider and let
swishspider.pl really spider instead of just getting one resource.  Much of
the http.c code could be put in a perl script making it easier to maintain
and change.

It would be smarter still if swishspider used LWP::Parallel::UserAgent when
spidering, or just forked off a bunch of spiders to run in parallel (what I
do) and fed that output back to swish.

My suggestion would be to get rid of the http method and just provide a
hook where swish does a piped open to any program, and that program can
feed the documents in some standard type of format just by writing to
STDOUT.  That way you could index files locally that needed to be filtered,
spider remote web sites, or index a bunch of records stored in a database.



Bill Moseley
mailto:moseley@hank.org
Received on Mon Sep 25 17:40:39 2000