OK, here's a question -- I've got to run this huge index twice: once for the
regular index, and another time for the Fuzzy index. Is there any way to
only run the spider.pl part of it once, and then somehow get the results to
both the non-fuzzy and the fuzzy SWISH indexing? Since the work is really
in the grabbing of all the HTML files, not in the indexing part.
[mailto:firstname.lastname@example.org]On Behalf Of Bill Moseley
Sent: Tuesday, October 08, 2002 10:21 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Spider taking too long to index?
On Tue, 8 Oct 2002, David VanHook wrote:
> Last night, during a time when the site was not very busy at all, it took
> spider.pl 3 hours and 16 minutes to index 19,277 files (a rate of 1.6 per
> second, according to the SWISH report). The total amount of CPU
> time was 22 minutes, 20 seconds.
> The way I'm doing it is, I feed Spider.pl a single page which contains a
> list of links to all the pages I want it to index. That page is huge, of
> course. Then I tell spider.pl to only go one level deep. So it grabs the
> first item on the page, indexes it, returns to the list, grabs the next
> item, indexes it, returns to the list, etc. Is that not the way this
> should work? Should I modify some setting on SwishSpiderConfig.pl to
> account for this system?
Yes, but it's only fetching one doc at a time, which might be slow. It
took you 11,760 seconds to fetch 19,277 files.
That can be faster if keep_alives are working. It requires both that you
have a server set to do keep alives, of course, and also that the
spidering machine has a current version of LWP installed. I think it will
complain if you set keep_alive and do not have the supporting LWP code,
You should be able to check if keep alive is working on the server:
> cat t.pl
@servers = (
base_url => 'http://apache.org',
email => 'email@example.com',
max_files => 1,
keep_alive => 1, # enable keep alives requests
> SPIDER_DEBUG=headers ./spider.pl t.pl >/dev/null
----HEADERS for http://apache.org ---
Connection: Keep-Alive <<<<<
Date: Tue, 08 Oct 2002 14:11:58 GMT
Server: Apache/2.0.43 (Unix)
Content-Type: text/html; charset=iso-8859-1
Expires: Wed, 09 Oct 2002 14:11:58 GMT
Client-Date: Tue, 08 Oct 2002 14:12:01 GMT
Keep-Alive: timeout=5, max=100 <<<<
Title: Welcome! - The Apache Software Foundation
If it says:
then keep alives are not working.
If you are spidering more than one site the set the keep_alive value to a
larger number -- that setting is the number of connection cache entries
If your web server has any monitoring features maybe that will show if
keep alive requests are working. In Apache you can use mod_status to
monitor the server.
If keep alives are not working fast enough and you don't mind hitting the
web server harder, then there's ways to do parallel fetching, but that
would require a rewrite of the spider.pl program.
> Because I'm generating this list of items to index myself, I turned off
> test_url function. But that didn't seem to help performance all that
No, it wouldn't -- all the time is probably in the either the connection
process, or in the transfer of data. You can see from the huge difference
CPU time vs running time that the program is mostly waiting for I/O.
Bill Moseley firstname.lastname@example.org
Received on Tue Oct 8 21:50:58 2002