Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] index a list of files

From: Brad Bauer <bbauer(at)not-real.telstate.com>
Date: Wed Jul 09 2008 - 22:36:24 GMT
Bill,

The wait I was seeing was related to the default 5 second delay, setting
delay_min resolved the issue.  It when from 23 minutes down to 55 seconds
for my test case.


I appreciate the help guys.  Thanks!



B Bauer


-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Bill Moseley
Sent: Wednesday, July 09, 2008 1:43 PM
To: Swish-e Users Discussion List
Subject: Re: [swish-e] index a list of files

On Wed, Jul 09, 2008 at 09:47:13AM -0400, Brad Bauer wrote:
> 
> Perhaps there is something else at play slowing it down.  While trying 
> to get the spider working I reduced the SwishSpiderConfig.pl settings 
> to a bare minimum, so any timings are at their default.  What are the 
> default timings the spider uses?  Can you recommend good options for 
> the timing related settings?

You would have to check, but I think there's a default delay between
requests for the spider (in a questionable attempt at making the spider be
nice to the web server).  So, make sure "delay_sec" is set to zero, I think.

For local spidering, I'd set delay_set to zero and make sure keep-alives are
enabled on the web server and the spider.

Again, I can't imagine that fetching the content over HTTP on a local
machine is so significant that it's the problem.

> I'll look into modifying spider.pl, but I am no perl guru so I might 
> take an easier route: I am thinking I can just adjust 
> SwishSpiderConfig.pl#test_url to append each .pdf URL it encounters to 
> a log file and return false for that file.  Then I will probably 
> modify file.pl (since it is such a simple
> file) to index the pdfs saved in the log file.  Do you see any 
> potential issues with that?

Whatever works for you.  Path of least resistance is always good.  I would
first just make sure there's no delays and make sure you are comparing
apples.  I'd "spider" just a pdf file (so it only indexes one file) and
compare that to indexing that file with the file system.
Make sure the resulting indexes have the same content.  You have to expect
some additional overhead with spidering (especially with a single file where
keep-alive doesn't do any good).

But, if there's a huge amount of difference there then I'd start wondering
about where it's happening.  Maybe back off and see how long wget or "GET"
(which uses Perl's LWP like the spider does) takes to fetch the pdf.

--
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Jul 9 18:36:22 2008