Skip to main content.
home | support | download

Back to List Archive

RE: Merging vs. Spider

From: David VanHook <dvanhook(at)not-real.mshanken.com>
Date: Thu Oct 03 2002 - 19:36:33 GMT
Thank you both (Bill and Gerald) very much for the responses.  Here's some
answers to your questions:

>25 minutes seems a little long for indexing 20,000 static files, but
>perhaps they are large -- takes about 15 minutes on Apache.org to index
>something around 50,000 files, IIRC.

I guess I wasn't clear on that -- we've got two separate indexes (one
metaphone, one not), and it takes 25 mins total for both of them to run,
combined.  They're about 12 mins each.

>Are you using a caching proxy?  Do you always need to flush the cache when
>you make an update to the site?  Or could you let the normal cache control
>headers control when the proxy updates its cache?

The caching is all handled by our content management system, Vignette.  The
first time any page on the site is called, if caching is turned on for that
template, it creates the cached file.  From there on, until a template-level
change is made, it serves the cached version of the page.  But as soon as
any template-level change is made, the cache is automatically cleared, and
new dynamic versions are generated the first time every page is called
again.

I agree with you both that the spider seems to be the way to go --
unfortunately, I'm no Perl expert, so it'll take me a while to make it work.
I'll keep you all posted.

Thanks --

Dave VanHook





-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Thursday, October 03, 2002 12:48 PM
To: dvanhook@mshanken.com; Multiple recipients of list
Subject: Re: [SWISH-E] Merging vs. Spider


At 09:04 AM 10/03/02 -0700, David VanHook wrote:
>So, I'm confronted with a couple of possible solutions, both of which have
>potential problems:
>
>1) Generate a FULL index of the entire site, all 20,000 pages.  Store that
>as the MAIN index.  Then run daily incremental indexes, using a timestamp,
>and merge each of these incremental updates with the main index, always
>keeping a copy of the main index as backup.  As long as we don't flush
>cache, the incremental updates will only contain files created since the
>main index, so there will be no overlap.
>
>Potential problem:  If we do a flush cache on the whole site, the daily
>incremental update is going to contain thousands of files -- both new files
>AND new versions of files already in the main index.  When we do a merge,
>will all of these items show up twice on search results, since they are in
>both indexes?  They'd have the same filenames, and otherwise be pretty much
>identical.

Merge compares the path names, and when duplicates are found only the one
with the newest date is kept.  I'm not a huge fan of merge -- it works much
better now (thank you Jose!), but I'd rather keep a local cache of updated
files and index those in one shot.

>2) Run the spider.pl on the entire site on a nightly basis, which would
>force our server to re-generate any pages deleted via a flush cache.

This would probably be my suggestion -- spider the caching server. That
will let the caching server do its work of deciding what pages to return
from the cache vs. fetch from the server.

>Potential problem:  How long would it take spider.pl to index 20,000 files?
>Right now, using the filesystem, it takes about 25 minutes to run both our
>full indexes -- we've got two separate ones, one for Fuzzy searching, and
>one regular.

25 minutes seems a little long for indexing 20,000 static files, but
perhaps they are large -- takes about 15 minutes on Apache.org to index
something around 50,000 files, IIRC.


>Would the spider take 2 or 3 times that, or would it take 20
>or 30 times that?

Make sure you have a current installation of the LWP Perl libraries and use
the keep_alive feature (if your server supports that -- it will save on the
connection time and the number of processes (say, if running a forking
server) needed to handle the requests.

Are you using a caching proxy?  Do you always need to flush the cache when
you make an update to the site?  Or could you let the normal cache control
headers control when the proxy updates its cache?

The basic problem is the docs are dynamically generated so you have to
fetch them from a web server at some point.  But I suppose there could be
some optimizations:

For example, if your web space (i.e. cache) mirrors the file system, then
use the file system for "spidering" and when you try to follow a link that
doesn't exist then you then you ask the web server for it.  Spider.pl could
do this with minor changes.  In test_url() you first rewrite the URL into a
path, check for the file on disk, if there use that, if not you let the
spider fetch it from the server.

Sorry, I'm not offering much help.   Please report back with whatever you
come up with, ok?


--
Bill Moseley
mailto:moseley@hank.org
Received on Thu Oct 3 19:40:38 2002