Skip to main content.
home | support | download

Back to List Archive

Merging vs. Spider

From: David VanHook <dvanhook(at)not-real.mshanken.com>
Date: Thu Oct 03 2002 - 16:04:48 GMT
Hello -- we're about to implement SWISH on our website, and are encountering
some tricky questions regarding the filesystem SWISH vs. Spidering.  The
problem isn't with SWISH (which is a fantastic program), it's trying to
figure out how to let SWISH do its thing properly.

The tricky stuff comes from the fact that we're trying to do this on a
dynamically generated site.  Our pages are generated dynamically -- and
then, as each page is requested by a user, cached on our server.  So, I was
originally thinking that I'd just have SWISH index these files, through the
filesystem, every night at midnight.  There are about 20,000 files total on
our site.

The problem comes in when we make any sitewide changes, which we do with
some frequency.  At that point, all those cached pages get deleted, and are
only recreated when a user requests them.  So if SWISH ran at midnight on a
night after we'd made a sitewide change, it would probably only index about
4,000 pages -- because the other pages wouldn't have been requested by users
during that day.

So, I'm confronted with a couple of possible solutions, both of which have
potential problems:

1) Generate a FULL index of the entire site, all 20,000 pages.  Store that
as the MAIN index.  Then run daily incremental indexes, using a timestamp,
and merge each of these incremental updates with the main index, always
keeping a copy of the main index as backup.  As long as we don't flush
cache, the incremental updates will only contain files created since the
main index, so there will be no overlap.

Potential problem:  If we do a flush cache on the whole site, the daily
incremental update is going to contain thousands of files -- both new files
AND new versions of files already in the main index.  When we do a merge,
will all of these items show up twice on search results, since they are in
both indexes?  They'd have the same filenames, and otherwise be pretty much
identical.

2) Run the spider.pl on the entire site on a nightly basis, which would
force our server to re-generate any pages deleted via a flush cache.

Potential problem:  How long would it take spider.pl to index 20,000 files?
Right now, using the filesystem, it takes about 25 minutes to run both our
full indexes -- we've got two separate ones, one for Fuzzy searching, and
one regular.  Would the spider take 2 or 3 times that, or would it take 20
or 30 times that?

Any help or advice from anybody else trying to do this on a dynamic system
would be greatly appreciated.  I love SWISH, and we're almost ready to go
live with this, if we could just figure this little bit out.

Thanks very much.

David VanHook
dvanhook@mshanken.com
Received on Thu Oct 3 16:08:42 2002