Definitely use SWISH::API, you'll notice a big speed increase. Beware of
caching the SWISH::API object and refreshing indices. It seems that
swish caches the old index and I get old results even though the old
index file is gone and replaced by a new one. What I should be doing is
a stat() on the index file every time before I do a query to check if
it's more recent than when I created the swish object, and reloading it
if it is. Instead I'm lazy and just create a new swish object every time
I do a query. And it's still very fast.
And instead of incremental indexing, I use daily indexes and just keep
refreshing today's index. Then at the end of the day I 'rollover' the
indices. So today becomes yesterday, yesterday becomes the day before
that and so on. and today gets a fresh index which is now rebuilt every
few hours till the next rollover.
Then when you search, you pass swish the list of indices for the days
you want to search. If you're going beyond 30 days it'll get a little
slow doing merges. So you probably want to do a weekly merge of
everything past say 15 days and use that for searches going back to
The nice thing about this system is when you have clustered servers that
need copies of the swish indices, you're only copying today's data out
to the servers every few hours. Then at the end of the day you'll run a
remote command on the servers and tell them to rollover their indices.
So it's quite bandwidth efficient.
On Thu, 2005-02-17 at 02:37 -0800, Markus Peter wrote:
> I'm using swish-e for over a year now with so far good results, but
> recently, two questions arose:
> I currently use several swish-e based search tools from a mod_perl
> application, with several index files with up to 100MB index size.
> The speed requirements for our application are very high - searches need
> to be performed < 1 second, but recently, the times got as high as 3.5
> As we're still using the old SWISH.pm with an external swish-e binary, I
> suppose I could speed up the searches by using SWISH::API.
> Now, my question is:
> I guess the major speedup SWISH::API allows is to keep the index file
> open between searches, so it needn't be reopened and reparsed for every
> request. How would I use it the best way, especially in the context of
> Apache 1 and mod_perl, where Apache forks new children. Can I already
> open the index files in my Apache mod_perl startup script (=before the
> fork of the children) and it will automatically do the right thing, or
> should I write Apache child startup handlers, so that they are opened
> immediately after an Apache child has been forked? How much memory does
> an index of that size, which is kept open, consume, by the way, and how
> much of that memory is shared between Apache processes?
> The other question I have is regarding incremental mode. So far I've
> been using the traditional mode with cron jobs to update once or twice a
> day, but I'd really like to convert the search to be "real time". How
> stable is incremental mode? And "how incremental" is it? Can I use it,
> to add/modify/remove documents from the search index on the fly, as they
> are added/modified or is it rather targetted at batch processing a larger
> number of updates (=merely a better merge)?
Received on Thu Feb 17 12:22:18 2005