Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] concurrent access to index files

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri May 01 2009 - 03:32:09 GMT
Judith Retief wrote on 4/30/09 1:00 AM:
> I have a question on multiple swish-e processes accessing the same index
> files.
> 
> We're running swish-e against a fileset of about 2 million XML articles. An
> indexing process checks for newly arrived articles about once a minute,
> indexing them into a seperate file and then merging that into the master.
> (The indexer first merges to a temporary master, and then moves the temp file
> over the existing master). A fixed set of Reader processes are started up, a
> search Scheduler forwards user search requests to these Readers and they
> search through the master files. Requests build up in the Scheduler if the
> Reader processes are all busy.
> 
> I've assumed that I should be able to scale up concurrent searches by
> starting up more Readers. A search takes about 1/2 seconds, so 20 concurrent
> user requests should be served in 30 seconds with one Reader, 15 seconds with
> two Readers and 1/2 seconds with 20 Readers.
> 
> Indeed everything scales nicely but only up to a point; after four Readers
> the swish searches start to take longer. The problem worsens the more Readers
> I start up.
> 
> The Indexer and Readers are written in TCL, invoking swish-e on the
> commandline. It seems as if the swish processes are locking each other out of
> the master files, but as they're all only doing read access that shouldn't be
> the case, should it? It could also be the Indexer process, but then it only
> locks the master index for the very short and intermittent time of the file
> move, I wouldn't expect our very consistent deterioration profile to be the
> result of that.
> 
> Does anyone have an idea of what could be causing the deteriorating
> performance?

You don't mention the version you are using, so I'll assume 2.4.x with the
native index format.

As I read the src in db_native.c, swish-e just does a fopen() on the index file
with no locking. So any locking is happening at your OS level.

You would get much better performance if you held the index open in a persistent
connection, instead of spawning shell processes with swish-e from your TCL app.
That's what the libswishe-e C lib and SWISH::API Perl module allow for. See the
docs for those. See also the swished mod_perl app.

OTOH, if you are consistently swapping in a new index every few minutes, you'll
need to close those persistent connections periodically. SWISH::API::Stat (Perl,
on cpan.org) is one way of doing that.

But those observations are about your architecture, not the particular behaviour
you are seeing. I would be looking at OS and hardware limitations first, just to
eliminate those. Like, are you hitting swap for some reason. Then I'd try your
architecture with something besides swish-e, like a simple file i/o script or
similar, just to see if the issue is particular to swish-e.

Really, though, I would be thinking about persistent index connections.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Apr 30 23:32:05 2009