Skip to main content.
home | support | download

Back to List Archive

Re: Large Merge Consuming too many resources

From: Peter Karman <karman(at)not-real.cray.com>
Date: Mon Aug 30 2004 - 13:01:08 GMT
these are interesting admin questions.


Tac wrote on 8/30/04 6:16 AM:

> We've been running the indexing process on our production server (because
> it's fast, same box as the database, etc.), but at times it consumes so many
> resources it can bring our site temporarily down.  I'm trying to figure out
> a way to let other processes have some time slices.  Two questions:
>  
> (1) Does swish-e index a record coming in via -S prog as it gets the record,
> or does it wait until all records have been retrieved?  If it indexes at as
> it gets it, I can add a sleep() command every few thousand records.
>  

my understanding is that swish writes *properties* to disk for each doc 
as it parses. The main word index is not written for each doc, but 
instead the RAM is flushed periodically and written to disk. The main 
writing of the word index happens at the end of the indexing. You can 
vary this with the -e option.

You can watch this happen (under *nix OSes anyway) with the 'watch' 
command while swish-e is indexing. Just do a:

watch -n 1 'ls -l yourindex*'

in the same directory as the indexes are being written. You can watch 
the sizes change in 'real' time. If running under -e, try:

watch -n 1 'ls -l yourindex* swtmp*'

though NOTE that swtmp files number in the 100s and your terminal window 
probably won't show that many.

whether your sleep() idea will work, I can't say. Try it and let us know. :)


> (2) The major problem is not during the individual collection indexing, but
> during the 150+ index merge.  (Merging them all into 1 makes searches
> against all the collections MUCH faster).  I don't think the bottleneck is
> CPU, I think it's disk access, or maybe something else (file handles?). But
> it's so intensive that even the MySQL box on that same quad-processor
> machine locks up threads, unable to process any queries.   Any ideas for how
> to get around this?  We're looking at moving the indexing to another  box,
> but that requires getting the data from across the network, indexing and
> merging, then ftp'ing the indices back, which seems like a lot of work.  If
> there's a way to not have swish-e consume all the resources it'd be much
> easier to keep it where it is.
>  

Here we get to the heart of the matter, IMHO. My guess with the merge 
problem is that all the word data is being read into RAM simultaneously. 
Jose or Bill could answer that more intelligently. Have you tried -e 
with merging?

If the main problem is merging all those indexes and there's not a clean 
technical way (either improving the code or documentation) of making 
swish work under your current configuration, you have some workaround 
options:

1. try merging into several smaller sub-indexes rather than one huge 
one. You can search fairly effectively across half a dozen or so 
indexes, in my experience. This can also help with maintenance, if there 
are some of these sub-indexes that don't change as often as others.

2. since the indexes are os-independent, consider NFS (or some other 
network filesystem) to create them on a different box, and make them 
available on your production machine. I do that here.

3. consider being a bleeding edge tester of Josh's new swished remove 
server project. I'm sure he'd love the bugtesting your environment could 
provide. Search the email archive for the URL. It runs under Apache and 
mod_perl. I haven't had time yet to test it myself, but have installed 
it and it looks very promising.

Let us know what you do. Someone ought to be compiling a "SWISH-E 
CookBook" for cases like these.
-- 
Peter Karman  651-605-9009  karman@cray.com
Received on Mon Aug 30 06:01:25 2004