Skip to main content.
home | support | download

Back to List Archive

Re: Large Merge Consuming too many resources

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Aug 30 2004 - 12:51:36 GMT
On Mon, Aug 30, 2004 at 04:15:38AM -0700, Tac wrote:

What's the machine show?  Are you running out of memory?  Have you
watched, for example, vmstat's output?  Swish eats both memory and
CPU.  When running with -e (which I think it does automatically when
merging -- need to check that.  Jose?) if frees up memory but then
there's quite a bit of disk I/O.

Unfortunately, I think this is a common problem when indexing large
collections.

> (1) Does swish-e index a record coming in via -S prog as it gets the record,
> or does it wait until all records have been retrieved?  If it indexes at as
> it gets it, I can add a sleep() command every few thousand records.

I think nice(1) would be a better way at controlling that.  But, yes,
in general swish gets files a record at a time.  You don't have to run
the -S prog and swish at the same time.  For example, I do this when
spidering:

    # First spider the sites to a local file
    SPIDER_QUIET=1 ./spider.pl spiderconfig.pl | gzip > all.gz || exit

    # Now index with no output (-v0) and send errors to stderr (-E)
    # using -e will trade RAM for disk space (and inodes) while indexing.

    gzip -dc all.gz | ./swish-e -v0 -E -c swish.conf -S prog -i stdin

gzip may or may not be good in your case -- in can reduce some disk
access at the cost of CPU.


> (2) The major problem is not during the individual collection indexing, but
> during the 150+ index merge.  (Merging them all into 1 makes searches
> against all the collections MUCH faster).  I don't think the bottleneck is
> CPU, I think it's disk access, or maybe something else (file handles?). But
> it's so intensive that even the MySQL box on that same quad-processor
> machine locks up threads, unable to process any queries.   Any ideas for how
> to get around this?  We're looking at moving the indexing to another  box,
> but that requires getting the data from across the network, indexing and
> merging, then ftp'ing the indices back, which seems like a lot of work.  If
> there's a way to not have swish-e consume all the resources it'd be much
> easier to keep it where it is.

You are merging all 150 indexes at once?  I doubt it will help, but
have you tried doing it in smaller batches?  Or maybe some combination
of smaller batches to get two or four indexes you can search together?
I'm sure searching 150 indexes would be slow but maybe just a few
would be ok.

Or maybe you could rsync or somehow replicate your data onto another
machine over time so you don't need to index the data over the network
when it comes time to reindex.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Aug 30 05:51:52 2004