Skip to main content.
home | support | download

Back to List Archive

Re: merging indexes with stop words

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Apr 29 2002 - 19:08:35 GMT
At 04:16 AM 04/29/02 -0700, Eric Lease Morgan wrote:
>Again, thank you for the prompt reply. All the indexes I am trying to 
>merge use the same stop word list.

Can you send me a few small files and your config setup that demonstrate
the problem?  I can't duplicate it here.


>In the system I am creating, there will be potentially hundreds of 
>indexes to search if I don't merge them together. I find it hard to 
>believe that opening and closing this many files would be preferable to 
>merging them together into a single file.

The suggestion was to use fewer indexes.  The point being that swish is
much faster at indexing now, so indexing thousands of files isn't much of
an issue, even if you need to do it frequently.

In other words, the time and memory required to merge hundreds of indexes
is probably a lot more than just indexing the files all at the same time.

  23839 files indexed.  177636357 total bytes.  19739042 total words.
  Elapsed time: 00:01:10 CPU time: 00:00:59

Ok, so there's almost 24K files in about a minute. 

You are saying you need hundreds of indexes.  I assume that means you
either have a huge number of files or a quickly changing set of files and
you are trying to use swish for incremental indexing.  Either case, then
swish may be the wrong application at this time.


Unfortunately, merge just doesn't work that well.  It's not just a matter
of joining two indexes together.  Merge avoids reading and parsing the
files again, but it does not have the memory optimizations of regular
indexing.  It's not a net gain.

The above index of 24K files created in a minute and took about 70MB of RAM.

Here's merging that 24K file index with a small 13 file index.

  ./swish-e -M index.swish-e index.1 index.out
  Input index 'index.swish-e' has 23839 files and 272200 words
  Input index 'index.1' has 13 files and 2549 words
  ...
  23852 files indexed.  0 total bytes.  19777250 total words.
  Elapsed time: 00:02:46 CPU time: 00:02:45
  Indexing done!

Not only a lot longer, it took almost 270MB of RAM instead of 70MB.  Hard
to imagine that those extra 13 files needed 200MB and 1 1/2 minutes.  So
merge is not anywhere as optimized as normal indexing.

Here's merging two 24K file indexes:

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 3794 moseley   16   0  673M 462M 84628 D     1.9 92.1  16:39 swish-e

 47678 files indexed.  0 total bytes.  39478084 total words.
 Elapsed time: 00:18:59 CPU time: 00:16:47

So, almost 10 times more RAM and 15 times longer for twice the files.


But searching multiple indexes is not a very good answer either:

Here's searching 100 indexes:

 ./swish-e -w '"major rewrite"' -f `ls index* | grep -v prop` -m 1
# SWISH format: 2.1-dev-25
# Search words: "major rewrite"
# Number of hits: 100
# Search time: 0.009 seconds
# Run time: 1.846 seconds

And here's searching 500 indexes.  I doubt command.com could take 500 files
on the command line ;)


 ./swish-e -w '"major rewrite"' -f `ls index* | grep -v prop` -m 1
# SWISH format: 2.1-dev-25
# Search words: "major rewrite"
# Number of hits: 501
# Search time: 2.683 seconds
# Run time: 13.012 seconds


That's way too slow.  Swish works best on medium size collection of files
that don't change very fast.

It would be nice to optimize merge, but it's a higher priority to get
incremental indexing to work.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Mon Apr 29 19:08:47 2002