Skip to main content.
home | support | download

Back to List Archive

Re: Indexing performances, multi millions words

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Dec 27 2001 - 00:56:59 GMT
At 04:05 PM 12/26/01 -0800, Jean-François PIÉRONNE wrote:
>> But, 4.5 million unique "words"?  That's a lot of words.  Are you really
>> going to search those words?
>> 
>
>The files are sources listing (OpenVMS sources listing) which contains lot of
>number in decimal, hexadecimal and C hex format (0x format) but i haven't
found
>how to not index the two hex formats (for the first format i have define
>IGNOREALLN to 1 in config.h)

There's a new config setting that's in CVS -- I'm not sure how long ago I
checked it in.  I'm not even sure if it will stay in swish.

   IgnoreNumberChars

For example, if you set it like:

   IgnoreNumberChars 0123456789abcdef

It won't index hex numbers.  Of course it won't index the word "bad"
either.  So you can see it has limited use for hex.  It's really more
designed for

   IgnoreNumberChars 0123456789,.

But, I'm still not clear if it's worth keeping in the source.

Another approach would be to use -S prog and simply use regex matching to
remove all the hex numbers from the source before sending to swish.

Regarding -e and not -e:  If your OS is caching the temporary file in RAM
anyway, might as well run without -e.  You can imagine that -e with your
index size would really work the disk drive.  While in the step "Writing
Word Data" -e has to seek all over the place to collect all the words (word
position data) from different documents together.  Without -e, the words
are just linked together in memory while indexing.

I'm also not clear if there's any optimization (or guesses) done to prevent
reallocating memory too often or not when using very large number of unique
words.

>With '-e' switch:
>The process, until it reach the "Writing word data:" point, took 30' CPU
and use
>350 MB of memory, no paging or swaping.


>Without '-e' switch:
>The process, until it reach the "Writing word data:" point, took 38' CPU
and use
>350 MB of memory, generate more than 10 M pages faults.

Page faults meaning that it accessed memory that had been swapped out to disk?

Does that mean you really don't have enough RAM for indexing?  Or do you
have some memory limits that force swapping to disk?


>So the  switch '-e' seem to made the "Writing word data:" step very costly.

I think -e is very costly.  While indexing the word position data are
written sequentially out to disk, and so it's a lot of disk i/o reading
back in.  I think the point of -e is for people that don't have enough
memory, where indexing would basically make the machine swap to death.

Keep in mind that indexing is MUCH faster and more memory efficient than in
1.3 (or 2.0.x).  I mention this often, but my /usr/doc of 25,000 files
indexes in about 4 minutes (on my 128M system).  Not too long ago that was
15 minutes.  And before that it was about three hours (with swapping).  An
index that took nine hours on sunsite now takes about 15 minutes, if I
remember correctly.

Swish doesn't get pushed into the millions of files, or millions of words
too often, so these speed issues don't show up much.  But, it's a good time
to try to optimize even more.  So, if you can find anything that improves
the indexing, then that's great.

BTW -- Jose has tested using a btree-type of indexing scheme.  I'm not sure
of it's current state, but you might (for fun) try indexing with:

#define USE_BTREE

in config.h.  I don't think you can use -e with btree.




Bill Moseley
mailto:moseley@hank.org
Received on Thu Dec 27 00:58:25 2001