Skip to main content.
home | support | download

Back to List Archive

Re: Indexing performances, multi millions words

From: Jean-François PIÉRONNE <jfp(at)not-real.altavista.net>
Date: Thu Dec 27 2001 - 00:05:40 GMT
> 
> At 02:53 AM 12/26/01 -0800, Jean-François PIÉRONNE wrote:
> >indexing large documents (more than 11000 files, 1.2 Go and near 4.5 M
> words), i
> >have noticed that the indexing times can be heavily reduced when i
> increased the
> >three "#define" HASHSIZE, BIGHASHSIZE, SEARCHHASHSIZE.
> >
> >I don't know which of the three is the most significant, but indexing time
> drop
> >from 6 hours to less than 2 hours, and these 2 hours are mostly CPU bound.
> 
> Jose will need to comment on those settings.  I've played with them a
> little but didn't see much change.  What specific settings did you use?
> 
> But if you really have 4.5 million words, then maybe increasing the hash
> size would help.  A larger hash index would mean less stepping through
> words one-by-one with the same hash value.  There may be some other reasons
> to use higher values - going from six hours to two hours makes me think you
> went from swapping to not swapping.  Was the machine load (and memory
> demand) the same for each run.  That is, were there other programs
> demanding memory on the six hour run?
> 

The only difference between the two run are the CPU consumed, with a large
number of word there is a lot of collision in the hash table. Which demonstrate
the problem is that the first files are scan very quickly and the process goes
slower and slower each times it parse a new file.
I have set the three hashsize respectively to 10007, 100003 and 1000003

> But, 4.5 million unique "words"?  That's a lot of words.  Are you really
> going to search those words?
> 

The files are sources listing (OpenVMS sources listing) which contains lot of
number in decimal, hexadecimal and C hex format (0x format) but i haven't found
how to not index the two hex formats (for the first format i have define
IGNOREALLN to 1 in config.h)

i use the '-e' switch to save memory, but i have 2 Go of memory into my
workstation.
With '-e' switch:

4484259 unique words indexed.
8 properties sorted.
11150 files indexed.  1233261154 total bytes.
Elapsed time: 01:21:43 CPU time: 01:17:14
Indexing done!

The process, until it reach the "Writing word data:" point, took 30' CPU and use
350 MB of memory, no paging or swaping.
The "Writing word data:" consumed 1 h of CPU and 13 M I/O all of these reads are
satisfied through the cache, the temporary file is less 350 MB so it fit into
memory. The CPU used for these I/O (more than 9000 read/s) is little less than
90% of the total CPU consumed by this last step.

Without '-e' switch:

4484258 unique words indexed.
8 properties sorted.
11150 files indexed.  1233261156 total bytes.
Elapsed time: 00:42:40 CPU time: 00:40:25
Indexing done!

The process, until it reach the "Writing word data:" point, took 38' CPU and use
350 MB of memory, generate more than 10 M pages faults.
The "Writing word data:" consumed  the few remaining minutes of CPU.



So the  switch '-e' seem to made the "Writing word data:" step very costly.



> I did this the other day:
> 
> 679890 unique words indexed.
> 2 properties sorted.
> 38740 files indexed.  455105121 total bytes.  19705343 total words.
> Elapsed time: 00:11:12 CPU time: 00:09:40
> Indexing done!
> 
> That was on a BSD machine with a load average of about *ten*.
> 
> 680K unique words is a lot, I thought, and I discovered that was due to
> indexing a mail archive with MIME attachments.  Without the mail archive it's:
> 
> 75495 unique words indexed.
> 2 properties sorted.
> 29505 files indexed.  384191825 total bytes.  12758054 total words.
> Elapsed time: 00:07:09 CPU time: 00:05:45
> Indexing done!
> 



Jean-François
Received on Thu Dec 27 00:05:47 2001