Skip to main content.
home | support | download

Back to List Archive

Indexing performances, multi millions words

From: Jean-François PIÉRONNE <jfp(at)not-real.altavista.net>
Date: Wed Dec 26 2001 - 10:54:29 GMT
Hi all,

indexing large documents (more than 11000 files, 1.2 Go and near 4.5 M words), i
have noticed that the indexing times can be heavily reduced when i increased the
three "#define" HASHSIZE, BIGHASHSIZE, SEARCHHASHSIZE.

I don't know which of the three is the most significant, but indexing time drop
from 6 hours to less than 2 hours, and these 2 hours are mostly CPU bound.

May be, these parameters can dynamics (configuration parameters) or have larger
default.

After this, most of the times (80-90 %) is spent in the phase "writing word
data" doing a lot of CPU and millions reads in  the temporary file build during
the parsing-collecting pass.
I haven't isolate which routines is costly.

Jean-François
Received on Wed Dec 26 10:54:39 2001