Skip to main content.
home | support | download

Back to List Archive

More on indexing and memory requirements in swish-e 2.x

From: <jmruiz(at)not-real.boe.es>
Date: Thu Aug 31 2000 - 13:40:18 GMT
Hi all,

The old news...
As you have read in previous posts to this list, swish-e 2.x
is consuming a really big amount of memory in the index
proccess. Many of this memory is used for storing the
words info: 
- file number (index to the file info)
- metaname (it is 1 if no metaname, 2,3 for the rest)
- structure (stores if the word is in head, body, title ...)
- frequency (the number of occurences of the word in the file)
- positions (the positions of the word in the file) This can be a 
repetitive value.
Each of these values needs 4 bytes.

Now, the new and good news...
Many of that info can be compressed to save memory. So I 
decided to make a try and modify the code to handle it. Here are 
the results:
The test case contains 10000 files and 35000 different words.
Each file contains about 70 words with 7 fields (metaNames) and 5 
properties.
The test box is a SUN Solaris 2.6 (400 MHZ) with 512MB.
(Note: All the files are in memory cache to minimize the effect of 
the filesystem I/O).

swish-e-2.0.1 needed 33 MB of RAM and the index time was 33 
seconds.

"Modified" swish-e 2.x (including new index engine and beta 
compression option) needed 20 MB RAM ant the index time was 
35 seconds.

Both output index files are identical (except for the date/time of the
the header info).

As you see, there is a reduction in memory usage of about 40%.
I do not know if this is enough. Of course, it depends on how many 
docs are being indexed and how powerful are your machine 
resources.
I will release this modifications after completing them (Need to add 
them to merge option).

Now, it is time for my vacation.
cu on Sept 17
Jose
Received on Thu Aug 31 13:44:38 2000