Skip to main content.
home | support | download

Back to List Archive

Memory and swish-e-2.X

From: <jmruiz(at)not-real.boe.es>
Date: Mon Sep 18 2000 - 12:22:43 GMT
Hi all,

Finally, I am here once again.

Old and new issues.

Just to remeber...

As you have read in previous posts to this list, swish-e 2.x
is consuming a really big amount of memory in the index
proccess. Many of this memory is used for storing the
words info:
- file number (index to the file info)
- metaname (it is 1 if no metaname, 2,3 for the rest)
- structure (stores if the word is in head, body, title ...)
- frequency (the number of occurences of the word in the file)
- positions (the positions of the word in the file) This can be a
repetitive value.
Each of these values needs 4 bytes.

Many of that info can be compressed to save memory. So I
decided to make a try and modify the code to handle it. Here are
the results:
The test case contains 10000 files and 35000 different words.
Each file contains about 70 words with 7 fields (metaNames) and 5
properties.
The test box is a SUN Solaris 2.6 (400 MHZ) with 512MB.
(Note: All the files are in memory cache to minimize the effect of
the filesystem I/O).

swish-e-2.0.1 needed 33 MB of RAM and the index time was 33
seconds.

"Modified" swish-e 2.x (including new index engine and beta
compression option) needed 20 MB RAM ant the index time was
35 seconds.

All output index files are identical (except for the date/time of the
the header info).

As you see, there is a reduction in memory usage of about 40%.
I do not know if this is enough. Of course, it depends on how many
docs are being indexed and how powerful are your machine
resources.

The news. Hope you like it..
The sunday, back from my vacation, I tried to push the memory
reduction a little bit further. So I modified the code to store all  word 
and file (properties included) info into two temporal files.
It was really hard: after several cores, I finally got a working version: 
The previous test case only needed 8.5 MB of RAM but it tooks 
56 seconds.

Resuming:

swish-e-2.0.1: 33 MB  33 seconds
With memory compression: 20 MB 25 seconds
With memory compression and temp files: 8.5M 56 seconds

I am not sure how accurate the time is because during the test
the machine was caching the file system.

Since the option with temp files is slower, I have adopted a new
parameter in the index line to activate it (-e). This means "economic
mode".

I have also added them to merge option. Since merge is a faster 
proccess it always use temp files.

I will release this modifications after completing them.

cu
Jose
Received on Mon Sep 18 12:22:59 2000