Finally, I am here once again.
Old and new issues.
Just to remeber...
As you have read in previous posts to this list, swish-e 2.x
is consuming a really big amount of memory in the index
proccess. Many of this memory is used for storing the
- file number (index to the file info)
- metaname (it is 1 if no metaname, 2,3 for the rest)
- structure (stores if the word is in head, body, title ...)
- frequency (the number of occurences of the word in the file)
- positions (the positions of the word in the file) This can be a
Each of these values needs 4 bytes.
Many of that info can be compressed to save memory. So I
decided to make a try and modify the code to handle it. Here are
The test case contains 10000 files and 35000 different words.
Each file contains about 70 words with 7 fields (metaNames) and 5
The test box is a SUN Solaris 2.6 (400 MHZ) with 512MB.
(Note: All the files are in memory cache to minimize the effect of
the filesystem I/O).
swish-e-2.0.1 needed 33 MB of RAM and the index time was 33
"Modified" swish-e 2.x (including new index engine and beta
compression option) needed 20 MB RAM ant the index time was
All output index files are identical (except for the date/time of the
the header info).
As you see, there is a reduction in memory usage of about 40%.
I do not know if this is enough. Of course, it depends on how many
docs are being indexed and how powerful are your machine
The news. Hope you like it..
The sunday, back from my vacation, I tried to push the memory
reduction a little bit further. So I modified the code to store all word
and file (properties included) info into two temporal files.
It was really hard: after several cores, I finally got a working version:
The previous test case only needed 8.5 MB of RAM but it tooks
swish-e-2.0.1: 33 MB 33 seconds
With memory compression: 20 MB 25 seconds
With memory compression and temp files: 8.5M 56 seconds
I am not sure how accurate the time is because during the test
the machine was caching the file system.
Since the option with temp files is slower, I have adopted a new
parameter in the index line to activate it (-e). This means "economic
I have also added them to merge option. Since merge is a faster
proccess it always use temp files.
I will release this modifications after completing them.
Received on Mon Sep 18 12:22:59 2000