Skip to main content.
home | support | download

Back to List Archive

Re: swish-e-2.X and memory

From: Frank Heasley <DrHeasley(at)>
Date: Wed Aug 30 2000 - 21:29:09 GMT
Here's another idea I've been thinking about that might be fairly simple to 

Index the information incrementally and then merge the indices.

The program could be set up to default to a certain optimal memory value, 
or number of words or files, then it would crank thru all of them, 
producing temp indices and merge into a final index when finished.


At 12:07 PM 8/30/00, you wrote:

>Hi all,
>As you have read in previous posts in this list there seems to
>be a problem with the memory requirements of swish-e 2.X
>when it is indexing.
>As I posted, the main problem is storing word info in memory
>prior to write it to disk.
>For example, let us say that we have 10000 files with 100 words in
>each file. If each word is just once in each file  (the worst case) we
>will need 6 integers per word:
>- One for the file number (4 bytes)
>- One for the metaName (4 bytes)
>- One for the structure (4 bytes)
>- One for the rank (4 bytes)
>- One for the frequency (4 bytes)
>- One for the position (4 bytes)
>We also need 4 bytes for a pointer to the next occurence.
>So, total bytes: 28. This does not include the word itself and other
>Thus, we need:
>10000 * 100 * 30 = 28000000  (28 Meg).
>All this info is compressed when writing to the index file
>This does not include the info for the files, properties, etc..
>that can also be big.
>So, if your index proccess takes a long time, check if your box is
>It is possible to find solutions to this issue. Eg:
>- Compress the data in memory
>     - Normaly, metaName only needs one byte
>     - file number normally requires two bytes
>     - structure fits just one byte
>     - frequency an position nomally need one byte each
>- Write to a temporal file all this info. Once it has been extracted
>from the files, it is not longer needed until it is written to disk.
>(Any other idea is welcome)
>Of course, both options can be a penalty in performance.
>(perhaps a flag to activate this option can be included)
>Is this really rquired?
>Memory is cheap everyday...
>I would like to hear from all of you.
>BTW, I will be on vacation from Sept 1 to Sept 15 (no mail, no
Received on Wed Aug 30 21:33:22 2000