As you have read in previous posts in this list there seems to
be a problem with the memory requirements of swish-e 2.X
when it is indexing.
As I posted, the main problem is storing word info in memory
prior to write it to disk.
For example, let us say that we have 10000 files with 100 words in
each file. If each word is just once in each file (the worst case) we
will need 6 integers per word:
- One for the file number (4 bytes)
- One for the metaName (4 bytes)
- One for the structure (4 bytes)
- One for the rank (4 bytes)
- One for the frequency (4 bytes)
- One for the position (4 bytes)
We also need 4 bytes for a pointer to the next occurence.
So, total bytes: 28. This does not include the word itself and other
Thus, we need:
10000 * 100 * 30 = 28000000 (28 Meg).
All this info is compressed when writing to the index file
This does not include the info for the files, properties, etc..
that can also be big.
So, if your index proccess takes a long time, check if your box is
It is possible to find solutions to this issue. Eg:
- Compress the data in memory
- Normaly, metaName only needs one byte
- file number normally requires two bytes
- structure fits just one byte
- frequency an position nomally need one byte each
- Write to a temporal file all this info. Once it has been extracted
from the files, it is not longer needed until it is written to disk.
(Any other idea is welcome)
Of course, both options can be a penalty in performance.
(perhaps a flag to activate this option can be included)
Is this really rquired?
Memory is cheap everyday...
I would like to hear from all of you.
BTW, I will be on vacation from Sept 1 to Sept 15 (no mail, no
Received on Wed Aug 30 19:12:05 2000