Here's another idea I've been thinking about that might be fairly simple to
Index the information incrementally and then merge the indices.
The program could be set up to default to a certain optimal memory value,
or number of words or files, then it would crank thru all of them,
producing temp indices and merge into a final index when finished.
At 12:07 PM 8/30/00, you wrote:
>As you have read in previous posts in this list there seems to
>be a problem with the memory requirements of swish-e 2.X
>when it is indexing.
>As I posted, the main problem is storing word info in memory
>prior to write it to disk.
>For example, let us say that we have 10000 files with 100 words in
>each file. If each word is just once in each file (the worst case) we
>will need 6 integers per word:
>- One for the file number (4 bytes)
>- One for the metaName (4 bytes)
>- One for the structure (4 bytes)
>- One for the rank (4 bytes)
>- One for the frequency (4 bytes)
>- One for the position (4 bytes)
>We also need 4 bytes for a pointer to the next occurence.
>So, total bytes: 28. This does not include the word itself and other
>Thus, we need:
>10000 * 100 * 30 = 28000000 (28 Meg).
>All this info is compressed when writing to the index file
>This does not include the info for the files, properties, etc..
>that can also be big.
>So, if your index proccess takes a long time, check if your box is
>It is possible to find solutions to this issue. Eg:
>- Compress the data in memory
> - Normaly, metaName only needs one byte
> - file number normally requires two bytes
> - structure fits just one byte
> - frequency an position nomally need one byte each
>- Write to a temporal file all this info. Once it has been extracted
>from the files, it is not longer needed until it is written to disk.
>(Any other idea is welcome)
>Of course, both options can be a penalty in performance.
>(perhaps a flag to activate this option can be included)
>Is this really rquired?
>Memory is cheap everyday...
>I would like to hear from all of you.
>BTW, I will be on vacation from Sept 1 to Sept 15 (no mail, no
Received on Wed Aug 30 21:33:22 2000