On 27 Sep 2002, at 5:34, Lauren wrote:
> I could use some simplifying translation here. What is puzzling me is
> not that the size dropped _when_ we moved to swish-e 2.2, but that it
> dropped at a subsequent time, when we hadn't made any updates to our
> version. We put 2.2 in place and used it for several months. The
> index file grew gradually to 35.9 meg; then the next week it was much
> smaller with no evident change on _our_ end. I am convinced that no
> .html files are being skipped.
> So my question is this: Jose: Is there something in your compression
> routines that could result in a decrease that large just by my
> _adding_ some files to be indexed? I'm hoping for some illumination
> in the form of ideas about what could trigger such a fortuitous and
> dramatic result.
> (For example: One simplistic theory is that you've got a compression
So, all your files are HTML files, right?
Well, It is correct. I added quite recently a feature to compress
structure (IN_BODY, IN_FILE, etc...) in a better way. I have used a
technique that uses a bit flag to indicate that the word is only
in the body (this seems to be a very common case). When
this occurs 1 byte per position is saved. So, if the word
occurs 7 times in the file and it is only in the body, then we have
saved 7 bytes.
Received on Fri Sep 27 15:30:42 2002