Hi all,
I have been considering for a long time the possibility of including
some sort of compression capability for filenames, titles and
properties.
The problem arises to me when I index very long file-systems with
documents containing several fields (properties and/or metanames):
I can see a resultant indexfile of about 100 MB easily. So I checked
the space amount used by filename, title and properties and it was,
more or less, using 60 per cent of the total size of the index file
(about 60 MB).
These are the bad news. The good news are that this data is highly
repetitive.
So, I tried gzip to compress them and I could see how the file
dramatically downsizes. But gzip, bzip and other GPL tools do not
allow direct access to data as required by swish to get the document
data as faster as possible. So, if applied, compression must
guarantee direct and fast access to the info: no matter if the index
proccess is a little bit slower, search must be fast.
We cannot use zlib because it is not ready for direct access, but the
deflate algorithm can be applied. Just a little effort in coding is
needed. Deflate is a well known algorithm and it is also free of
patent's consideration.
Yesterday, I write a very simple code to test deflate in the real
scenary of swish-e. The results are quite good: The size of the
filename, title and properties downsizes to 40% and searching does
not show penalty (in fact, it needs less I/O).
Of course, gzip and its zlib library can achieve better results: they
use a more elaborated technique but swish-e just needs
performance and direct access.
This is just an idea, any comments are welcome.
cu in apachecon next week
(will be back oct. 26)
Jose
Received on Fri Oct 20 12:57:19 2000