Skip to main content.
home | support | download

Back to List Archive

swish-e-2.1.X and compression

From: <jmruiz(at)not-real.boe.es>
Date: Fri Oct 20 2000 - 12:52:09 GMT
Hi all,

I have been considering for a long time the possibility of including
some sort of compression capability for filenames, titles and 
properties.

The problem arises to me when I index very long file-systems with
documents containing several fields (properties and/or metanames): 
I can see a resultant indexfile of about 100 MB easily. So I checked 
the space amount used by filename, title and properties and it was, 
more or less, using  60 per cent of the total size of the index file 
(about 60 MB).
These are the bad news. The good news are that this data is highly 
repetitive. 
So, I tried gzip to compress them and I could see how the file 
dramatically downsizes. But gzip, bzip and other GPL tools do not 
allow direct access to data as required by swish to get the document 
data as faster as possible. So, if applied, compression must 
guarantee direct and fast access to the info: no matter if the index 
proccess is a little bit slower, search must be fast.
We cannot use zlib because it is not ready for direct access, but the 
deflate algorithm can be applied. Just a little effort in coding is 
needed. Deflate is a well known algorithm and it is also free of 
patent's consideration. 
Yesterday, I write a very simple code to test deflate in the real 
scenary of swish-e. The results are quite good: The size  of the 
filename, title and properties downsizes to 40% and  searching does 
not show penalty (in fact, it needs less I/O).
Of course, gzip and its zlib library can achieve better results: they 
use a more elaborated technique but swish-e just needs 
performance and direct access.

This is just an idea, any comments are welcome.

cu in apachecon next week
(will be back  oct. 26)

Jose
Received on Fri Oct 20 12:57:19 2000