Skip to main content.
home | support | download

Back to List Archive

IgnoreLimit stopwords etc

From: Frances Coakley <f.coakley(at)not-real.eim.surrey.ac.uk>
Date: Mon Feb 19 2001 - 16:25:10 GMT
I did a little playing around with alternative data structures - using
swish 1.3.2 as base (the version of swish 2.0.3 ?) could not handle my
files on my available windows machine.  I have some 4500 HTML files - about
140,000 words of which some 60k are unique to 1 file (and of these some 40k
are probably scanning errors!) - the overall index was some 17.5Mbytes
On taking the -D output, runlength encoding the file refs (eg 4 bits for 14
files different using 0000 and 0001 to signify that next 12 bits = file no)
brought index down to about 3Mbytes - on this scheme (admittedly crude) a
word that occurs in all occupies no more that NoOfFiles/2 bytes. This
begins to be attarctive for Java based searches for CD Roms
By grouping words alphabetically - using modulo 40 coding (that gives my
age away!) and splitting words into run lengths of 6, 9 12 or 15+ chars
(not worth above this in my case - maybe Finnish would be different) then
worst case binary search is only about 16/17 tests deep - however
trucation/stemming becomes very cheap as does the ability to provide
soundex and to handle the a* + b* type query 
The role of the importance count is somewhat debatable - for single word
searches then can order file sequence in this order (adds a little to run
length) 

Frances Coakley
Senior Lecturer, Rm 18BB02, Dept Elec Eng; University of Surrey, GUILDFORD,
GU2 7XH, UK
Tel +(0)1483 879129 email f.coakley@eim.surrey.ac.uk
Received on Mon Feb 19 16:29:45 2001