On Wed, 2003-05-07 at 19:19, John Movius wrote:
> I have a 100 Meg genealogy website (currently using an older version of
> SWISH) and I am interested in using the "fuzzy" indexing mode of SWISH-e
> on it in the near future. I understand that the SWISH-e fuzzy indexing
> feature provides a search similar to the "Soundex" search
SWISH-E implements Don Knuth's Soundex algorithm as well as Metaphone,
DoubleMetaphone, and Stemming algorithms.
> However a fuzzy index is most likely to also be of
> substantially larger size than a normal SWISH-e index.
In the case of Soundex the index should be much smaller. Soundex, as
you probably know, reduces the word to a 4 digit representation. And,
since each numeric digit represents many letters then many different
words are reduced to a single Soundex code.
> Thus I am wondering if two SWISH-e indexes are needed to accomplish my
> goals ...
Yes. You might want to provide an index with soundex and an index
> My questions include: Has anyone on this SWISH-e list had any actual
> experience in using the fuzzy indexing feature of SWISH-e?
I use it a bit. I incorporated the Soundex algorithm into SWISH-E. So,
I'm probably the one responsible if it doesn't work as you expect. ;-)
> Is it possible to have two SWISH-e search engines installed and
> operating on the same web server?
You would simply need multiple index files...
> Has this been done with success ... i.e. are there any examples using it
> to look at on the WWW (only fuzzy? Fuzzy plus normal)?
"David" appears at the bottom of all pages. Davis and David are the
same Soundex code. So, with Soundex, all pages are returned when
searching for "Davis" whereas only the D and index pages are returned
when searching without Soundex.
Ideally, one might utilize some sort of metadata to restrict the soundex
index to only the genealogical data itself.
> Does anyone have any stats on the relative size of a regular SWISH-e
> index vs. a fuzzy SWISH-e index? I realize this could vary
$ ls -l genes*
-rw-r--r-- 1 augur users 51136 May 8 02:11 genes-no.idx
-rw-r--r-- 1 augur users 48277 May 8 02:08 genes.idx
genes-no.idx is without Soundex. genes.idx is with Soundex. This is an
extremely small dataset (since I have neglected my genealogy for several
In short, as your dataset becomes larger the index files with soundex
should become increasingly smaller than the normal index.
ICQ - 412039
Received on Thu May 8 06:30:30 2003