Skip to main content.
home | support | download

Back to List Archive

WordCharacters

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Aug 27 1999 - 23:57:28 GMT
I'm using version 1.3.1.

I'm curious what people are using for the WordCharacters settings for
indexing.

I'm a bit confused by the documentation.  The documentation at:
http://sunsite.berkeley.edu/SWISH-E/Manual/config.user.html
says these are the available characters:

abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}()

And gives an example of:
WordCharacters abcdefghijklmnopqrstuvwxyz\?0123456789.@|,-'"[](~!@$%^{}_+?\\

I'm currently using these settings:

WordCharacters  abcdefghijklmnopqrstuvwxyz0123456789_-
BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789_-
EndCharacters   abcdefghijklmnopqrstuvwxyz0123456789_-
IgnoreLastChar  )]}'.,;?!_-"
IgnoreFirstChar ([{'_-"


It doesn't seem likely that you would want '(' or ')' included in indexed
words since parenthesis are used to specify order of evaluation on the
search string.  

The string in the documentation also would seem reduce the number of words
placed in the index (since it would split up fewer strings of characters
into individual words), and make searching a bit tougher since punctuation
would be included in the index.

Is there a reason to use the longer WordCharacters string?

There's a couple of things that Swish does (or doesn't do) with stemming
enabled.  First, wild card search words (words with '*' at the end) do not
seem to get stemmed, so you don't get the results you would expect.  

Second, any query that includes punctuation characters seems to just search
for that exact term, even if the characters are not part of the
WordCharacters settings.  Seems like the same rules should apply to the
search string fed into Swish as to the words that are indexed.




Bill Moseley
mailto:moseley@hank.org
Received on Fri Aug 27 16:59:16 1999