Skip to main content.
home | support | download

Back to List Archive

RE: WordCharacters

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Aug 28 1999 - 03:14:34 GMT
At 09:16 PM 8/27/99 -0500, "David Norris" <dave@webaugur.com> wrote:

>> WordCharacters  abcdefghijklmnopqrstuvwxyz0123456789_-
>
>For an example, these bits of text wouldn't be indexed with the above
>config:
>Jimbob's
>exposť or expos&eacute; or expos&#233;
>Scary!
>Really?
>Sadly.

Exactly.  With the Wordcharacters setting above these index as:

config
jimbob
scary
really
sadly

(exposť probably should be included, though.)

Which would seem to be more likely search terms.  I guess I'd rather have
too many hits than miss ones.  I apply the rules to the input query so if
someone searches for "config:" I query with "config" and find things that
were in the source docs as "config:" and also simply "config".  The other
way and someone searches for "config" and they miss "config:" (unless they
search for "config*, of course).

>
>You need to have &, #, ;, a-z, and 0-9 in your WordCharacters to catch the
>SGML entities and various other character codes.  @ would probably be good
>idea if you plan to index email addresses and such.

Yes, perhaps.  But say someone@domain.com gets split into three fields,
"someone", "domain", "com" in the index.  Searching for someone@domain.com
gets split and passed to swish as the three separate words and would still
find the email address.  Granted it would find other documents too that
included those three words.  But without splitting it up, searching for
just "someone" or "domain" if that's all you knew would not get you the
document.

>>...wild card search words (words with '*' at the end) do not
>> seem to get stemmed, so you don't get the results you would expect.
>
>What behavior would you expect with wild cards?  The stemming algorithm
>depends on a word with a letter ending.

True, and a wild card search term may not stem.  Or it may.

In the index all words are stemmed.  So by not stemming a wild card word
first really fails to work.  I'd expect that the word would be stemmed, and
then any word in the (stemmed) index that starts with those letters match.
I do this now in pre-processing.

>And, stemming wouldn't be needed
>with a wildcard in place.  The algorithm can't stem a word if that word ends
>in an infinite length of random characters.  Besides, that infinite length
>of random characters would likely contain the stem depending on the wildcard
>placement and word.

True.  I can't think of a very good example.  Well, how about searching for
"configuration*"?

E:\swish>stem configuration
configur
E:\swish>stem configurations
configur

But since Swish doesn't stem "configuration*" to "configur*" it won't find
what you would expect, namely "configuration" and "configurations".


>Otherwise it will index words incorrectly.  According to your configuration:
>the string "this is an example." contains four words.  "this", "is", "an",
>"example."  Notice, that the last word has punctuation mark at the end.  It
>would be indexed and searched as such.  Any search for "example" would not
>return "example." and vice-versa.

I guess if I was searching for the word "example" I would want to see any
documents that contain that word, regardless of where the word was in a
sentence.  It would be nice, indeed, if indexing made multiple index
entries "example." and "example" from the same word in the source document.
 But then we get a big and slow index.

I'd guess how one indexes is dependent on what type of queries are expected.

Bill Moseley
mailto:moseley@hank.org
Received on Fri Aug 27 20:15:12 1999