On Fri, Jun 03, 2005 at 12:39:54PM -0700, Brad Miele wrote:
> So it seems like it is stemming the word and then comparing it against the
> stopwords. Does this seem like a correct assesment?
Yes, the logic is wrong. Stopwords are removed after applying
stemming when searching, but before when indexing.
When searching the code goes something like this:
parse_swish_query()
tokenize_query_string()
tokenize by white space and operator characters
lower case words
check for buzzwords
parse_swish_words() -- convert into swish word:
apply TranslateChars
tokenize again based on wordcharacters/begin/endchars.
(where stopwords were removed before)
limit by max word size
apply fuzzy translation
remove stopwords
So, yes, the IgnoreWords list is applied after stemming when
searching. I think it can be debated what should be done first --
fuzzy translation or stopword removal. For stemming seems like you
might want to do it after ("IgnoreWords run" should remove all forms:
runs running if using stemming), but for things like soundex you would
want it to apply before (you don't want to enter soundex codes into
your stopword list).
This is back to the issue of the query parser needing a rewrite.
Might be able to just move the stopword check back to where they were
removed before, but there's some notes in the source about why that's
not done, so I'd need to check up on that first.
Indexing goes something like this:
indexstring()
next_word()
tokenize by whte space
lower case word
check for buzzwords
next_swish_word()
tokenize into "swish words" based on Wordcharacters, etc.
make sure word starts with begin/endchars.
limit by
stopwords
word length
consecutive digits
consecutive vowels
consecutive consonants
apply fuzzy translation
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Sun Jun 5 09:06:24 2005