At 08:41 AM 04/05/00 -0700, Jose Manuel Ruiz wrote:
>This is how I have implented it: word position is always
>incremented (if there is a stopword, it is incremented too).
>In fact, word position is also incremented when other non
>blank nor new-line character is found.
This is really hard. I think you have to define words as swish does using
WordCharacters and IgnoreFirst and IgnoreLast. You can't use new-line
because it's often html that's indexed, of course.
I would, though, think it would be good to be able to define a few word
ending characters (such as a period or a comma) that would bump up the word
This would allow people, for example, to define if a phase could match
across sentences or not without having to explicitly type the period in
>I made a minor change here in the code, so you can
>define the rules in swish.h (using a simple #define clause).
>So, if you define "<and>" as the rule instead of "and", and if
>"and" is not a stopword, you can find "Joe and Mary".
>Anyway, If you have stopwords in the index file, you can not search
>for phrases that contain stopwords. This is, for example, how
>Verity's Search Information Server (a commercial searcher) works.
>You do need to store stopwords with their position in the index file
>for phrase search.
I'm confused ;) Isn't a stop word by definition a word that's not in the
Why do you need the stop words in the index? Say you have the phrase
"...Swish is a search engine...."
word: 4 - - 5 6
where "is" and "a" are stop words.
Searching for: swish is a search engine
will throw out "is" and "a" in the search. Without quotes around the
search phrase will find any documents that have "swish" "search" and
"engine". But, searching with quotes will require that the words found are
sequential and in the correct order.
I'm really curious to see how big the index becomes with all the word
Received on Wed Apr 5 12:28:34 2000