At 07:10 AM 02/25/02 -0800, Gaye Karagulle wrote:
>yes, in fact I don't want to use the indexing feature
>of swish, because I should do that on my own, this is
>I just want some features, if exists, that will be
>hepful in creating document vectors to compare to the
>query vectors. For example, stemming seems like a very
>complex topic to me, and I think swish has a feature
>like this. that's why I want help from you.
Ok. Swish uses a parser to extract out words (libxml2 is the better parser
option), then applies stemming to those words using Porter's Stemming
system (which many will debate the usefulness of), and then basically
tallys up the words, keeping track of position (for phrase matching),
"structure" which describes HTML formatting of the word (<title> words
should rank higher), and also a "metaID" which is a way to place words in
fields so searching can be limited to just those fields.
You can do all that yourself (for example, there's perl modules included
with swish that do search term highlighting, and some of them try to
emulate the way swish indexes to find the words to highlight), and for
stemming I have a perl module available on perl's CPAN sites that uses the
same code swish uses to stem words -- so you can stem words as needed
(which you will need if you want to query your index).
Using swish to index and then parsing the output from -T index_words_full
will make things easier, I suppose.
Received on Mon Feb 25 15:46:55 2002