Skip to main content.
home | support | download

Back to List Archive

RE: Word Locations

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Mar 01 2001 - 13:26:07 GMT
At 04:54 AM 03/01/01 -0800, Rainer.Scherg@rexroth.de wrote:
>This is not an easy task...

A number of new headers are available in the CVS version of swish to help
with this, too.  The WordCharacters, (and BeginChars, EndChars,
IgnoreFirst...) are displayed per searched index by the new -X (extended
(or is it eXtreme?)) header switch.  -X also will print the "Parsed" query
string which tells you the query as swish sees it (split by wordcharacters,
stopwords removed, words stemmed).

>Today, IMO the best way to do so is the following:
>  - implement the following into the search cgi.
>  - when clicking on the document (e.g. html)
>    -> process searchwords by replacing
>       each found searchword in the doc with
>       the text you want to have...
>
>    like:
>       s#(searchword)#<FONT...>\$1</FONT>#g

But that doesn't work so well with highlighting phrases.  I currently parse
the source document into an array of tokens, then map that array to a list
of "swish words" meaning split by wordcharacters, and stopwords and HTML
tags removed.  I do that for each meta tag in the document so I can
highlight the right part of the document.  And I also parse the document
into "layers", so that I can search for parts of a href URL, and then
highlight the entire <A> tag link text.  With this information you can do
phrase highlighting, too, since you now know the position of each word.  

That can be a lot of processing for each document returned in a search
result, so then you have to think about pre-parsing the docs.

And then it's a guess, at best, since complicated queries can cause words
to be highlighted in a document when that was not a word that selected for
that document.

Since I'm trying to imitate the swish indexing parser, it might be nice to
have the parser available as part of the swish C library.  But perl make
this stuff easy, too.



Bill Moseley
mailto:moseley@hank.org
Received on Thu Mar 1 13:31:23 2001