Skip to main content.
home | support | download

Back to List Archive

Re: Getting position of word in file

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed May 07 2003 - 19:18:40 GMT
On Wed, May 07, 2003 at 08:12:20PM +0200, Philipp Roessler wrote:

> Hmm, I wish it was easier, but I see the problem with a hugh number of 
> word positions.
> Anyway, I need some sort of context showing with search results. I was 
> really impressed when I saw that a search in swish-e.org/Discussion 
> does that in a very nice manner;
> am I right that this is done in Perl -  so to say a generic 'grep'?

There's a perl script included in the distribution called swish.cgi that loads a module to
do the search term highlighting.  There's three different modules for different types of
highlighting.  They work in different ways, but all with the text of the document that is
stored in a property (normally by using the StoreDescription directive).

One does a simple s/word/${highlight_on}word${highlight_off}/g type of thing (which is
fastest of the three) to one that parses the document using the WordCharacters,
IgnoreFirstChar, and IgnoreLastChar settings returned in the header of the search results to
split up the document, skipping stopwords (IgnoreWords), and working with a stemmed index (so
searching for "run" finds and highlights "runs" and "running") and highlighting phrases.

I also have Perl code not in the distribution that will highlight terms and phrases in HTML 
source, which means parsing the HTML and trying to be smart about matching up tags (think 
about highlighting a phrase which includes a word where each letter is marked-up as a 
different color).

One has to pick speed vs. accuracy.  Here's some notes from swish.cgi on selecting the
different highlighting mode -- and using mod_perl (via the SWISH::API module) instead of
mod_cgi calling the swish-e binary).  Tests were using Apache Benchmark.


       Here's some general request/second on an Athlon XP 1800+ with 1/2GB
       RAM, Linux 2.4.20.

                                     Highlighting Mode

                             None     Phrase    Default     Simple
          Using SWISH::API   45        1.5        2          12
          ----------------------------------------------------------------------------
          Using swish-e      12        1.3       1.8         7.5
            binary

None means no highlighting -- just display the first X words or characters of the property

Phrase is the most aggressive by context highlighting phrases, looking at stemming, 
wildcards and stopwords.

Default is context highlighting and looks at stemming, but no phrase or stopwords.

Simple just shows the first X words, and then uses a regular expression to highlight words, 
IF they happen to be in the first X words, which may not be the case -- that is there may 
not be any words highlighted.

45 request (searches) per second isn't too bad, but 1.3 is slow.

> >But, the way I do term highlighting and context results by parsing and
> >searching the source documents is very slow.  So it would be great to 
> >come
> >up with a better system.
> 
> What kind of system could that be?

Exactly.

One way to speed up the current system would be to pre-parse the documents (the part of 
splitting a document up into tokens before it can be searched for words or phrases to 
highlight).  But that's not much of a solution, but rather just an improvement.

-- 
Bill Moseley
moseley@hank.org
Received on Wed May 7 19:22:42 2003