Skip to main content.
home | support | download

Back to List Archive

Re: contexts of hits

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu May 09 2002 - 15:05:52 GMT
At 04:09 AM 05/09/02 -0700, Eneko Agirre wrote:
>I am using swish-e 2.0. I want to retrieve the context of occurrences of the
>hits, i.e. if I a searching "agency" I want to retrieve the sentence (or at
>least N sorrounding words) where "agency" or "agencies" occur. I had two
>possible solution on my mind:
>
>a) Get the offsets of each of the hits of the search terms, and make an
>external routine to get the contexts.

Swish knows word positions, but they are positions used for phrase
matching, not really physical word positions.  It would difficult to map
those word positions to words in your source documents.  You have to take
into consideration how swish parses source documents, and all the various
config settings like stopwords and word characters, and the settings that
can modify when the word position is bumped.

Swish could potentially return a huge amount of data in search results, too.

>b) Introduce metatags in the source texts and make swish-e return the whole
>sentence.

You mean wrap every sentence in a metatag?  That wouldn't work currently as
swish doesn't report back what metatag a word was found in.  It works the
other way around -- you tell swish what metatag to search in.


>Is either of this solutions easily implementable?

No.  But if you upgrade to 2.1-dev and use the included CGI script it will
do context output with term highlighting.  Go to the example directory and
type "perl doc swish.CGI".

There's three or four different highlighting modules to pick from depending
on your needs (speed vs. accuracy).

2.1-dev can store the text of documents it parses in a property (and
there's a special purpose property for doing this: see the StoreDescription
setting).  If you build swish it zlib support then swish will store the
text compressed, saving a bit of disk space.

The modules vary in how they work.  One, I think, does simple regular
expression replacement, and another other will highlight phrases and
basically parse the document in a way very similar to how swish parses.

I've also indexed large documents that were formatted into small sections
and I wrapped the sections in <div> and <a name> tags and then used swish
(and an external HTML parser) to index the sections as individual
"documents".  The advantage is that search results are not only focused,
but clicking on the links in the search results goes right to that section
of the document.



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Thu May 9 15:07:25 2002