Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:280] Re: Swish comments

From: Brendan Jones <brendan(at)not-real.mpce.mq.edu.au>
Date: Fri May 08 1998 - 03:39:55 GMT
Paul Lucas wrote:

> > Context becomes a difficult thing when you are doing a search on more
> > than one term.  Do you provide context around all search terms?
> 
> 	And why not?

Because I think the search output becomes messy and no longer flows.

If people search on four terms, you're providing four separate context
outputs for each document - disconnected or potentially overlapping
slabs of text which swamps the reader - and may provide no more
illumination than my preferred method of document title and first 100
words.  All IMO, of course.

> > Grep is unsatisfactory and leads to disconnected output.  The title of the
> > document and the opening lines generally provide enough context information.
> 
> 	This also gets "messy" in that, if you want to do a good job,
> 	you not only have to strip out the HTML, but also anything
> 	between <SCRIPT>...</SCRIPT> since HTML files that contain
> 	scripts usually have them at the beginning.

Stripping out HTML, which my interface does, is easy.  It even makes sure
that it chops any trailing, unclosed, HTML tags that span the final line
of the document read in to compose the first 100 content words.

My interface won't chop out content between <SCRIPT>...</SCRIPT>, but then
again, no site I maintain has any javascript or applets.

Chopping out content between <SCRIPT>...</SCRIPT> would be a relatively
trivial perl modification if I ever needed to implement it.  Not messy at
all.

> > For an AND search, this would require that the "too common" word is still
> > examined for its presence in the target documents.  Any views from the
> > developers as to the feasibility of this?
> 
> 	I personally throw out stopwords alltogether so there is no
> 	easy way to do that.  I don't see why words that are too common
> 	should be treated any differently from predefined stopwords
> 	since, for the document set, they *are* stopwords.

In many cases doing this is fine, but in some situations, the stopword
adds nonzero value.  For example, say someone does an AND search for
"fee fie foe foo" and "foe" is a stopword that has been thrown out because
it is too common.

In my view, the correct way to implement this search is to first find all
documents which contain fee, fie and foo.  So far, this is what you've
implemented.  But that output could be further refined (and correctly
refined) by then throwing out any documents in this list which do NOT also
contain foe, even though foe is a stopword.

This way, the impact of stopwords is always graceful, and if a stopword
can add value at the last step, it does.

If you don't do this, documents containing fee, fie and foo but not foe
are returned even though they should not be.

Now, maybe this is too difficult to implement, but if not, I think it is
worthy of consideration.

> 	SWISH++ discards comments.

Well, you've sold me on swish++ rather than swish-e on this alone!

-- 
Dr Brendan Jones        |
Honorary Associate      |
Electronics Department  |
Macquarie University    | Email: brendan@mpce.mq.edu.au
NSW 2109  AUSTRALIA     | WWW  : http://www.mpce.mq.edu.au/~brendan/
Received on Thu May 7 20:49:11 1998