On Wed, 6 May 1998, Brendan Jones wrote:
> I think providing context around search terms is unnecessary - quoting
> the first 50 or 100 words from the document is, I think, quite sufficient.
> Context becomes a difficult thing when you are doing a search on more
> than one term. Do you provide context around all search terms?
And why not?
> Grep is unsatisfactory and leads to disconnected output. The title of the
> document and the opening lines generally provide enough context information.
This also gets "messy" in that, if you want to do a good job,
you not only have to strip out the HTML, but also anything
between <SCRIPT>...</SCRIPT> since HTML files that contain
scripts usually have them at the beginning.
> One of the few criticisms I had with swish seems to be addressed with
> swish-e - that is, stopwords shouldn't invalidate an otherwise possible
> search (i.e. there are other search terms in the query which are indexed).
SWISH++ ignores stopwords in search strings.
> For an AND search, this would require that the "too common" word is still
> examined for its presence in the target documents. Any views from the
> developers as to the feasibility of this?
I personally throw out stopwords alltogether so there is no
easy way to do that. I don't see why words that are too common
should be treated any differently from predefined stopwords
since, for the document set, they *are* stopwords.
> Finally, I don't like the fact that swish indexes comments in HTML documents.
SWISH++ discards comments.
- Paul J. Lucas
NASA Ames Research Center Caelum Research Corporation
Moffett Field, California San Jose, California
<pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Thu May 7 00:39:08 1998