I've just joined this list and look forward to swish discussions!
When I first looked for a web indexing system, I was immediately taken
with swish, its small indexes and relatively powerful search capabilities.
Just the compact engine I was looking for!
I have Kevin Hughes' original swish running on two public sites:
http://www.ausflag.com.au and http://www.republic.org.au. At both of these
sites I've written a perl CGI script to process the swish output to
provide an Alta-Vista like result (except that it is not paged).
Try searching for "Norfolk Island" on Ausflag and "Peter Collins" on the
Republic site. Or anything else for that matter.
The first site contains about 350 HTML documents, the second about 250.
Total size about 10 Mb in each case. Indexing takes about a minute (the
servers are BSD unix). Now that there is a swish-e I will be upgrading soon!
I scanned the mailing list archive and would like to add my 2 cents worth to
some previous discussion, plus suggest some improvements to swish-e.
I think providing context around search terms is unnecessary - quoting
the first 50 or 100 words from the document is, I think, quite sufficient.
Context becomes a difficult thing when you are doing a search on more
than one term. Do you provide context around all search terms? It
gets messy quickly. Grep is unsatisfactory and leads to disconnected
output. The title of the document and the opening lines generally provide
enough context information.
One of the few criticisms I had with swish seems to be addressed with
swish-e - that is, stopwords shouldn't invalidate an otherwise possible
search (i.e. there are other search terms in the query which are indexed).
Is this feature in swish-e 1.2 only? When is 1.2 release anticipated?
This feature should also be extended to words that are "too common", i.e.
if there is more than one search term, but one of them causes the "too
common" error, searches should continue on remaining terms. For an AND
search, this would require that the "too common" word is still examined
for its presence in the target documents. Any views from the developers
as to the feasibility of this?
Finally, I don't like the fact that swish indexes comments in HTML
documents. This should most definitely be an option in the config file.
Either choose to index HTML comments, or choose to ignore them.
For example, you can't search for my name in either of the two sites above
because my name is in the HTML author comment at the top of every document at
the site - generating a "word too common" error.
So, my plea to the developers for v1.2, before it is released, would be
to add an ignore HTML comment feature!
Dr Brendan Jones |
Honorary Associate |
Electronics Department |
Macquarie University | Email: email@example.com
NSW 2109 AUSTRALIA | WWW : http://www.mpce.mq.edu.au/~brendan/
Received on Wed May 6 19:19:22 1998