The last ten months or so I have been working with a system which (among
other things) indexes documents for searching. The system can use
several search engines at once, including swish-e. All this has given me
quite some experience of the related tasks and I would like to share
that experience and hopefully help improving swish-e.
Please comment on this.
First of all there are typically two major tasks involved. Indexing a
bunch of documents and searching through the documents. These tasks can
be modularized into sub-tasks like this:
Indexing:
+--------+ +----------+ +--------+ +-------+
| Gather |-->| Retrieve |-->| Filter |-->| Index |
+--------+ +----------+ +--------+ +-------+
Gather:
Decide which documents should be indexed and generate a list of
them. In most cases each document is identified by a URL. But in
some cases other types of unique identifiers are used (for example
scrollkeeper, see below). This task is typically performed by a
spider, but other solutions are possible. I think swish-e handles
this in a good way.
Retrieve:
Retrieve the contents of a document by its identifier. In most cases
this means open and and read a file or get a file by http. This is a
very common task and is not specific to search engines at all. There
exists several good general purpose solutions to this already (see
below) and I think swish-e should be able take advantage of them.
Filter:
Transform the contents of a document from one mime-type to another
and perhaps change the encoding so the indexer can understand it.
Most indexers want text/plain, some also accepts text/html, text/xml
or some more specific mime-types. This is also a quite common task,
it could be solved once and for all, for everybody to use.
Index:
And finally index the documents. This is the only thing that need to
differ between different search engines IMHO. I would like to see
this part separated out.
Searching:
+-------+ +----------+ +--------+ +------+
| Query |-->| Retrieve |-->| Filter |-->| View |
+-------+ +----------+ +--------+ +------+
Query:
Search among the indexed documents and return the identifier of one
(or more) of them. The exact functionality of this is of course
closely related to the functionality of the indexer, they can not be
separated.
Retrieve:
Retrieve the contents of a document by its identifier. This is the
exact same thing as in the indexing task above. It should be handled
by the exact same routines. As far as I can see, swish-e does
nothing beyond returning the document identifier. I think it should
also support some way to create a data stream from the identifier.
Filter:
Transform the contents of a document from one mime-type to another
and perhaps change the encoding so it can be viewed. This usually
means transforming it into a format that browsers can understand (ie
text/html). But it is essentially the same task as above.
View:
Display the document on the screen. This is usually done by sending
the document to the browser (with perl or php perhaps). But it might
also be useful to feed it to a local viewer. This doesn't need to
be a part of the search engine at all.
I haven't yet studied the source of swish-e, so I don't know he details
of the architecture. I'm willing to implement the changes I suggested
myself, if they are considered useful. Some other products I use today
and would want to add support for in swish-e (or use swish-e in)
includes:
Gnome Virtual File System
This is a library which can retrieve documents using a wide range of
access methods (including http and ftp), documents inside archives
(including tar and zip) and decompress them on the fly. It can
also detect the mime-type of a document either from the URL or from
the contents.
GStreamer (http://www.gstreamer.org/)
This is a library for high performance media streaming. That it does
is to convert one mime-type to another by using plug-ins, and possibly
combining them. It is mainly intended for audio and video, but can be
used for text as well. It can also read many types of URL:s,
optionally by using Gnome VFS.
Scrollkeeper
This is a system for keeping track of documentation. It builds a
database of installed documents. Each document has a unique
identifier, but it is independant of the of where the document is
stored and its filename.
Yelp (http://www.gnome.org/softwaremap/projects/yelp/)
This is a help file viewer for the gnome desktop. It can display man
pages, info pages, docbook and html documents.
Received on Wed Oct 8 17:02:17 2003