Skip to main content.
home | support | download

Back to List Archive

A modularized view of a search engine

From: Magnus Bergman <magnus.bergman(at)not-real.observer.net>
Date: Wed Oct 08 2003 - 16:54:59 GMT
The last ten months or so I have been working with a system which (among
other things) indexes documents for searching. The system can use
several search engines at once, including swish-e. All this has given me
quite some experience of the related tasks and I would like to share
that experience and hopefully help improving swish-e.

Please comment on this.

First of all there are typically two major tasks involved. Indexing a
bunch of documents and searching through the documents. These tasks can
be modularized into sub-tasks like this:


Indexing:
  +--------+   +----------+   +--------+   +-------+
  | Gather |-->| Retrieve |-->| Filter |-->| Index |
  +--------+   +----------+   +--------+   +-------+
  Gather:
    Decide which documents should be indexed and generate a list of
    them. In most cases each document is identified by a URL. But in
    some cases other types of unique identifiers are used (for example
    scrollkeeper, see below). This task is typically performed by a
    spider, but other solutions are possible. I think swish-e handles
    this in a good way.

  Retrieve:
    Retrieve the contents of a document by its identifier. In most cases
    this means open and and read a file or get a file by http. This is a
    very common task and is not specific to search engines at all. There
    exists several good general purpose solutions to this already (see
    below) and I think swish-e should be able take advantage of them.

  Filter:
    Transform the contents of a document from one mime-type to another
    and perhaps change the encoding so the indexer can understand it.
    Most indexers want text/plain, some also accepts text/html, text/xml
    or some more specific mime-types. This is also a quite common task,
    it could be solved once and for all, for everybody to use.

  Index:
    And finally index the documents. This is the only thing that need to
    differ between different search engines IMHO. I would like to see
    this part separated out.


Searching:
  +-------+   +----------+   +--------+   +------+
  | Query |-->| Retrieve |-->| Filter |-->| View |
  +-------+   +----------+   +--------+   +------+
  Query:
    Search among the indexed documents and return the identifier of one
    (or more) of them. The exact functionality of this is of course
    closely related to the functionality of the indexer, they can not be
    separated.

  Retrieve:
    Retrieve the contents of a document by its identifier. This is the
    exact same thing as in the indexing task above. It should be handled
    by the exact same routines. As far as I can see, swish-e does
    nothing beyond returning the document identifier. I think it should
    also support some way to create a data stream from the identifier.

  Filter:
    Transform the contents of a document from one mime-type to another
    and perhaps change the encoding so it can be viewed. This usually
    means transforming it into a format that browsers can understand (ie
    text/html). But it is essentially the same task as above.

  View:
    Display the document on the screen. This is usually done by sending
    the document to the browser (with perl or php perhaps). But it might
    also be useful to feed it to a local viewer. This doesn't need to
    be a part of the search engine at all.


I haven't yet studied the source of swish-e, so I don't know he details
of the architecture. I'm willing to implement the changes I suggested
myself, if they are considered useful. Some other products I use today
and would want to add support for in swish-e (or use swish-e in)
includes:

Gnome Virtual File System
  This is a library which can retrieve documents using a wide range of
  access methods (including http and ftp), documents inside archives
  (including tar and zip) and decompress them on the fly. It can
  also detect the mime-type of a document either from the URL or from
  the contents.

GStreamer (http://www.gstreamer.org/)
  This is a library for high performance media streaming. That it does
  is to convert one mime-type to another by using plug-ins, and possibly
  combining them. It is mainly intended for audio and video, but can be
  used for text as well. It can also read many types of URL:s,
  optionally by using Gnome VFS.

Scrollkeeper
  This is a system for keeping track of documentation. It builds a
  database of installed documents. Each document has a unique
  identifier, but it is independant of the of where the document is
  stored and its filename.

Yelp (http://www.gnome.org/softwaremap/projects/yelp/)
  This is a help file viewer for the gnome desktop. It can display man
  pages, info pages, docbook and html documents.
Received on Wed Oct 8 17:02:17 2003