Skip to main content.
home | support | download

Back to List Archive

Re: A modularized view of a search engine

From: Magnus Bergman <magnus.bergman(at)not-real.observer.net>
Date: Thu Oct 09 2003 - 14:46:47 GMT
On Wed, 8 Oct 2003 11:05:30 -0700
moseley@hank.org wrote:

> On Wed, Oct 08, 2003 at 09:53:33AM -0700, Magnus Bergman wrote:
> 
> 
> > Indexing:
> >   +--------+   +----------+   +--------+   +-------+
> >   | Gather |-->| Retrieve |-->| Filter |-->| Index |
> >   +--------+   +----------+   +--------+   +-------+
> >   Gather:
> >     Decide which documents should be indexed and generate a list of
> >     them. In most cases each document is identified by a URL. But in
> >     some cases other types of unique identifiers are used (for
> >     example scrollkeeper, see below). This task is typically
> >     performed by a spider, but other solutions are possible. I think
> >     swish-e handles this in a good way.
> > 
> >   Retrieve:
> >     Retrieve the contents of a document by its identifier. In most
> >     cases this means open and and read a file or get a file by http.
> >     This is a very common task and is not specific to search engines
> >     at all. There exists several good general purpose solutions to
> >     this already (see below) and I think swish-e should be able take
> >     advantage of them.
> 
> For HTML, those are better done together since you have to Retrieve to
> know what to Gather.

That is true, in this very common case. But I would still want them
separated. The gather module could just use the retrieve module (and
perhaps the filter module). In the system I use there are several
external crawlers, so swish-e doesn't need to do that at all. And in the
case of scrollkeeper (developed and used by Sun) there doesn't need to
be any crawler since the documents kind register themselves.

The main point: the job of retrieving a (fixed size, linear) document
only needs to be implemented once for the whole system. Each and every
program that needs this functionality can use the same code.

> >   Filter:
> >     Transform the contents of a document from one mime-type to
> >     another and perhaps change the encoding so the indexer can
> >     understand it. Most indexers want text/plain, some also accepts
> >     text/html, text/xml or some more specific mime-types. This is
> >     also a quite common task, it could be solved once and for all,
> >     for everybody to use.
> 
> SWISH::Filter?
> 
> So what you describe above is basically how swish-e works.

Yes, that is great. But what I really want to do is to use my own
filters with swish-e and perhaps use the filters of swish-e with other
indexers and document streamers.

I must admit that I haven't looked must at SWISH::Filter since I don't
know Perl. Can it easily be used on the command line to convert
documents? And can other command line filters easily be used with
swish-e? (By easy I mean without writing any Perl code.)

> > Scrollkeeper
> >   This is a system for keeping track of documentation. It builds a
> >   database of installed documents. Each document has a unique
> >   identifier, but it is independant of the of where the document is
> >   stored and its filename.
> 
> Sounds like a job for swish-e! ;)

Yes it could be. I personally doesn't like scrollkeeper very much, but
Sun seams to do. And it's a solution that is in use in many places
and I would like to integrate swish-e in some of these places.
Received on Thu Oct 9 14:47:18 2003