Skip to main content.
home | support | download

Back to List Archive

Re: A modularized view of a search engine

From: Bernhard Weisshuhn <bkw(at)>
Date: Wed Oct 08 2003 - 17:39:47 GMT
On Wed, Oct 08, 2003 at 09:53:29AM -0700, Magnus Bergman <> wrote:

> Searching:
>   Retrieve:
>     Retrieve the contents of a document by its identifier. This is the
>     exact same thing as in the indexing task above. It should be handled
>     by the exact same routines. As far as I can see, swish-e does
>     nothing beyond returning the document identifier. I think it should
>     also support some way to create a data stream from the identifier.

You might want to check out what swish-e properties are for as opposed to
MetaNames. You can return quite a lot of information from the indexed
content as long as you told swish to save that properties along with
the index during its creation.

I don't agree that swish-e should create the datastream as you say. The
job of a searchengine imho should be to find stuff, not to do anything
usefull with it. That should be the job of other pats of the framework,
which *really* know how to handle that. Take indexing pdfs for example.
If swish-e had 'native' support for it, we would have to include huge
libraries (adding dependencies and bloat) to do domething that xpdf most
probably can do much better. If a new pdf revision needs to be supported,
you update xpdf and thats that.

> [...] Some other products I use today and would want to add support for
> in swish-e (or use swish-e in) includes:

> Gnome Virtual File System
> GStreamer (
> Scrollkeeper
> Yelp (

Please don't be offended if I don't seem to get your point.

What do you mean by support? I fail to see how these are "unsupported"
as long as one can write a wrapper that retrieves the contents and
converts them to xml for siwsh to index.

I fear you're about to add a lot of dependencies in our compact litte
swish-e without adding much benefit. I think using perl-scripts with
swishe's -S prog option gives so much power (think CPAN) it should be
pretty easy index all these contents (and many many more).

Filtering and viewing imho should not be the responsibility of the
indexing engine. What we would get from that is the notion that
swish-e 'supports' say scrollkeeper, but doesn't 'support' say
postgres, to pick something random. This makes no sense of course. The
proper way to index *any* content is to define exactly how to retrieve
it, what parts to index (filtering) and know what to do with
searchresults, just like you said in your mail.
This is exactly what the several wrappers for the engine do.
You can include as much clues the frontend needs for interpreting the data
in swish-e's properties.

Or did I completely misread your mail and you actually want to supply
those wrappers for the filters and prog-bin directories of the
distribution? In this case of course: Excellent idea, go ahead! ;)

just my 2 cents,
Received on Wed Oct 8 17:43:42 2003