Skip to main content.
home | support | download

Back to List Archive

Incremental indexing?

From: Keith Thompson <kjt(at)>
Date: Fri Feb 22 2002 - 20:20:44 GMT
Hello all,

I've been wondering if any recent thought has gone into
incremental indexing with swish-e.  I see at the web site that
it is listed amongst the maybe-in-3.0 list.  I was hoping
that possibly someone already had their sites set on this one.
Maybe I just volunteered.  :(

I am involved in a project right now that is planning to use the
merge functionality in order to support incremental indexing.
This is unfortunately a bit clumsy as well as more memory and
time intensive than I'll probably like.  I don't know yet--I'm
just guessing.

Why not just reindex everything as is usually suggested?
Without getting into the painful details (that aren't
important anyway), here's the gist of the situation.:  I get
sent a document to add to the index.  It might be one every
few seconds or one a week.  The important thing to know is
that I don't keep the files and have no way of asking the
sender to replay them to me.  I'm just given a file, a name
(pathname or URL), and some meta data (timestamp, owner, etc.).
The plan is to index each into a new swish-e index.  After a few
come in, I'll merge them together.  At any point, I am also sent
queries.  These queries need to return results that may involve
a file I was sent to index just moments before.  So, as you
can guess, keeping the index up to date is crucial.  It will
probably take quite a bit of tuning to figure out at what
point it pays to merge all of my little one-document indexes,
rather than doing the queries against these multiple indexes.
In any case, as you can see, this is a bit more in overhead
than I'd ideally like to be maintaining.  Also, I've heard
occasional negative remarks about merging on this list and
don't know if I can feel good about relying so heavily on
merging.  If I could do incremental indexing, this would be
a complete no-brainer.

If anyone is currently working on incremental indexing, I'd
like to hear more of it.  Or, if you have any suggestions on how
I can better manage this, that would be welcome as well.

Unfortunately, I also have one other thing to maintain as well,
that I don't have a real convenient answer for.  This is the
occasional request to remove a document from the index (either
through aging, because a new document [potentially with a
different name than the original] is to replace it, or because
of obsolesence).  Is there a means currently to remove from
the index, or does one do that only by reindexing everything else
(which I obviously can't do in this case) or by leaving them
in the index and having my front end filter them from the
results (very ugly)?

Thank much -keith
Received on Fri Feb 22 20:21:18 2002