I've been wondering if any recent thought has gone into
incremental indexing with swish-e. I see at the web site that
it is listed amongst the maybe-in-3.0 list. I was hoping
that possibly someone already had their sites set on this one.
Maybe I just volunteered. :(
I am involved in a project right now that is planning to use the
merge functionality in order to support incremental indexing.
This is unfortunately a bit clumsy as well as more memory and
time intensive than I'll probably like. I don't know yet--I'm
Why not just reindex everything as is usually suggested?
Without getting into the painful details (that aren't
important anyway), here's the gist of the situation.: I get
sent a document to add to the index. It might be one every
few seconds or one a week. The important thing to know is
that I don't keep the files and have no way of asking the
sender to replay them to me. I'm just given a file, a name
(pathname or URL), and some meta data (timestamp, owner, etc.).
The plan is to index each into a new swish-e index. After a few
come in, I'll merge them together. At any point, I am also sent
queries. These queries need to return results that may involve
a file I was sent to index just moments before. So, as you
can guess, keeping the index up to date is crucial. It will
probably take quite a bit of tuning to figure out at what
point it pays to merge all of my little one-document indexes,
rather than doing the queries against these multiple indexes.
In any case, as you can see, this is a bit more in overhead
than I'd ideally like to be maintaining. Also, I've heard
occasional negative remarks about merging on this list and
don't know if I can feel good about relying so heavily on
merging. If I could do incremental indexing, this would be
a complete no-brainer.
If anyone is currently working on incremental indexing, I'd
like to hear more of it. Or, if you have any suggestions on how
I can better manage this, that would be welcome as well.
Unfortunately, I also have one other thing to maintain as well,
that I don't have a real convenient answer for. This is the
occasional request to remove a document from the index (either
through aging, because a new document [potentially with a
different name than the original] is to replace it, or because
of obsolesence). Is there a means currently to remove from
the index, or does one do that only by reindexing everything else
(which I obviously can't do in this case) or by leaving them
in the index and having my front end filter them from the
results (very ugly)?
Thank much -keith
Received on Fri Feb 22 20:21:18 2002