Skip to main content.
home | support | download

Back to List Archive

Re: Incremental indexing?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Feb 22 2002 - 23:04:44 GMT
At 12:20 PM 02/22/02 -0800, Keith Thompson wrote:
>I've been wondering if any recent thought has gone into
>incremental indexing with swish-e.  I see at the web site that
>it is listed amongst the maybe-in-3.0 list.  I was hoping
>that possibly someone already had their sites set on this one.
>Maybe I just volunteered.  :(

It's on the top of the todo list and work is proceeding.

>I am involved in a project right now that is planning to use the
>merge functionality in order to support incremental indexing.
>This is unfortunately a bit clumsy as well as more memory and
>time intensive than I'll probably like.  I don't know yet--I'm
>just guessing.

Yes, I'd stay away from merge.

>Why not just reindex everything as is usually suggested?
>Without getting into the painful details (that aren't
>important anyway), here's the gist of the situation.:  I get
>sent a document to add to the index.  It might be one every
>few seconds or one a week.

Have you looked at other search engines that are based on something like
Berkeley DB?  That might solve your problem of new documents coming in fast.

That said, for incremental indexing I've always maintained two indexes.
Create a full index, and as new files are added run indexing on the new
files.  When indexing time becomes too long (whatever that means) for the
incremental indexing rerun the full index.

When searching you just specify both indexes.

If you were to store the documents compressed in something like MySQL as
they come in then it makes managing a system like that a bit easier since
you can timestamp things and not have to worry about duplicate files.  And
then you always have copies of the original documents.

Where that doesn't work very well is when you have files that are modified
and those need to be reindexed.

>Unfortunately, I also have one other thing to maintain as well,
>that I don't have a real convenient answer for.  This is the
>occasional request to remove a document from the index (either
>through aging, because a new document [potentially with a
>different name than the original] is to replace it, or because
>of obsolesence).  Is there a means currently to remove from
>the index, or does one do that only by reindexing everything else
>(which I obviously can't do in this case) or by leaving them
>in the index and having my front end filter them from the
>results (very ugly)?

Yes, you have to go the very ugly route.

The problem, of course, is swish creates a reverse index: words point to
files not the other way around.  So currently there's no way to say what
words belong to a given file.

The easiest way for swish may be to somehow keep a table of deleted file
numbers, and when searching ignore word hits that point to those files.
Otherwise, we need a table that points to all the word entries for a given
file, and somehow go in and flag those words as deleted.  Either way we end
up with a fragmented index.

Swish uses it's own database.  But swish can be compiled to use Berkeley
DB[1].  It's a lot slower indexing, but that might give a path to
incremental indexing for both adding records and for deleting records (with
an additional table, like I said above, to track words for each file). 

Jose has also been working on a internal Btree index structure for the
purpose of incremental indexing.  This looks most promising, but I'm not
clear where he is in that process.

[1] I have not used the Berkeley DB method for a while, so it may need
tweaking before it will work.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Feb 22 23:05:13 2002