Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Merging indexes, removing items from index

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Jan 09 2009 - 03:08:50 GMT
Greg Saylor wrote on 12/18/08 8:03 PM:
> Hello,
> 
> I have a couple of questions about merging indexes.  Here is the output of
> the merge on which these questions are based (Note: index.swish and
> index_new.swish are exact copies of each other):

[snip]

> 1. Does it consider the "PreSortedIndex" setting in swish-e.conf?   I'm
> kind of stumped based on some of the output that I'm seeing while doing a
> merge "40 properties sorted", I only saw "One property sorted" when doing
> the initial index. (Both indexes are using the same config file and I
> tried adding the -c option to the merge, but the result was the same).
> 

I'm stumped on this one too. Bill, any ideas?


> 2. Why is index_merge.swish-e larger - is that normal or does it represent
> something inefficient is occurring during the merge?

I can only assume that the merged index contains more documents than either of
the original 2, maybe because of a 'deleted' doc (as you mention below)?

> 
> 3. To delete items out of the index, I am creating a new index with
> changed items and setting the XML of the deleted ones to an empty value. 
> This seems to work okay, but overtime I can imagine how this could create
> some inefficiencies in the index.  My question is: based on how the merge
> process works, would it be a reasonable enhancement to have something like
> a DBM file of active IDs that is looked up during the merge -- and if the
> ID is not present it does not end up in the final merge file?   C/C++ is
> not my strong point, but thought I'd get some thoughts on this before
> digging into it too far.

The merge feature is really there because Swish-e 2.x doesn't have incremental
indexing by default (you can build that version but I wouldn't recommend it at
this point in history).

So your approach could work, but a proper incremental index would be a better
solution imo. I'll be posting on that topic shortly.

> 
> 4. (This is a bit off the subject of this email): If I am wanting to
> sqeeze the most performance possible out of the indexing process, is XML
> the optimal format, or should I consider another option (data is
> originating from a Postgres database)?

XML is fine. The best, actually, given the data source.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Jan 8 22:04:29 2009