Greg Saylor wrote on 12/18/08 8:03 PM:
> I have a couple of questions about merging indexes. Here is the output of
> the merge on which these questions are based (Note: index.swish and
> index_new.swish are exact copies of each other):
> 1. Does it consider the "PreSortedIndex" setting in swish-e.conf? I'm
> kind of stumped based on some of the output that I'm seeing while doing a
> merge "40 properties sorted", I only saw "One property sorted" when doing
> the initial index. (Both indexes are using the same config file and I
> tried adding the -c option to the merge, but the result was the same).
I'm stumped on this one too. Bill, any ideas?
> 2. Why is index_merge.swish-e larger - is that normal or does it represent
> something inefficient is occurring during the merge?
I can only assume that the merged index contains more documents than either of
the original 2, maybe because of a 'deleted' doc (as you mention below)?
> 3. To delete items out of the index, I am creating a new index with
> changed items and setting the XML of the deleted ones to an empty value.
> This seems to work okay, but overtime I can imagine how this could create
> some inefficiencies in the index. My question is: based on how the merge
> process works, would it be a reasonable enhancement to have something like
> a DBM file of active IDs that is looked up during the merge -- and if the
> ID is not present it does not end up in the final merge file? C/C++ is
> not my strong point, but thought I'd get some thoughts on this before
> digging into it too far.
The merge feature is really there because Swish-e 2.x doesn't have incremental
indexing by default (you can build that version but I wouldn't recommend it at
this point in history).
So your approach could work, but a proper incremental index would be a better
solution imo. I'll be posting on that topic shortly.
> 4. (This is a bit off the subject of this email): If I am wanting to
> sqeeze the most performance possible out of the indexing process, is XML
> the optimal format, or should I consider another option (data is
> originating from a Postgres database)?
XML is fine. The best, actually, given the data source.
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Users mailing list
Received on Thu Jan 8 22:04:29 2009