Skip to main content.
home | support | download

Back to List Archive

[swish-e] Merging indexes, removing items from index

From: Greg Saylor <gregs(at)not-real.net-virtual.com>
Date: Fri Dec 19 2008 - 02:03:23 GMT
Hello,

I have a couple of questions about merging indexes.  Here is the output of
the merge on which these questions are based (Note: index.swish and
index_new.swish are exact copies of each other):

bash$ swish-e -M index.swish-e index_new.swish-e index_merge.swish-e
Input index 'index.swish-e' has 20 files and 1203 words
Input index 'index_new.swish-e' has 20 files and 1203 words
Replaced file 'index_new.swish-e:2135710 2008-12-09 03:04:19 PST' with
'index.swish-e:2135710 2008-12-09 03:04:19 PST'
Replaced file 'index_new.swish-e:2135810 2008-12-09 03:04:21 PST' with
'index.swish-e:2135810 2008-12-09 03:04:21 PST'
Replaced file 'index_new.swish-e:652983 2008-12-07 18:35:33 PST' with
'index.swish-e:652983 2008-12-07 18:35:33 PST'
Replaced file 'index_new.swish-e:653082 2008-12-07 18:35:34 PST' with
'index.swish-e:653082 2008-12-07 18:35:34 PST'
Replaced file 'index_new.swish-e:653225 2008-12-07 18:35:36 PST' with
'index.swish-e:653225 2008-12-07 18:35:36 PST'
Replaced file 'index_new.swish-e:653260 2008-12-18 00:12:39 PST' with
'index.swish-e:653260 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:653399 2008-12-18 00:12:39 PST' with
'index.swish-e:653399 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:653872 2008-12-07 18:35:49 PST' with
'index.swish-e:653872 2008-12-07 18:35:49 PST'
Replaced file 'index_new.swish-e:653880 2008-12-18 00:12:39 PST' with
'index.swish-e:653880 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:654244 2008-12-07 18:35:54 PST' with
'index.swish-e:654244 2008-12-07 18:35:54 PST'
Replaced file 'index_new.swish-e:654405 2008-12-18 00:12:39 PST' with
'index.swish-e:654405 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:654499 2008-12-07 18:35:57 PST' with
'index.swish-e:654499 2008-12-07 18:35:57 PST'
Replaced file 'index_new.swish-e:654520 2008-12-18 00:12:39 PST' with
'index.swish-e:654520 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:654543 2008-12-07 18:35:58 PST' with
'index.swish-e:654543 2008-12-07 18:35:58 PST'
Replaced file 'index_new.swish-e:654651 2008-12-07 18:35:59 PST' with
'index.swish-e:654651 2008-12-07 18:35:59 PST'
Replaced file 'index_new.swish-e:654679 2008-12-07 18:36:00 PST' with
'index.swish-e:654679 2008-12-07 18:36:00 PST'
Replaced file 'index_new.swish-e:654842 2008-12-07 18:36:02 PST' with
'index.swish-e:654842 2008-12-07 18:36:02 PST'
Replaced file 'index_new.swish-e:654908 2008-12-07 18:36:03 PST' with
'index.swish-e:654908 2008-12-07 18:36:03 PST'
Replaced file 'index_new.swish-e:654970 2008-12-18 00:12:39 PST' with
'index.swish-e:654970 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:655138 2008-12-18 00:12:39 PST' with
'index.swish-e:655138 2008-12-18 00:12:39 PST'
Getting words in index 'index.swish-e':   1203 words
Getting words in index 'index_new.swish-e':   1203 words
Processing words in index 'index_merge.swish-e':   1203 words
Removed      0 words no longer present in docs for index
'index_merge.swish-e'
Writing main index...
Sorting words ...
Sorting 1,203 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
1,203 unique words indexed.
40 properties sorted.
20 files indexed.  0 total bytes.  5,199 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

bash$ ls -lrt
-rw-r--r--  1 gsaylor  gsaylor   16438 Dec 18 03:51 index.swish-e.prop
-rw-r--r--  1 gsaylor  gsaylor  442499 Dec 18 03:51 index.swish-e
-rw-r--r--  1 gsaylor  gsaylor  442499 Dec 18 17:51 index_new.swish-e
-rw-r--r--  1 gsaylor  gsaylor   16438 Dec 18 17:51 index_new.swish-e.prop
-rw-r--r--  1 gsaylor  gsaylor   16438 Dec 18 17:53 index_merge.swish-e.prop
-rw-r--r--  1 gsaylor  gsaylor  443519 Dec 18 17:53 index_merge.swish-e

1. Does it consider the "PreSortedIndex" setting in swish-e.conf?   I'm
kind of stumped based on some of the output that I'm seeing while doing a
merge "40 properties sorted", I only saw "One property sorted" when doing
the initial index. (Both indexes are using the same config file and I
tried adding the -c option to the merge, but the result was the same).

2. Why is index_merge.swish-e larger - is that normal or does it represent
something inefficient is occurring during the merge?

3. To delete items out of the index, I am creating a new index with
changed items and setting the XML of the deleted ones to an empty value. 
This seems to work okay, but overtime I can imagine how this could create
some inefficiencies in the index.  My question is: based on how the merge
process works, would it be a reasonable enhancement to have something like
a DBM file of active IDs that is looked up during the merge -- and if the
ID is not present it does not end up in the final merge file?   C/C++ is
not my strong point, but thought I'd get some thoughts on this before
digging into it too far.

4. (This is a bit off the subject of this email): If I am wanting to
sqeeze the most performance possible out of the indexing process, is XML
the optimal format, or should I consider another option (data is
originating from a Postgres database)?

Thanks!

- Greg





_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Dec 18 20:57:49 2008