Hello,
I have a couple of questions about merging indexes. Here is the output of
the merge on which these questions are based (Note: index.swish and
index_new.swish are exact copies of each other):
bash$ swish-e -M index.swish-e index_new.swish-e index_merge.swish-e
Input index 'index.swish-e' has 20 files and 1203 words
Input index 'index_new.swish-e' has 20 files and 1203 words
Replaced file 'index_new.swish-e:2135710 2008-12-09 03:04:19 PST' with
'index.swish-e:2135710 2008-12-09 03:04:19 PST'
Replaced file 'index_new.swish-e:2135810 2008-12-09 03:04:21 PST' with
'index.swish-e:2135810 2008-12-09 03:04:21 PST'
Replaced file 'index_new.swish-e:652983 2008-12-07 18:35:33 PST' with
'index.swish-e:652983 2008-12-07 18:35:33 PST'
Replaced file 'index_new.swish-e:653082 2008-12-07 18:35:34 PST' with
'index.swish-e:653082 2008-12-07 18:35:34 PST'
Replaced file 'index_new.swish-e:653225 2008-12-07 18:35:36 PST' with
'index.swish-e:653225 2008-12-07 18:35:36 PST'
Replaced file 'index_new.swish-e:653260 2008-12-18 00:12:39 PST' with
'index.swish-e:653260 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:653399 2008-12-18 00:12:39 PST' with
'index.swish-e:653399 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:653872 2008-12-07 18:35:49 PST' with
'index.swish-e:653872 2008-12-07 18:35:49 PST'
Replaced file 'index_new.swish-e:653880 2008-12-18 00:12:39 PST' with
'index.swish-e:653880 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:654244 2008-12-07 18:35:54 PST' with
'index.swish-e:654244 2008-12-07 18:35:54 PST'
Replaced file 'index_new.swish-e:654405 2008-12-18 00:12:39 PST' with
'index.swish-e:654405 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:654499 2008-12-07 18:35:57 PST' with
'index.swish-e:654499 2008-12-07 18:35:57 PST'
Replaced file 'index_new.swish-e:654520 2008-12-18 00:12:39 PST' with
'index.swish-e:654520 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:654543 2008-12-07 18:35:58 PST' with
'index.swish-e:654543 2008-12-07 18:35:58 PST'
Replaced file 'index_new.swish-e:654651 2008-12-07 18:35:59 PST' with
'index.swish-e:654651 2008-12-07 18:35:59 PST'
Replaced file 'index_new.swish-e:654679 2008-12-07 18:36:00 PST' with
'index.swish-e:654679 2008-12-07 18:36:00 PST'
Replaced file 'index_new.swish-e:654842 2008-12-07 18:36:02 PST' with
'index.swish-e:654842 2008-12-07 18:36:02 PST'
Replaced file 'index_new.swish-e:654908 2008-12-07 18:36:03 PST' with
'index.swish-e:654908 2008-12-07 18:36:03 PST'
Replaced file 'index_new.swish-e:654970 2008-12-18 00:12:39 PST' with
'index.swish-e:654970 2008-12-18 00:12:39 PST'
Replaced file 'index_new.swish-e:655138 2008-12-18 00:12:39 PST' with
'index.swish-e:655138 2008-12-18 00:12:39 PST'
Getting words in index 'index.swish-e': 1203 words
Getting words in index 'index_new.swish-e': 1203 words
Processing words in index 'index_merge.swish-e': 1203 words
Removed 0 words no longer present in docs for index
'index_merge.swish-e'
Writing main index...
Sorting words ...
Sorting 1,203 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
1,203 unique words indexed.
40 properties sorted.
20 files indexed. 0 total bytes. 5,199 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
bash$ ls -lrt
-rw-r--r-- 1 gsaylor gsaylor 16438 Dec 18 03:51 index.swish-e.prop
-rw-r--r-- 1 gsaylor gsaylor 442499 Dec 18 03:51 index.swish-e
-rw-r--r-- 1 gsaylor gsaylor 442499 Dec 18 17:51 index_new.swish-e
-rw-r--r-- 1 gsaylor gsaylor 16438 Dec 18 17:51 index_new.swish-e.prop
-rw-r--r-- 1 gsaylor gsaylor 16438 Dec 18 17:53 index_merge.swish-e.prop
-rw-r--r-- 1 gsaylor gsaylor 443519 Dec 18 17:53 index_merge.swish-e
1. Does it consider the "PreSortedIndex" setting in swish-e.conf? I'm
kind of stumped based on some of the output that I'm seeing while doing a
merge "40 properties sorted", I only saw "One property sorted" when doing
the initial index. (Both indexes are using the same config file and I
tried adding the -c option to the merge, but the result was the same).
2. Why is index_merge.swish-e larger - is that normal or does it represent
something inefficient is occurring during the merge?
3. To delete items out of the index, I am creating a new index with
changed items and setting the XML of the deleted ones to an empty value.
This seems to work okay, but overtime I can imagine how this could create
some inefficiencies in the index. My question is: based on how the merge
process works, would it be a reasonable enhancement to have something like
a DBM file of active IDs that is looked up during the merge -- and if the
ID is not present it does not end up in the final merge file? C/C++ is
not my strong point, but thought I'd get some thoughts on this before
digging into it too far.
4. (This is a bit off the subject of this email): If I am wanting to
sqeeze the most performance possible out of the indexing process, is XML
the optimal format, or should I consider another option (data is
originating from a Postgres database)?
Thanks!
- Greg
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Dec 18 20:57:49 2008