Skip to main content.
home | support | download

Back to List Archive

Re: New beta version

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Mon Jun 26 2000 - 15:05:51 GMT
Stephane,

> I have seen that you fixed the merge in your comments. One previous aspect
> of that merge that was rather strange was that depending on the file
> position in the list (first or second), the result was not the same.
> e.g.
> swish-e -M index_file1 index_file2 merged_index12
> swish-e -M index_file2 index_file1 merged_index21
> => merged_index12 <> merged_index21
> 

This is normal because merge function reads first one file index and
then,
the other one. Each file has a number in its index file. After merging,
new numbers are reasigned and this can difer depending which file index
was proccesed first (it uses a hash table). This was the old logic.

> Swish-e was not too good at updating an index, as when it was seeing an
> already present file, it was basically replacing the meta tags, but
> merging the keywords, which meant the index became a little inconsistent. 
> When doing a merge with two distinct indexes (no common file), the result
> was as expected.
> 
> I was wondering if when you did your merge fix, you allowed for a
> merge-replace (replace existing files with new meta tags and keywords but
> always insert new files) which is better to do incremental updates or if
> you kept the old logic. Here, the left or right position become very
> important as one index is loaded, and the other used to perform changes.
> As the date of the file is not kept, swish cannot automatically decide
> which one is better. It would mean to have a clear definition of how merge
> works:
> e.g. swish-e -M updates index_file1 updated_index1
> 

I kept the old logic. I have just added some minor info to the merge
function (wordcharacters, begincharacters, ...).
Really, I do not like merging files. Up to day, some information is lost
in the header. Duplicate file entries are not well implemented: The
first 
entry found is used and this can change depending on the order of the
file index in the command line.

> To even improve more swish, a delete function could allow to delete any
> reference to a file in the index, which is the last step for incremental
> updates.
> 

Delete is very difficult to code with the internal format of index
files. Perhaps, an easy approach is mark a file entry as deleted. The 
file space will not be recovered and word entries will not be deleted. 
This means that a rebuild must be issued periodically to recover space.
The same problems apply to update files.

> It may be possible that with all the performance improvement you did,
> incremental updates would save very little and become useless.
> 

I think so. That is one of the reasons I made them. Now, I have 3
databases
with more than 50000 douments. I rebuild them everyday in less than one
hour
in a Sun Sparc 400Mhz. BTW, it is even faster in an IBM RS6K PowerPC
333Mhz.
I would like to get time to do a simple benchmark.

Anyway, it seems that updating and deleting documents are important to
people. Since this is a major update, they can be included in a future
release.

cu

Jose
Received on Mon Jun 26 11:23:06 2000