Skip to main content.
home | support | download

Back to List Archive

Re: question about merging indexes when there is file overlap

From: Thomas M. Parris <parris(at)not-real.isciences.com>
Date: Tue Apr 13 2004 - 19:17:59 GMT
Wow, that was fast!  Many thanks, not just for the answer (which seems to be
that there would not be a problem), but for a walk through of how to answer
such questions for myself in the future.

-- Tom

-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Bill Moseley
Sent: Tuesday, April 13, 2004 1:52 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: question about merging indexes when there is file
overlap


On Tue, Apr 13, 2004 at 11:25:14AM -0700, Thomas M. Parris wrote:
> Dear SWISHers:
>
> I have a quick question about the merge feature of swish.
>
> 1. Assume I have a directory with five files in it (1.html .... 5.html),
and
> that I index these files with swish (index1.index).
>
> 2. Now assume I modify 5.html and and 6.html and index these files
> (index2.index). (In my case, 5.html is modified to point to 6.html).
>
> 3. Will I run into problems if I merge index1.index and index2.index?
(this
> process will repeat hundreds of times).

Not exactly sure.  I think if you merge two indexes with the same files it
will end
up replacing all files -- When merging if a file comes along with the same
or newer date
then it will replace the existing file.  Maybe it should only replace if
it's newer.

moseley@bumby:~/merge$ swish-e -M index1 index1 same
Input index 'index1' has 5 files and 1 words
Input index 'index1' has 5 files and 1 words
Replaced file '1.html 2004-04-13 11:40:10 PDT' with '1.html 2004-04-13
11:40:10 PDT'
Replaced file '2.html 2004-04-13 11:40:10 PDT' with '2.html 2004-04-13
11:40:10 PDT'
Replaced file '3.html 2004-04-13 11:40:10 PDT' with '3.html 2004-04-13
11:40:10 PDT'
Replaced file '4.html 2004-04-13 11:40:10 PDT' with '4.html 2004-04-13
11:40:10 PDT'
Replaced file '5.html 2004-04-13 11:40:10 PDT' with '5.html 2004-04-13
11:40:10 PDT'

But if the index only has the modified files in it:

moseley@bumby:~/merge$ for num in 1 2 3 4 5; do echo "hi" > $num.html; done
moseley@bumby:~/merge$ swish-e -f index1 -i *.html -v0

moseley@bumby:~/merge$ for num in 4 5; do echo "foo" > $num.html; done
moseley@bumby:~/merge$ swish-e -f index2 -i 4.html 5.html -v0

moseley@bumby:~/merge$ swish-e -M index1 index2 indexout
Input index 'index1' has 5 files and 1 words
Input index 'index2' has 2 files and 1 words
Replaced file '4.html 2004-04-13 11:40:10 PDT' with '4.html 2004-04-13
11:40:38 PDT'
Replaced file '5.html 2004-04-13 11:40:10 PDT' with '5.html 2004-04-13
11:40:38 PDT'
Getting words in index 'index1':      1 words
Getting words in index 'index2':      1 words
Processing words in index 'indexout':      2 words
Removed      0 words no longer present in docs for index 'indexout'
Writing main index...
Sorting words ...
Sorting 2 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
2 unique words indexed.
4 properties sorted.
5 files indexed.  0 total bytes.  5 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

moseley@bumby:~/merge$ swish-e -f indexout -T index_words_full

-----> WORD INFO in index indexout <-----

foo
 Meta:1 4.html Freq:1 Pos/Struct:2/9
 Meta:1 5.html Freq:1 Pos/Struct:2/9

hi
 Meta:1 1.html Freq:1 Pos/Struct:2/9
 Meta:1 2.html Freq:1 Pos/Struct:2/9
 Meta:1 3.html Freq:1 Pos/Struct:2/9


>
> Many thanks in advance for your help.  I love swish.  I've succesfully
used
> it for a number of varied applications.
>
> Cheers,
> Tom
> -------------------------------------------------------
> Thomas M. Parris
> Research Scientist and Executive Director Boston Office
> ISciences, LLC
> 685 Centre Street, Suite 207
> Jamaica Plain, MA  02130  USA
>
> Tel:   +617-524-8041        http://www.isciences.com/
> Fax:   +617-344-2580        http://www.terraviva.net/
> Email: parris@isciences.com
> ------------------------------------------------------
>
>

--
Bill Moseley
moseley@hank.org
Received on Tue Apr 13 12:17:59 2004