Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] incremental index format produces *much* larger ind ex files?

From: Judith Retief <JudithR(at)>
Date: Thu Nov 29 2007 - 08:15:46 GMT
> Judith Retief wrote on 11/20/07 3:54 AM:

>> However, my index files differ hugely in size: the merged index files add
>> to about 80M, the incremental index files almost 600M! What's going on?
>> Is there anything that I could be doing wrong to be generating these huge
>> files? 

> Without seeing examples of your configs and the merge commands you are 
> running, it's hard to speculate.
> One guess is that at merge time duplicates are being tossed out. But 
> that size difference seems too significant.
> IME, the index size varies a lot based on the number/size/compression 
> of the properties I am storing.

> -- 
> Peter Karman  .  .  peter(at)

It's not my intent to have other people debug my code, but if anyone is
willing to have a look at this to see if I'm doing anything ridiculously
wrong I'd appreciate it. 

I use exactly the same config file for the two indexes. I index the same
content set for the two runs, and there are no duplicates.

This is the swish.config file:
IndexContents XML*  .xml
ParserWarnLevel 3

MetaNames content title body abstract attachment_data
content.u_effective_date content.u_expiration_date content.u_updated_date 
PropertyNamesDate content.u_effective_date content.u_expiration_date

PropCompressionLevel 9
MinWordLimit 3

UndefinedMetaTags index 
UndefinedXMLAttributes index

IgnoreWords file: /home/cms/cass/indexer/stopwords/english.txt
IgnoreNumberChars 0123456789$.,

Our app selects content items in batches of 50 from a database queue, and
for each set
it opens the swish pipe, indexes the 50 itmes, and closes the pipe again. 

The merging version looks like this (it's TCL code)

set swish_index [open "|swish-e -v3 -S prog -i stdin \
                                -c ./swish.config \
                                -f /tmp/temp.index" w]

Then, for each of the content items:
    set id (get the data id from the database)
    set data (read the data from the database)
    set date [now]
    set content_length [string length $data)

    puts $swish_index "Path-Name: \'$id\'"
    puts $swish_index "Content-Length: $content_length"
    puts $swish_index "Last-Mtime: [clock format [clock scan $date] -format
    puts $swish_index "Document-Type: XML*\n"
    puts $swish_index $data

And after indexing the set, we merge the master and temp to a temp merged 
file, which is then the new master:

exec swish-e -M "/data/merge_index/index.swish-e" /tmp/temp.index

exec mv /tmp/merged.index		"/data/merge_index/index.swish-e"
exec mv /tmp/merged.index.prop	"/data/merge_index/index.swish-e.prop"
exec mv /tmp/merged.index.btree	"/data/merge_index/index.swish-e.btree"
exec mv /tmp/merged.index.array	"/data/merge_index/index.swish-e.array"
exec mv /tmp/merged.index.file	"/data/merge_index/index.swish-e.file"
exec mv /tmp/merged.index.psort	"/data/merge_index/index.swish-e.psort"
exec mv /tmp/merged.index.wdata	"/data/merge_index/index.swish-e.wdata"

exec rm /tmp/temp.index
(and remove the rest of the temp index files likewise)

The incremental index version looks like this
Firstly one has to create an initial index file set by indexing one item
without specifying Update-Mode (Update-Mode: Index assumes there's an
existing file):

set swish_index [open "|swish-e -v3 \
                                -S prog -i stdin 
                                -c ./swish.config 
                                -f /data/incremental_index/index.swish-e" w]

and then you index one item using:
    puts $swish_index "Path-Name: \'id\'"
    puts $swish_index "Content-Length: $content_length"
    puts $swish_index "Last-Mtime: [clock format [clock scan $date]] -format
    puts $swish_index "Document-Type: XML*\n"
    puts $swish_index $dataset swish_index [open "|swish-e -v3 -u \
                                -S prog -i stdin \
                                -c ./swisn.config \
                                -f /data/incremental_index/index.swish-e" w]

After bootstrapping like this, we kick of the true incremental indexing:

set swish_index [open "|swish-e -v3 -u \
                                -S prog -i stdin \
                                -c ./swish.config \
                                -f /data/incremental_index/index.swish-e" w]

And for each of the 50 items:
    puts $swish_index "Path-Name: \'id\'"
    puts $swish_index "Update-Mode: Index"
    puts $swish_index "Content-Length: $content_length"
    puts $swish_index "Last-Mtime: [clock format [clock scan $date]] -format
    puts $swish_index "Document-Type: XML*\n"
    puts $swish_index $data

The search results for the two types of indexes seem to be identical - so
why would the incremental indexes files be so much larger?

Users mailing list
Received on Thu Nov 29 03:15:54 2007