Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] incremental index format produces *much* larger ind ex files?

From: Judith Retief <JudithR(at)not-real.inet.co.za>
Date: Thu Nov 29 2007 - 08:15:46 GMT
> Judith Retief wrote on 11/20/07 3:54 AM:

>> However, my index files differ hugely in size: the merged index files add
up
>> to about 80M, the incremental index files almost 600M! What's going on?
> 
>> Is there anything that I could be doing wrong to be generating these huge
>> files? 


> Without seeing examples of your configs and the merge commands you are 
> running, it's hard to speculate.
>
> One guess is that at merge time duplicates are being tossed out. But 
> that size difference seems too significant.
>
> IME, the index size varies a lot based on the number/size/compression 
> of the properties I am storing.

> -- 
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com


It's not my intent to have other people debug my code, but if anyone is
willing to have a look at this to see if I'm doing anything ridiculously
wrong I'd appreciate it. 

I use exactly the same config file for the two indexes. I index the same
content set for the two runs, and there are no duplicates.


This is the swish.config file:
===============================================
IndexContents XML*  .xml
ParserWarnLevel 3

MetaNames content title body abstract attachment_data
content.u_effective_date content.u_expiration_date content.u_updated_date 
PropertyNamesDate content.u_effective_date content.u_expiration_date
content.u_updated_date 

PropCompressionLevel 9
MinWordLimit 3

UndefinedMetaTags index 
UndefinedXMLAttributes index

IgnoreWords file: /home/cms/cass/indexer/stopwords/english.txt
IgnoreNumberChars 0123456789$.,
=================================================

Our app selects content items in batches of 50 from a database queue, and
for each set
it opens the swish pipe, indexes the 50 itmes, and closes the pipe again. 



The merging version looks like this (it's TCL code)
--------------------------------------------------

set swish_index [open "|swish-e -v3 -S prog -i stdin \
                                -c ./swish.config \
                                -f /tmp/temp.index" w]

Then, for each of the content items:
    set id (get the data id from the database)
    set data (read the data from the database)
    set date [now]
    set content_length [string length $data)

    puts $swish_index "Path-Name: \'$id\'"
    puts $swish_index "Content-Length: $content_length"
    puts $swish_index "Last-Mtime: [clock format [clock scan $date] -format
%s]"
    puts $swish_index "Document-Type: XML*\n"
    puts $swish_index $data

And after indexing the set, we merge the master and temp to a temp merged 
file, which is then the new master:

exec swish-e -M "/data/merge_index/index.swish-e" /tmp/temp.index
/tmp/merged.index

exec mv /tmp/merged.index		"/data/merge_index/index.swish-e"
exec mv /tmp/merged.index.prop	"/data/merge_index/index.swish-e.prop"
exec mv /tmp/merged.index.btree	"/data/merge_index/index.swish-e.btree"
exec mv /tmp/merged.index.array	"/data/merge_index/index.swish-e.array"
exec mv /tmp/merged.index.file	"/data/merge_index/index.swish-e.file"
exec mv /tmp/merged.index.psort	"/data/merge_index/index.swish-e.psort"
exec mv /tmp/merged.index.wdata	"/data/merge_index/index.swish-e.wdata"

exec rm /tmp/temp.index
(and remove the rest of the temp index files likewise)


The incremental index version looks like this
---------------------------------------------
Firstly one has to create an initial index file set by indexing one item
without specifying Update-Mode (Update-Mode: Index assumes there's an
existing file):

set swish_index [open "|swish-e -v3 \
                                -S prog -i stdin 
                                -c ./swish.config 
                                -f /data/incremental_index/index.swish-e" w]

and then you index one item using:
    puts $swish_index "Path-Name: \'id\'"
    puts $swish_index "Content-Length: $content_length"
    puts $swish_index "Last-Mtime: [clock format [clock scan $date]] -format
%s]"
    puts $swish_index "Document-Type: XML*\n"
    puts $swish_index $dataset swish_index [open "|swish-e -v3 -u \
                                -S prog -i stdin \
                                -c ./swisn.config \
                                -f /data/incremental_index/index.swish-e" w]


After bootstrapping like this, we kick of the true incremental indexing:

set swish_index [open "|swish-e -v3 -u \
                                -S prog -i stdin \
                                -c ./swish.config \
                                -f /data/incremental_index/index.swish-e" w]

And for each of the 50 items:
    puts $swish_index "Path-Name: \'id\'"
    puts $swish_index "Update-Mode: Index"
    puts $swish_index "Content-Length: $content_length"
    puts $swish_index "Last-Mtime: [clock format [clock scan $date]] -format
%s]"
    puts $swish_index "Document-Type: XML*\n"
    puts $swish_index $data
	

The search results for the two types of indexes seem to be identical - so
why would the incremental indexes files be so much larger?

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Nov 29 03:15:54 2007