Skip to main content.
home | support | download

Back to List Archive

Re: more out of memory fun - woohoo!

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Oct 21 2006 - 21:48:15 GMT
On Sat, Oct 21, 2006 at 04:38:20PM -0400, Brad Miele wrote:
> We do need to sort results by the custom weight properties, they are sort 
> of customized versions of the swishranking that are specific to the 
> "quality" of the records as they pertain to various user groups who search 
> our index.
> 
> Can i get away without presorting even though i want to sort by them 
> eventually?

IIRC, the pre-sorted tables are created like this:

While indexing, for each property (that is set to be pre-sorted):

    1) A table is created in RAM total "filenum" in length
    2) Each property is read from disk (not sure of the order of this)
    3) qsort is used to place that table in order by the property's
    value
    4) A new table is created in the same order as the sorted table
    with sequence IDs -- this is the property's sort value.
    5) That table is sorted again to put it back in filenum order
    6) It's written to the index.


When searching and you need to sort results by a given property:

    1) The table is loaded into RAM.
    2) A sort table is created in RAM the length of the search
    results.
    3) Each result's "sort value" is looked from the pre-sorted
    table and placed in the sort table.
    4) qsort() sorts the table. There could be more than one level of
    sorting, of course.
    5) The new sorted results are returned.

(Using -L is even more convoluted, but that's for another time.)



The advantage is that when sorting results there's only one big table
to load into memory to get a property's sort value -- instead of
multiple access to disk to get the actual property value.  It's also
much faster to sort integers than long character strings.  And it uses
less memory to sort integer tables than string tables.

Of course, you have a lot of files so you end up with a large
pre-sorted table that has to be loaded into memory.  If you typically
just a few search results you might find the overhead of reading the
index for the actual property value might be less than reading in that
larger integer table.

To most, searching time is more important than indexing time, so it's
smarter to use the pre-sorted tables.  But, not everyone is indexing
hundreds of thousands of documents.


Make sense?


There may only be so much blood you can squeeze out of this turnip.

> any other thoughts on how to reduce the extended cpu time. it seems to 
> sit on one custom weight in particular, in this case for probably 11 out 
> of the 13.5 hours.

Since your cpu and real time are the same it doesn't seem you are
swapping.  Maybe it justs takes a long time to sort.

You could dump filenum and that property out of the index and see how
long it takes to sort -- using sort or by writing a little C program
that uses qsort().

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Sat Oct 21 14:48:19 2006