Skip to main content.
home | support | download

Back to List Archive

removal oopsie! function - was: something about memory

From: Brad Miele <bmiele(at)not-real.ipnstock.com>
Date: Sun Oct 22 2006 - 00:45:38 GMT
hi,

regardless of whether or not i get the index to happen faster (and it 
looks like i can loose some of the sorts, so it will be faster), I am 
totally sold on the incremental, primarily the rremove function, which we 
will be using often.

one feature that I would be interested in seeing would be an undelete 
function, sort of an "oops, i didn't mean to delete that record". It is my 
understanding of delete that it basically flags the file as unavailable, 
if it could be marked available agian without having to resort and stuff. 
it would make things perfect for me.

the reasoning here is that mostly this will be used by editors, who will 
accept or reject image records. often within one sitting, they will reject 
and then later reaccept the image records.

anyway, i would be happy to look at trying out my rusty (hell, 
non-existent) c knowledge on trying to do this if no one else wants to. 
just point me at the right functions to look at.

thanks as always,

Brad
---------------------
Brad Miele
VP Technology
IPNStock.com
866 476 7862 x902
bmiele@ipnstock.com

On Sat, 21 Oct 2006, Bill Moseley wrote:

> On Sat, Oct 21, 2006 at 04:38:20PM -0400, Brad Miele wrote:
>> We do need to sort results by the custom weight properties, they are sort
>> of customized versions of the swishranking that are specific to the
>> "quality" of the records as they pertain to various user groups who search
>> our index.
>>
>> Can i get away without presorting even though i want to sort by them
>> eventually?
>
> IIRC, the pre-sorted tables are created like this:
>
> While indexing, for each property (that is set to be pre-sorted):
>
>    1) A table is created in RAM total "filenum" in length
>    2) Each property is read from disk (not sure of the order of this)
>    3) qsort is used to place that table in order by the property's
>    value
>    4) A new table is created in the same order as the sorted table
>    with sequence IDs -- this is the property's sort value.
>    5) That table is sorted again to put it back in filenum order
>    6) It's written to the index.
>
>
> When searching and you need to sort results by a given property:
>
>    1) The table is loaded into RAM.
>    2) A sort table is created in RAM the length of the search
>    results.
>    3) Each result's "sort value" is looked from the pre-sorted
>    table and placed in the sort table.
>    4) qsort() sorts the table. There could be more than one level of
>    sorting, of course.
>    5) The new sorted results are returned.
>
> (Using -L is even more convoluted, but that's for another time.)
>
>
>
> The advantage is that when sorting results there's only one big table
> to load into memory to get a property's sort value -- instead of
> multiple access to disk to get the actual property value.  It's also
> much faster to sort integers than long character strings.  And it uses
> less memory to sort integer tables than string tables.
>
> Of course, you have a lot of files so you end up with a large
> pre-sorted table that has to be loaded into memory.  If you typically
> just a few search results you might find the overhead of reading the
> index for the actual property value might be less than reading in that
> larger integer table.
>
> To most, searching time is more important than indexing time, so it's
> smarter to use the pre-sorted tables.  But, not everyone is indexing
> hundreds of thousands of documents.
>
>
> Make sense?
>
>
> There may only be so much blood you can squeeze out of this turnip.
>
>> any other thoughts on how to reduce the extended cpu time. it seems to
>> sit on one custom weight in particular, in this case for probably 11 out
>> of the 13.5 hours.
>
> Since your cpu and real time are the same it doesn't seem you are
> swapping.  Maybe it justs takes a long time to sort.
>
> You could dump filenum and that property out of the index and see how
> long it takes to sort -- using sort or by writing a little C program
> that uses qsort().
>
> -- 
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
>   http://swish-e.org/Discussion/
>
> Help with Swish-e:
>   http://swish-e.org/current/docs
>   swish-e@sunsite.berkeley.edu
>
>
>
Received on Sat Oct 21 17:45:43 2006