Skip to main content.
home | support | download

Back to List Archive

Re: Memory issues even with -e

From: José Manuel Ruiz <jmruiz(at)>
Date: Thu Oct 14 2004 - 08:43:52 GMT
Bill Moseley escribió:

>On Wed, Oct 13, 2004 at 08:57:44AM -0700, Tuc wrote:
>>	I'm trying to index a few large sites, which I copy locally using
>>"webcopy". Once I finish the copy, I run it with "-e" . It ran for 12 or so
>12 hours is a long time to wait.
>Are you indexing large files, or just a lot of files?
>>	I saw that I could do it by individual directory, then use the "-M"
>>to merge, or allow the searches to use "-f". I think that if I do the "-M"
>>that even with "-e" it will cause the memory allocation issue. And I'm 
>>afraid with the "-f" that the search will take too long to join them all.
>I'm not sure if -M will help that much.  You may find the memory
>requirements are similar.
>Using -f with multiple indexes shouldn't be that much slower.  There's
>the overhead of opening the extra indexes.  If you keep the indexes
>open between requests then you can avoid that.  Searching will be
>somewhat slower than searching a single index, but I wouldn't expect
>much.  If you try and sort by a property other than rank and output a
>large result set than that can be slower -- swish has to read the
>property file for each result in that case.
>Maybe Jose will have more to offer.  You could also try using a
>development snapshot (swish-daily build) -- I think Jose has done dome
>work on the indexing code.
>Jose, would you think the btree backend would deal with the large data
>sets any better?

12 hours is a long time. I usually index every day 200000 docs in 1 hour 
(Prog method, Pentium IV 1 Ghz, 2 GB RAM).
So I supose that Tuc's machine is paging but he probably is already 
aware of this. So, it is hard to find a solution. I have tried several
things as Bill knows. Word data is compressed in memory while is being 
read. -e just swaps this data to several files on disk.

The btree backend uses the same schema. The only thing that changes is 
how data is stored in the index. In the code this means
a few changes in db_native.c but not in the main core index code. But 
btree allows incremental indexing... So, perhaps, Tuc can
index chunk1 and then update the index with chunk2 and so on. I have not 
made huge tests with this situation but I know some
people using it. There is a con, of course: The index files can be 
fragmented after dozen (?) of update operations.

swish-e -c my.config -i site1
swish-e -c my.config -i site1 -u

Anyway, I recomend the use of 2.5. The index code has some updates not 
present in 2.4 because, mainly for the largefile support.
Oops!!! I supose that Tuc's index file is less than 2GB, right?

Now, I am ending a very important project for my company. Hope that in 
two or three of weeks I could go back to the btreee
code to optimize it (and some of  the Peter's problems with merge and 
new rank schema).

Received on Thu Oct 14 01:44:03 2004