Bill Moseley escribió:
>On Wed, Oct 13, 2004 at 08:57:44AM -0700, Tuc wrote:
>> I'm trying to index a few large sites, which I copy locally using
>>"webcopy". Once I finish the copy, I run it with "-e" . It ran for 12 or so
>12 hours is a long time to wait.
>Are you indexing large files, or just a lot of files?
>> I saw that I could do it by individual directory, then use the "-M"
>>to merge, or allow the searches to use "-f". I think that if I do the "-M"
>>that even with "-e" it will cause the memory allocation issue. And I'm
>>afraid with the "-f" that the search will take too long to join them all.
>I'm not sure if -M will help that much. You may find the memory
>requirements are similar.
>Using -f with multiple indexes shouldn't be that much slower. There's
>the overhead of opening the extra indexes. If you keep the indexes
>open between requests then you can avoid that. Searching will be
>somewhat slower than searching a single index, but I wouldn't expect
>much. If you try and sort by a property other than rank and output a
>large result set than that can be slower -- swish has to read the
>property file for each result in that case.
>Maybe Jose will have more to offer. You could also try using a
>development snapshot (swish-daily build) -- I think Jose has done dome
>work on the indexing code.
>Jose, would you think the btree backend would deal with the large data
>sets any better?
12 hours is a long time. I usually index every day 200000 docs in 1 hour
(Prog method, Pentium IV 1 Ghz, 2 GB RAM).
So I supose that Tuc's machine is paging but he probably is already
aware of this. So, it is hard to find a solution. I have tried several
things as Bill knows. Word data is compressed in memory while is being
read. -e just swaps this data to several files on disk.
The btree backend uses the same schema. The only thing that changes is
how data is stored in the index. In the code this means
a few changes in db_native.c but not in the main core index code. But
btree allows incremental indexing... So, perhaps, Tuc can
index chunk1 and then update the index with chunk2 and so on. I have not
made huge tests with this situation but I know some
people using it. There is a con, of course: The index files can be
fragmented after dozen (?) of update operations.
swish-e -c my.config -i site1
swish-e -c my.config -i site1 -u
Anyway, I recomend the use of 2.5. The index code has some updates not
present in 2.4 because, mainly for the largefile support.
Oops!!! I supose that Tuc's index file is less than 2GB, right?
Now, I am ending a very important project for my company. Hope that in
two or three of weeks I could go back to the btreee
code to optimize it (and some of the Peter's problems with merge and
new rank schema).
Received on Thu Oct 14 01:44:03 2004