these are interesting admin questions.
Tac wrote on 8/30/04 6:16 AM:
> We've been running the indexing process on our production server (because
> it's fast, same box as the database, etc.), but at times it consumes so many
> resources it can bring our site temporarily down. I'm trying to figure out
> a way to let other processes have some time slices. Two questions:
> (1) Does swish-e index a record coming in via -S prog as it gets the record,
> or does it wait until all records have been retrieved? If it indexes at as
> it gets it, I can add a sleep() command every few thousand records.
my understanding is that swish writes *properties* to disk for each doc
as it parses. The main word index is not written for each doc, but
instead the RAM is flushed periodically and written to disk. The main
writing of the word index happens at the end of the indexing. You can
vary this with the -e option.
You can watch this happen (under *nix OSes anyway) with the 'watch'
command while swish-e is indexing. Just do a:
watch -n 1 'ls -l yourindex*'
in the same directory as the indexes are being written. You can watch
the sizes change in 'real' time. If running under -e, try:
watch -n 1 'ls -l yourindex* swtmp*'
though NOTE that swtmp files number in the 100s and your terminal window
probably won't show that many.
whether your sleep() idea will work, I can't say. Try it and let us know. :)
> (2) The major problem is not during the individual collection indexing, but
> during the 150+ index merge. (Merging them all into 1 makes searches
> against all the collections MUCH faster). I don't think the bottleneck is
> CPU, I think it's disk access, or maybe something else (file handles?). But
> it's so intensive that even the MySQL box on that same quad-processor
> machine locks up threads, unable to process any queries. Any ideas for how
> to get around this? We're looking at moving the indexing to another box,
> but that requires getting the data from across the network, indexing and
> merging, then ftp'ing the indices back, which seems like a lot of work. If
> there's a way to not have swish-e consume all the resources it'd be much
> easier to keep it where it is.
Here we get to the heart of the matter, IMHO. My guess with the merge
problem is that all the word data is being read into RAM simultaneously.
Jose or Bill could answer that more intelligently. Have you tried -e
If the main problem is merging all those indexes and there's not a clean
technical way (either improving the code or documentation) of making
swish work under your current configuration, you have some workaround
1. try merging into several smaller sub-indexes rather than one huge
one. You can search fairly effectively across half a dozen or so
indexes, in my experience. This can also help with maintenance, if there
are some of these sub-indexes that don't change as often as others.
2. since the indexes are os-independent, consider NFS (or some other
network filesystem) to create them on a different box, and make them
available on your production machine. I do that here.
3. consider being a bleeding edge tester of Josh's new swished remove
server project. I'm sure he'd love the bugtesting your environment could
provide. Search the email archive for the URL. It runs under Apache and
mod_perl. I haven't had time yet to test it myself, but have installed
it and it looks very promising.
Let us know what you do. Someone ought to be compiling a "SWISH-E
CookBook" for cases like these.
Peter Karman 651-605-9009 firstname.lastname@example.org
Received on Mon Aug 30 06:01:25 2004