Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:78] problems with temporary files (etc.) using swish-e -M

From: WWW server manager <webadm(at)not-real.info.cam.ac.uk>
Date: Tue Nov 25 1997 - 22:36:11 GMT
This is a followup to my earlier message about problems encountered with
using swish-e's ability to merge multiple index files, in the case I was
trying just over 90 indexes. Most of them relatively small - few as large as 
3-6MB - but ideally searchable individually and also as part of the overall
server content.

One problem I mentioned was that the merge option creates temporary files
using tmpnam(), which (on Solaris 2, at least) places them unconditionally in 
/var/tmp - which happens to have relatively little free space on the system 
I was using. Changing those calls to tempnam(), which allows an explicit
default directory (I specified one with 1GB+ free ...) and also overriding
the default using TMPDIR, was easy, and I tried running the original 
90+ file merge again. 

It reached 177MB virtual memory and a 38MB temporary (intermediate) file 
before I decided it was getting nowhere and potentially impacting the web 
server (since it was competing for physical memory with the web server and 
other processes over an extended). It had already used over 4 hours of CPU
time.

*However*, I realised subsequently that an additional swish-e bug (or at 
least, poor implementation) had confused the comparison against indexing
everything in a single run. When I built most of the indexes that I was
trying to merge, the shell script accidentally had "-c  -i" in the middle
of the command-line arguments. It should have been "-c test.conf -i", but 
swish-e did not notice (or did not care) that the config filename was 
missing, and did the indexing entirely with default options - I specified
the directories to index on the command line, so that was viable. It also
meant that the *content* of all the image files was indexed, which helped
to make the total size of the individual indexes over 50MB (compared to 
around 8MB for the index built by indexing everything at once). The 
partial merged index (left in a temporary file when the process was killed) 
as well as most of the indexes being merged included a vast quantity of 
"junk" words comprising fragments of GIF image content...

So, I had another try today. Rebuilding the 90+ indexes that had been built 
incorrectly (a few others had been built separately and were OK) took
just over 8 minutes real-time and just over 6 minutes (of SPARCserver 10/51) 
CPU time. 

For comparison, building a single index from the same directories in one
run took 7.5 minutes real-time and 5.5 minutes CPU time; an overall index 
for the whole server took 45 minutes realtime (but is much faster overnight) 
and 16 minutes CPU time (and was using 11MB memory when I checked it 
part-way through).

Attempting to merge all the 90+ indexes succeeded this time, but used over 
158MB virtual memory (for around 20MB of input indexes) by 2/3 of the way 
through [I was in a meeting when it finished, so don't know the peak usage]
and took 1 hour 33 minutes real-time, 1 hour 15 minutes CPU time. 2/3 of the 
way through the merge, the larger temporary file had reached 38MB, though 
the final index was around 8MB, which was about the same as an overall index 
of the whole server. A bit too resource-intensive to do on a daily basis.

Finally, for comparison I tried building essentially the same merged
index from just 4 input files - 3 that were the same as before, and a 
single overall index for the bulk of the files (University society pages, 
where it would be preferable to be able to search each Society's pages 
individually). That run took only 8 minutes realtime and 7.5 minutes CPU 
time for the merge; I didn't see the peak memory usage but around half-way 
through it reached only 5-6MB.

Summarising the above, plus some others points that I noticed while
investigating:

 * swish-e doesn't validate its command line arguments thoroughly and 
   mistakes may go unnoticed. "... -c -i ..." is accepted as valid (config
   file name omitted) when indexing, and "-v -M ..." is accepted even 
   though -v is not documented as valid with -M and apparently ignored.
   [For -M, since it summarises its actions as it merges successive
   files  even without -v, it would be helpful if it could report the
   name of the each file as it starts merging it to make it easier to
   see how it is progressing through the list of files to merge.]

 * swish-e uses a lot of virtual memory and temporary file space when
   merging a lot of files - far more than the total size of the input
   indexes, even though the only hint in the documentation is that 
   peak memory use should be around half the total size of the input files!
   For me, with around 20MB of input files it reached at least 158MB
   of memory, and created unexpectedly large intermediate files.]

 * swish-e uses two temporary files when merging; it appears to merge into
   one, copy back to the other, then merge that plus the next supplied 
   input file into the second, and so on. When the temporary indexes 
   easily reached 30-40MB, that's rather inefficient - it should be able
   to merge alternately into the first and second temporary files, if 
   it cannot do the merging efficiently in memory (which with the inflated
   memory use for a lot of input files could be a problem, even if the
   total size of input indexes was modest compared to available memory). [As
   I write this, I'm starting to wonder if maybe it was alternating between
   the two files as output for merging, but it certainly looked like it 
   was always merging into the second, then copying back to the first.]

 * in spite of the size of the intermediate files, the final result (when
   it doesn't run out of space and get totally confused, as reported  
   previously :-) does seem to be about the right size (very similar for
   different ways of indexing/merging the same data, and the differences
   may be due to data files being updated while I was building the 
   indexes). 

Extrapolating somewhat from the above, it looks like it may be faster
and less resource-intensive to merge groups of e.g. 5-6 indexes, then 
repeat as required for the resulting group indexes, until a single 
overall index is reached; letting swish-e merge 90+ indexes sequentially
appears to be a very bad idea...

                                John Line
-- 
University of Cambridge WWW manager account (usually John Line)
Send general WWW-related enquiries to webmaster@ucs.cam.ac.uk
Received on Tue Nov 25 14:44:08 1997