Roy Tennant wrote on 4/9/09 7:28 PM:
> I wanted to report back on my problem and what I did to fix it, and to
> thank those of you who chimed in with strategies. I did a few things:
> 1) I changed the script I was using to break up a massive spreadsheet
> into XML files so that it would parcel the files out into a set of
> directories with no more than 250,000 files per directory (I picked
> the figure pretty much at random), so that it spread the files across
> 12 directories at present (something near 2.8 million small XML
that's a smart strategy long-term since you'll have the option of moving files
onto different disk partitions, matching by dir name, etc.
> 2) I indexed the files on my laptop, which has 3GB of RAM (which did
> fine) and then uploaded the index to the server. Unfortunately, this
> did not work despite having the same version of Swish-e in both
> places. Since the error was a "wrong index format" error, I think the
> problem may have stemmed from either moving from a Mac to Unix (the
> classic line-ending problem) or an error introduced in tarring,
> gzipping, ftping, and untarring and ungzipping.
I expect it's a 32-bit vs 64-bit issue. I move files from Linux to OS X to
FreeBSD all the time, and as long as they are the same size architecture and
same Swish-e version, all is well.
> 3) Lastly, I used the "-e" switch and indexed on the server. This
> worked, probably since Swish-e could do the disk caching more
> intelligently to reduce thrash and create efficiencies. But it still
> took nearly seven hours and and hour and 13 minutes CPU time.
> Some statistics:
> - 2,779,517 small XML files in 12 directories
> - 8,733,343 words
> - 1,138,100,470 total bytes
> - 72,898,219 total words.
> - 1GB RAM on unknown hardware, running Ubuntu Dapper Drake
that's a useful benchmark, for prospective users especially, and actually
exceeds what I would have expected Swish-e capable of in terms of number of
documents in the collection.
> So, to recap, my savior was to use the "-e" switch to make Swish-e do
> the disk caching instead of my OS. Thanks to all, but particularly to
> Peter Karman for pointing out that switch to me. I didn't particularly
> want to break up the indexes, but as my project grows I may get to
> that point. Oh, and the result is at
> <http://roytennant.com/proto/hathi/> if anyone is curious. Thanks,
Glad you got it working.
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Users mailing list
Received on Thu Apr 9 20:46:36 2009