Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] ANNOUNCE: 2.4.7 released

From: Peter Karman <peter(at)>
Date: Fri Apr 10 2009 - 00:46:24 GMT
Roy Tennant wrote on 4/9/09 7:28 PM:
> I wanted to report back on my problem and what I did to fix it, and to
> thank those of you who chimed in with strategies. I did a few things:
> 1) I changed the script I was using to break up a massive spreadsheet
> into XML files so that it would parcel the files out into a set of
> directories with no more than 250,000 files per directory (I picked
> the figure pretty much at random), so that it spread the files across
> 12 directories at present (something near 2.8 million small XML
> files).

that's a smart strategy long-term since you'll have the option of moving files
onto different disk partitions, matching by dir name, etc.

> 2) I indexed the files on my laptop, which has 3GB of RAM (which did
> fine) and then uploaded the index to the server. Unfortunately, this
> did not work despite having the same version of Swish-e in both
> places. Since the error was a "wrong index format" error, I think the
> problem may have stemmed from either moving from a Mac to Unix (the
> classic line-ending problem) or an error introduced in tarring,
> gzipping, ftping, and untarring and ungzipping.

I expect it's a 32-bit vs 64-bit issue. I move files from Linux to OS X to
FreeBSD all the time, and as long as they are the same size architecture and
same Swish-e version, all is well.

> 3) Lastly, I used the "-e" switch and indexed on the server. This
> worked, probably since Swish-e could do the disk caching more
> intelligently to reduce thrash and create efficiencies. But it still
> took nearly seven hours and and hour and 13 minutes CPU time.
> Some statistics:
>  - 2,779,517 small XML files in 12 directories
>  - 8,733,343 words
>  - 1,138,100,470 total bytes
>  - 72,898,219 total words.
>  - 1GB RAM on unknown hardware, running Ubuntu Dapper Drake

that's a useful benchmark, for prospective users especially, and actually
exceeds what I would have expected Swish-e capable of in terms of number of
documents in the collection.

> So, to recap, my savior was to use the "-e" switch to make Swish-e do
> the disk caching instead of my OS. Thanks to all, but particularly to
> Peter Karman for pointing out that switch to me. I didn't particularly
> want to break up the indexes, but as my project grows I may get to
> that point. Oh, and the result is at
> <> if anyone is curious. Thanks,
> Roy

Glad you got it working.

Peter Karman  .  .  peter(at)
Users mailing list
Received on Thu Apr 9 20:46:36 2009