I wanted to report back on my problem and what I did to fix it, and to
thank those of you who chimed in with strategies. I did a few things:
1) I changed the script I was using to break up a massive spreadsheet
into XML files so that it would parcel the files out into a set of
directories with no more than 250,000 files per directory (I picked
the figure pretty much at random), so that it spread the files across
12 directories at present (something near 2.8 million small XML
2) I indexed the files on my laptop, which has 3GB of RAM (which did
fine) and then uploaded the index to the server. Unfortunately, this
did not work despite having the same version of Swish-e in both
places. Since the error was a "wrong index format" error, I think the
problem may have stemmed from either moving from a Mac to Unix (the
classic line-ending problem) or an error introduced in tarring,
gzipping, ftping, and untarring and ungzipping.
3) Lastly, I used the "-e" switch and indexed on the server. This
worked, probably since Swish-e could do the disk caching more
intelligently to reduce thrash and create efficiencies. But it still
took nearly seven hours and and hour and 13 minutes CPU time.
- 2,779,517 small XML files in 12 directories
- 8,733,343 words
- 1,138,100,470 total bytes
- 72,898,219 total words.
- 1GB RAM on unknown hardware, running Ubuntu Dapper Drake
So, to recap, my savior was to use the "-e" switch to make Swish-e do
the disk caching instead of my OS. Thanks to all, but particularly to
Peter Karman for pointing out that switch to me. I didn't particularly
want to break up the indexes, but as my project grows I may get to
that point. Oh, and the result is at
<http://roytennant.com/proto/hathi/> if anyone is curious. Thanks,
On Sun, Apr 5, 2009 at 8:47 AM, Peter Karman <email@example.com> wrote:
> Jordan Hayes wrote on 4/5/09 12:26 AM:
>>> I have something like 2-3 million small little XML files
>>> in one directory that I'm indexing ...
>> I'm going to guess that SWISH-E "slows down" due to your OS's directory
> That would surprise me.
> I expect rather that you're simply running out of memory and you're swapping, as
> you suspected.
> I would just split my single dir into 3 and create 3 indexes. Then you could
> either merge them (though you might hit the same mem limit) or just search them
> might also look at the -e option to swish-e while indexing.
> Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
> Users mailing list
Users mailing list
Received on Thu Apr 9 20:28:03 2009