Skip to main content.
home | support | download

Back to List Archive

Re: swish-e on a large scale

From: Peter Karman <karman(at)not-real.cray.com>
Date: Thu Sep 30 2004 - 16:11:20 GMT
Hi Aaron,

Glad to see Apple is joining the swish-e ranks. :)

Aaron Levitt wrote on 09/30/2004 10:53 AM:
I ran the indexer with the following command:
> ./bin/swish-e -S prog -c swish.conf.
> 

Can you send along the contents of swish.conf? those might be helpful.


> So, I have the following questions:
> 
> 1. I expect to have over 1,000,000 documents in our archives as things 
> progress.  Is this pushing the limits of swish-e?
> 

I think there are folks on this list doing in excess of a million docs, 
but perhaps in smaller groups, depending on how often they need to be 
reindexed. One thing I like about swish-e is the ability to search 
multiple indexes simultaneously.

So yes, I think you can do a million, but for admin purposes, you might 
want to identify subsets and split them into smaller indexes.


> 2. I have seen the indexer hit my robots.txt multiple times, is there a 
> way to check on the progress to see if/when it will finish indexing?
> 

Bill will likely have a better idea than me.


> 3. What should I do regarding the current index process?  I'm afraid to 
> stop it, because I don't want to have to start the indexing all over 
> again.

hmm. I'd let it go just for curiousity's sake. But I understand your 
concern. Is there a way you could benchmark the index size via the -S fs 
method, so you know what you're aiming for? I'm just wondering if you 
could identify whether the bottleneck is the spider or really is the 
indexer.

> 
> 4. Do you have any recommendations on what I can do to improve this 
> process?

Like I said above, splitting up the docs into subsets depending on how 
often they need to be indexed can be helpful. It's also a nice way to 
limit the scope of a search, just by selecting which indexes are 
searched. That way you needed futz with special metanames, etc.

-- 
Peter Karman - 651-605-9009 - karman@cray.com
Received on Thu Sep 30 09:11:48 2004