Re: Swish-e max db size vs. Google App

From: Peter Karman <peter(at)>
Date: Fri Apr 22 2005 - 20:54:04 GMT
Thomas Dowling scribbled on 4/22/05 12:42 PM:
> Greetings--
> I'm working with some staff members here who are interested in what we
> could do with a Google Appliance.  My gut reaction is, "Not much that we
> couldn't do with a beefy Linux box and Swish-e", natch.  But I find that
> I don't really have a sense of Swish-e's upper limits in terms of the
> number of documents or size/number of indexes it can work with.
> The home page says, "Swish-e is ideally suited for collections of a
> million documents or smaller."  I've seen posts on the list about 2GB+
> indexes of ~6 million documents under 2.5.x, along with a comment from
> Bill that that was pushing the envelope.  Does that reflect reasonable
> upper limits for current and forthcoming versions repectively?  Am I
> overlooking something obvious in the documentation?

google appliance vs swish-e isn't really a fair comparison. even if you could 
index several million docs with swish-e and not see a performance hit (which you 
will, as Bill noted), the two tools have different strengths/weaknesses.

Google is really good at indexing LOTS of docs, quickly, and searching even 
quicker. With the appliance you're getting hardware that's tuned just for 
google. And you're paying through the nose for it. Though that includes support 
from google. You get what you pay for.

Swish-e is really good at indexing what I think of as medium-size collections. 
Sure, it can probably do several million docs, but not nearly at the rate that 
google can.

But the real defining issue for me is the ranking/accuracy of the search. 
Swish-e gives you exactly what you asked for, and because you can custmize the 
metanames/props in infinite custom variations, you can get really specific 
queries. Google, on the other hand, has their vaunted PageRank system, highly 
secretive and proprietary, and pretty good. Not perfect, but pretty good. The 
secret lies in the relative importance of any given doc as rated by the rest of 
the docs (the algorithm is the company secret). Swish-e ranking doesn't even 
pretend to do that. It's very simplistic, though fairly useful, depending on 
what you're trying to find.

Bottom line: if you have the money, and you want to index/search several million 
docs of varying format, size and complexity, and you want something that you can 
just plug in, turn on and point at your servers, go google. You get the support 
and the pedigree. If you don't have the money or want finer control over what 
you index and what kind of info you keep in the index, consider Swish-e -- 
though be prepared to get creative in terms of multiple indexes, etc., to 
minimize the performance hit.

you get what you pay for.
Peter Karman  .  .  peter(at)
