Skip to main content.
home | support | download

Back to List Archive

Re: Indexing a Large and Increasing DB

From: <moseley(at)not-real.hank.org>
Date: Wed Oct 15 2003 - 19:56:01 GMT
On Wed, Oct 15, 2003 at 11:35:22AM -0700, Sean Downey wrote:

> The number of documents grows by about 5000 per week. Stories can be
> modified during the day - but would not usually be modified after 7 days.

Will old stories every be removed?  Swish-e is fast but was not designed
for an every increasing number of documents.  Scalability is an issue.

> 
> My current line of thinking is that there should be three index DBs.
> 
> M - the Main Index
> S1 - a small index which would store stories back to the last Sunday.
> S2 - a small index which would store stories from the last Sunday to the
> Sunday before.

So the point of S2 is to allow for merging, correct?

> The search would use M, S1 & S2.
> 
> does this sound reasonable?
> or is there a better way of doing this?
> I have read a few topics about staying away from the merge - is merging
> still a problem?

I'd suggest testing, of course (and reporting back your findings).  
Merge should work mostly like normal indexing.  It avoids the re-parsing 
of documents, but it has to do additional sorting of all the word data 
to accomplish the merge.  So there's some trade-offs and I think testing 
is the only way to see what happens with your data.

-- 
Bill Moseley
moseley@hank.org
Received on Wed Oct 15 19:56:04 2003