Skip to main content.
home | support | download

Back to List Archive

Re: Perl API and mod_perl/Incremental

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Feb 17 2005 - 15:24:00 GMT
On Thu, Feb 17, 2005 at 02:37:08AM -0800, Markus Peter wrote:
> Can I already open the index files in my Apache mod_perl startup script 
> (=before the fork of the children) and it will automatically do the right thing

I'm not sure.  Searching writes into the index structures, so you are
going to get a copy of the memory anyway (copy-on-write).  Using a
second mod_perl server with SWISHED (as Peter commented about)  might
be a bit more efficient memory wise since there would be fewer child
processes running swish.  If that's worth the trade-off of running a
second mod_perl server is something you would have to determine.

The act of opening the index doesn't use that much RAM.  Running
searches can, though.

Try opening the indexes in startup.pl and in child fork and report
back the differences and how you measured it.

> The other question I have is regarding incremental mode. So far I've
> been using the traditional mode with cron jobs to update once or twice a
> day, but I'd really like to convert the search to be "real time". How
> stable is incremental mode? And "how incremental" is it? Can I use it,
> to add/modify/remove documents from the search index on the fly, as they
> are added/modified or is it rather targetted at batch processing a larger
> number of updates (=merely a better merge)?

I only know a little about incremental internals.

It's not really on-the-fly.  It uses a different index format -- a
btree structure that allow updates.  Deletions are made by marking
that the file has zero words total, but doesn't really delete the
word data.  So the index continues to grow.  It also means that the
search engine really finds words from deleted files and then those
files are checked to see if they have been deleted, and if so,
not added to the result set.

It's not really on-the-fly because, although you can add files to an
existing index, the final stages of indexing are still done every
time a file is added -- namely the presorted indexes have to be
rebuilt.  I'm not 100% sure, but I suspect there's a time when the
index is in an unstable state while adding files to the index.

I tried a commercial search engine once -- I can't remember what it
was (they kept emailing me for months after the "free trial" so you
would think I would remember) -- but it truly allowed searches while
it was indexing.  The down side was it took  f o r e v e r   to run
indexing, and searches were not that speedy.  Yes, I suspect that was
a trade-off for scalability.


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Feb 17 07:24:14 2005