Skip to main content.
home | support | download

Back to List Archive

Indexing API

From: Tac <tac(at)>
Date: Wed Jul 07 2004 - 19:16:13 GMT
Bill asked why I thought indexing should be callable (like searching),
rather than through a command line program.  Here are my reasons:
(1) I like having lots of control over the process.  We're indexing millions
of xml documents, and I like to have a better sense of where things are at,
rather than just firing up the program and waiting.  
(2) Right now, we're using a hackish system to create a config file which is
passed to the swish-e program (since not everything can be passed from the
command line), this feels very prone to errors.
Now, both of those are more philosophical, the real reason I want to be able
to index from within a perl (or other) script is so that I can index on the
fly.  We have about 6 million documents, each document has between 1 and 500
pages.   swish-e indexes the documents, but when displaying them I only want
to display the appropriate pages (so if you search for a word that shows up
on page 26, we display a fragment of page 26 and a link to the image.  I
should mention that all our documents are OCR of images).
So what I'd like to do is pass the page data (the OCR) and index it, then
just search the individual pages.  Since we'd be doing this for every
document on the fly (and we often display 10 or more documents per page), it
would involve a lot of resources.  Fortunately, swish-e is incredibly fast.
But I don't want to pay for the overhead of calling system() each time.
I'd also like to capture information about word counts and such, without
having to parse the results of the index command line call.
Anyway, if indexing were callable, I'm sure other people would use it in
other ways as well, it would just be a more powerful tool.
PS I've been indexing our documents for the past several days, and the
preliminary testing is very, very exciting.   Our current website is
terribly slow, but I think swish-e is going to be much, much faster.  I'm
like an impatient child, waiting for these multi-million page runs to finish
to I can play with searches and compare.

Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
Received on Wed Jul 7 12:16:22 2004