Skip to main content.
home | support | download

Back to List Archive

Re: Indexing API

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jul 07 2004 - 20:21:53 GMT
On Wed, Jul 07, 2004 at 12:15:44PM -0700, Tac wrote:
> Bill asked why I thought indexing should be callable (like searching),
> rather than through a command line program.  Here are my reasons:
>  
> (1) I like having lots of control over the process.  We're indexing millions
> of xml documents, and I like to have a better sense of where things are at,
> rather than just firing up the program and waiting.  

Well, I guess I'd need to see an API first, as I can't picture how
that would be used.  It would require a big rewrite of the indexing
part of swish -- as much of the code (like the config stuff) is very
much the same as it was in 1997.

> Now, both of those are more philosophical, the real reason I want to be able
> to index from within a perl (or other) script is so that I can index on the
> fly.

I'm not sure I understand.  You mean index files individually?

> We have about 6 million documents, each document has between 1 and 500
> pages.

Not your typical web site of few 100 pages. ;)

> swish-e indexes the documents, but when displaying them I only want
> to display the appropriate pages (so if you search for a word that shows up
> on page 26, we display a fragment of page 26 and a link to the image.  I
> should mention that all our documents are OCR of images).

Do you mean like at:

   http://swish-e.org/current/docs/searchdoc.html

upon indexing my -S prog scrip splits the documentation up into chunks
and indexes them separately.  That way searchs are more specific.


> So what I'd like to do is pass the page data (the OCR) and index it, then
> just search the individual pages.  Since we'd be doing this for every
> document on the fly (and we often display 10 or more documents per page), it
> would involve a lot of resources.  Fortunately, swish-e is incredibly fast.
> But I don't want to pay for the overhead of calling system() each time.

Calling system() for what?


> I'd also like to capture information about word counts and such, without
> having to parse the results of the index command line call.

I think most of that data is available in the C/Perl API.

You are making some big indexes.  Report back on your findings.


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Jul 7 13:22:06 2004