I believe I know what Tac is referring to. I, too, have a system set up
that requires indexing small document collections (in my case, a few XML
files from a database via -S prog, 100s of times a day) via CGI,
on-the-fly, and have thought a few times that it would be nice to have
an indexing perl API similar to the search API to rely on, instead of
making an external system() call to the swish-e binary and creating tmp
ad hoc config files (since the properties change depending on the db).
boy that was a long sentence.
in the end, though, I think that the two processes (indexing and
searching) are different enough that it doesn't make sense to have both
APIs. Namely, the big advantage to the search API is that you can make a
single connection to an index and search it multiple times. Searching it
just once via the API and once via the swish-e binary are roughly the
same speed, in my experience. The big performance boost comes when doing
multiple queries on the same index(es).
The same kind of advantage is negated when indexing, since you only
index once. So the system() call is really not much more overhead (I
would guess...I have no benchmarks to prove this, only anecdotal
experience) than any kind of internal API would be, because you're only
connecting (creating) the index once, not multiple times.
What I ended up doing instead was writing a SWISH::Index perl module to
let me quickly create those on-the-fly indexes with a single OO call.
The module takes care of naming the index, creating/taking the XML
input, creating the tmp ad hoc config file (along with using a master
common config), and cleaning up. It also logs output from the swish-e
I do something like:
my $db = new Database(name=>foo);
no fuss, no muss.
Bill Moseley wrote on 7/7/04 3:21 PM:
> On Wed, Jul 07, 2004 at 12:15:44PM -0700, Tac wrote:
>>Bill asked why I thought indexing should be callable (like searching),
>>rather than through a command line program. Here are my reasons:
>>(1) I like having lots of control over the process. We're indexing millions
>>of xml documents, and I like to have a better sense of where things are at,
>>rather than just firing up the program and waiting.
> Well, I guess I'd need to see an API first, as I can't picture how
> that would be used. It would require a big rewrite of the indexing
> part of swish -- as much of the code (like the config stuff) is very
> much the same as it was in 1997.
>>Now, both of those are more philosophical, the real reason I want to be able
>>to index from within a perl (or other) script is so that I can index on the
> I'm not sure I understand. You mean index files individually?
>>We have about 6 million documents, each document has between 1 and 500
> Not your typical web site of few 100 pages. ;)
>>swish-e indexes the documents, but when displaying them I only want
>>to display the appropriate pages (so if you search for a word that shows up
>>on page 26, we display a fragment of page 26 and a link to the image. I
>>should mention that all our documents are OCR of images).
> Do you mean like at:
> upon indexing my -S prog scrip splits the documentation up into chunks
> and indexes them separately. That way searchs are more specific.
>>So what I'd like to do is pass the page data (the OCR) and index it, then
>>just search the individual pages. Since we'd be doing this for every
>>document on the fly (and we often display 10 or more documents per page), it
>>would involve a lot of resources. Fortunately, swish-e is incredibly fast.
>>But I don't want to pay for the overhead of calling system() each time.
> Calling system() for what?
>>I'd also like to capture information about word counts and such, without
>>having to parse the results of the index command line call.
> I think most of that data is available in the C/Perl API.
> You are making some big indexes. Report back on your findings.
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:firstname.lastname@example.org
Received on Wed Jul 7 20:11:58 2004