Skip to main content.
home | support | download

Back to List Archive

Re: Indexing API

From: Peter Karman <karman(at)not-real.cray.com>
Date: Thu Jul 08 2004 - 03:11:38 GMT
I believe I know what Tac is referring to. I, too, have a system set up 
that requires indexing small document collections (in my case, a few XML 
files from a database via -S prog, 100s of times a day) via CGI, 
on-the-fly, and have thought a few times that it would be nice to have 
an indexing perl API similar to the search API to rely on, instead of 
making an external system() call to the swish-e binary and creating tmp 
ad hoc config files (since the properties change depending on the db).

boy that was a long sentence.

in the end, though, I think that the two processes (indexing and 
searching) are different enough that it doesn't make sense to have both 
APIs. Namely, the big advantage to the search API is that you can make a 
single connection to an index and search it multiple times. Searching it 
just once via the API and once via the swish-e binary are roughly the 
same speed, in my experience. The big performance boost comes when doing 
multiple queries on the same index(es).

The same kind of advantage is negated when indexing, since you only 
index once. So the system() call is really not much more overhead (I 
would guess...I have no benchmarks to prove this, only anecdotal 
experience) than any kind of internal API would be, because you're only 
connecting (creating) the index once, not multiple times.

What I ended up doing instead was writing a SWISH::Index perl module to 
let me quickly create those on-the-fly indexes with a single OO call. 
The module takes care of naming the index, creating/taking the XML 
input, creating the tmp ad hoc config file (along with using a master 
common config), and cleaning up. It also logs output from the swish-e 
indexer.

I do something like:

	my $db = new Database(name=>foo);
	$db->index_data;

no fuss, no muss.

Bill Moseley wrote on 7/7/04 3:21 PM:

> On Wed, Jul 07, 2004 at 12:15:44PM -0700, Tac wrote:
> 
>>Bill asked why I thought indexing should be callable (like searching),
>>rather than through a command line program.  Here are my reasons:
>> 
>>(1) I like having lots of control over the process.  We're indexing millions
>>of xml documents, and I like to have a better sense of where things are at,
>>rather than just firing up the program and waiting.  
> 
> 
> Well, I guess I'd need to see an API first, as I can't picture how
> that would be used.  It would require a big rewrite of the indexing
> part of swish -- as much of the code (like the config stuff) is very
> much the same as it was in 1997.
> 
> 
>>Now, both of those are more philosophical, the real reason I want to be able
>>to index from within a perl (or other) script is so that I can index on the
>>fly.
> 
> 
> I'm not sure I understand.  You mean index files individually?
> 
> 
>>We have about 6 million documents, each document has between 1 and 500
>>pages.
> 
> 
> Not your typical web site of few 100 pages. ;)
> 
> 
>>swish-e indexes the documents, but when displaying them I only want
>>to display the appropriate pages (so if you search for a word that shows up
>>on page 26, we display a fragment of page 26 and a link to the image.  I
>>should mention that all our documents are OCR of images).
> 
> 
> Do you mean like at:
> 
>    http://swish-e.org/current/docs/searchdoc.html
> 
> upon indexing my -S prog scrip splits the documentation up into chunks
> and indexes them separately.  That way searchs are more specific.
> 
> 
> 
>>So what I'd like to do is pass the page data (the OCR) and index it, then
>>just search the individual pages.  Since we'd be doing this for every
>>document on the fly (and we often display 10 or more documents per page), it
>>would involve a lot of resources.  Fortunately, swish-e is incredibly fast.
>>But I don't want to pay for the overhead of calling system() each time.
> 
> 
> Calling system() for what?
> 
> 
> 
>>I'd also like to capture information about word counts and such, without
>>having to parse the results of the index command line call.
> 
> 
> I think most of that data is available in the C/Perl API.
> 
> You are making some big indexes.  Report back on your findings.
> 
> 

-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Wed Jul 7 20:11:58 2004