Skip to main content.
home | support | download

Back to List Archive

Re: Incremental indexing (was Index headers)

From: Bill Moseley <moseley(at)>
Date: Tue Sep 19 2000 - 16:55:44 GMT
At 08:48 AM 09/19/00 -0700, wrote:
>I can add some functions to the library to get these info:
>- StopWords
>- metaNames?
>- Files?

Great.  Again, I currently read the configuration file for the stopwords,
but having access to them through the index would be much better.  So would
there also be a switch for swish-e binary, too?

>The problem arises when you uses several index files (more than 
>one -f directive): You will get one line per file.

That's a tough one.  I'd think if searching two indexes that most of the
indexing parameters should match up -- so there shouldn't be a need for
many duplicate headers.  If searching in a meta tag, then that tag should
be defined in both indexes.  Stopwords and Wordcharacters should match up
too, otherwise the search results won't make much sense.

Let me restate my problem and see if we can come up with some ideas.  I'm
really trying to provide an incremental update to the main index file that
allows my searches to stay current:

I have a large index that is indexed once a week.  Files are added during
the week that I'd like to be able to search right away.  Currently, when
new files are added an "incremental" index is updated (recreated).  That
happens very quickly since there aren't that many new files.

Then I can search both indexes using the -f and specify both index files.

The problems are these:

1) I don't know total hits until I've read all the results back from
swish-e because it currently displays "# Number of hits:" for the first
index, then the results from the first index, then the "# Number of hits:"
for the second index and then the results.

2) If a previously indexed file is modified and thus ends up in the
incremental index, searches may return that file twice -- once for the main
index and once for the incremental index.

3) I'd like to let swish sort the results and give me the sorted results a
page at a time using -b and -m.  With multiple indexes each index's results
are sorted index-by-index.  I'd like all the results sorted together.

It would be a neat feature if swish could do incremental indexing.  Perhaps
the main index could be appended with the incremental indexing data.  Swish
could be smart enough when searching the main index to ignore results from
any files listed in the appended incremental index.  This would allow files
previously indexed in the main index to be modified and included in the
incremental index.

A feature that would help facilitate incremental indexing is to only index
files newer than some date (or the date of a file).  In other words, tell
swish to index only files newer than the main index file's date.  

Speaking of dates:  Would it be useful to store the date of the file in the
index and provide a way to limit search results by date?

>> Does SwishOpen() really open the index file?  Or is the file opened and
>> closed on each search?
>The file is opened and closed in SwishOpen, just to read the 
>header information.
>It also opens and closes the file on each search.

Oh, so is there any speed benefit to doing one SwishOpen() and then calling
SwishSearch() multiple times?  I was thinking I'd call SwishOpen() once per
Apache mod_perl child process, and then just call SwishSearch() for each
search.  Something like Apache::DBI that keeps (slow to make) database
connections open from request to request.  I'm looking for speed.


Bill Moseley
Received on Tue Sep 19 16:56:05 2000