At 05:04 AM 10/30/00 -0800, firstname.lastname@example.org wrote:
>Now, the TODO list:
1 - Integrated swishspider with the same functionality as the perl
I wonder about the spidering. I don't use it and I'm not very familiar
with the way it works. I'd rather see some type of "plug-in" method to
access remote documents and to filter non-html documents. I wonder how
much of a performance hit it is to fork swish and exec perl for each
request. This can be saved by integrating the spidering into Swish, but
what kind of spider?
Spidering is complicated and although it should be part of the Swish
package, I'm not sure if it should be integrated. There's issues like how
fast to request documents, robots.txt checking, how many requests to queue
at the same time, how to handle redirects, should documents that return
errors be retried after a few minutes*, and should documents be cached
locally so you can use If-Modified-Since: headers in requests to safe
transferring documents that haven't changed.
I think it would be nice to have a plug-in system where swish knows about
file indexing, but you can replace that with a plug-in that feeds swish the
documents. Another "hook" into swish would be for filtering the documents.
It would be smart if these "plug-ins" were only executed once and then
some type of IPC (e.g. STDIN <-> STDOUT) protocol was used to exchange
info. All this to avoid the fork.
* I have a spider that rechecks errors after an hour and it's interesting
to see how many of those errors are cleared up after waiting.
And here are some others to toss out:
7 - Since word position is now in the document how hard would it be to
implement a NEAR or ADJ tag to find words that are near or adjacent to
other words? Could one say things like NEAR:5 to be within five words?
8 - Dates: It would be nice to be able to have a the last modified date of
the document stored in the index (and maybe also the indexed date if a
merged index). It would be somewhat handy to be able to say: "Return the
documents that contain this phrase, and are newer than a month." I know
this can be done with properties, but it would be nice is swish would do
9 - Incremental indexing: I'd like ways to add new files to an index, and
to also mark existing files in an index as "deleted" so they are not
returned by swish.
Received on Mon Oct 30 17:58:00 2000