On 30 Oct 2000, at 9:54, Bill Moseley wrote:
> At 05:04 AM 10/30/00 -0800, email@example.com wrote:
> >Now, the TODO list:
> 1 - Integrated swishspider with the same functionality as the perl
> I wonder about the spidering. I don't use it and I'm not very
> familiar with the way it works. I'd rather see some type of "plug-in"
> method to access remote documents and to filter non-html documents. I
> wonder how much of a performance hit it is to fork swish and exec perl
> for each request. This can be saved by integrating the spidering into
> Swish, but what kind of spider?
Well, fork is not the problem; in fact, there is a delay parameter to
avoid very consecutive requests to the server. The main problem,
specially for windows users is the perl interpreter and the modules
required (4 or 5 additional modules are required).
The spider would work in the same way perl's script does...
> And here are some others to toss out:
> 7 - Since word position is now in the document how hard would it be to
> implement a NEAR or ADJ tag to find words that are near or adjacent to
> other words? Could one say things like NEAR:5 to be within five
It may not be difficult.
> 8 - Dates: It would be nice to be able to have a the last modified
> date of the document stored in the index (and maybe also the indexed
> date if a merged index). It would be somewhat handy to be able to
> say: "Return the documents that contain this phrase, and are newer
> than a month." I know this can be done with properties, but it would
> be nice is swish would do the comparison.
The dates can be added to the file info. But if you really want to
search between dates, this is a different problem: I think that these
dates should be "internal fields" with a reserved name (eg:
indexdate and filedate).
But swish-e does not have date fields. All the fields are "varchars".
The parser should also allow things like >,>=,<,<=,=, !=.
Anyway, it would be nice but it is neccesary to change the way
swish-e works now. This is a major addon.
As far as I can remember, another free indexer, freewais-sf, has
some sort of date support but the implementation has a severe
penalty performance because it uses a sequential approach for all
dates in order to get the results between two dates. For a fast date
or numeric search, a btree or similar sorted structure must be
> 9 - Incremental indexing: I'd like ways to add new files to an index,
> and to also mark existing files in an index as "deleted" so they are
> not returned by swish.
* Marking files as deleted is the easy part. Eg:
./swish-e -d file -f fileindex
* Updating or inserting new files is a little bit complicated:
- All the words affected by the insert/update should be expanded
and rewritten to the file index (probably at the end of the file after
recomputing the offsets).
- File and properties have the same problem.
* For these reasons, the index file size will grow, fragmenting the
data in it. So, searching should be slower.
Thus, a reorganization function is also needed.
Received on Mon Oct 30 19:30:07 2000