Skip to main content.
home | support | download

Back to List Archive

Re: swish-e-2.1.x status and todo list

From: <jmruiz(at)not-real.boe.es>
Date: Mon Oct 30 2000 - 19:27:45 GMT
Hi Bill,

On 30 Oct 2000, at 9:54, Bill Moseley wrote:

> At 05:04 AM 10/30/00 -0800, jmruiz@boe.es wrote:
> >Now, the TODO list:
> 
> 1 - Integrated swishspider with the same functionality as the perl
> 
> I wonder about the spidering.  I don't use it and I'm not very
> familiar with the way it works.  I'd rather see some type of "plug-in"
> method to access remote documents and to filter non-html documents.  I
> wonder how much of a performance hit it is to fork swish and exec perl
> for each request.  This can be saved by integrating the spidering into
> Swish, but what kind of spider?  
> 
Well, fork is not the problem; in fact, there is a delay parameter to 
avoid very consecutive requests to the server. The main problem, 
specially for windows users is the perl interpreter and the modules 
required (4 or 5 additional modules are required).

The spider would work in the same way perl's script does...
(What about javascript links?)

> And here are some others to toss out:
> 
> 7 - Since word position is now in the document how hard would it be to
> implement a NEAR or ADJ tag to find words that are near or adjacent to
> other words?  Could one say things like NEAR:5 to be within five
> words?
> 

It may not be difficult.

> 8 - Dates:  It would be nice to be able to have a the last modified
> date of the document stored in the index (and maybe also the indexed
> date if a merged index).  It would be somewhat handy to be able to
> say: "Return the documents that contain this phrase, and are newer
> than a month."  I know this can be done with properties, but it would
> be nice is swish would do the comparison.
> 
The dates can be added to the file info. But if you really want to 
search between dates, this is a different problem: I think that these 
dates should be "internal fields" with a reserved name (eg: 
indexdate and filedate).
But swish-e does not have date fields. All the fields are "varchars".
The parser should also allow things like >,>=,<,<=,=, !=.
Anyway, it would be nice but it is neccesary to change the way
swish-e works now. This is a major addon.

As far as I can remember, another free indexer, freewais-sf, has 
some sort of date support but the implementation has a severe 
penalty performance because it uses a sequential approach for all 
dates in order to get the results between two dates. For a fast date
or numeric search, a btree or similar sorted structure must be 
implemented.

> 9 - Incremental indexing:  I'd like ways to add new files to an index,
> and to also mark existing files in an index as "deleted" so they are
> not returned by swish.
> 

* Marking files as deleted is the easy part. Eg:
./swish-e -d file -f fileindex
* Updating or inserting new files is a little bit complicated:
    - All the words affected by the insert/update should be expanded 
and rewritten to the file index (probably at the end of the file after
recomputing the offsets).
    - File and properties have the same problem.
* For these reasons, the index file size will grow, fragmenting the 
data in it. So, searching should be slower.
Thus, a reorganization function is also needed.

cu
Jose
Received on Mon Oct 30 19:30:07 2000