Skip to main content.
home | support | download

Back to List Archive

Re: swish-e stopword exclusion

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Apr 04 2005 - 14:30:30 GMT
On Mon, Apr 04, 2005 at 12:24:06PM +0200, Stefan Klett wrote:
>  He asked you if you think it is possible to 
> make some alterations in the treatment of stopwords , that is to exclude a 
> configurable list of stopwords to be found in some metanames. Which ones 
> should be configurable in a directive in swish-e.conf 

So instead of a global list of stopwords have a list of stopwords
assigned to each metaname?

> I hope you don't mind if i bypass the mailing-list and send you this 
> posting directly - if you do think that it would be better to have the 
> list involved, i would resent it to there. 

Please post to the list.  I've cc'ed the list on this reply.

> Browsing the code i found that the treatment of stopwords, metanames  and 
> directives seems to be scattered widely in the files.

Yes, their implementation is scattered around.

> So it seems hard to me to figure out where it is feasible to make
> the requested changes. I would like to ask you, if you may hint me
> to some functions which be of special interest in this respect.
> First of all it would be very helpful to know an easy way to
> implement a new directive - that means reading the conf-file
> conforming to the way used in  the existing code. 

Are you sure you need stopwords?


You would start by looking at how the stopword lists are currently
managed and generated.

parse_conffile.c has the really ugly code to parse the configuration
file.  That's where you would add the data for parsing your config
options.  Feel free to completely rewrite that file if you can't
stand working with it as it is.

You will need to relate the stopword lists to metanames.  So, (without
much consideration) I'd expect you would want to attach the stopword
lists to the "metaEntry" structure.  That's defined in swish.h.

Metanames are managed in "metanames.h".  There you might provide
methods to test for a stopword, and you would need a way to free
memory use by the stopword lists when the metaEntry is destroyed.

index.c is probably where you would test for stopwords as each word
is indexed by metaname and skip when you find a stopword.

Now, you need the stopword lists when searching, too.

db_write.c is where you would write the stopword lists to disk

db_read.c is where they are read at search time.

headers.c is where they are fetched for display at search time.

The stopword lists would need to be available by the C library.  You
would need to add the interface to swish-e.h and libtest.c for
testing.  Also perl/SWISH::API would need methods added to access by
each metaname. You would have to look to see how to best do that.
There's currently a way to access the metanames, so I suspect that is
where to add the methods to access the lists by metaname.

swish_words.c is where stopwords are removed from a query.  This will
be a little bit harder since you will need to know what metaname is
currently being applied at that point in the query.  Then you have to
adjust the query if someone searches for only a stopword:

           swish-e -w title=foo AND other=<stopword>

ends up:

           swish-e -w title=foo

IgnoreLimit is a way to generate the stopword lists automatically.
That's mostly in index.c, IIRC.  You would need to spend quite a bit
of time looking at how that already works.  I'd probably not bother
trying to get that to work.

I'm sure there's things I left out, but that should give you
something to start with.




-- 
Bill Moseley
moseley@hank.org
Received on Mon Apr 4 07:30:31 2005