I just noticed that the version of SWISH::Stemmer on the CPAN is
different from the one in the distribution. Basically, the one on
CPAN stemms differently than the one in the distribution.
SWISH::Stemmer contains the original stemming code extracted from
swish-e. It was created before there was an API for swish to stem
using the swish-e C library.
There's a few problems with SWISH::Stemmer. First, if it gets out of
sync with the stemming code inside swish-e then it might not stem the
same way that swish-e stems while indexing. That's the case now with
the CPAN version. Second, it only does one type of stemming, where
swish-e has a number of stemmers available.
The best solution is to use the SWISH::API module for both searching
and stemming (as Jonas Wolf posted with his patch to the highlighting
code the other day), but that won't work if using the swish-e binary
So, if SWISH::Stemmer needs to stay around then it either needs to be
updated whenever the swish-e stemmer.c code changes (harder to track)
or make SWISH::Stemmer a thin wrapper around SWISH::API and figure out
some way in SWISH::API to provide a Stem() function that doesn't need
a swish handle. (I'm thinking out loud a bit here.)
Here's why I'm posting now: I like the idea of making SWISH::Stemmer
a wrapper around SWISH::API, but I wonder if that's a performance
issue loading the large SWISH::API vs. loading the small
SWISH::Stemmer module. Anyone know if that's an issue on modern
operating systems? That is, is the OS smart enough to only load
what's needed from the shared library?
Received on Tue Jul 20 11:26:54 2004