On Thu, Oct 05, 2006 at 07:19:47AM -0700, Peter Karman wrote:
> not strictly Swish-related, but wondering how others of you implement the 'did
> you mean...' feature in their web search apps. Do you use a custom thesaurus?
> Dictionary? etc.
I've done a dictionary lookup before using Text::Aspell. I created a
dictionary for each meta name so only words in the index would be
returned in spelling suggestions.
Hum, I've got this module floating around -- maybe it's old, as I
thought I had a version that used SWISH::API to determine "swish
I've got other code for doing spelling and re-displaying, but it's
very ugly and would take me a while to read it off the punch card
# caches open dictionary handle
my $speller = LII::SpellCheck->new(
dictionary => $dict_path,
stopwords => \@word_list,
wordcharacters => $valid_word_characters,
max_words => $max_words_to_return,
my $words = $speller->check( $query );
$words is an array of hashes
This module takes a string of text and looks up words in the Apsell
dictionary pointed to by $dict_path. The words are split into "swish"
words based upon the stopwords and wordcharacters passed in. Wordchar-
acters are the valid characters that can be in a word indexed by swish.
Keep in mind that a dictionary is flat, where a swish index is really
many indexes. This has to be considered when creating the GNU Aspell
new( \%config )
The new() method returns a new object that caches an open dictionary.
The method will die on errors. This should be trapped by the caller.
Parameters are passed as a hash (or ref to a hash). All are required
except where noted.
This lists the full path to the GNU Aspell dictionary file.
dictionary => '/path/to/dictionary',
This is an array reference of stopwords -- words to ignore while
stopwords => [ $swish->HeaderValue( $index, 'stopwords' ) ],
This is a list of valid characters in the 8859-1 encoding used for
wordcharacters => $swish->HeaderValue( $index, 'wordcharacters' ),
See CAVEATS below about limitations in how wods are split.
Sets the maximum number of word suggestions to return for each
incorrect word. The default is four.
This option is not required.
This method checks a string for words not found in the dictionary. The
string is split into words and non-words. Words that are not stop
words or the list of swish operators (and or not) will be checked.
Returns an array of hashes. The array is the string passed in tok-
enized into "swish_words" and non-swish_words. Each element has one or
more of the following keys:
This key is the original text from the string passed into check.
It may contain text or blank.
This is true if the word is considered a "swish_word" (i.e. is made
up of wordcharacter characters). This will include stopwords and
This is true for words that could be in the swish-e index, but
could not be spell checked because they contain non-alpha charac-
This is an array reference of word suggestions. If an empty array
the word was still not found in the dictionary, but the dictioanry
offered no suggestions.
The string of words (i.e. query) passed to the check() method has to be
converted into "swish words" before the dictionary is searched. This
means throwing out stopwords and splitting words based on how Swish-e
splits words while indexing.
Swish-e does provide a "Parsed Words" header that has the input query
converted into "swish words", but it cannot be used when searching a
stemmed index (since the parsed query is stemmed). It also means that
some words would not show up when re-displaying the query with cor-
rected spelling to the user.
So, this module must try and emulate how swish would parse words, and
is why stopwords and wordcharacters is passed in. Unfortunately, swish
uses more than just those two items to generate "swish words" meaning
the conversion will not always match how swish parses.
There's two options, though. One would be to use Parsed Words output
from Swish -- but means running a second query on a non-stemmed index.
The other would be to expose in the swish API a method to access "swish
Bill Moseley <email@example.com>
This module is Copyright (c) 2005 Bill Moseley.
You may distribute under the terms of either the GNU General Public
License or the Artistic License, as specified in the Perl README file.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Thu Oct 5 08:39:05 2006