Skip to main content.
home | support | download

Back to List Archive

Re: [OT] thesaurus

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Oct 05 2006 - 15:39:02 GMT
On Thu, Oct 05, 2006 at 07:19:47AM -0700, Peter Karman wrote:
> not strictly Swish-related, but wondering how others of you implement the 'did 
> you mean...' feature in their web search apps. Do you use a custom thesaurus? 
> Dictionary? etc.

I've done a dictionary lookup before using Text::Aspell.  I created a
dictionary for each meta name so only words in the index would be
returned in spelling suggestions.

Hum, I've got this module floating around -- maybe it's old, as I
thought I had a version that used SWISH::API to determine "swish
words".

I've got other code for doing spelling and re-displaying, but it's
very ugly and would take me a while to read it off the punch card
backups.



SYNOPSIS
           use LII::SpellCheck;

           # caches open dictionary handle

           my $speller = LII::SpellCheck->new(
               dictionary      => $dict_path,
               stopwords       => \@word_list,
               wordcharacters  => $valid_word_characters,
               max_words       => $max_words_to_return,
           );

           # later
           my $words = $speller->check( $query );

           $words is an array of hashes

DESCRIPTION
       This module takes a string of text and looks up words in the Apsell
       dictionary pointed to by $dict_path.  The words are split into "swish"
       words based upon the stopwords and wordcharacters passed in.  Wordchar-
       acters are the valid characters that can be in a word indexed by swish.

       Keep in mind that a dictionary is flat, where a swish index is really
       many indexes.  This has to be considered when creating the GNU Aspell
       dictionary.

METHODS
       new( \%config )

       The new() method returns a new object that caches an open dictionary.
       The method will die on errors.  This should be trapped by the caller.

       Parameters are passed as a hash (or ref to a hash).  All are required
       except where noted.

       Parameters are:

       dictionary
           This lists the full path to the GNU Aspell dictionary file.

               dictionary      => '/path/to/dictionary',

       stopwords
           This is an array reference of stopwords -- words to ignore while
           spelling.

               stopwords       => [ $swish->HeaderValue( $index, 'stopwords' ) ],

       wordcharacters
           This is a list of valid characters in the 8859-1 encoding used for
           words

               wordcharacters  => $swish->HeaderValue( $index, 'wordcharacters' ),

           See CAVEATS below about limitations in how wods are split.

       max_words
           Sets the maximum number of word suggestions to return for each
           incorrect word.  The default is four.

           This option is not required.

       check

       This method checks a string for words not found in the dictionary.  The
       string is split into words and non-words.  Words that are not stop
       words or the list of swish operators (and or not) will be checked.

       Returns an array of hashes.  The array is the string passed in tok-
       enized into "swish_words" and non-swish_words.  Each element has one or
       more of the following keys:

       word
           This key is the original text from the string passed into check.
           It may contain text or blank.

       isword
           This is true if the word is considered a "swish_word" (i.e. is made
           up of wordcharacter characters).  This will include stopwords and
           swish-e operators.

       unknown
           This is true for words that could be in the swish-e index, but
           could not be spell checked because they contain non-alpha charac-
           ters.

       suggestions
           This is an array reference of word suggestions.  If an empty array
           the word was still not found in the dictionary, but the dictioanry
           offered no suggestions.

CAVEATS
       The string of words (i.e. query) passed to the check() method has to be
       converted into "swish words" before the dictionary is searched.  This
       means throwing out stopwords and splitting words based on how Swish-e
       splits words while indexing.

       Swish-e does provide a "Parsed Words" header that has the input query
       converted into "swish words", but it cannot be used when searching a
       stemmed index (since the parsed query is stemmed).  It also means that
       some words would not show up when re-displaying the query with cor-
       rected spelling to the user.

       So, this module must try and emulate how swish would parse words, and
       is why stopwords and wordcharacters is passed in.  Unfortunately, swish
       uses more than just those two items to generate "swish words" meaning
       the conversion will not always match how swish parses.

       There's two options, though.  One would be to use Parsed Words output
       from Swish -- but means running a second query on a non-stemmed index.
       The other would be to expose in the swish API a method to access "swish
       words".

AUTHOR
       Bill Moseley <moseley@hank.org>

COPYRIGHT
       This module is Copyright (c) 2005 Bill Moseley.

       You may distribute under the terms of either the GNU General Public
       License or the Artistic License, as specified in the Perl README file.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Oct 5 08:39:05 2006