Skip to main content.
home | support | download

Back to List Archive

Re: Mis-spelled words More refined way of doing

From: Eric Lease Morgan <emorgan(at)not-real.nd.edu>
Date: Thu Mar 18 2004 - 23:15:58 GMT
On Mar 18, 2004, at 5:06 PM, swish-e@sunsite.berkeley.edu wrote:

>> http://www.swish-e.org/Discussion/archive/2003-08/6028.html
>>
>> Some one talked about mis spelled word on this forum and the answer 
>> was =
>> to use the perl module , using which look up for one of the matches 
>> from =
>> systems dictionary. Here a question arises in mind that what if after 
>> =
>> suggesting that word and being requested for that word in search, no =
>> results found from the index file.
>
> Eh, reread that post again.  That was the the point of that message --
> using a dictionary built from words that *are* in the index.

Ah, ha! I can contribute something here.

Taking the lead from Bill a number of months ago I hacked together a 
Did You Mean function in a number of my swish-based searches. The 
technique first involves creating a dictionary of terms from the 
content of a swish index. Next, when examining the number of hits from 
a swish search, a thresh hold is set, and if the thresh hold is less 
than the specified number I grab number of possible other words from 
the dictionary and rebuild the initial query.

Here is some sample code. First, the process to create a dictionary:

#!/usr/bin/perl

# make-dictionary.pl - create an Aspell dictionary from a swish-e index
# Eric Lease Morgan <eric_morgan@infomotions.com>
# Thanks to Bill Mosely who inspired this hack.

# 2003/12/08 - got it working after reading Perl Cookbook
# 2003/11/27 - first investigations; Thanksgiving

# define a few contants
my $SWISH  = '/usr/local/bin/swish-e -T INDEX_WORDS_ONLY -f 
/usr/local/apache/htdocs/books/etc/books.idx';
my $ASPELL = '/usr/local/bin/aspell --lang=en create master 
/usr/local/apache/htdocs/books/etc/books.dict';


######################################################
# no configuration should be necessary below this line

# practice good programming
use strict;

# initialize input and output words
my $words  = undef;

# get the list of words from the index
open INPUT, "$SWISH |";
while (<INPUT>) {

	chop;                      # get rid of trailing return
	next if (! /^[A-Za-z]+$/); # discard word that include numbers
	$words .= $_ . ' ';        # build list of valid words

}
close INPUT;

# create a dictionary
open OUTPUT, "| $ASPELL";
print OUTPUT $words;
close OUTPUT;

# done; too simple!
exit;


Second, a code snippet from a search routine, specifically a suggestion:

# define constants
my $INDEX      = './etc/books.idx';
my $DICTIONARY = './etc/books.dict';
my $query      = 'foo and bar';

# create swish object
my $swish = SWISH::API->new($INDEX);

# create a search object
my $search = $swish->New_Search_Object;

# search
my $results = $search->Execute($query);

# get the number of titles found
$number_of_hits = $results->Hits;

# check for number of hits
if ( ! $number_of_hits ) {
		
   # initalize dictionary
   my $dictionary = Text::Aspell->new;
   $dictionary->set_option('master', $DICTIONARY);

   # parse the query
   my @query = split / /, $query;

   # initialize the new query
   my $new_query = undef;

   # process each query word
   foreach my $q (@query) {

	# get suggestion
	my @suggestions = $dictionary->suggest($q);
	
	# build new query
	$new_query .= @suggestions[0] . ' ';

   }

   # add a suggestion to the output
   print "Did you mean: $new_query?";
	
}


I use these techniques in a number of half-baked interfaces. Try 
entering misspelled words:

   http://infomotions.com/books/
   http://infomotions.com/alex2/
   http://dewey.library.nd.edu/morgan/microforms/
   http://dewey.library.nd.edu/morgan/serials/
   http://dewey.library.nd.edu/morgan/microforms/eighteenth/

Fun!

-- 
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
University Libraries of Notre Dame

(574) 631-8604
Received on Thu Mar 18 15:15:58 2004