Skip to main content.
home | support | download

Back to List Archive

Re: INDEX_WORDS interpretation

From: Peter Karman <karman(at)not-real.cray.com>
Date: Mon Mar 29 2004 - 18:29:08 GMT
Thanks for the help, Bill. For others that might be interested, here's 
my quick and dirty script for reporting on common words in a swish-e 
index. It requires the Text::FormatTable module from CPAN. I like it for 
ASCII tables.

#!/usr/bin/perl -w
#
# count instances of words in a swish-e index
# and report on NUM number of top instances
#
# usage: countwords [NUM [INDEX]]

use strict;
use Text::FormatTable;

my ($num,$index) = @ARGV;

#defaults
$index ||= 'index.swish-e';
$num ||= 50;

my $count;
my $cmd = "swish-e -f $index -T INDEX_WORDS";

open(SWISH, "$cmd |")
         or die "can't exec '$cmd': $!\n";

while(<SWISH>) {
         chomp;
         my ($word,@insts) = split /\[\d+ /, $_ ;
         INST: for my $i (@insts) {
                 next INST if ! $i;
                 my ($doc,$cnt) = split(/\s+/,$i);
                 $count->{$word}->[0] += $cnt;
                 $count->{$word}->[1]++;
         }
}

close(SWISH);

# print results, stopping at $num
# use FormatTable for pretty ASCII

my $tbl = new Text::FormatTable('r  l  l');
$tbl->head('word','count','unique docs');
$tbl->rule('=');
my $seen = 0;

for my $word (sort {
         $count->{$b}->[0] <=> $count->{$a}->[0]
         } keys %$count) {
         my ($cnt,$docs) = @{ $count->{$word} };
         $tbl->row($word, $cnt, $docs);
         last if ++$seen == $num;
}

print $tbl->render(60);

exit;


---------------

Bill Moseley supposedly wrote on 3/29/04 10:26 AM:

> Interesting, as it seems index_words_full doesn't include the file
> number (just the file name).  Not as "full" as it should be.
> 

-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Mon Mar 29 10:29:08 2004