Re: Problem swish-e not finding words present in index

From: John P. Rouillard <rouilj(at)>
Date: Wed Sep 03 2003 - 15:11:55 GMT
Bill Moseley writes:

>On Mon, Sep 01, 2003 at 01:35:23PM -0700, John P. Rouillard wrote:
>> Hi all:
>> This occurs with both a 2.1dev25 and a 2.4pr1 release.
>> Running:
>> /tools/swish_e-2.1dev25/bin/swish-e -w guest -TINDEX_WORDS_FULL -f hyperma
>il.idx | less
>> I find:
>> guest
>>  Meta:10 http://XXXXX/mailing-lists/ZZZZZ/0016.html Freq:4 Pos/Struct:138/
>>  Meta:10 http://XXXXX/mailing-lists/YYYYYY/0082.html Freq:1 Pos/Struc::122
>> Running:
>> /tools/swish_e-2.1dev25/bin/swish-e -w guest -f hypermail.idx 
>> # SWISH format: 2.1-dev-25
>> # Search words: guest
>> err: no results
>> .
>> Huh?? Any idea what's happening here? The same thing happens if I use
>> /tools/swish_e-2.4.0_pr1/bin/swish-e
>Looks to me like "guest" is indexed under metaname number 10.
>  swish-e -T index_metanames
>should show you the name of meta ID 10.

Hmm, 10 is swishtitle. Wierd. I wonder why its not showing up under
swishdefault since swishtitle should be in swishdefault should mirror
each other right?

-----> METANAMES for hypermail.idx <-----
        swishdefault : id= 1 type= 1  META_INDEX  Rank Bias=  0
       swishreccount : id= 2 type=42  META_INTERNAL META_PROP:NUMBER
           swishrank : id= 3 type=42  META_INTERNAL META_PROP:NUMBER
        swishfilenum : id= 4 type=42  META_INTERNAL META_PROP:NUMBER
         swishdbfile : id= 5 type=38  META_INTERNAL META_PROP:STRING(case:compare)
        swishdocpath : id= 6 type= 6  META_PROP:STRING(case:compare) *presorted*
          swishtitle : id= 7 type=70  META_PROP:STRING(case:ignore) *presorted*
        swishdocsize : id= 8 type=10  META_PROP:NUMBER *presorted*
   swishlastmodified : id= 9 type=18  META_PROP:DATE *presorted*
          swishtitle : id=10 type= 1  META_INDEX  Rank Bias=  0
                name : id=11 type= 1  META_INDEX  Rank Bias=  0
               email : id=12 type= 1  META_INDEX  Rank Bias=  0
                name : id=13 type=70  META_PROP:STRING(case:ignore) *presorted*
               email : id=14 type=70  META_PROP:STRING(case:ignore) *presorted*
                sent : id=15 type=18  META_PROP:DATE *presorted*
    swishdescription : id=16 type= 6  META_PROP:STRING(case:compare) *presorted*
        swishdocpath : id=17 type= 1  META_INDEX  Rank Bias=  0
               title : id=18 type= 1  META_INDEX  Rank Bias=  0 [Alias for swishtitle (10)]
                path : id=19 type= 1  META_INDEX  Rank Bias=  0 [Alias for swishdocpath (17)]

What is wierd is that I am seeing this on two other indexes as
well. In one case its indexed under metaname id 11 that is also the
swishtitle. This is wierd. I am spidering for the other two indexes,
and the hypermail program is producing valid HTML, but its not being
indexed under swishdefault.

What should I be looking at to see why swishdefault is not being

I have tried:

% /tools/swish_e-2.4.0_pr1/share/doc/swish-e/examples/prog-bin/\  /data/www/mailing-lists/admin/0016.html > test.html

% /tools/swish_e-2.4.0_pr1/bin/swish-e -i test.html -T indexed_words

  Indexing Data Source: "File-System"
  Indexing "test.html"
    Adding:[1:swishdefault(1)]   'guest'   Pos:172  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'guest'   Pos:206  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'guest'   Pos:235  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'guest'   Pos:245  Stuct:0x9 ( BODY FILE )
  Removing very common words...
  no words removed.
  Writing main index...
  Sorting words ...
  Sorting 141 words alphabetically
  Writing header ...
  Writing index entries ...
    Writing word text: Complete
    Writing word hash: Complete
    Writing word data: Complete
  141 unique words indexed.
  4 properties sorted.                                              
  1 file indexed.  3085 total bytes.  241 total words.
  Elapsed time: 00:00:00 CPU time: 00:00:00
  Indexing done!

Which shows that guest is swishdefault.

  % /tools/swish_e-2.4.0_pr1/bin/swish-e -w guest
  # SWISH format: 2.4.0-pr1
  # Search words: guest
  # Removed stopwords: 
  # Number of hits: 1
  # Search time: 0.001 seconds
  # Run time: 0.023 seconds
  1000 test.html "TWiki security setup." 3085

So the simple test case works. Doing a guest search on the entire 
directory tree returns no hits, my config file is:

  IndexName "Majordomo Mailing list archives"
  IndexDescription "Index of Majordomo mailing list archives"
  IndexPointer "http://XXXXXX/mailing-lists"
  IndexAdmin "

  IndexDir /tools/swish_e-2.1dev25/lib/swish-e/progs/
  IndexFile /data/www/swish-e/hypermail.idx

  SwishProgParameters /data/www/mailing-lists/*
  ReplaceRules replace "/data/www/" "http://XXXXX/"
  MetaNames swishtitle name email
  PropertyNames name email
  PropertyNamesDate sent
  IndexContents HTML2 .html
  StoreDescription HTML2 <body> 100000
  UndefinedMetaTags  ignore

  IncludeConfigFile /home/jrouilla/develop/search/

  MetaNames swishdocpath
  MetaNameAlias swishtitle title
  MetaNameAlias swishdocpath path

The config file is:

  # include all the available filters and mappings for
  # files that we index
  FileFilter .pdf /tools/swish_e-2.1dev25/lib/swish-e/filters/
  FileFilter .PDF /tools/swish_e-2.1dev25/lib/swish-e/filters/
  Filefilter .ppt /tools/xlhtml-0.5.1/bin/ppthtml "'%p'"
  FileFilter .doc /tools/catdoc-0.91.5/bin/catdoc "-a -s8859-1 -d8859-1 '%p'"
  FileFilter .xls /tools/xlhtml-0.5.1/bin/xlhtml "-nc '%p'"
  FileFilter .exe /usr/bin/strings "'%p'"
  FileFilter .rpm /bin/rpm "-qil -p '%p'"
  FileFilter .zip /usr/bin/unzip "-v \"%p\""
  FileFilter .ZIP /usr/bin/unzip "-v \"%p\""
  FileFilter .tar.Z  /bin/tar "-tzvf '%p'"
  FileFilter .tar.gz  /bin/tar "-tzvf '%p'"
  FileFilter .tgz  /bin/tar "-tZvf '%p'"
  FileFilter .tar  /bin/tar "-tvf '%p'"
  FileFilter .gz /bin/gunzip "-c '%p'"
  FileFilter .z  /bin/gunzip "-c '%p'"
  FileFilter .Z  /bin/gunzip "-c '%p'"
  FileFilter .ps /usr/bin/ps2ascii "'%p'"
  FileFilter .rtf /usr/bin/strings "'%p'"

  IndexContents HTML2 .pdf .ppt .PDF .xls
  IndexContents TXT2  .doc .xls .exe .zip .ZIP .tar.Z .tar.gz .tgz .tar
  IndexContents TXT2  .gz .z .Z .ps .rtf

