Skip to main content.
home | support | download

Back to List Archive

indexing problem

From: Dave Skinner <dave(at)>
Date: Tue Mar 08 2005 - 07:24:47 GMT
#uname -a
Linux 2.6.9-1.667 #1 Tue Nov 2 14:41:25 EST 2004 i686 
athlon i386 GNU/Linux
(this is fedora core3 with some rpm updates)

swish version 2.4.3

I let ./configure pick all the defaults

everything seems to work (ie. swish.cgi finds the search terms and creates 
links to the files correctly) except StoreDescription is not storing the 
description so swich.cgi can not display the body text

this is my configuration file

IgnoreWords file: /home/swish/stuff/stopwords.txt
MetaNames swishtitle
MetaNames swishdocpath
StoreDescription HTML* <body> 20000
PropCompressionLevel 9

different issue??  (or a non-issue because everything is too short to 
compress?)  changing PropCompressionLevel from 0 to 9 does not change the 
length of the created files.  zlib is on the machine and configure found it

metanames seems to know it is supposed to save the descriptions....

#/home/swish/swish-e-2.4.3/src/swish-e -T index_metanames

-----> METANAMES for index.swish-e <-----
         swishdefault : id= 1 type= 1  META_INDEX  Rank Bias=  0
        swishreccount : id= 2 type=42  META_INTERNAL META_PROP:NUMBER
            swishrank : id= 3 type=42  META_INTERNAL META_PROP:NUMBER
         swishfilenum : id= 4 type=42  META_INTERNAL META_PROP:NUMBER
          swishdbfile : id= 5 type=38  META_INTERNAL 
META_PROP:STRING(case:compare) SortKeyLen: 100
         swishdocpath : id= 6 type= 6  META_PROP:STRING(case:compare) 
SortKeyLen: 100  *presorted*
           swishtitle : id= 7 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
         swishdocsize : id= 8 type=10  META_PROP:NUMBER *presorted*
    swishlastmodified : id= 9 type=18  META_PROP:DATE *presorted*
           swishtitle : id=10 type= 1  META_INDEX  Rank Bias=  0
         swishdocpath : id=11 type= 1  META_INDEX  Rank Bias=  0
     swishdescription : id=12 type= 6  META_PROP:STRING(case:compare) 
SortKeyLen: 100  *presorted*

I do the index with the following

/home/swish/swish-e-2.4.3/prog-bin/ \
         /mirror \
         | /home/swish/swish-e-2.4.3/src/swish-e \
         -c swish.conf \
         -v9 -S prog -i stdin produces the following for each file (its always type HTML, it 
prefilters everything except html)

Path-Name: file_1.html
Content-Length: 7948
Last-Mtime: 1108515537
Document-Type: HTML
<blank line>
<file body>

swish-e produces lots of lines like

file_1.html - Using HTML parser -  (100 words)
file_2.html   - Using HTML parser -  (98 words)

and ends with

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 56,664 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
56,664 unique words indexed.
5 properties sorted.
1,302 files indexed.  19,636,528 total bytes.  1,200,429 total words.
Elapsed time: 00:00:38 CPU time: 00:00:06
Indexing done!

If I do a command line search such as

/home/swish/swish-e-2.4.3/src/swish-e  -x 
'<swishrank>:<swishdescription>:<swishtitle>\n' -w "frog"
# SWISH format: 2.4.3
# Search words: frog
# Removed stopwords:
# Number of hits: 3
# Search time: 0.003 seconds
# Run time: 0.021 seconds
1000::Title 1
526::Title 2
526::Title 3

Now the real question:    where is the <body> text??

Received on Mon Mar 7 23:24:50 2005