Skip to main content.
home | support | download

Back to List Archive

Re: problem storing swishdescription

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Mar 11 2004 - 06:22:37 GMT
On Wed, Mar 10, 2004 at 02:38:31PM -0800, Kevin Lewandowski wrote:
> Hello, I'm testing Swish-e on three html files saved on my local disk. 
> Using the following config:
> 
> DefaultContents HTML2
> IndexDir /somedir/
> IndexFile /somedir/
> StoreDescription HTML2 <body> 20000
> 
> Indexing and searching works okay but with the config above the 
> swishdescription field is not stored. But if I change the document type 
> to "HTML" (in lines 1 and 4), it now stores the swishdescription. But 
> now I'm able search against text which I've tried to prevent using the 
> <!-- index --> and <!-- noindex --> tags (previously search would not 
> find this when using the HTML2 type). Any ideas on what I'm doing wrong?

The index/noindex thing only works with the libxml2 parser.  I can't 
explain why it's not storing the description.  Post a complete example 
and I can try.

> Also, is it possible to store the swishdescription with the <!-- 
> noindex --> sections removed? Right now it stores the entire document 
> text.

Like this?

moseley@bumby:~$ cat c
DefaultContents HTML2
StoreDescription HTML2 <body> 20000

moseley@bumby:~$ cat 1.html
<html>
<head><title>titleword</title></head>
<body>
top
<!-- noindex -->
dontindexthis
<!-- index -->
bottom
</body>
</html>

moseley@bumby:~$ swish-e -c c -i 1.html -T indexed_words properties -v0
    Adding:[1:swishdefault(1)]   'titleword'   Pos:2  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'top'   Pos:5  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'bottom'   Pos:6  Stuct:0x9 ( BODY FILE )
          swishdocpath: 6 (  6) S: "1.html"
            swishtitle: 7 (  9) S: "titleword"
          swishdocsize: 8 (  4) N: "126"
     swishlastmodified: 9 (  4) D: "2004-03-10 20:41:23 PST"
      swishdescription:10 ( 10) S: "top bottom"

-- 
Bill Moseley
moseley@hank.org
Received on Wed Mar 10 22:22:37 2004