Skip to main content.
home | support | download

Back to List Archive

Re: How to filter the swishdescription ??

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Jan 13 2003 - 15:45:01 GMT
On Mon, 13 Jan 2003, Michael wrote:

> so, i will try to better explain my mind :
> 
> so :
> 
> i think what is excluded from indexation isn't exclude from the
> storedescription of the document.
> am i right ?

For HTML, swish parses out the text (using libxml2 if available).  That
text is used for indexing and for storing as a property (e.g.
StoreDescription).

Normally, all text is indexed under a default metaname "swishdefault", but
you can tell swish to index some text as different metanames.  The reason
it's called "swishdefault" is because it is the default metaname to use if
the text is not indexed under another metaname.

StoreDescription is a shortcut for storing the contents of a tag under the
PropertyName "swishdescription". Typically it's used to store the contents
of the <body> tag.  

In your case <script> was within the <body> tag so its content was also
indexed and stored.

In simple terms, the parser passes swish-e some text marked as being
within a given tag.  Swish uses the settings of MetaNames, IgnoreTags,
UndefinedMetaTags, PropertyNames, StoreDescription and a few others to
decide if it should be indexed, and under what metaname, and if it should
be stored as a property and under what property name.

> so i have the following result :
> i exclude from indexation the text between <JAVASCRIPT> tags.  (and PHP tags
> !)
> so if i search the word 'script language' (or '<?PHP'), swish-e returns me a
> null score, so it is good.
> 
> BUT
> 
>  if this excluded text was behind <BODY> tags, it will be store (and display
> !) when swish will find other word (which was not excluded) in the same
> document. and it is my problem !

Right.  If you want to exclude text you should use the IgnoreTags
metaname:

<HTML>
<BODY>
hello world
<SCRIPT LANGUAGE="javascript">
alert('hello again !');
</SCRIPT>
</BODY>
</HTML>

 > cat c
DefaultContents HTML2
StoreDescription HTML2 <body> 1000

 > ./swish-e -v0 -c c -i 1.html -T indexed_words properties
    Adding:[1:swishdefault(1)]   'hello'   Pos:2  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'world'   Pos:3  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'alert'   Pos:4  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'hello'   Pos:5  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'again'   Pos:6  Stuct:0x9 ( BODY FILE )
          swishdocpath: 6 (  6) S: "1.html"
          swishdocsize: 8 (  4) N: "108"
     swishlastmodified: 9 (  4) D: "2003-01-13 07:36:51"
      swishdescription:10 ( 35) S: "hello world alert('hello again !');"

So it's indexing and storeing the javascript contents.

> cat c
DefaultContents HTML2
StoreDescription HTML2 <body> 1000
IgnoreMetaTags script


    Adding:[1:swishdefault(1)]   'hello'   Pos:2  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'world'   Pos:3  Stuct:0x9 ( BODY FILE )
          swishdocpath: 6 (  6) S: "1.html"
          swishdocsize: 8 (  4) N: "108"
     swishlastmodified: 9 (  4) D: "2003-01-13 07:36:51"
      swishdescription:10 ( 11) S: "hello world"

Now it's gone.


Now PHP is server generated, not HTML, so the HTML parser won't be able to
correctly parse it, of course.


-- 
Bill Moseley moseley@hank.org
Received on Mon Jan 13 15:58:09 2003