Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] "-S prog" mashing up words in HTML files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Mar 20 2007 - 21:31:39 GMT
On Tue, Mar 20, 2007 at 04:15:32PM -0500, Matthew Stanislawski wrote:
> ...loading dock)<br/></td></tr><tr><td>&nbsp;H5</td><td>DCL Hallway...
> 
> 
> White-space found word 'dock)H5DCL'
>      Adding:[120:swishdefault(1)]   'dock'   Pos:289  Stuct:0x1 ( FILE )
>      Adding:[120:details(13)]   'dock'   Pos:289  Stuct:0x1 ( FILE )
>      Adding:[120:swishdefault(1)]   'h5dcl'   Pos:290  Stuct:0x1 ( FILE )
>      Adding:[120:details(13)]   'h5dcl'   Pos:290  Stuct:0x1 ( FILE )

Hum, can't see to duplicate it.  Can you try these -- and/or put the
output from your script someplace?

Might turn up the ParserWarnLevel to see if it's getting confused.

How are you specifying "details" as a metaname?

Maybe a problem with your version of libxml2?

moseley@bumby:~$ cat test.html
<html>
<head>
<title>hello</title>
</head>
<body>
<table>
    <tr>
        <tr><td>&nbsp;H5</td><td>DCL Hallway</td>
    </tr>
</table>
</body>
</html>

moseley@bumby:~$ swish-e -v0 -T indexed_words -i test.html
    Adding:[1:swishdefault(1)]   'hello'   Pos:5  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'h5'   Pos:16  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'dcl'   Pos:19  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'hallway'   Pos:20  Stuct:0x9 ( BODY FILE )


Maybe it has something to do with -S prog??  Nope:

moseley(at)not-real.bumby:~$ /usr/local/lib/swish-e/spider.pl default file:///home/moseley/test.html 2>/dev/null | swish-e -S prog -i stdin -T indexed_words -v 0           
    Adding:[1:swishdefault(1)]   'hello'   Pos:5  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'h5'   Pos:16  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'dcl'   Pos:19  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'hallway'   Pos:20  Stuct:0x9 ( BODY FILE )

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 20 17:31:39 2007