Skip to main content.
home | support | download

Back to List Archive

Re: Merged words from XML tables

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Oct 27 2004 - 14:37:20 GMT
On Tue, Oct 26, 2004 at 11:42:09PM -0700, Stein-Egil Museus wrote:
> <row><entry><para>559999</para></entry><entry><para>Some text</para></entry><row>

Anyone know what the XML spec says about this?  How do you know what
are tags should split text?

With HTML some tags are block level and some are inline:

moseley@laptop:~$ cat 1.html
<html>
<head>
<body>
<div>first</div>second<b>third</b>forth<div>sixth</div>last
</body>
</html>

moseley@laptop:~$ swish-e -i 1.html -T indexed_words -v0
    Adding:[1:swishdefault(1)]   'first'   Pos:7  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'secondthirdforth'   Pos:10  Stuct:0x49 ( EM BODY FILE )
    Adding:[1:swishdefault(1)]   'sixth'   Pos:13  Stuct:0x49 ( EM BODY FILE )
    Adding:[1:swishdefault(1)]   'last'   Pos:16  Stuct:0x49 ( EM BODY FILE )

Libxml2 provides a way to tell the difference.


A quick look at src/parser.c looks like you might be able to uncomment
the "append_buffer()" call at about line 1068 if you want all the tags
to be block level.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Oct 27 07:37:21 2004