Skip to main content.
home | support | download

Back to List Archive

Re: Merged words from XML tables

From: Bill Moseley <moseley(at)>
Date: Wed Oct 27 2004 - 14:37:20 GMT
On Tue, Oct 26, 2004 at 11:42:09PM -0700, Stein-Egil Museus wrote:
> <row><entry><para>559999</para></entry><entry><para>Some text</para></entry><row>

Anyone know what the XML spec says about this?  How do you know what
are tags should split text?

With HTML some tags are block level and some are inline:

moseley@laptop:~$ cat 1.html

moseley@laptop:~$ swish-e -i 1.html -T indexed_words -v0
    Adding:[1:swishdefault(1)]   'first'   Pos:7  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'secondthirdforth'   Pos:10  Stuct:0x49 ( EM BODY FILE )
    Adding:[1:swishdefault(1)]   'sixth'   Pos:13  Stuct:0x49 ( EM BODY FILE )
    Adding:[1:swishdefault(1)]   'last'   Pos:16  Stuct:0x49 ( EM BODY FILE )

Libxml2 provides a way to tell the difference.

A quick look at src/parser.c looks like you might be able to uncomment
the "append_buffer()" call at about line 1068 if you want all the tags
to be block level.

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Wed Oct 27 07:37:21 2004