I just tried this with 2.5.2 and it appears to split on the tags:
karpet@cartermac 271% swish-e -i test.xml -v 3 -c c
Parsing config file 'c'
Indexing Data Source: "File-System"
Indexing "test.xml"
Checking file "test.xml"...
test.xml - Using XML2 parser - (3 words)
...
karpet@cartermac 272% swish-e -T index_all
-----> WORD INFO in index index.swish-e <-----
559999
Meta:1 test.xml Freq:1 Pos/Struct:5/1
some
Meta:1 test.xml Freq:1 Pos/Struct:10/1
text
Meta:1 test.xml Freq:1 Pos/Struct:11/1
karpet@cartermac 273% cat test.xml
<row><entry><para>559999</para></entry><entry><para>Some
text</para></entry></row>
karpet@cartermac 274% cat c
WordCharacters 0123456789abcdefghijklmnopqrstuvwxyz._-/$
BeginCharacters 0123456789abcdefghijklmnopqrstuvwxyz._-/
EndCharacters 0123456789abcdefghijklmnopqrstuvwxyz_-/
MinWordLimit 1
IndexContents HTML* .html
IndexContents XML* .xml
Peter Karman wrote on 10/27/04 9:16 AM:
> looks like your word position isn't getting incremented?
>
> have you tried a newer release? 2.4.0pr1 is old and that may be fixed in
> a newer version (it was "pre release" after all).
>
> Stein-Egil Museus wrote on 10/27/04 1:43 AM:
>
>
>>Hi
>>
>>I try to index some xml files with tables with swish-e 2.4.0.pr1, and get the following erroneous output.
>>
>>Here are a fragment of a XML file:
>>
>><row><entry><para>559999</para></entry><entry><para>Some text</para></entry><row>
>>
>>This gives the index words '559999Some' and 'text' in the index.
>>
>>My config file look like this
>>
>>IndexContents HTML* .htm .html .shtml
>>
>>IndexContents XML* .xml
>>
>>IndexDir ./
>>
>>IndexOnly .html .htm .xml
>>
>>IndexFile ./text.index
>>
>>What is wrong?
>>
>>/Stein-Egil
>>
>>
>>
>>
>>*********************************************************************
>>Due to deletion of content types excluded from this list by policy,
>>this multipart message was reduced to a single part, and from there
>>to a plain text message.
>>*********************************************************************
>
>
--
Peter Karman . http://www.cray.com/craydoc/ . karman(at)not-real.cray.com
Received on Wed Oct 27 08:11:04 2004