Skip to main content.
home | support | download

Back to List Archive

Re: Merged words from XML tables

From: Peter Karman <karman(at)>
Date: Wed Oct 27 2004 - 15:11:03 GMT
I just tried this with 2.5.2 and it appears to split on the tags:

karpet@cartermac 271% swish-e -i test.xml -v 3 -c c
Parsing config file 'c'
Indexing Data Source: "File-System"
Indexing "test.xml"

Checking file "test.xml"...
   test.xml - Using XML2 parser -  (3 words)


karpet@cartermac 272% swish-e -T index_all

-----> WORD INFO in index index.swish-e <-----

  Meta:1 test.xml Freq:1 Pos/Struct:5/1

  Meta:1 test.xml Freq:1 Pos/Struct:10/1

  Meta:1 test.xml Freq:1 Pos/Struct:11/1

karpet@cartermac 273% cat test.xml

karpet@cartermac 274% cat c
WordCharacters 0123456789abcdefghijklmnopqrstuvwxyz._-/$
BeginCharacters 0123456789abcdefghijklmnopqrstuvwxyz._-/
EndCharacters 0123456789abcdefghijklmnopqrstuvwxyz_-/
MinWordLimit 1
IndexContents HTML* .html
IndexContents XML* .xml

Peter Karman wrote on 10/27/04 9:16 AM:

> looks like your word position isn't getting incremented?
> have you tried a newer release? 2.4.0pr1 is old and that may be fixed in 
> a newer version (it was "pre release" after all).
> Stein-Egil Museus wrote on 10/27/04 1:43 AM:
>>I try to index some xml files with tables with swish-e 2.4.0.pr1, and get the following erroneous output.
>>Here are a fragment of a XML file:
>><row><entry><para>559999</para></entry><entry><para>Some text</para></entry><row>
>>This gives the index words '559999Some' and 'text' in the index.
>>My config file look like this
>>IndexContents HTML* .htm .html .shtml
>>IndexContents XML* .xml
>>IndexDir ./
>>IndexOnly .html .htm .xml
>>IndexFile ./text.index
>>What is wrong?
>>Due to deletion of content types excluded from this list by policy,
>>this multipart message was reduced to a single part, and from there
>>to a plain text message.

Peter Karman  . .  karman(at)
Received on Wed Oct 27 08:11:04 2004