Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: Bill Moseley <moseley(at)>
Date: Sat Nov 06 2004 - 15:00:45 GMT
On Sat, Nov 06, 2004 at 06:39:39AM -0800, wrote:
> No Bill, it doesn't work.  

Ok, look carefully:

> WordCharacters 0123456789abcdefghijklmnñopqrstuvwxyzáéíóúàèòÇ  

Why are you setting wordcharacters?

> Indexing "test.xml"  
>  Adding:[2:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE )  
>     Adding:[2:idioma(10)]   'dise'   Pos:25  Stuct:0x1 ( FILE )  
>     Adding:[2:curso(12)]   'dise'   Pos:25  Stuct:0x1 ( FILE )  
>     Adding:[2:asignatura(14)]   'dise'   Pos:25  Stuct:0x1 ( FILE )  
>     Adding:[2:asignatura.nombre(15)]   'dise'   Pos:25  Stuct:0x1  
> ( FILE )  
>     Adding:[2:idioma(10)]   'o'   Pos:26  Stuct:0x1 ( FILE )  
>     Adding:[2:curso(12)]   'o'   Pos:26  Stuct:0x1 ( FILE )  
>     Adding:[2:asignatura(14)]   'o'   Pos:26  Stuct:0x1 ( FILE )  
>     Adding:[2:asignatura.nombre(15)]   'o'   Pos:26  Stuct:0x1  
> ( FILE )  
>     Adding:[2:idioma(10)]   'bases'   Pos:27  Stuct:0x1 ( FILE )  

Do you see diseño anywhere?

You can see that diseño is being split into dise and o.  That
indicates that ñ is not in wordcharacters -- regardless of what your
posted config says.

Comment out Wordcharacters.


> dsorian@linux:~/swish-e-2.4.2> head -1 test.xml
> <?xml version="1.0" encoding="UTF-8"?>

Is your input file really encoded in utf-8?  I used iso-8859-1 in my example.

Again, libxml2 parses the input XML file -- and it must know the
correct encoding of your file.  Then libxml2 passes the text to swish
which then converts it to 8859-1.

-T indexed_words shows the words swish indexes after splitting on
Wordcharacters and applying IgnoreFirst/Last and Begin/EndChars.

-T parsed_words will show you the "white-space" delimited words before
Wordcharacters and friends are applied.

moseley@laptop:~$ cat c
ParserWarnLevel 9
UndefinedXMLAttributes auto
UndefinedMetaTags auto

IndexOnly .xml .html .htm

IndexContents XML* .xml
IndexContents HTML2 .html .htm

moseley@laptop:~$ cat 1.xml 
<?xml version="1.0" encoding="iso-8859-1" ?>
<!DOCTYPE order SYSTEM "pedido.dtd">
<Idioma tipo="Castellano">
   <curso numero="quinto">
        <asignatura nombre="IPI" codigo="1">
            <tipo> Troncal</tipo>
            <descripcion> Blah blah</descripcion>

        <asignatura nombre="Diseño de bases de datos" codigo="4">
            <tipo> Optativa</tipo>
            <descripcion> Diseñar.</descripcion>

   <curso numero="segundo">
        <asignatura nombre="Base de datos" codigo="2">
            <tipo> Troncal </tipo>
            <descripcion> </descripcion>

moseley@laptop:~$ swish-e -c c -i 1.xml -T indexed_words -v0 | grep diseño
    Adding:[1:idioma(10)]   'diseño'   Pos:26  Stuct:0x1 ( FILE )
    Adding:[1:curso(12)]   'diseño'   Pos:26  Stuct:0x1 ( FILE )
    Adding:[1:asignatura(14)]   'diseño'   Pos:26  Stuct:0x1 ( FILE )
    Adding:[1:asignatura.nombre(15)]   'diseño'   Pos:26  Stuct:0x1 ( FILE )

moseley@laptop:~$ swish-e -w asignatura.nombre=diseño 
# SWISH format: 2.5.2
# Search words: asignatura.nombre=diseño
# Removed stopwords: 
# Number of hits: 1
# Search time: 0.005 seconds
# Run time: 0.037 seconds
1000 1.xml "1.xml" 677

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Sat Nov 6 07:00:45 2004