Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: <brad(at)not-real.auroraquanta.com>
Date: Mon Nov 08 2004 - 03:24:30 GMT
hi,

in order to get nyas to work in our multilanguage search, i had to use

<?xml version="1.0" encoding="ISO-8859-1"?>

i notice that you are using

<?xml version="1.0" encoding="UTF-8"?>

have you tried ISO-8859-1?

good luck! i fought with this for a long time.

Brad
------------------------------------------------------------
 Brad Miele
 Technology Director
 IPNStock
 (866) 476-7862 x902
 bmiele@ipnstock.com

 You can make it illegal, but you can't make it unpopular.


On Sun, 7 Nov 2004 dasoso@alumni.uv.es wrote:

>
>  Hi all.
>
>  Ok Bill, I commented out Wordcharacters.
>
>  dsorian@linux:~/swish-e-2.4.2> cat swish-e.conf
>
> IndexDir /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk
> (test.html and test.xml are the only files in the dir)
>
> UndefinedXMLAttributes auto
> UndefinedMetaTags auto
>
> IndexOnly .xml .html .htm
>
> IndexReport 3
> ParserWarnLevel 9
>
> IndexContents XML* .xml
> IndexContents HTML2 .html .htm
>
> TranslateCharacters :ascii7:
> #WordCharacters 0123456789abcdefghijklmnñopqrstuvwxyzáéíóúàèòÇ
>
>
>
> dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -T
> indexed_words
>
>
>
>  Adding:[1:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE )
>     Adding:[1:idioma(10)]   'diseno'   Pos:25  Stuct:0x1 ( FILE )
>     Adding:[1:curso(12)]   'diseno'   Pos:25  Stuct:0x1 ( FILE )
>     Adding:[1:asignatura(14)]   'diseno'   Pos:25  Stuct:0x1
> ( FILE )
>     Adding:[1:asignatura.nombre(15)]   'diseno'   Pos:25  Stuct:0x1
> ( FILE )
>     Adding:[1:idioma(10)]   'bases'   Pos:26  Stuct:0x1 ( FILE )
>
>
>
>
> test.html - Using HTML2 parser -     Adding:[2:swishdefault(1)]
> 'disea'   Pos:2  Stuct:0x9 ( BODY FILE )
>     Adding:[2:swishdefault(1)]   'o'   Pos:3  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'disea'   Pos:4  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'ar'   Pos:5  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'sea'   Pos:6  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'ales'   Pos:7  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'escoa'   Pos:8  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'ado'   Pos:9  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'matraz'   Pos:10  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'nia'   Pos:11  Stuct:0x9 ( BODY
> FILE )
>     Adding:[2:swishdefault(1)]   'o'   Pos:12  Stuct:0x9 ( BODY
> FILE )
>  (11 words)
>
>
> dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -T
> parsed_words
>
>
>
> test.xml - Using XML2 parser
>
>
>
> White-space found word 'Blah.'
> White-space found word 'Dise?'  <--the white blanks appear like a
> square char
> White-space found word 'de'
> White-space found word 'bases'
> White-space found word 'de'
> White-space found word 'datos'
> White-space found word '4'
> White-space found word 'Optativa'
> White-space found word 'Dise?r.'   <--- here too
> White-space found word 'segundo'
> White-space found word 'Base'
> White-space found word 'de'
> White-space found word 'datos'
> White-space found word '2'
> White-space found word 'Troncal'
>  (17 words)
>
>   test.html - Using HTML2 parser - White-space found word 'diseño'
> White-space found word 'diseñar'
> White-space found word 'señales'
> White-space found word 'Escoñado'
> White-space found word 'matraz'
> White-space found word 'niño'
>  (11 words)
>
>
>
>  So the search for diseño in test.html works perfectly thanks to
> HTML2.
>
>  dsorian@linux:~/swish-e-2.4.2> swish-e -w diseño
> # SWISH format: 2.4.2
> # Search words: diseño
> # Removed stopwords:
> # Number of hits: 1
> # Search time: 0.001 seconds
> # Run time: 0.024 seconds
> 1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test.html
> "test.html" 78
>
>
>
> dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseño'
> # SWISH format: 2.4.2
> # Search words: asignatura.nombre=diseño
> # Removed stopwords:
> err: no results
>
>
> dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseno'
> # SWISH format: 2.4.2
> # Search words: asignatura.nombre=diseno
> # Removed stopwords:
> # Number of hits: 1
> # Search time: 0.001 seconds
> # Run time: 0.023 seconds
> 1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test.xml
> "test.xml" 671
>
>
>
> It seems, I will not have problems with the search in .html files.
>
> linux:/usr/... # head -1 test.xml
> <?xml version="1.0" encoding="UTF-8"?>
>
> You said that the search for diseño and diseno should match, but it
> doen't.Why?
>
>
>
> Thank you.
>
> David Soriano.
>
>
>
>
>
>
Received on Sun Nov 7 19:24:34 2004