Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: <dasoso(at)not-real.alumni.uv.es>
Date: Sun Nov 07 2004 - 21:03:24 GMT
 
 Hi all. 
  
 Ok Bill, I commented out Wordcharacters. 
 
 dsorian@linux:~/swish-e-2.4.2> cat swish-e.conf 
 
IndexDir /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk  
(test.html and test.xml are the only files in the dir) 
 
UndefinedXMLAttributes auto 
UndefinedMetaTags auto 
 
IndexOnly .xml .html .htm 
 
IndexReport 3 
ParserWarnLevel 9 
 
IndexContents XML* .xml 
IndexContents HTML2 .html .htm 
 
TranslateCharacters :ascii7: 
#WordCharacters 0123456789abcdefghijklmnñopqrstuvwxyzáéíóúàèòÇ 
 
   
 
dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -T 
indexed_words 
 
 
 
 Adding:[1:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE ) 
    Adding:[1:idioma(10)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[1:curso(12)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[1:asignatura(14)]   'diseno'   Pos:25  Stuct:0x1 
( FILE ) 
    Adding:[1:asignatura.nombre(15)]   'diseno'   Pos:25  Stuct:0x1 
( FILE ) 
    Adding:[1:idioma(10)]   'bases'   Pos:26  Stuct:0x1 ( FILE ) 
 
 
 
 
test.html - Using HTML2 parser -     Adding:[2:swishdefault(1)]   
'disea'   Pos:2  Stuct:0x9 ( BODY FILE ) 
    Adding:[2:swishdefault(1)]   'o'   Pos:3  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'disea'   Pos:4  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'ar'   Pos:5  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'sea'   Pos:6  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'ales'   Pos:7  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'escoa'   Pos:8  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'ado'   Pos:9  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'matraz'   Pos:10  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'nia'   Pos:11  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[2:swishdefault(1)]   'o'   Pos:12  Stuct:0x9 ( BODY 
FILE ) 
 (11 words) 
 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -T 
parsed_words 
 
 
  
test.xml - Using XML2 parser 
 
 
 
White-space found word 'Blah.' 
White-space found word 'Dise?'  <--the white blanks appear like a 
square char 
White-space found word 'de' 
White-space found word 'bases' 
White-space found word 'de' 
White-space found word 'datos' 
White-space found word '4' 
White-space found word 'Optativa' 
White-space found word 'Dise?r.'   <--- here too 
White-space found word 'segundo' 
White-space found word 'Base' 
White-space found word 'de' 
White-space found word 'datos' 
White-space found word '2' 
White-space found word 'Troncal' 
 (17 words) 
  
  test.html - Using HTML2 parser - White-space found word 'diseño' 
White-space found word 'diseñar' 
White-space found word 'señales' 
White-space found word 'Escoñado' 
White-space found word 'matraz' 
White-space found word 'niño' 
 (11 words) 
  
  
   
 So the search for diseño in test.html works perfectly thanks to 
HTML2. 
  
 dsorian@linux:~/swish-e-2.4.2> swish-e -w diseño 
# SWISH format: 2.4.2 
# Search words: diseño 
# Removed stopwords: 
# Number of hits: 1 
# Search time: 0.001 seconds 
# Run time: 0.024 seconds 
1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test.html 
"test.html" 78 
 
 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseño' 
# SWISH format: 2.4.2 
# Search words: asignatura.nombre=diseño 
# Removed stopwords: 
err: no results 
 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -w 'asignatura.nombre=diseno' 
# SWISH format: 2.4.2 
# Search words: asignatura.nombre=diseno 
# Removed stopwords: 
# Number of hits: 1 
# Search time: 0.001 seconds 
# Run time: 0.023 seconds 
1000 /usr/local/jakarta-tomcat-4.1.18-LE-jdk14/webapps/cocoon/webs/borrame/kk/test.xml 
"test.xml" 671 
 
 
 
It seems, I will not have problems with the search in .html files. 
 
linux:/usr/... # head -1 test.xml 
<?xml version="1.0" encoding="UTF-8"?> 
 
You said that the search for diseño and diseno should match, but it 
doen't.Why? 
 
 
 
Thank you. 
 
David Soriano. 
 
 
  
Received on Sun Nov 7 13:03:32 2004