Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: <dasoso(at)not-real.alumni.uv.es>
Date: Thu Nov 04 2004 - 15:15:54 GMT
 
 
> On Mon, Nov 01, 2004 at 03:40:06AM -0800, dasoso@alumni.uv.es 
wrote: 
> > My .conf file looks like this:  
> >   
> > UndefinedXMLAttributes auto  
> > UndefinedMetaTags auto  
>  
> Sure you want to do that?  Seems like you will be creating a lot 
of metanames. 
 
For the moment I want to index all the metanames. 
 
>  
> To find out why you are getting no resuts first use: 
>  
>   swish-e -c config -i test.html test.xml -T indexed_words 
>  
> and you will notice something odd.  Indexing stops in the middle 
of 
> the XML file. 
>  
> Then to find out why the parser stopped processing the file turn 
on: 
 
 
I tried it but I don't see anything odd here's what I get. It seems 
that every word are indexed. The only problem appears with diseņo 
that is indexed as diseno: 
I have the ParserWarmLevel 9 in the config file 
 
This is the test.html file: 
<html> 
<body> 
diseņo 
seņales 
niņo 
perro 
leņa 
piņa 
</body> 
</html> 
 
and the test.xml: 
 
<?xml version="1.0" standalone="no" ?> 
<!DOCTYPE order SYSTEM "pedido.dtd"> 
<Idioma tipo="Castellano"> 
   <curso numero="quinto"> 
        <asignatura nombre="IPI" codigo="1"> 
            <tipo> Troncal</tipo> 
            <descripcion> Blah.</descripcion> 
        </asignatura> 
 
        <asignatura nombre="Diseņo de bases de datos" codigo="4"> 
            <tipo> Optativa</tipo> 
            <descripcion> Diseņar.</descripcion> 
        </asignatura> 
   </curso> 
 
   <curso numero="segundo"> 
        <asignatura nombre="Base de datos" codigo="2"> 
            <tipo> Obligatoria </tipo> 
            <descripcion> </descripcion> 
        </asignatura> 
   </curso> 
 
</Idioma> 
 
 
 
 
dsorian@linux:~/swish-e-2.4.2> swish-e -c swish-e.conf -i test.html 
test.xml -T indexed_words 
 
Indexing Data Source: "File-System" 
Indexing "test.html" 
 
Checking file "test.html"... 
  test.html - Using HTML2 parser -     Adding:[1:swishdefault(1)]   
'disea'   Po 
s:2  Stuct:0x9 ( BODY FILE ) 
    Adding:[1:swishdefault(1)]   'o'   Pos:3  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[1:swishdefault(1)]   'sea'   Pos:4  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[1:swishdefault(1)]   'ales'   Pos:5  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[1:swishdefault(1)]   'nia'   Pos:6  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[1:swishdefault(1)]   'o'   Pos:7  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[1:swishdefault(1)]   'perro'   Pos:8  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[1:swishdefault(1)]   'lea'   Pos:9  Stuct:0x9 ( BODY 
FILE ) 
    Adding:[1:swishdefault(1)]   'pia'   Pos:10  Stuct:0x9 ( BODY 
FILE ) 
 (9 words) 
Indexing "test.xml" 
 
Checking file "test.xml"... 
  test.xml - Using XML2 parser - **Adding automatic MetaName 
'idioma' found in f 
ile 'test.xml' 
**Adding automatic MetaName 'idioma.tipo' found in file 'test.xml' 
    Adding:[2:idioma(10)]   'castellano'   Pos:3  Stuct:0x1 ( FILE ) 
    Adding:[2:idioma.tipo(11)]   'castellano'   Pos:3  Stuct:0x1 
( FILE ) 
**Adding automatic MetaName 'curso' found in file 'test.xml' 
**Adding automatic MetaName 'curso.numero' found in file 'test.xml' 
    Adding:[2:idioma(10)]   'quinto'   Pos:7  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'quinto'   Pos:7  Stuct:0x1 ( FILE ) 
    Adding:[2:curso.numero(13)]   'quinto'   Pos:7  Stuct:0x1 
( FILE ) 
**Adding automatic MetaName 'asignatura' found in file 'test.xml' 
**Adding automatic MetaName 'asignatura.nombre' found in file 
'test.xml' 
    Adding:[2:idioma(10)]   'ipi'   Pos:11  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'ipi'   Pos:11  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'ipi'   Pos:11  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.nombre(15)]   'ipi'   Pos:11  Stuct:0x1 
( FILE ) 
**Adding automatic MetaName 'asignatura.codigo' found in file 
'test.xml' 
    Adding:[2:idioma(10)]   '1'   Pos:14  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   '1'   Pos:14  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   '1'   Pos:14  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.codigo(16)]   '1'   Pos:14  Stuct:0x1 
( FILE ) 
**Adding automatic MetaName 'tipo' found in file 'test.xml' 
    Adding:[2:idioma(10)]   'troncal'   Pos:17  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'troncal'   Pos:17  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'troncal'   Pos:17  Stuct:0x1 
( FILE ) 
    Adding:[2:tipo(17)]   'troncal'   Pos:17  Stuct:0x1 ( FILE ) 
**Adding automatic MetaName 'descripcion' found in file 'test.xml' 
    Adding:[2:idioma(10)]   'blah'   Pos:20  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'blah'   Pos:20  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'blah'   Pos:20  Stuct:0x1 ( FILE ) 
    Adding:[2:descripcion(18)]   'blah'   Pos:20  Stuct:0x1 ( FILE ) 
    Adding:[2:idioma(10)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'diseno'   Pos:25  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'diseno'   Pos:25  Stuct:0x1 
( FILE ) 
    Adding:[2:asignatura.nombre(15)]   'diseno'   Pos:25  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'bases'   Pos:26  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'bases'   Pos:26  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'bases'   Pos:26  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.nombre(15)]   'bases'   Pos:26  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'datos'   Pos:27  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'datos'   Pos:27  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'datos'   Pos:27  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.nombre(15)]   'datos'   Pos:27  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   '4'   Pos:30  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   '4'   Pos:30  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   '4'   Pos:30  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.codigo(16)]   '4'   Pos:30  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'optativa'   Pos:33  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'optativa'   Pos:33  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'optativa'   Pos:33  Stuct:0x1 
( FILE ) 
    Adding:[2:tipo(17)]   'optativa'   Pos:33  Stuct:0x1 ( FILE ) 
    Adding:[2:idioma(10)]   'disenar'   Pos:36  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'disenar'   Pos:36  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'disenar'   Pos:36  Stuct:0x1 
( FILE ) 
    Adding:[2:descripcion(18)]   'disenar'   Pos:36  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'segundo'   Pos:42  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'segundo'   Pos:42  Stuct:0x1 ( FILE ) 
    Adding:[2:curso.numero(13)]   'segundo'   Pos:42  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'base'   Pos:46  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'base'   Pos:46  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'base'   Pos:46  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.nombre(15)]   'base'   Pos:46  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'datos'   Pos:47  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   'datos'   Pos:47  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   'datos'   Pos:47  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.nombre(15)]   'datos'   Pos:47  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   '2'   Pos:50  Stuct:0x1 ( FILE ) 
    Adding:[2:curso(12)]   '2'   Pos:50  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura(14)]   '2'   Pos:50  Stuct:0x1 ( FILE ) 
    Adding:[2:asignatura.codigo(16)]   '2'   Pos:50  Stuct:0x1 
( FILE ) 
    Adding:[2:idioma(10)]   'obligatoria'   Pos:53  Stuct:0x1 
( FILE ) 
    Adding:[2:curso(12)]   'obligatoria'   Pos:53  Stuct:0x1 
( FILE ) 
    Adding:[2:asignatura(14)]   'obligatoria'   Pos:53  Stuct:0x1 
( FILE ) 
    Adding:[2:tipo(17)]   'obligatoria'   Pos:53  Stuct:0x1 ( FILE ) 
 (17 words) 
 
Removing very common words... 
no words removed. 
Writing main index... 
Sorting words ... 
Sorting 24 words alphabetically 
Writing header ... 
Writing index entries ... 
  Writing word text: Complete 
  Writing word hash: Complete 
  Writing word data: Complete 
24 unique words indexed. 
4 properties sorted. 
2 files indexed.  745 total bytes.  73 total words. 
Elapsed time: 00:00:00 CPU time: 00:00:00 
Indexing done! 
dsorian@linux:~/swish-e-2.4.2> 
 
 
 I want to know if the non-English chars can be indexed correctly in 
the XML files. 
 
 
Thank you. 
Received on Thu Nov 4 07:15:57 2004