On Sat, Nov 06, 2004 at 06:39:39AM -0800, dasoso@alumni.uv.es wrote:
> No Bill, it doesn't work.
Ok, look carefully:
> WordCharacters 0123456789abcdefghijklmnñopqrstuvwxyzáéíóúàèòÇ
Why are you setting wordcharacters?
> Indexing "test.xml"
>
>
>
> Adding:[2:descripcion(18)] 'blah' Pos:20 Stuct:0x1 ( FILE )
> Adding:[2:idioma(10)] 'dise' Pos:25 Stuct:0x1 ( FILE )
> Adding:[2:curso(12)] 'dise' Pos:25 Stuct:0x1 ( FILE )
> Adding:[2:asignatura(14)] 'dise' Pos:25 Stuct:0x1 ( FILE )
> Adding:[2:asignatura.nombre(15)] 'dise' Pos:25 Stuct:0x1
> ( FILE )
> Adding:[2:idioma(10)] 'o' Pos:26 Stuct:0x1 ( FILE )
> Adding:[2:curso(12)] 'o' Pos:26 Stuct:0x1 ( FILE )
> Adding:[2:asignatura(14)] 'o' Pos:26 Stuct:0x1 ( FILE )
> Adding:[2:asignatura.nombre(15)] 'o' Pos:26 Stuct:0x1
> ( FILE )
> Adding:[2:idioma(10)] 'bases' Pos:27 Stuct:0x1 ( FILE )
Do you see diseño anywhere?
You can see that diseño is being split into dise and o. That
indicates that ñ is not in wordcharacters -- regardless of what your
posted config says.
Comment out Wordcharacters.
Also:
> dsorian@linux:~/swish-e-2.4.2> head -1 test.xml
> <?xml version="1.0" encoding="UTF-8"?>
Is your input file really encoded in utf-8? I used iso-8859-1 in my example.
Again, libxml2 parses the input XML file -- and it must know the
correct encoding of your file. Then libxml2 passes the text to swish
which then converts it to 8859-1.
-T indexed_words shows the words swish indexes after splitting on
Wordcharacters and applying IgnoreFirst/Last and Begin/EndChars.
-T parsed_words will show you the "white-space" delimited words before
Wordcharacters and friends are applied.
moseley@laptop:~$ cat c
ParserWarnLevel 9
UndefinedXMLAttributes auto
UndefinedMetaTags auto
IndexOnly .xml .html .htm
IndexContents XML* .xml
IndexContents HTML2 .html .htm
moseley@laptop:~$ cat 1.xml
<?xml version="1.0" encoding="iso-8859-1" ?>
<!DOCTYPE order SYSTEM "pedido.dtd">
<Idioma tipo="Castellano">
<curso numero="quinto">
<asignatura nombre="IPI" codigo="1">
<tipo> Troncal</tipo>
<descripcion> Blah blah</descripcion>
</asignatura>
<asignatura nombre="Diseño de bases de datos" codigo="4">
<tipo> Optativa</tipo>
<descripcion> Diseñar.</descripcion>
</asignatura>
</curso>
<curso numero="segundo">
<asignatura nombre="Base de datos" codigo="2">
<tipo> Troncal </tipo>
<descripcion> </descripcion>
</asignatura>
</curso>
</Idioma>
moseley@laptop:~$ swish-e -c c -i 1.xml -T indexed_words -v0 | grep diseño
Adding:[1:idioma(10)] 'diseño' Pos:26 Stuct:0x1 ( FILE )
Adding:[1:curso(12)] 'diseño' Pos:26 Stuct:0x1 ( FILE )
Adding:[1:asignatura(14)] 'diseño' Pos:26 Stuct:0x1 ( FILE )
Adding:[1:asignatura.nombre(15)] 'diseño' Pos:26 Stuct:0x1 ( FILE )
moseley@laptop:~$ swish-e -w asignatura.nombre=diseño
# SWISH format: 2.5.2
# Search words: asignatura.nombre=diseño
# Removed stopwords:
# Number of hits: 1
# Search time: 0.005 seconds
# Run time: 0.037 seconds
1000 1.xml "1.xml" 677
.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Sat Nov 6 07:00:45 2004