Hi all.
I must have something missconfigured in my system about the
characters set. Because your examples show that swish-e has no
problem with ñ's etc.
swish-e-2.4.3> cat test.xml
españa PESTAÑA
niño NIÑO
émbolo ÉMBOLO
swish-e-2.4.3> LANG=es_ES swish-e -i test.xml -T indexed_words -v0
Adding:[1:swishdefault(1)] 'espa? Pos:5 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'a' Pos:6 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'pesta? Pos:7 Stuct:0x9 ( BODY
FILE )
Adding:[1:swishdefault(1)] 'a' Pos:8 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'ni? Pos:9 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'o' Pos:10 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'ni? Pos:11 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'o' Pos:12 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] '? Pos:13 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'mbolo' Pos:14 Stuct:0x9 ( BODY
FILE )
Adding:[1:swishdefault(1)] '? Pos:15 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'mbolo' Pos:16 Stuct:0x9 ( BODY
FILE )
swish-e-2.4.3> swish-e -k '*'
# SWISH format: 2.4.3
index.swish-e: a espa?mbolo ni?o pesta??
David.
> On Wed, Feb 16, 2005 at 06:46:41AM -0800, dasoso@alumni.uv.es
wrote:
> > 1.-Here ara my locale settings, could be the reason because
swish-e
> > indexes ÁRBOL as Árbol?
>
> Yes, that what I was suggesting.
>
> Swish-e is converting your text to 8858-1 encoding, but you are
> telling it to sort using UTF-8.
>
> Run swish like this:
>
> LANG=es_ES swish-e -c config
>
> Maybe a demonstration will make it clear:
>
> moseley@bumby:~$ LANG=es_ES.UTF-8 swish-e -i t.txt -T
indexed_words -v0
> Adding:[1:swishdefault(1)] 'pestaÑa' Pos:5 Stuct:0x9
( BODY FILE )
> Adding:[1:swishdefault(1)] 'Águila' Pos:6 Stuct:0x9
( BODY FILE )
> Adding:[1:swishdefault(1)] 'águila' Pos:7 Stuct:0x9
( BODY FILE )
> moseley@bumby:~$ LANG=es_ES swish-e -i t.txt -T indexed_words -v0
> Adding:[1:swishdefault(1)] 'pestaña' Pos:5 Stuct:0x9
( BODY FILE )
> Adding:[1:swishdefault(1)] 'águila' Pos:6 Stuct:0x9
( BODY FILE )
> Adding:[1:swishdefault(1)] 'águila' Pos:7 Stuct:0x9
( BODY FILE )
>
> And you will need to search that way, too -- or at least be
> consistent that your locale setting is the same when indexing and
> when searching so that tolower() operates the same when when
> searching as it does when indexing. But the bottom line is you
don't
> want to tell tolower() that it's working with UTF-8 encoding when
> it's really working with 8859-1 encoding.
>
>
> moseley@bumby:~$ LANG=es_ES.UTF-8 swish-e -w PESTAÑA -H9 | grep
Parsed
> # Parsed Words: pestaÑa
> moseley@bumby:~$ LANG=es_ES swish-e -w PESTAÑA -H9 | grep Parsed
> # Parsed Words: pestaña
>
>
> We could force LANG at program startup, but there's more than one
> valid setting (i.e. en_US de_DE es_ES) so we want people to be
able
> to set that.
>
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
> http://swish-e.org/Discussion/
>
> Help with Swish-e:
> http://swish-e.org/current/docs
> swish-e@sunsite.berkeley.edu
>
>
Received on Wed Feb 23 11:41:06 2005