Skip to main content.
home | support | download

Back to List Archive

Re: problem with tolower contiues

From: <dasoso(at)not-real.alumni.uv.es>
Date: Sat Feb 19 2005 - 21:09:03 GMT
 
  Thanks for the example Bill, but it didn't work. 
 
> cat test.xml 
<?xml version="1.0" encoding="ISO-8859-1"?> 
<!DOCTYPE order SYSTEM "pedido.dtd"> 
<Idioma tipo="Castellano"> 
  árbol    ÁRBOL   empeño.  EMPEÑO. 
</Idioma> 
 
 
swish-e-2.4.3> LANG=es_ES swish-e -c swish-e.conf -T indexed_words 
 
 
Adding automatic MetaName 'descripcion' found in file test.xml 
 
    Adding:[2:idioma(10)]   'empe'   Pos:6  Stuct:0x1 ( FILE ) 
    Adding:[2:descripcion(12)]   'empe'   Pos:6  Stuct:0x1 ( FILE ) 
    Adding:[2:idioma(10)]   'empe'   Pos:7  Stuct:0x1 ( FILE ) 
    Adding:[2:descripcion(12)]   'empe'   Pos:7  Stuct:0x1 ( FILE 
 
. 
 
 
swish-e-2.4.3> LANG=es_ES swish-e -w PESTAÑA -H9 | grep Parsed 
# Parsed Words: pesta 
 
 
dsorian@linux:~/swish-e-2.4.3> LANG=es_ES swish-e -w pestaña 
locale es es_ES@@@@@@@@@@@@ 
# SWISH format: 2.4.3 
# Search words: pestaña 
# Removed stopwords: 
err: no results 
 
 
 
   Did I make something wrong? I used LANG=es_ES but it's worse, 
swish-e splits the words in the ñ's. So the problem isn't in the 
locale settings (I think). Any suggestion? 
 
 
Thank you. 
 
 
 
> On Wed, Feb 16, 2005 at 06:46:41AM -0800, dasoso@alumni.uv.es 
wrote: 
> >  1.-Here ara my locale settings, could be the reason because 
swish-e      
> > indexes ÁRBOL as Árbol?      
>  
> Yes, that what I was suggesting. 
>  
> Swish-e is converting your text to 8858-1 encoding, but you are 
> telling it to sort using UTF-8. 
>  
> Run swish like this: 
>  
>     LANG=es_ES swish-e -c config 
>  
> Maybe a demonstration will make it clear: 
>  
> moseley@bumby:~$ LANG=es_ES.UTF-8 swish-e -i t.txt -T 
indexed_words -v0 
>     Adding:[1:swishdefault(1)]   'pestaÑa'   Pos:5  Stuct:0x9 
( BODY FILE ) 
>     Adding:[1:swishdefault(1)]   'Águila'   Pos:6  Stuct:0x9 
( BODY FILE ) 
>     Adding:[1:swishdefault(1)]   'águila'   Pos:7  Stuct:0x9 
( BODY FILE ) 
> moseley@bumby:~$ LANG=es_ES swish-e -i t.txt -T indexed_words -v0 
>     Adding:[1:swishdefault(1)]   'pestaña'   Pos:5  Stuct:0x9 
( BODY FILE ) 
>     Adding:[1:swishdefault(1)]   'águila'   Pos:6  Stuct:0x9 
( BODY FILE ) 
>     Adding:[1:swishdefault(1)]   'águila'   Pos:7  Stuct:0x9 
( BODY FILE ) 
>  
> And you will need to search that way, too -- or at least be 
> consistent that your locale setting is the same when indexing and 
> when searching so that tolower() operates the same when when 
> searching as it does when indexing.  But the bottom line is you 
don't 
> want to tell tolower() that it's working with UTF-8 encoding when 
> it's really working with 8859-1 encoding. 
>  
>  
> moseley@bumby:~$ LANG=es_ES.UTF-8 swish-e -w PESTAÑA -H9 | grep 
Parsed 
> # Parsed Words: pestaÑa  
> moseley@bumby:~$ LANG=es_ES swish-e -w PESTAÑA -H9 | grep Parsed 
> # Parsed Words: pestaña 
>  
>  
> We could force LANG at program startup, but there's more than one 
> valid setting (i.e. en_US de_DE es_ES) so we want people to be 
able 
> to set that. 
>  
>  
>  
>  
> --  
> Bill Moseley 
> moseley@hank.org 
>  
> Unsubscribe from or help with the swish-e list:  
>    http://swish-e.org/Discussion/ 
>  
> Help with Swish-e: 
>    http://swish-e.org/current/docs 
>    swish-e@sunsite.berkeley.edu 
>  
>  
 
 
Received on Sat Feb 19 13:09:03 2005