Skip to main content.
home | support | download

Back to List Archive

Re: TranslateCharacters - clarification required

From: David L Norris <dave(at)not-real.webaugur.com>
Date: Wed Feb 26 2003 - 03:49:02 GMT
On Tue, 2003-02-25 at 22:11, Tref Gare wrote:
> An example of the xml we're indexing which contains the accented chars can be found here:
> http://www2.areeba.com.au/swishe/test.xml
> html example (same problems)
> http://www2.areeba.com.au/swishe/test.htm

Thanks, much better.  Berkeley's list processer is stripping
content-type headers from the swish-e list.  My mail reader keeps
randomly guessing character encodings.  Note: look at the message I sent
direct rather than the one through the mailing list just in case the
characters were mangled by the list.

> I've had a look at the locale settings on the solaris boxes (test dev and live are all displaying the 
> same behaviour) and they're set to ISO-8859-1.  Can anyone tell me if the locale settings are global to all logins/users or are they login specific?

They should be adjustable per user per environment.  Meaning you should
be able to temporarily change it just for your current terminal.  On
Linux it's as simple as setting the LANG environment variable to a
different locale; all the other settings inherit from LANG.  

Solaris may be completely different.  But probably not.  Here's an
example on my Redhat 8 workstation:

$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

# Let's switch to ISO-8859-1
$ export LANG=en_US

$ locale
LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=


> I've set ParserWarnLevel to 1 in the various config files I'm using and so far no errors are being reported.  Do I also need to use the -v switch?

Looks like it is parsing fine.

> As far as I can see, swish-e is happily indexing the files and storing them in UTF-8.
..
> When that index then gets searched, it's failing to translate the unicode back into ISO-8859-1.
> Instead it returns "?".  That "?" is then translated by our java htmlencode function as &#65533; which cannot be displayed.

$ cat c
DefaultContents XML2

$ swish-e -i test.xml -c c

$ swish-e -k c | tee test.txt
# SWISH format: 2.3-dev-04
index.swish-e: c0d20b3637e541b98de1273d279c7f84
c894a4a6470a4044804b13156b6d5352 calendar celebrating centre cin cinema
cinemas cinematheque cinematic cin?math?que classic collection come
consists contact content copyright corporatename cult

OK, my terminal is showing me ? for the accented characters.  My locale
is en_US.utf8 (on Redhat 8).

So, let's convert SWISH-E's output from Latin1 to UTF8 and see what we
get.

$ recode latin1..utf8 test.txt

$ cat test.txt
# SWISH format: 2.3-dev-04
index.swish-e: c0d20b3637e541b98de1273d279c7f84
c894a4a6470a4044804b13156b6d5352 calendar celebrating centre cin cinema
cinemas cinematheque cinematic cinémathèque classic collection come
consists contact content copyright corporatename cult

OK, that's better.  Let's go back to Latin1:
$ recode utf8..latin1 test.txt

$ cat test.txt
# SWISH format: 2.3-dev-04
index.swish-e: c0d20b3637e541b98de1273d279c7f84
c894a4a6470a4044804b13156b6d5352 calendar celebrating centre cin cinema
cinemas cinematheque cinematic cin?math?que classic collection come
consists contact content copyright corporatename cult

Alrighty, now it's showing those blasted question marks again.


I don't know if any of my rambling is helpful.  But maybe it will give
you some debugging ideas.  Maybe you need to explicitly tell your Java
app or whatnot to import the text as ISO-8859-1?  Java is entirely
Unicode isn't it?

-- 
 David Norris
  http://www.webaugur.com/dave/
  ICQ - 412039
Received on Wed Feb 26 03:49:29 2003