Thanks Bill, Rick and David,
An example of the xml we're indexing which contains the accented chars can be found here:
html example (same problems)
I've had a look at the locale settings on the solaris boxes (test dev and live are all displaying the same behaviour) and they're set to ISO-8859-1. Can anyone tell me if the locale settings are global to all logins/users or are they login specific?
The xml we're indexing has the following header/declaration line
<?xml version="1.0" encoding="iso-8859-1"?>
I also tried a test index with the iso in caps ie: ISO-8859-1. No change in behaviour.
> I suppose a useful -T option would be to dump as bytes the UTF-8 strings
> that libxml2 is passing to swish. That would be helpful in debugging.
Which -t option is this? Or are you just thinking it would be a nice to have feature?
> Swish-e then blindly tries to convert the UTF-8 to 8859-1. A warning
> is displayed when a UTF-8 character cannot be mapped to 8859-1.
I've set ParserWarnLevel to 1 in the various config files I'm using and so far no errors are being reported. Do I also need to use the -v switch?
As far as I can see, swish-e is happily indexing the files and storing them in UTF-8. When I look at the INDEXED_WORDS I see cinematheque in there as "cinÚmathÞque", which I'm assuming is just the terminals best attempt to display the Unicode. When that index then gets searched, it's failing to translate the unicode back into ISO-8859-1. Instead it returns "?". That "?" is then translated by our java htmlencode function as � which cannot be displayed.
I'm going slightly loopy on this one. Any further guidance will get you guaranteed positions on the Christmas card list (honest).
Received on Wed Feb 26 03:12:13 2003