Skip to main content.
home | support | download

Back to List Archive

Problems with ISO 8859-1 to UTF-8 Conversion?

From: Thomas Seifert <tseifert(at)not-real.mediaregister.de>
Date: Fri Sep 13 2002 - 10:20:21 GMT
Hi,

i play around with Swish-E and french texts the last few days and i've 
encountered a problem that i can't solve.

I'm indexing XML-files (via -S prog parameter) like this one:
--------------- snip -----------------------------
Path-Name: /tvtitel/287358
Content-Length: 208
Last-Mtime: 1031911762
Document-Type: XML

<?xml version="1.0" encoding="ISO-8859-1"?>
<xml>
<titel>L'instit : Le choix de Théo</titel><desc>francetélévision (France2)|1 
Boulevard Victor, Immeuble Le Barjac|F75015|Paris|11-09-02 
21:10||||</desc></xml>
--------------- snip -----------------------------

In the Config File I use the "TranslateCharacters :ascii7:" Parameter which 
should index "Théo" as "Theo" (as I understood with this feature only the 
Index is converted, not the actual text) so that i could search for "theo" 
and find the above document.

When printing the keywords (with -k '*') I can't find the word "theo":
--------------- snip -----------------------------
... thac thaco the ti ticket ...
--------------- snip -----------------------------

When im Searching for "theo" i get no results, when searching for "th*" I get 
this result:
--------------- snip -----------------------------
# SWISH format: 2.1-dev-26
# Search words: titel=(th*)
# Number of hits: 4
# Search time: 0.001 seconds
# Run time: 0.038 seconds
L'instit : Le choix de Théo
The Brian Benben Show
Thaïlande
Thé ou café
--------------- snip -----------------------------

For me It looks like that the conversion from UTF-8, that is used internally 
by the libxml, back to ISO-8859-1 for the indexer doesn't work. But there is 
no error report when indexing.

Any Ideas?

thanks,
thomas
Received on Fri Sep 13 10:23:56 2002