Skip to main content.
home | support | download

Back to List Archive

Re: non-English charaters in XML files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Nov 09 2004 - 18:44:44 GMT
On Tue, Nov 09, 2004 at 10:19:05AM -0800, dasoso@alumni.uv.es wrote:
> 
> 
> > On Tue, Nov 09, 2004 at 06:52:24AM -0800, dasoso@alumni.uv.es wrote:
> > > Swish-e splits the words in ISO-8859. I like the way that works 
> with 
> > > the UTF-8. 
> > 
> > So I guess that means your source xml is encoded in UTF-8.
> 
> 
>   Yes, but I noticed that my server has files encoded in UTF-8 and 
> others in ISO-8859, so I'll have files with 's indexed as n and 
> others whit the words splitted. Anyone has this problem with the xml 
> files? How do you resolve it and index your XML files? Don't know 
> what to do.

You might review http://xmlsoft.org/encoding.html ("How is it
implemented?" section).  This part seems to be related to this
discussion.

    If there is no encoding declaration, then the input has to be in
    either UTF-8 or UTF-16, if it is not then at some point when
    processing the input, the converter/checker of UTF-8 form will
    raise an encoding error. You may end-up with a garbled document,
    or no document at all !

You may need to make sure you xml is well-formed and has the encoding
specified.  You might be able to automate that process (maybe the
file(1) command can help figure out the encoding).

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Nov 9 10:44:45 2004