Skip to main content.
home | support | download

Back to List Archive

Re: 8-bit chars

From: John Angel <angel_john(at)not-real.hotmail.com>
Date: Wed Dec 10 2003 - 21:36:34 GMT
Windows-1250 codepage example which is not working is attached.

Non-english characters are not indexed at all.

WordCharacters doesn't help.



----- Original Message ----- 
From: "Bill Moseley" <moseley@hank.org>
To: "John Angel" <angel_john@hotmail.com>
Cc: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
Sent: Wednesday, December 03, 2003 20:21
Subject: Re: [SWISH-E] 8-bit chars


> On Wed, Dec 03, 2003 at 04:22:05AM -0800, John Angel wrote:
> > I have added chars above ASCII 127 to WordCharacters but it still
displays
> > blanks instead of them. Where's the catch?
>
> You need to give an example of what's not working.
>
> > BTW, I have noticed that in WordCharacters there are only small caps
chars.
>
> Yes, words are lowercased with "tolower()" as you noticed.  So only
> lower case need to be specified.
>
> > UTF-8 support would be great, but I understand it requires major
rewrite. Is
> > it possible to have at least full 8-bit chars support instead?
>
> It is full 8-bit, but there's a conversion to Latin1 when using libxml2
> so it may not be 100% 8-bit "clean".  I have not tested that with
> libxml2.
>
> BTW - First thing swish-e does when starting is:
>
>       setlocale(LC_CTYPE, "");
>
> but that's only in the binary.  (So that might result in problems when
> people use the Swish-e API on systems with different locales -- that is,
> tolower() might not change umlauts on indexing but would on searching.q
>
> > Searching through previous posts shows that the problem could be in
> > UTF8Toisolat1() and tolower() functions, but I am not sure how to change
and
> > fix that.
>
> Can you provide a specific example of the problem?
>
>
>
> -- 
> Bill Moseley
> moseley@hank.org
>
>



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Wed Dec 10 21:36:42 2003