Skip to main content.
home | support | download

Back to List Archive

Re: Problems with sorting German Umlaut

From: Andreas Seltenreich <andreas.seltenreich(at)not-real.ubka.uni-karlsruhe.de>
Date: Thu Feb 03 2005 - 18:40:55 GMT
Bill Moseley writes:

> On Thu, Feb 03, 2005 at 06:31:03PM +0100, Andreas Seltenreich wrote:
>> Sadly, ISO C doesn't know strNcoll. Naively, I'd just copy and
>> zero-terminate the strings and feed them to strcoll, using static
>> memory to make the penalty bearable. But I'm afraid, will this have to
>> be implemented thread safe? Is it okay to introduce a new string
>> properties flag "case:locale" or similar to make it runtime
>> configurable?

> The point of making it configurable so that you can fallback to the
> old strncasecmp() if you don't need it?

Even with LOCALE=C there's still a penalty of using strcoll, so people
that don't need more than US-ASCII should IMHO not be forced to use
the locale-aware functions. The people over on postgresql.org did some
comparisons a while ago:

<http://groups.google.de/groups?selm=Pine.LNX.4.30.0111261852030.612-100000%40peter.localdomain>
(Sorry for the media breach)

> Might be better to figure out where those strings are allocated and
> allocate another byte and make them null-terminated to start with.

Ok, I'm going to spend some time getting myself more familiar with the
code.

> Just one more thing that won't work when we move to utf-8.  (how does
> utf-8 sort??  Do some languages sort to the top?)

strcoll works flawlessly with utf-8 locales. Here's an example I ran
in an utf8-xterm (I used "file" to make sure I am actually typing
utf-8):

$ echo  > /tmp/test
$ file !$
/tmp/test: UTF-8 Unicode text
$ LC_CTYPE=de_DE.utf-8 LC_COLLATE=de_DE.utf-8 ./a.out  
strcasecmp: 32
    strcmp: 32
   strcoll: -6
$ LC_CTYPE=de_DE.utf-8 LC_COLLATE=de_DE.utf-8 ./a.out  b
strcasecmp: 97
    strcmp: 97
   strcoll: -1

Thanks
Andreas
Received on Thu Feb 3 10:41:01 2005