Skip to main content.
home | support | download

Back to List Archive

Re: Problems with sorting German Umlaut

From: Andreas Seltenreich <andreas.seltenreich(at)not-real.ubka.uni-karlsruhe.de>
Date: Thu Feb 03 2005 - 17:34:03 GMT
Bill Moseley writes:

> On Wed, Feb 02, 2005 at 06:46:08AM -0800, Uwe Dierolf wrote:
>> We are using the default "case:ignore" for properties.
>> We checked the implementation of strncasecmp (see below).
>> This function does not take into consideration the value
>> of LC_COLLATE (under SuSE Linux 9.x).
>
> Oh, my mistake.  I used google too much.  I googled for 
>
>     strncasecmp locale and got to:
>
> http://www.delorie.com/gnu/docs/glibc/libc_75.html
>
> which it is local dependent.  But my man page says nothing of that.

Interesting, we're using the very same version of glibc, so I did some
more testing with our test case program. The documentation seems to be
a bit misleading here, as it just says strncasecmp is "locale
dependent".

strXcasecmp actually uses the LC_CTYPE locale information to match
upper and lowercase characters, but it doesn't use the locale's
collating information:

--8<---------------cut here---------------start------------->8---
$ cat sortlocale.c 
#include <string.h>
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>

int main(int argc, char **argv) {

  setlocale(LC_CTYPE, getenv("LC_CTYPE"));
  setlocale(LC_COLLATE, getenv("LC_COLLATE"));

  if (argc != 3) {
    puts("benötige 2 Argumente");
    return -1;
  } else {
    printf("strcasecmp: %d\n"
           "    strcmp: %d\n"
           "   strcoll: %d\n",
           strcasecmp(argv[1], argv[2]),
           strcmp(argv[1], argv[2]),
           strcoll(argv[1], argv[2]));
  }
  return 0;

}
$ gcc sortlocale.c
$ LC_CTYPE=C LC_COLLATE=de_DE ./a.out ā Ā
strcasecmp: 32
    strcmp: 32
   strcoll: -2
$ LC_CTYPE=de_DE LC_COLLATE=de_DE ./a.out ā Ā
strcasecmp: 0
    strcmp: 32
   strcoll: -2
$ LC_CTYPE=de_DE LC_COLLATE=de_DE ./a.out ā a
strcasecmp: 127
    strcmp: 127
   strcoll: 4
$ 
--8<---------------cut here---------------end--------------->8---

So i guess there's no way around strcoll.

>> Would it be possible for you or other swish-e developers to 
>> change the swish-e source so that it will use strcoll?
>> We need correctly sorted results. 
>
> Can you send me a patch?  If there was a strncoll() then it would be a
> drop-in replacemnet...

Sadly, ISO C doesn't know strNcoll. Naively, I'd just copy and
zero-terminate the strings and feed them to strcoll, using static
memory to make the penalty bearable. But I'm afraid, will this have to
be implemented thread safe? Is it okay to introduce a new string
properties flag "case:locale" or similar to make it runtime
configurable?

Thanks
Andreas
Received on Thu Feb 3 09:34:10 2005