Skip to main content.
home | support | download

Back to List Archive

Re: Problems with sorting German Umlaut

From: Andreas Seltenreich <andreas.seltenreich(at)not-real.ubka.uni-karlsruhe.de>
Date: Sat Feb 05 2005 - 01:07:26 GMT
--=-=-=


Bill Moseley writes:

> On Thu, Feb 03, 2005 at 07:34:46PM +0100, Andreas Seltenreich wrote:
>> > Might be better to figure out where those strings are allocated and
>> > allocate another byte and make them null-terminated to start with.
>> 
>> Ok, I'm going to spend some time getting myself more familiar with the
>> code.
>
> Let me know if you need help.  I'd need some time to refresh myself
> with the code, but I rewrote much of the property code at one time.

Ok, this is as far as I got until now:

I guess there are two ways to make sure there are only \0-terminated
propstrings around: One is to terminate them after they come out of
the index, and the other is to terminate them before they get in.

I hope I read the source correctly, in that all propstrings have to
pass through addDocProperty() on their way into the index. Grepping
for it, and looking at what is being fed to it revealed, that the
propstrings are all \0-Terminated. Except one special case: when the
strings are longer than meta_entry->max_len, only prop->propLen will
be decreased, the terminating \0 won't be adjusted. I fixed those
places and put strcoll() in. This is the result:

CVS HEAD without modifications, classic strcasecmp() with the wrong
collation (Please ignore the wallclock values, as the system was
rather busy during the test. The user/sys times were reproducible
within 0.5s):

--8<---------------cut here---------------start------------->8---
Indexing Data Source: "File-System"
Indexing "daten/"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 196,636 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
196,636 unique words indexed.
12 properties sorted.
18,937 files indexed.  30,442,681 total bytes.  4,397,736 total words.
Elapsed time: 00:01:22 CPU time: 00:01:15
Indexing done!

real    1m22.137s
user    1m5.657s
sys     0m8.890s
--8<---------------cut here---------------end--------------->8---

Now the same with terminating propstrings _before_ they're put into
the index and replacing strncasecmp() with strcoll(),
LC_COLLATE=de_DE:

--8<---------------cut here---------------start------------->8---
real    1m24.777s
user    1m8.490s
sys     0m8.411s
--8<---------------cut here---------------end--------------->8---

The result looks very promising. Setting the locale to "C" (the
default) or "POSIX" will yield the collation of the former strcmp()
comparison, while en_US will look almost like the sorting with
strncasecmp() (the difference being that A and a have a distance of 1,
no longer 0).

Out of curiosity I also tried the less invasive approach of
terminating the non-terminated propstrings inside Compare_Properties()
using bin2string(), which terminates the propstrings after copying
them to dynamically allocated memory. The result did surprise me, as
it was less than 3% slower than the former version.

strcoll() with dynamic allocation and copying of the non-terminated
propstrings via bin2string() inside Compare_Properties():

--8<---------------cut here---------------start------------->8---
real    1m27.589s
user    1m10.142s
sys     0m9.110s
--8<---------------cut here---------------end--------------->8---

same with copying via alloca() (I dunno if this is available on all
platforms) instead of bin2string()+free():

--8<---------------cut here---------------start------------->8---
real    1m27.461s
user    1m9.204s
sys     0m9.302s
--8<---------------cut here---------------end--------------->8---

same with LC_COLLATE=C:

--8<---------------cut here---------------start------------->8---
real    1m22.516s
user    1m6.376s
sys     0m8.573s
--8<---------------cut here---------------end--------------->8---

Seeing that the penalty of dynamic allocation inside
Compare_Properties() using alloca() is smaller penalty than the one of
the switch from strncasecmp()->strcoll(), I am tempted to suggest
using the latter version, as it is more robust, less invasive and
easily left in parallel with the old strncasecmp()/strcmp() code.

Attached is a patch against CVS HEAD of the latter approach.

regards,
Andreas


--=-=-=
Content-Disposition: inline; filename=strcoll.diff

Index: configure.in
===================================================================
RCS file: /cvsroot/swishe/swish-e/configure.in,v
retrieving revision 1.74
diff -c -r1.74 configure.in
*** configure.in	20 Dec 2004 02:12:19 -0000	1.74
--- configure.in	5 Feb 2005 00:55:04 -0000
***************
*** 185,193 ****
      AC_SUBST(LIBXML2_CFLAGS)
  fi
  
! 
! 
! 
  
  dnl Provide an option for enabling btree/incremental indexing for development
  AC_ARG_ENABLE(incremental,
--- 185,198 ----
      AC_SUBST(LIBXML2_CFLAGS)
  fi
  
! dnl Use strcoll() instead of strncmp()/strncasecmp() to enable locale dependent collating
! AC_ARG_ENABLE(strcoll,
!              AC_HELP_STRING([--enable-strcoll], [** developer use only **]),strcoll=yes,)
! 
! if test x$strcoll = xyes; then
!     AC_MSG_WARN([** Buidling with developer only collation code **])
!     AC_DEFINE(USE_STRCOLL,[],[Experimental strcoll support])
! fi
  
  dnl Provide an option for enabling btree/incremental indexing for development
  AC_ARG_ENABLE(incremental,
Index: src/docprop.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/docprop.c,v
retrieving revision 1.119
diff -c -r1.119 docprop.c
*** src/docprop.c	28 Apr 2004 20:17:27 -0000	1.119
--- src/docprop.c	5 Feb 2005 00:55:05 -0000
***************
*** 996,1004 ****
--- 996,1030 ----
          int rc;
          int len = Min( p1->propLen, p2->propLen );
  
+ #ifndef USE_STRCOLL
+ 
          rc = is_meta_ignore_case( meta_entry)
               ? strncasecmp( (char *)p1->propValue, (char *)p2->propValue, len )
               : strncmp( (char *)p1->propValue, (char *)p2->propValue, len );
+ 
+ #else
+ 	/* strcoll() takes locale dependent collation into account and
+ 	   works with unicode. Sadly, there's no strNcoll(). */
+         char *str1 = (char *)alloca(len + 1);
+         char *str2 = (char *)alloca(len + 1);
+ 
+         memcpy(str1, p1->propValue, len);
+         str1[len] = '\0';
+         memcpy(str2, p2->propValue, len);
+         str2[len] = '\0';
+ 
+  	rc = strcoll(str1, str2);
+ 
+ /*   	the bin2string() version is a little bit slower, I dunno */
+ /* 	if alloca() is available on all platforms though */
+ 
+ /* 	char *str1 = bin2string(p1->propValue, len); */
+ /* 	char *str2 = bin2string(p2->propValue, len); */
+ /* 	rc = strcoll(str1, str2); */
+ /* 	efree(str1), efree(str2); */
+ /* 	rc = strcoll((char *)p1->propValue, (char *)p2->propValue); */
+ 
+ #endif
               
          if ( rc != 0 )
              return rc;
Index: src/swish.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/swish.c,v
retrieving revision 1.126
diff -c -r1.126 swish.c
*** src/swish.c	6 Dec 2004 22:08:43 -0000	1.126
--- src/swish.c	5 Feb 2005 00:55:06 -0000
***************
*** 169,174 ****
--- 169,177 ----
  
      setlocale(LC_CTYPE, "");
  
+ #ifdef USE_STRCOLL
+     setlocale(LC_COLLATE, "");
+ #endif
  
      /* Start a session */
      sw = swish_new();            /* Get swish handle */

--=-=-=--
Received on Fri Feb 4 17:07:35 2005