Skip to main content.
home | support | download

Back to List Archive

I18n with strcoll() (was: segmentation fault with version 2.5.4

From: Andreas Seltenreich <andreas.seltenreich(at)not-real.ubka.uni-karlsruhe.de>
Date: Wed Jun 01 2005 - 03:01:10 GMT
Bill Moseley writes:

> I do remember wondering if there should only be one config option
> instead of three.  That is instead of:
>
>     PropertyNamesCompareCase
>     PropertyNamesIgnoreCase
>     PropertyNamesUseStrcoll
>
> If it might be better to use:
>
>     PropertyNamesCase (compare|ignore|strcoll)  <metaname> [, <metaname> ]

Since CompareCase and IgnoreCase are mutually exclusive, and setting
UseStrcoll invalidates the other choices in the current
implementation, the latter makes more sense to me. But I guess we
should have the old directives around for compatibility's sake?

> I had some concerns here:
>
>    http://swish-e.org/archive/2005-02/9007.html
>
> I suspect why I didn't update the documentation is due to those
> questions I had.

Sorry for not following up back then. I'll try to answer some of them
here:

> On Tue, Feb 08, 2005 at 11:50:08AM +0100, Andreas Seltenreich wrote:

> > > I know you mentioned this before, but what does strcoll do with case?
> > > I wonder what to do with the is_meta_ignore_case test there.
> > 
> > I hope I understood you correctly here. strcoll() is a bit orthogonal
> > to strcasecmp(). Basically, it is a strcmp(), but with a
> > locale-dependent order of the characters.

> Sorry, I wasn't writing clearly.  What does someone do if they want
> strcoll() but ignoring case?  Is it a matter of getting a locale that
> sorts in that way?

I'm afraid there isn't such a locale. The ones applicable to
iso-8859-1 use "AaBbCc" (i.e. the distance between "A" and "a" is 1
instead of strcasecmp's 0). But I don't think users care about the
slightly different (less random) order of the result. IMHO sorting
with strcoll() under a non-"C"-locale is "as good as" strcasecmp for
the user.

There is also the possibility for the user of creating a custom
collating sequence using localedef.

> If we have to use tolower() ( or toupper() ) to do a "ignorecase"
> compare then have to go back to making a copy in memory.  The question
> is where to make that copy?

> One option would be to add a new entry into the property so there's
> both the normal prop string and also a tolower() version of the
> string.  Then in Compare_Properties use p1->propValue_lower when
> "ignorecase" is set.  That would be fastest, but most memory
> intensive.

> The other option is to make the copy and tolower() in
> Compare_Properties using alloca if available, otherwise bin2string (or
> strdup since the props should be null terminated now) to make a copy
> and lowercase it.  Your tests were reasonably fast, but it just bugs
> me to do too much work inside a function called by qsort.

The library's production OPAC is still using 2.4.3 patched with my
first approach of using alloca() and zero-terminated copies before
feeding strcoll() within Compare_Properties(), and the time for
sorting is negligible within the several hours needed for indexing. So
I'm not so scared of the copying.

But I still don't see the need for a "really" case insensitive
strcoll(), since IMHO this just adds some randomness to the order of
the result.

> Here's the way I implemented it.  configure checks for strcoll() and
> if available swish is compiled with support for it.  Then you can set
> the type of compare function:

>     PropertyNamesCompareCase compare
>     PropertyNamesIgnoreCase ignore
>     PropertyNamesUseStrcoll strcoll

I'd be happy with your three current options of strcmp, strcasecmp and
strcoll as the i18n-version of strcasecmp.

> It's a bit confusing, if you did:

>    PropertyNamesuseStrcoll prop1 prop2
>    presortedIndex prop1

> then prop1 is sorted based on the LC_COLLATE setting at indexing time,
> but prop2 is sorted based on the LC_COLLATE setting at run time.

> Now that can break, because there's cases where swish has to go back
> and look at the properties during run time, even when there's a
> pre-sorted index available for that property.

> So if LC_COLLATE changes between indexing and run time then there will
> be odd sorting.

Can't we just put that under "Doctor, it hurts when I do this..."? :-)
No, seriously, I think this is comparable to issuing a

   rm [A-Z]*

in a shell with the intention to delete only files starting with
uppercase letters while having set LC_COLLATE to something != "C" or
"POSIX". IMHO a warning in the manual would be sufficient.

> One solution would be to store the locale used at indexing time and
> use that at run time.  That limits the ability to modify the sort
> order at run time, though.  There's also the risk that a locale may be
> available where indexing is done, but not where searching is done.

> I'm not sure about if that should be per-property or per-index,
> though.  And what if sorting by multiple properties?  Would need to
> change locale inside a qsort function.  I assume there's overhead in
> doing that.

I can't imagine of an use of sorting with multiple different locales
within the same index, as long as there is only support for
iso-8859-1.

> Also need to think about how merge works.  What if there's conflicting
> locales?

Well, if Compare_Properties() would be used on the keys to maintain
the b-tree itself, it could get corrupted, if data is inserted using
different locales. But I don't see Compare_Properties() being used for
b-tree maintenance. So the worst case would be odd sorting too,
wouldn't it?

regards,
Andreas
Received on Tue May 31 20:01:11 2005