Bill Moseley writes:
> I do remember wondering if there should only be one config option
> instead of three. That is instead of:
>
> PropertyNamesCompareCase
> PropertyNamesIgnoreCase
> PropertyNamesUseStrcoll
>
> If it might be better to use:
>
> PropertyNamesCase (compare|ignore|strcoll) <metaname> [, <metaname> ]
Since CompareCase and IgnoreCase are mutually exclusive, and setting
UseStrcoll invalidates the other choices in the current
implementation, the latter makes more sense to me. But I guess we
should have the old directives around for compatibility's sake?
> I had some concerns here:
>
> http://swish-e.org/archive/2005-02/9007.html
>
> I suspect why I didn't update the documentation is due to those
> questions I had.
Sorry for not following up back then. I'll try to answer some of them
here:
> On Tue, Feb 08, 2005 at 11:50:08AM +0100, Andreas Seltenreich wrote:
> > > I know you mentioned this before, but what does strcoll do with case?
> > > I wonder what to do with the is_meta_ignore_case test there.
> >
> > I hope I understood you correctly here. strcoll() is a bit orthogonal
> > to strcasecmp(). Basically, it is a strcmp(), but with a
> > locale-dependent order of the characters.
> Sorry, I wasn't writing clearly. What does someone do if they want
> strcoll() but ignoring case? Is it a matter of getting a locale that
> sorts in that way?
I'm afraid there isn't such a locale. The ones applicable to
iso-8859-1 use "AaBbCc" (i.e. the distance between "A" and "a" is 1
instead of strcasecmp's 0). But I don't think users care about the
slightly different (less random) order of the result. IMHO sorting
with strcoll() under a non-"C"-locale is "as good as" strcasecmp for
the user.
There is also the possibility for the user of creating a custom
collating sequence using localedef.
> If we have to use tolower() ( or toupper() ) to do a "ignorecase"
> compare then have to go back to making a copy in memory. The question
> is where to make that copy?
> One option would be to add a new entry into the property so there's
> both the normal prop string and also a tolower() version of the
> string. Then in Compare_Properties use p1->propValue_lower when
> "ignorecase" is set. That would be fastest, but most memory
> intensive.
> The other option is to make the copy and tolower() in
> Compare_Properties using alloca if available, otherwise bin2string (or
> strdup since the props should be null terminated now) to make a copy
> and lowercase it. Your tests were reasonably fast, but it just bugs
> me to do too much work inside a function called by qsort.
The library's production OPAC is still using 2.4.3 patched with my
first approach of using alloca() and zero-terminated copies before
feeding strcoll() within Compare_Properties(), and the time for
sorting is negligible within the several hours needed for indexing. So
I'm not so scared of the copying.
But I still don't see the need for a "really" case insensitive
strcoll(), since IMHO this just adds some randomness to the order of
the result.
> Here's the way I implemented it. configure checks for strcoll() and
> if available swish is compiled with support for it. Then you can set
> the type of compare function:
> PropertyNamesCompareCase compare
> PropertyNamesIgnoreCase ignore
> PropertyNamesUseStrcoll strcoll
I'd be happy with your three current options of strcmp, strcasecmp and
strcoll as the i18n-version of strcasecmp.
> It's a bit confusing, if you did:
> PropertyNamesuseStrcoll prop1 prop2
> presortedIndex prop1
> then prop1 is sorted based on the LC_COLLATE setting at indexing time,
> but prop2 is sorted based on the LC_COLLATE setting at run time.
> Now that can break, because there's cases where swish has to go back
> and look at the properties during run time, even when there's a
> pre-sorted index available for that property.
> So if LC_COLLATE changes between indexing and run time then there will
> be odd sorting.
Can't we just put that under "Doctor, it hurts when I do this..."? :-)
No, seriously, I think this is comparable to issuing a
rm [A-Z]*
in a shell with the intention to delete only files starting with
uppercase letters while having set LC_COLLATE to something != "C" or
"POSIX". IMHO a warning in the manual would be sufficient.
> One solution would be to store the locale used at indexing time and
> use that at run time. That limits the ability to modify the sort
> order at run time, though. There's also the risk that a locale may be
> available where indexing is done, but not where searching is done.
> I'm not sure about if that should be per-property or per-index,
> though. And what if sorting by multiple properties? Would need to
> change locale inside a qsort function. I assume there's overhead in
> doing that.
I can't imagine of an use of sorting with multiple different locales
within the same index, as long as there is only support for
iso-8859-1.
> Also need to think about how merge works. What if there's conflicting
> locales?
Well, if Compare_Properties() would be used on the keys to maintain
the b-tree itself, it could get corrupted, if data is inserted using
different locales. But I don't see Compare_Properties() being used for
b-tree maintenance. So the worst case would be odd sorting too,
wouldn't it?
regards,
Andreas
Received on Tue May 31 20:01:11 2005