Bill Moseley writes:
> On Sat, Feb 05, 2005 at 02:03:31AM +0100, Andreas Seltenreich wrote:
> The strings passed in are \0 terminated. The properties
> (propEntry->propValue array) are not null-terminated. If you are
> seeing a \0 at the end of the propValue it's just by chance. Only the
> length of the string is copied in memory.
Indeed :-/
> Also, properties can be appended, so you couldn't put the null on the
> end until parsing of a given document is complete.
Right.
> If you look at CreateProperty() you can see where the docProp is
> allocated:
>
> docProp=(propEntry *) emalloc(sizeof(propEntry) + propLen);
> memcpy(docProp->propValue, propValue, propLen);
> docProp->propLen = propLen;
>
> sizeof(propEntry) is returning more bytes than needed, so there's
> actually room at the end. So this should work:
>
> docProp=(propEntry *) emalloc(sizeof(propEntry) + propLen);
> memcpy(docProp->propValue, propValue, propLen);
> docProp->propValue[propLen] = '\0';
> docProp->propLen = propLen;
So the same trick in append_property() and we're done? That sounds
almost too easy :-)
>> Seeing that the penalty of dynamic allocation inside
>> Compare_Properties() using alloca() is smaller penalty than the one of
>> the switch from strncasecmp()->strcoll(), I am tempted to suggest
>> using the latter version, as it is more robust, less invasive and
>> easily left in parallel with the old strncasecmp()/strcmp() code.
>
> I'd agree, but how portable is alloca()? I suppose we could just test
> for it in configure when using --enable-strcoll.
It doesn't seem to be standardised at all. So I guess a fallback with
autoconf to copying with bin2string() would be mandatory.
> I know you mentioned this before, but what does strcoll do with case?
> I wonder what to do with the is_meta_ignore_case test there.
I hope I understood you correctly here. strcoll() is a bit orthogonal
to strcasecmp(). Basically, it is a strcmp(), but with a
locale-dependent order of the characters.
Using LC_COLLATE=C, the characters are collated the same as in ASCII:
ABC...Zabc...z, so strcoll() behaves exactly like strcmp(). By
switching the locale, the sequence differs from ASCII. With en_US
it'll look like this: "AaBbCc..Zz", so the order of the sorted
properties turns out similar to the one of strcasecmp(). Imagine
strcasecmp(s1, s2) as strcmp(tolower(s1), tolower(s2)).
I don't know what the right way would be to deal with the difference.
Maybe instead of a case:ignore flag one should introduce a
collate:<insert locale here> flag, and adjust the locale appropriately
on each document property. So the user would still be able to choose a
per-property collating sequence.
>> setlocale(LC_CTYPE, "");
>>
>> + #ifdef USE_STRCOLL
>> + setlocale(LC_COLLATE, "");
>> + #endif
>
> Can (should?) that just be:
>
> setlocale(LC_ALL, "");
Theoretically yes, but I don't know the code well enough to decide
that. Setting LC_ALL will switch on locale-awareness for a lot of
other functions. For example it'll also set LC_NUMERIC, which changes,
for example, the output/input format of the printf()/scanf() functions
dependent on the selected locale. I could imagine this could break
some parts of the code, or users' code that depends on
machine-readable output swish-e.
regards,
Andreas
Received on Tue Feb 8 02:53:14 2005