On Thu, Dec 11, 2003 at 04:02:04AM -0800, John Angel wrote:
> I understand that libxml2 converts everything to utf-8.
>
> I don't see why we could not convert everything back to original 8-bit using
> other function instead of UTF8Toisolat1()? It seems that we even do not need
> to know what was the original charset.
Correct. That's why I was suggesting you could use iconv() in parser.c.
Might not be that much of a hack to replace the lation1 conversion with
your windows-1250 conversion.
I have thought about using iconv() in the past but it would require
other changes to the code to support it in a general way (updates to the
config process, index header format and the parser) and haven't had the
time, and also have thought that's a work-around instead of a real fix
using utf-8 internally would be. It might be easier to do a complete
rewrite than to convert to utf-8, though.
> Regarding tolower(), it should behave the same way - first we convert
> everything to utf-8, then do the tolower_utf8() and then convert everything
> back to 8-bit.
Where's tolower_utf8() defined? Doing the tolower on the utf-8 is
possible -- but it's not trivial because the was the input buffer is
managed, and currently the input text buffer is shared between
properties and text for indexing -- so those buffers would need to be
split (don't want to tolower() the properties).
> Of course, search script has to know what is the input charset so it can
> properly translate the input to utf8. Checkout the parameters when searching
> using Google - it does the same. This way we can even introduce full utf-8
> support at least for the search script.
What action should swish-e take when converting utf-8 on input and
there's a conversion failure?
--
Bill Moseley
moseley@hank.org
Received on Thu Dec 11 14:23:14 2003