What are the chances to implement the following features officially.
I suggest introducing new attribute e.g. TargetCharset defining in which
charset will be all documents converted/indexed. Default value should be
"iso-8859-1" for vertical compatibility.
E.g. if TargetCharset is "Windows-1250" it should look like this:
1) indexer: iconv(Windows-1250, utf-8) instead of UTF8Toisolat1()
2) indexer: setlocale(Windows-1250) on-the-fly
3) search script: setlocale(Windows-1250) on-the-fly
Beside all 8-bit charsets supported that way, there should be one more
possible value (e.g. TargetCharset "as-is"), suggesting that documents
should be indexed exactly in the same encoding as they were originally.
It looks like this:
1) indexer: iconv(charset_of_the_document_being_indexed, utf-8) instead of
2) indexer: setlocale(charset_of_the_document_being_indexed) on-the-fly
3) search script:
----- Original Message -----
From: "Bill Moseley" <email@example.com>
To: "John Angel" <firstname.lastname@example.org>
Cc: "Multiple recipients of list" <email@example.com>
Sent: Thursday, December 11, 2003 21:20
Subject: Re: [SWISH-E] Re: Fw: Re: 8-bit chars
> On Thu, Dec 11, 2003 at 11:23:00AM -0800, John Angel wrote:
> > > You are free to modify parser.c to use iconv and covert back to
> > > Windows-1250, as I suggested. But that won't work for everyone else.
> > Is it possible to use iconv(charset_of_the_document_being_indexed,
> > instead of UTF8Toisolat1()?
> You mean convert from libxml2's internal utf-8 back to the encoding of
> the original document? Probably -- I assume there's some way to have
> libxml2 tell you what it was encoding from.
> But that would not work if you have documents of different encodings.
> The index itself has to be one encoding. That's why I was saying that
> iconv could be used with a configuration setting to say what 8-bit
> encoding to use.
> > > What tolower does depends on the tolower
> > > function swish-e was linked with.
> > setlocale(charset_of_the_document_being_indexed) on-the-fly?
> Well, you want tolower to work for the encoding that the index is
> encoded in.
> Bill Moseley
Received on Sat Dec 13 14:49:04 2003