Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Dec 14 2003 - 15:20:02 GMT
On Sun, Dec 14, 2003 at 12:03:05AM -0800, John Angel wrote:
> > Yes, that's been on my todo list for a long time.  Just adding iconv
> > support to parser.c would not be too hard.  It's all the other stuff
> > that goes along with that that's the issue.
> 
> What other stuff should be modified also?

That stuff in my brain.

> > > Beside all 8-bit charsets supported that way, there should be one more
> > > possible value (e.g. TargetCharset "as-is"), suggesting that documents
> > > should be indexed exactly in the same encoding as they were originally.
> >
> > As I said yesterday, that doesn't make sense.  I tried to explain why I
> > don't think it can work.  Maybe you can explain in detail how it can
> > work.
> 
> It is the same implementation as for target charset, I don't see why it
> shouldn't be done? It makes a lot of sense when you try to index documents
> in different languages and encodings.

I guess I was looking more for a technical explanation instead of using
desire and good sense to make it work.


Remember those old video cards that could display only 256 colors 
on the screen?  You could use a color palette to map those 256 colors to 
to a 3 byte value allowing selection of any of the 16 million colors.  
You could use any color, but only 256 of them on the same screen.  The 
video card box said "Displays 16 million colors!!" wasn't exactly true.

Swish-e works the same way.  You can only have 256 characters
represented in the index at a time.  Instead of a character palette,
there's character sets or encodings where you map the entire 256 set to
a set of characters (glyphs, I think is the right term).  

As it turns out, there's more than 256 glyphs in the world's
languages.  Just like that video card, you can display an of the world's 
characters (assuming you have the font files), but if you data is only 8 
bit you can only display one set of 256 at time.

So the encoding in the index defines what glyphs can be stored in the
index, and thus what 256 characters can be searched.  In case you missed
this point: you cannot store more that 256 characters in the index at
any given time.  That is what it means when using an 8-bit encoding. 

There is NO WAY to store more than one encoding in the index as it is
currently designed.

And that's exactly what you are asking to do.  You want to have libxml2
convert the document back to it's original encoding when storing the
words in the index -- "as-is" -- and that's trying to store more than
one encoding in the index at the same time.

> All other open source engines have similar full 8-bit support.
> 
> ht://dig has "translate_latin1" attribute for conversion to latin1. If set
> to false, it will act as I described - "as-is" conversion.

Huh? No, that option has to do with translating entities.  Note the
comment about "to avoid these entities being mapped to inappropriate
8-bit characters".

 translate_latin1

    type:
        boolean 
    used by:
        htdig 
    default:
        true 
    description:
        If set to false, the SGML entities for ISO-8859-1 (or Latin 1) 
        characters above &nbsp; (or &#160;) will not be translated into 
        their 8-bit equivalents. This attribute should be set to false 
        when using a locale that doesn't use the ISO-8859-1 character set, 
        to avoid these entities being mapped to inappropriate 8-bit 
        characters shown in a different character set in search results.



-- 
Bill Moseley
moseley@hank.org
Received on Sun Dec 14 15:20:12 2003