Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: John Angel <angel_john(at)not-real.hotmail.com>
Date: Thu Dec 11 2003 - 12:02:29 GMT
I understand that libxml2 converts everything to utf-8.

I don't see why we could not convert everything back to original 8-bit using
other function instead of UTF8Toisolat1()? It seems that we even do not need
to know what was the original charset.

Regarding tolower(), it should behave the same way - first we convert
everything to utf-8, then do the tolower_utf8() and then convert everything
back to 8-bit.

Of course, search script has to know what is the input charset so it can
properly translate the input to utf8. Checkout the parameters when searching
using Google - it does the same. This way we can even introduce full utf-8
support at least for the search script.


----- Original Message ----- 
From: "Bill Moseley" <moseley@hank.org>
To: "John Angel" <angel_john@hotmail.com>
Cc: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
Sent: Thursday, December 11, 2003 00:09
Subject: Re: [SWISH-E] Re: Fw: Re: 8-bit chars


> On Wed, Dec 10, 2003 at 03:00:22PM -0800, John Angel wrote:
> > Bill, I want to leave everything exactly as it was in original. Nothing
> > else. It that possible?
>
> Not with using libxml2 because it does character conversions.
> You can use the HTML parser, but it has a lot to be desired over the
> libxml2 parser (it's helpful to try both parsers and compare what gets
> indexed).
>
> Still, there is character conversion via the tolower() function.  I'm
> not sure if that would cause problems or not.  I assume not in this case
> (i.e. tolower would ignore those high bit chars).  It would be
> interesting to try a simple C program and see what tolower does.
>
> If you do leave your encoding at windows-1250 then I assume you would
> need to be sure that's also the case on searching.
>
>
> >
> >
> > ----- Original Message ----- 
> > From: "Bill Moseley" <moseley@hank.org>
> > To: "John Angel" <angel_john@hotmail.com>
> > Cc: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
> > Sent: Wednesday, December 10, 2003 23:40
> > Subject: Re: [SWISH-E] Fw: Re: 8-bit chars
> >
> >
> > > On Wed, Dec 10, 2003 at 01:42:17PM -0800, John Angel wrote:
> > > > Here it is:
> > >
> > > Hi John,
> > >
> > > I'm not sure what you are asking.  If I index with the HTML parser the
> > > chars are indexed.  If I index with the libxml2 parser they are not
> > > indexed (of course I had to add the characters to *Characters
settings).
> > >
> > > Note what happens if use the iconv utility:
> > >
> > > moseley@bumby:~$ iconv -f WINDOWS-1250 -t LATIN1 test.htm
> > > <HTML>
> > > <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
> > > charset=Windows-1250">
> > >
> > > <P>Non-english chars: iconv: illegal input sequence at position 108
> > >
> > > 108 is 6c hex:
> > >
> > > 00000060  6c 69 73 68 20 63 68 61  72 73 3a 20 f0 2c 20 9e  |lish
chars:
> > , .|
> > >
> > > Which is f0.  That's a valid windows-1250 char (a small "d" with a
line
> > > through it).  If there's no 8859-1 character like that then it makes
> > > sense it won't convert.
> > >
> > > I'm not sure what you want.  Do you want to convert to Windows-1250
> > > character set instead of 8859-1 when parsing?  If so, you would need
to
> > > edit parser.c and use the iconv library to do your conversion.  I
> > > suppose you would have to carefully edit your WordCharacter (and
other)
> > > settings so you are adding the right characters (based on your
editor's
> > > character set).  And as I mentioned, swish-e uses tolower() function
> > > and the LC_CTYPE locale is set to the default type.  So case
conversion
> > > may end up with odd results for some characters.
> > >
> > > I'm not sure why swish-e sets the LC_CTYPE locale.
> > >
> > > Interesting that when I read test.htm file with mozilla and a web
server
> > > it ignores the meta tag and says the file is 8859-1 but if I read it
> > > without the web server it says it's Windows-1250.
> > >
> > >
> > > -- 
> > > Bill Moseley
> > > moseley@hank.org
> > >
> > >
> >
>
> -- 
> Bill Moseley
> moseley@hank.org
>
>
Received on Thu Dec 11 12:02:38 2003