Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: Bill Moseley <moseley(at)>
Date: Wed Dec 10 2003 - 22:40:56 GMT
On Wed, Dec 10, 2003 at 01:42:17PM -0800, John Angel wrote:
> Here it is:

Hi John,

I'm not sure what you are asking.  If I index with the HTML parser the 
chars are indexed.  If I index with the libxml2 parser they are not 
indexed (of course I had to add the characters to *Characters settings).

Note what happens if use the iconv utility:

moseley@bumby:~$ iconv -f WINDOWS-1250 -t LATIN1 test.htm
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; 

<P>Non-english chars: iconv: illegal input sequence at position 108

108 is 6c hex:

00000060  6c 69 73 68 20 63 68 61  72 73 3a 20 f0 2c 20 9e  |lish chars: , .|

Which is f0.  That's a valid windows-1250 char (a small "d" with a line 
through it).  If there's no 8859-1 character like that then it makes 
sense it won't convert.

I'm not sure what you want.  Do you want to convert to Windows-1250 
character set instead of 8859-1 when parsing?  If so, you would need to 
edit parser.c and use the iconv library to do your conversion.  I 
suppose you would have to carefully edit your WordCharacter (and other) 
settings so you are adding the right characters (based on your editor's 
character set).  And as I mentioned, swish-e uses tolower() function 
and the LC_CTYPE locale is set to the default type.  So case conversion 
may end up with odd results for some characters.

I'm not sure why swish-e sets the LC_CTYPE locale.

Interesting that when I read test.htm file with mozilla and a web server
it ignores the meta tag and says the file is 8859-1 but if I read it
without the web server it says it's Windows-1250.

Bill Moseley
Received on Wed Dec 10 22:41:03 2003