On Wed, 26 Feb 2003, Tref Gare wrote:
> > I suppose a useful -T option would be to dump as bytes the UTF-8 strings
> > that libxml2 is passing to swish. That would be helpful in debugging.
>
> Which -t option is this? Or are you just thinking it would be a nice
> to have feature?
Just thinking it would be nice to have so can confirm what is happening.
> I've set ParserWarnLevel to 1 in the various config files I'm using
> and so far no errors are being reported. Do I also need to use the -v
> switch?
No you don't need the -v option. Here's the code where you can see that
an encoding error is displayed when parser warning level is one.
if ( ret == -2 ) // encoding failed
{
if ( parse_data->sw->parser_warn_level >= 1 )
xmlParserWarning(parse_data->ctxt, "Failed to convert internal UTF-8 to Latin-1.\nR$
>
> As far as I can see, swish-e is happily indexing the files and storing
> them in UTF-8. When I look at the INDEXED_WORDS I see cinematheque in
> there as "cinÚmathÞque", which I'm assuming is just the terminals best
> attempt to display the Unicode.
No. I think it's more likely that the terminal is assuming a different
encoding and thus displaying it incorrectly.
Swish does not store the text in UTF-8, but rather in 8859-1. It's only
UTF-8 while it's inside the libxml2 parser.
> When that index then gets searched,
> it's failing to translate the unicode back into ISO-8859-1. Instead
> it returns "?". That "?" is then translated by our java htmlencode
> function as � which cannot be displayed.
I think it's more likely that the java doesn't realize that the text is
8859-1.
I just edited parser.c (see diff below) to print out the string as libxml2
has it stored, and then print it out after it's been converted to 8859-1.
Here's the input file:
$ cat doc
cinémathèque
Now, libxml2 is setup as a SAX parser. The parser reads in chunks and
when it has something of interest (like a bit of text) it calls "call
back" functions in the swish-e code (in parser.c). So in this case you
can see it process the above "document" in bits due to the entities.
How this goes through mail is another issue:
moseley@bumby:~/swish-e/src$ ./swish-e -i doc -T indexed_words -v0
So it's passing the first three chars "cin" to swish-e's code to convert
to 8859-1. The # +digit is just a count, followed by the char display and
its hex and decimal value. Then you can see the three chars *after* the
conversion to 8859-1 (which is the same for the first three chars).
# 0: char [c] byte 63 (99)
# 1: char [i] byte 69 (105)
# 2: char [n] byte 6e (110)
8859-1 # 0: char [c] byte 63 (99)
8859-1 # 1: char [i] byte 69 (105)
8859-1 # 2: char [n] byte 6e (110)
Now the entity is processed. First you see the UTF-8 encoding (char
195.169) is the accented e. Then you can see the 8859-1 representation,
which is 8859-1 char 195 (or hex e9).
# 0: char [Ã] byte c3 (195)
# 1: char [©] byte a9 (169)
8859-1 # 0: char [é] byte e9 (195)
Then more of the same:
# 0: char [m] byte 6d (109)
# 1: char [a] byte 61 (97)
# 2: char [t] byte 74 (116)
# 3: char [h] byte 68 (104)
8859-1 # 0: char [m] byte 6d (109)
8859-1 # 1: char [a] byte 61 (97)
8859-1 # 2: char [t] byte 74 (116)
8859-1 # 3: char [h] byte 68 (104)
# 0: char [Ã] byte c3 (195)
# 1: char [¨] byte a8 (168)
8859-1 # 0: char [è] byte e8 (195)
# 0: char [q] byte 71 (113)
# 1: char [u] byte 75 (117)
# 2: char [e] byte 65 (101)
# 3: char [
] byte a (10)
8859-1 # 0: char [q] byte 71 (113)
8859-1 # 1: char [u] byte 75 (117)
8859-1 # 2: char [e] byte 65 (101)
8859-1 # 3: char [
] byte a (10)
Adding:[1:swishdefault(1)] 'cinémathèque' Pos:2 Stuct:0x9 ( BODY FILE )
What I'm not clear on is how libxml know what the source encoding is. I
assume from my locale setting, but it may also just look at the source
text.
As a side note, if I cut-n-paste your word from above it ends up looking
like this:
'cinmathque'
I had to "Read" in the above from saving the output to a file.
But if I paste into xemacs it pastes with the accents. I also cannot
paste into my xterm shell window. There's a Linux Unicode faq somewhere
that might explain how fix that.
Seems odd that the xterm window can show the accents and that I can paste
from the xterm shell into xemacs, but not back into the shell or into
Pine. But I guess that's not really a swish-e issue.
Just makes it hard to search for that word by pasting with my mouse.
$ cvs diff -u parser.c
Index: parser.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/parser.c,v
retrieving revision 1.46
diff -u -r1.46 parser.c
--- parser.c 25 Nov 2002 21:22:53 -0000 1.46
+++ parser.c 26 Feb 2003 05:53:44 -0000
@@ -846,6 +846,13 @@
int used;
+{
+ int i;
+ for ( i = 0; i < txtlen; i++ )
+ printf("# %2d: char [%c] byte %2hhx (%2hhu)\n", i, txt[i], txt[i], txt[i] );
+}
+
+
/* (re)allocate buf if needed */
if ( txtlen >= buf->max )
@@ -870,6 +877,13 @@
if ( used > 0 ) // tally up total bytes consumed
buf->cur += used;
+if ( ret == 0 )
+{
+ int i;
+ for ( i = 0; i < buf->cur; i++ )
+ printf("8859-1 # %2d: char [%c] byte %2hhx (%2hhu)\n", i, buf->buffer[i], buf->buffer[i], txt[i] );
+}
+
if ( ret == 0 ) // all done
return;
@@ -900,6 +914,8 @@
return;
}
}
+
+
}
>
> I'm going slightly loopy on this one. Any further guidance will get you guaranteed positions on the Christmas card list (honest).
>
> cheers
>
> ------------------------------------------------------
> Tref Gare
> Development Consultant
> Areeba
>
--
Bill Moseley moseley@hank.org
Received on Wed Feb 26 06:27:00 2003