Skip to main content.
home | support | download

Back to List Archive

Re: TranslateCharacters - clarification required

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Feb 26 2003 - 06:26:07 GMT
On Wed, 26 Feb 2003, Tref Gare wrote:

> > I suppose a useful -T option would be to dump as bytes the UTF-8 strings
> > that libxml2 is passing to swish.  That would be helpful in debugging.
> 
> Which -t option is this? Or are you just thinking it would be a nice
> to have feature?

Just thinking it would be nice to have so can confirm what is happening.

> I've set ParserWarnLevel to 1 in the various config files I'm using
> and so far no errors are being reported.  Do I also need to use the -v
> switch?

No you don't need the -v option.  Here's the code where you can see that
an encoding error is displayed when parser warning level is one.

        if ( ret == -2 )        // encoding failed
        {
            if ( parse_data->sw->parser_warn_level >= 1 )
                xmlParserWarning(parse_data->ctxt, "Failed to convert internal UTF-8 to Latin-1.\nR$

> 
> As far as I can see, swish-e is happily indexing the files and storing
> them in UTF-8.  When I look at the INDEXED_WORDS I see cinematheque in
> there as "cinÚmathÞque", which I'm assuming is just the terminals best
> attempt to display the Unicode.

No.  I think it's more likely that the terminal is assuming a different
encoding and thus displaying it incorrectly.

Swish does not store the text in UTF-8, but rather in 8859-1.  It's only
UTF-8 while it's inside the libxml2 parser.

> When that index then gets searched,
> it's failing to translate the unicode back into ISO-8859-1.  Instead
> it returns "?".  That "?" is then translated by our java htmlencode
> function as &#65533; which cannot be displayed.

I think it's more likely that the java doesn't realize that the text is
8859-1.

I just edited parser.c (see diff below) to print out the string as libxml2
has it stored, and then print it out after it's been converted to 8859-1.

Here's the input file:

  $ cat doc
  cin&#233;math&#232;que

Now, libxml2 is setup as a SAX parser.  The parser reads in chunks and
when it has something of interest (like a bit of text) it calls "call
back" functions in the swish-e code (in parser.c).  So in this case you
can see it process the above "document" in bits due to the entities.

How this goes through mail is another issue:

moseley@bumby:~/swish-e/src$ ./swish-e -i doc -T indexed_words -v0

So it's passing the first three chars "cin" to swish-e's code to convert
to 8859-1.  The # +digit is just a count, followed by the char display and
its hex and decimal value.  Then you can see the three chars *after* the
conversion to 8859-1 (which is the same for the first three chars).

#  0: char [c] byte 63 (99)
#  1: char [i] byte 69 (105)
#  2: char [n] byte 6e (110)
8859-1 #  0: char [c] byte 63 (99)
8859-1 #  1: char [i] byte 69 (105)
8859-1 #  2: char [n] byte 6e (110)

Now the entity is processed.  First you see the UTF-8 encoding (char
195.169) is the accented e.  Then you can see the 8859-1 representation,
which is 8859-1 char 195 (or hex e9).

#  0: char [Ã] byte c3 (195)
#  1: char [©] byte a9 (169)
8859-1 #  0: char [é] byte e9 (195)

Then more of the same:

#  0: char [m] byte 6d (109)
#  1: char [a] byte 61 (97)
#  2: char [t] byte 74 (116)
#  3: char [h] byte 68 (104)
8859-1 #  0: char [m] byte 6d (109)
8859-1 #  1: char [a] byte 61 (97)
8859-1 #  2: char [t] byte 74 (116)
8859-1 #  3: char [h] byte 68 (104)
#  0: char [Ã] byte c3 (195)
#  1: char [¨] byte a8 (168)
8859-1 #  0: char [è] byte e8 (195)
#  0: char [q] byte 71 (113)
#  1: char [u] byte 75 (117)
#  2: char [e] byte 65 (101)
#  3: char [
] byte  a (10)
8859-1 #  0: char [q] byte 71 (113)
8859-1 #  1: char [u] byte 75 (117)
8859-1 #  2: char [e] byte 65 (101)
8859-1 #  3: char [
] byte  a (10)
    Adding:[1:swishdefault(1)]   'cinémathèque'   Pos:2  Stuct:0x9 ( BODY FILE )


What I'm not clear on is how libxml know what the source encoding is.  I
assume from my locale setting, but it may also just look at the source
text.


As a side note, if I cut-n-paste your word from above it ends up looking
like this:

  'cinmathque'

I had to "Read" in the above from saving the output to a file.

But if I paste into xemacs it pastes with the accents.  I also cannot
paste into my xterm shell window.  There's a Linux Unicode faq somewhere
that might explain how fix that.

Seems odd that the xterm window can show the accents and that I can paste
from the xterm shell into xemacs, but not back into the shell or into
Pine.  But I guess that's not really a swish-e issue.

Just makes it hard to search for that word by pasting with my mouse.


$ cvs diff -u parser.c
Index: parser.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/parser.c,v
retrieving revision 1.46
diff -u -r1.46 parser.c
--- parser.c    25 Nov 2002 21:22:53 -0000      1.46
+++ parser.c    26 Feb 2003 05:53:44 -0000
@@ -846,6 +846,13 @@
     int             used;
     
 
+{
+    int i;
+    for ( i = 0; i < txtlen; i++ )
+        printf("# %2d: char [%c] byte %2hhx (%2hhu)\n", i, txt[i], txt[i], txt[i] );
+}
+     
+    
     /* (re)allocate buf if needed */
     
     if ( txtlen >= buf->max )
@@ -870,6 +877,13 @@
         if ( used > 0 )         // tally up total bytes consumed
             buf->cur += used;
 
+if ( ret == 0 )
+{
+    int i;
+    for ( i = 0; i < buf->cur; i++ )
+        printf("8859-1 # %2d: char [%c] byte %2hhx (%2hhu)\n", i, buf->buffer[i],  buf->buffer[i], txt[i] );
+}
+
         if ( ret == 0 )         // all done
             return;
 
@@ -900,6 +914,8 @@
             return;
         }
     }
+
+
 }


> 
> I'm going slightly loopy on this one. Any further guidance will get you guaranteed positions on the Christmas card list (honest).
> 
> cheers
> 
> ------------------------------------------------------
> Tref Gare
> Development Consultant
> Areeba
> 

-- 
Bill Moseley moseley@hank.org
Received on Wed Feb 26 06:27:00 2003