Skip to main content.
home | support | download

Back to List Archive

Re: TranslateCharacters - clarification required

From: Tref Gare <TrefG(at)not-real.areeba.com.au>
Date: Mon Mar 03 2003 - 02:05:34 GMT
To close the archival loop on this issue - here was the end result.

After much playing around, the culprit was discovered to be within the java wrapper.  We needed to explicitly tell the Servlet that it was to expect ISO-8859-1 encoded results.  

Specifically (for the javically inclined).. in our wrapper we were calling swish-e with the following code

  try{
            String[] command = new String[3];
            String osName = System.getProperty("os.name");
            if(osName.indexOf("Windows") != -1){
                command[0] = "cmd.exe";
                command[1] = "/c";
            } else{
                command[0] = "bash";
                command[1] = "-c";
            }
            command[2] = searchString;
            Process process = Runtime.getRuntime().exec(command);
            BufferedReader input = new BufferedReader(new InputStreamReader(process.getInputStream()));

}catch (IOException ioe){
	System.out.println("Damn");
}

The fix we implemented was to alter the following line

BufferedReader input = new BufferedReader(new InputStreamReader(process.getInputStream()));

To explicitly state the encoding it was expecting:

BufferedReader input = new BufferedReader(new InputStreamReader(process.getInputStream(), "ISO-8859-1"));
            

This didn't explain why display of special characters was originally working (for several months) and then stopped, but it appears to have locked it down.

Many thanks to Bill, David and the list for all your assistance and advice.  

My hair can begin regrowing.

Tref

------------------------------------------------------
Tref Gare
Development Consultant
Areeba
Level 19/114 William St, Melbourne VIC 3000
email: trefg@areeba.com.au
phone: +61 3 9642 5553
fax: +61 3 9642 1335
website: http://www.areeba.com.au
------------------------------------------------------
"This email is intended only for the use of the individual or entity named above and contains information that is confidential. No confidentiality is waived or lost by any mis-transmission. If you received this correspondence in error, please notify the sender and immediately delete it from your system. You must not disclose, copy or rely on any part of this correspondence if you are not the intended recipient. Any communication directed to clients via this message is subject to our Agreement and relevant Project Schedule. Any information that is transmitted via email which may offend may have been sent without knowledge or the consent of Areeba."
------------------------------------------------------

-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org] 
Sent: Wednesday, 26 February 2003 5:25 PM
To: Tref Gare
Cc: swish-e@sunsite.berkeley.edu
Subject: RE: [SWISH-E] Re: TranslateCharacters - clarification required

On Wed, 26 Feb 2003, Tref Gare wrote:

> > I suppose a useful -T option would be to dump as bytes the UTF-8 strings
> > that libxml2 is passing to swish.  That would be helpful in debugging.
> 
> Which -t option is this? Or are you just thinking it would be a nice
> to have feature?

Just thinking it would be nice to have so can confirm what is happening.

> I've set ParserWarnLevel to 1 in the various config files I'm using
> and so far no errors are being reported.  Do I also need to use the -v
> switch?

No you don't need the -v option.  Here's the code where you can see that
an encoding error is displayed when parser warning level is one.

        if ( ret == -2 )        // encoding failed
        {
            if ( parse_data->sw->parser_warn_level >= 1 )
                xmlParserWarning(parse_data->ctxt, "Failed to convert internal UTF-8 to Latin-1.\nR$

> 
> As far as I can see, swish-e is happily indexing the files and storing
> them in UTF-8.  When I look at the INDEXED_WORDS I see cinematheque in
> there as "cinÚmathÞque", which I'm assuming is just the terminals best
> attempt to display the Unicode.

No.  I think it's more likely that the terminal is assuming a different
encoding and thus displaying it incorrectly.

Swish does not store the text in UTF-8, but rather in 8859-1.  It's only
UTF-8 while it's inside the libxml2 parser.

> When that index then gets searched,
> it's failing to translate the unicode back into ISO-8859-1.  Instead
> it returns "?".  That "?" is then translated by our java htmlencode
> function as &#65533; which cannot be displayed.

I think it's more likely that the java doesn't realize that the text is
8859-1.

I just edited parser.c (see diff below) to print out the string as libxml2
has it stored, and then print it out after it's been converted to 8859-1.

Here's the input file:

  $ cat doc
  cin&#233;math&#232;que

Now, libxml2 is setup as a SAX parser.  The parser reads in chunks and
when it has something of interest (like a bit of text) it calls "call
back" functions in the swish-e code (in parser.c).  So in this case you
can see it process the above "document" in bits due to the entities.

How this goes through mail is another issue:

moseley@bumby:~/swish-e/src$ ./swish-e -i doc -T indexed_words -v0

So it's passing the first three chars "cin" to swish-e's code to convert
to 8859-1.  The # +digit is just a count, followed by the char display and
its hex and decimal value.  Then you can see the three chars *after* the
conversion to 8859-1 (which is the same for the first three chars).

#  0: char [c] byte 63 (99)
#  1: char [i] byte 69 (105)
#  2: char [n] byte 6e (110)
8859-1 #  0: char [c] byte 63 (99)
8859-1 #  1: char [i] byte 69 (105)
8859-1 #  2: char [n] byte 6e (110)

Now the entity is processed.  First you see the UTF-8 encoding (char
195.169) is the accented e.  Then you can see the 8859-1 representation,
which is 8859-1 char 195 (or hex e9).

#  0: char [Ã] byte c3 (195)
#  1: char [©] byte a9 (169)
8859-1 #  0: char [é] byte e9 (195)

Then more of the same:

#  0: char [m] byte 6d (109)
#  1: char [a] byte 61 (97)
#  2: char [t] byte 74 (116)
#  3: char [h] byte 68 (104)
8859-1 #  0: char [m] byte 6d (109)
8859-1 #  1: char [a] byte 61 (97)
8859-1 #  2: char [t] byte 74 (116)
8859-1 #  3: char [h] byte 68 (104)
#  0: char [Ã] byte c3 (195)
#  1: char [¨] byte a8 (168)
8859-1 #  0: char [è] byte e8 (195)
#  0: char [q] byte 71 (113)
#  1: char [u] byte 75 (117)
#  2: char [e] byte 65 (101)
#  3: char [
] byte  a (10)
8859-1 #  0: char [q] byte 71 (113)
8859-1 #  1: char [u] byte 75 (117)
8859-1 #  2: char [e] byte 65 (101)
8859-1 #  3: char [
] byte  a (10)
    Adding:[1:swishdefault(1)]   'cinémathèque'   Pos:2  Stuct:0x9 ( BODY FILE )


What I'm not clear on is how libxml know what the source encoding is.  I
assume from my locale setting, but it may also just look at the source
text.


As a side note, if I cut-n-paste your word from above it ends up looking
like this:

  'cinmathque'

I had to "Read" in the above from saving the output to a file.

But if I paste into xemacs it pastes with the accents.  I also cannot
paste into my xterm shell window.  There's a Linux Unicode faq somewhere
that might explain how fix that.

Seems odd that the xterm window can show the accents and that I can paste
from the xterm shell into xemacs, but not back into the shell or into
Pine.  But I guess that's not really a swish-e issue.

Just makes it hard to search for that word by pasting with my mouse.


$ cvs diff -u parser.c
Index: parser.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/parser.c,v
retrieving revision 1.46
diff -u -r1.46 parser.c
--- parser.c    25 Nov 2002 21:22:53 -0000      1.46
+++ parser.c    26 Feb 2003 05:53:44 -0000
@@ -846,6 +846,13 @@
     int             used;
     
 
+{
+    int i;
+    for ( i = 0; i < txtlen; i++ )
+        printf("# %2d: char [%c] byte %2hhx (%2hhu)\n", i, txt[i], txt[i], txt[i] );
+}
+     
+    
     /* (re)allocate buf if needed */
     
     if ( txtlen >= buf->max )
@@ -870,6 +877,13 @@
         if ( used > 0 )         // tally up total bytes consumed
             buf->cur += used;
 
+if ( ret == 0 )
+{
+    int i;
+    for ( i = 0; i < buf->cur; i++ )
+        printf("8859-1 # %2d: char [%c] byte %2hhx (%2hhu)\n", i, buf->buffer[i],  buf->buffer[i], txt[i] );
+}
+
         if ( ret == 0 )         // all done
             return;
 
@@ -900,6 +914,8 @@
             return;
         }
     }
+
+
 }


> 
> I'm going slightly loopy on this one. Any further guidance will get you guaranteed positions on the Christmas card list (honest).
> 
> cheers
> 
> ------------------------------------------------------
> Tref Gare
> Development Consultant
> Areeba
> 

-- 
Bill Moseley moseley@hank.org
Received on Mon Mar 3 02:05:59 2003