Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] [Patch] recode on the fly to UTF-8

From: Bill Moseley <moseley(at)>
Date: Thu Apr 10 2008 - 06:43:45 GMT
On Mon, Apr 07, 2008 at 12:26:28PM +0200, Cedric Jeanneret wrote:
> Hello!
> I had some charset problems with swish-e when I had to display context
> with results.
> In fact, all my systems are in utf-8, and some docs are still in iso
> format (coming for windows...).
> As I want to output all in utf, here's what I do :
> File :
> [snip]
>   7 use CharsetDetector;
> [snip]
>  10 use Locale::Recode;
> [snip]
> 325     if ( $description_prop ) {
> 326         $description = $this_result->{ $description_prop } || '';
> 327         $char = uc (CharsetDetector::detect($description) );
> 328         my $cd = Locale::Recode->new (from => uc($char), to => 'UTF-8');
> 329         $cd->recode($description);
> 330     }
> And now, all my stuff are ouput in utf-8.

But limited to latin-1 characters.

If using libxml2 all text is encoded to latin-1 before processing and
storing by swish.

$ fgrep lat1 swish-e/src/parser.c
        ret = UTF8Toisolat1( (unsigned char *)start_buf, &used, (const unsigned char *)txt, &inlen );

Since it's all Latin-1 there's no need to try and detect the charset
(I'm not clear how well that module does on different 8-bit encodings
if at all).

Technically speaking, the script (swish.cgi) should be decoding from
latin-1 on input when reading results from swish-e, but it doesn't.

If you want to display encoded as utf8 then instead of encoding just
parts of the page (i.e. the description) I'd recommend buffering your
page into a single scalar and then calling Encode::encode_utf8() right
before sending it out to the browser with correct http headers added.

Again, you will only end up with the set of chars in latin-1 encoded
as utf8.

Now, if you don't use the libxml2 parser but the old broken one it's
just dealing with bytes so you might actually pass the utf8 chars
through in the description un-altered.  But the index will likely be
all wrong since it assumes a byte is a character.

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:

Users mailing list
Received on Thu Apr 10 02:43:46 2008