Skip to main content.
home | support | download

Back to List Archive

Re: UTF-8 highlighting

From: <moseley(at)not-real.hank.org>
Date: Thu Jul 03 2003 - 13:07:47 GMT
On Thu, Jul 03, 2003 at 02:31:21AM -0700, Tim Freedom wrote:
> 
> 1. I have various mailing-lists which are archived as UTF-8 files.  I didn't
>    do anything special about indexing them (no conversion is needed) and
>    I added the following line to the search.tt file (after <head>)

First, my knowledge of character issues is limited.

> 
>      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

I'm not so sure that would make any difference.  Does it?  I say that 
because with the libxml2 parser text is converted from UTF-8 to 8859-1 
before swish-e processes that text.

>    I see results - which is great, but the highlighting seems to mess things
>    up.  Instead of seeing my words highlighted properly, I see question
>    marks with sprinkled yellow highlights (every other question mark in a
>    row gets the yellow highlight).  Is there anyway to fix that.  I use the
>    default hightlight method (ie. I don't over-ride anything in that area,
>    so I believe its using 'PhraseHighlight' which is fine).  Not sure if
>    you need to include 'use utf8;' in your code or what so that the various
>    multibyte characters get grouped appropriately.  Any ideas/solutions ?

Not really.  I just recently upgraded to Perl 5.8.0 where character 
encoding works better.  It's been a while since I looked but with older 
Perl versions I had weird problems.  For example, to tokenize the text 
for highlighting I split the text using the WordCharacters setting 
passed back in the swish-e results header.  I had some odd problems once 
and it turned out I had text in Perl flagged as UTF8 but the split 
function in some cases was splitting in the middle of a multi-byte char 
then I ended up with invalid UTF8 strings.  I suspect that's fixed in 
5.8.0.

Can you work up an example to demonstrate the problem?  You might also 
try the SimpleHighlight module instead -- that only splits on 
whitespace.  If that changes things then maybe that's a clue where the 
problem is.

> 2. Using the 'TemplateToolkit' method - if I don't find a result for my
>    search I get a red "no result" _above_ the search form.  Is there anyway
>    to control the location of where that string gets inserted (and all other
>    error strings.  I saw that it gets spewed out as STDERR and didn't follow
>    it after that.

Sure, that's the entire point of using the template.  In search.tt:

[% WRAPPER page %]

    [% PROCESS swish_header %]


    [% title = PROCESS title %]

    [% IF ! search.results %]
        [% PROCESS show_message %]   <<<<<< move this
        [% PROCESS search_form %]

    [% ELSE %]
        [% PROCESS search_form %]
        [% PROCESS nav_bar %]
        [% PROCESS results_list %]
    [% END %]

    [% PROCESS swish_footer %]

[% END %]

-- 
Bill Moseley
moseley@hank.org
Received on Thu Jul 3 13:07:50 2003