Skip to main content.
home | support | download

Back to List Archive

RE: XML Character encoding issue

From: Tref Gare <trefg(at)not-real.areeba.com.au>
Date: Tue Feb 25 2003 - 01:52:39 GMT
Sorry Folks, posting this again as I'm not getting anywhere with the
debugging except to narrow it further down to the swish-e
indexing/encoding rather than any display code.  

To reiterate the issue.

I'm indexing a page which contains the text "cin&#233;math&#232;que"
which is latin-1 for cinémathèque.

When indexing this word in a file on its own I get the following results
on the PC system

C:\WWW\ACMI>swish-e -i test.html -v0 -T indexed_words
    Adding:[1:swishdefault(1)]   'cinÚmathÞque'   Pos:1  Stuct:0x1 (
FILE )

And these on the Solaris

\WWW\ACMI>swish-e -i test.html -v0 -T indexed_words
    Adding:[1:swishdefault(1)]   'cinémathèque'   Pos:1  Stuct:0x1 (
FILE )


That was leading me to believe that the pC system is the one with the
problem however when the resulting index is queried it returns/unencodes
correctly on the pc, whereas on the Solaris system it's returning the
accented chars as '?'.  That '?' is then getting encoded as &#65533; by
my display filtering code (java/jsp) - which I think means "I've no idea
what this char is about".

Similar tests with xml files return the same results.
The configuration files used on both systems are identical.

I’m guessing this has nothing to do with WordCharacter lists as the
characters being used here are Latin-1 encoded and the PC seems to be
getting them fine.

I've read the libxml converts the latin-1 characters to UTF-8 internally
and am wondering if this is what I'm seeing with the strange indexing of
characters on the PC system (though correct display).

Does anyone have any light to shine on this stuff?

Thanks

Tref


------------------------------------------------------
Tref Gare
Development Consultant
Areeba
Level 19/114 William St, Melbourne VIC 3000
email: trefg@areeba.com.au
phone: +61 3 9642 5553
fax: +61 3 9642 1335
website: http://www.areeba.com.au
------------------------------------------------------
"This email is intended only for the use of the individual or entity
named above and contains information that is confidential. No
confidentiality is waived or lost by any mis-transmission. If you
received this correspondence in error, please notify the sender and
immediately delete it from your system. You must not disclose, copy or
rely on any part of this correspondence if you are not the intended
recipient. Any communication directed to clients via this message is
subject to our Agreement and relevant Project Schedule. Any information
that is transmitted via email which may offend may have been sent
without knowledge or the consent of Areeba."
------------------------------------------------------

-----Original Message-----
From: Tref Gare 
Sent: Friday, 21 February 2003 12:13 PM
To: Multiple recipients of list
Subject: [SWISH-E] XML Character encoding issue 

Hi All,

Any help much appreciated.

I've got a problem with characters getting encoded strangely when I
index xml files containing special characters (in this example e's with
acute and grave accents).  When I do command line query/search on the
file's index the eventtitle parameter cin&#233;math&#232;que (in the
xml) is returned as cinÚmathÞque.

I'm working on two separate systems (for my sins), namely a PC based WIN
XP system (test) and a Solaris 2.8 box (live).  On the Solaris box the
accented characters display as either question marks or boxes, while on
the Windows box they display accurately as accented characters

Any thoughts anyone?


Thanks in advance 
Tref

My Config file looks like this
==================================================================

# Test index config file for indexing xml docs

IndexFile "/WWW/ACMI/WEB-INF/search/catalog/test.index"
IndexDir .
IndexContents XML2 .xml
NoContents .gif .jpg .mdb
StoreDescription XML2 <oneLiner> 320
IndexOnly .xml

FileRules pathname contains data
FileRules dirname contains data
FileRules dirname contains WEB-INF

#added to stop nav title fields getting indexed
IgnoreMetaTags title

FollowSymLinks no
MetaNames oneLiner eventTitle htmlLocation interval endDate startDate
keyword paragraph subTitle eventType times dates location audiences
admissionPrice free partof flag compoundEventId eventthumbnail rating
compoundType eventid ticketingeventid

# adding the property names line
PropertyNames oneliner htmllocation startdate enddate eventtitle
interval paragraph eventtype subtitle times dates location free partof
audiences admissionprice flag compoundeventid eventthumbnail rating
compoundtype eventid ticketingeventid ReplaceRules prepend "filesys"
ReplaceRules replace "filesys\." "http://wwd.acmi.net.au:88" #
ReplaceRules regex "/\x5c/\x2f/gi" ReplaceRules replace "\\\\" "/"


Following is a test version of the xml I'm indexing
========================================================================
=======================

<?xml version="1.0" encoding="iso-8859-1" ?>
<page id="590EC12043754DDAB68479A614179A7E" section="experience"
htmlLocation="/590EC12043754DDAB68479A614179A7E.jsp">
 <htmlLocation>/590EC12043754DDAB68479A614179A7E.jsp</htmlLocation>
<content>
 <title><![CDATA[melbourne cin&#233;math&#232;que]]></title>
    <event version="1.0" 
        id="590EC12043754DDAB68479A614179A7E" eventID="vbvbv"
ticketingEventID=""
        htmlLocation="" compoundType="activity stream">
            <eventID>vbvbv</eventID>
            <ticketingEventID></ticketingEventID>
            <eventTitle>melbourne cin&#233;math&#232;que</eventTitle>
            <eventThumbnail><![CDATA[<IMG
src="/experience/images/event_thumbs/thumb_cinematheque.gif"
alt="melbourne cinémathèque" border="0" width="70"
height="70">]]></eventThumbnail>
            <title>melbourne cin&#233;math&#232;que</title>
            <subtitle></subtitle>
            <oneLiner>Cinémathèque offers a diverse program of classic,
cult, animation, experimental, documentary, silent and short films
throughout the year.</oneLiner>
            <paragraph>Cinémathèque offers a diverse program of classic,
cult, animation, experimental, documentary, silent and short films.
&lt;SPAN&gt;Screenings are themed around filmmakers, genres and styles,
historical/literary movements, moments or figures, and national
cinemas.&lt;/SPAN&gt;</paragraph>
            <eventTypes><eventType>Cinema
Screening/Event</eventType></eventTypes>
            <dates>&lt;P&gt;Every Wednesday night from
7pm&lt;/P&gt;</dates>
            <times></times>
            <fullText>
                <subpage title="melbourne
cin&#233;math&#232;que">&lt;P&gt;Cinémathèque offers a diverse program
of classic, cult, animation, experimental, documentary, silent and short
films. &lt;/P&gt; &lt;P&gt;The year-long program consists of weekly
screenings of classic, cult, experimental, animation, documentary,
silent and short films.&amp;nbsp; Screenings are themed around
filmmakers, genres and styles, historical/literary movements, moments or
figures, and national cinemas.&lt;/P&gt;</subpage>
            </fullText>
            <soldOutText></soldOutText>
            <sponsorsPartners>Presented by the Australian Centre for the
Moving Image &amp;amp; the Melbourne Cinémathèque. Curated by the
Melbourne Cinémathèque. Supported by the Australian Film
Commission.</sponsorsPartners>
            <datesList>
                <interval>
                    <startDate>2003-02-19</startDate>
                    <endDate>2004-01-29</endDate>   
                </interval>
            </datesList>
    </event>
 </content>
</page>

========================================================================
=====


------------------------------------------------------
Tref Gare
Development Consultant
Areeba
Level 19/114 William St, Melbourne VIC 3000
email: trefg@areeba.com.au
phone: +61 3 9642 5553
fax: +61 3 9642 1335
website: http://www.areeba.com.au
------------------------------------------------------
"This email is intended only for the use of the individual or entity
named above and contains information that is confidential. No
confidentiality is waived or lost by any mis-transmission. If you
received this correspondence in error, please notify the sender and
immediately delete it from your system. You must not disclose, copy or
rely on any part of this correspondence if you are not the intended
recipient. Any communication directed to clients via this message is
subject to our Agreement and relevant Project Schedule. Any information
that is transmitted via email which may offend may have been sent
without knowledge or the consent of Areeba."
------------------------------------------------------



------------------------------------------------------
Tref Gare
Development Consultant
Areeba
Level 19/114 William St, Melbourne VIC 3000
email: trefg@areeba.com.au
phone: +61 3 9642 5553
fax: +61 3 9642 1335
website: http://www.areeba.com.au
------------------------------------------------------
"This email is intended only for the use of the individual or entity
named above and contains information that is confidential. No
confidentiality is waived or lost by any mis-transmission. If you
received this correspondence in error, please notify the sender and
immediately delete it from your system. You must not disclose, copy or
rely on any part of this correspondence if you are not the intended
recipient. Any communication directed to clients via this message is
subject to our Agreement and relevant Project Schedule. Any information
that is transmitted via email which may offend may have been sent
without knowledge or the consent of Areeba."
------------------------------------------------------
Received on Tue Feb 25 01:56:42 2003