Skip to main content.
home | support | download

Back to List Archive

XML Character encoding issue

From: Tref Gare <trefg(at)not-real.areeba.com.au>
Date: Fri Feb 21 2003 - 01:13:00 GMT
Hi All,

Any help much appreciated.

I've got a problem with characters getting encoded strangely when I
index xml files containing special characters (in this example e's with
acute and grave accents).  When I do command line query/search on the
file's index the eventtitle parameter cin&#233;math&#232;que (in the
xml) is returned as cinÚmathÞque.

I'm working on two separate systems (for my sins), namely a PC based WIN
XP system (test) and a Solaris 2.8 box (live).  On the Solaris box the
accented characters display as either question marks or boxes, while on
the Windows box they display accurately as accented characters

Any thoughts anyone?


Thanks in advance 
Tref

My Config file looks like this
==================================================================

# Test index config file for indexing xml docs

IndexFile "/WWW/ACMI/WEB-INF/search/catalog/test.index"
IndexDir .
IndexContents XML2 .xml
NoContents .gif .jpg .mdb
StoreDescription XML2 <oneLiner> 320
IndexOnly .xml

FileRules pathname contains data
FileRules dirname contains data
FileRules dirname contains WEB-INF

#added to stop nav title fields getting indexed
IgnoreMetaTags title

FollowSymLinks no
MetaNames oneLiner eventTitle htmlLocation interval endDate startDate
keyword paragraph subTitle eventType times dates location audiences
admissionPrice free partof flag compoundEventId eventthumbnail rating
compoundType eventid ticketingeventid

# adding the property names line
PropertyNames oneliner htmllocation startdate enddate eventtitle
interval paragraph eventtype subtitle times dates location free partof
audiences admissionprice flag compoundeventid eventthumbnail rating
compoundtype eventid ticketingeventid
ReplaceRules prepend "filesys"
ReplaceRules replace "filesys\." "http://wwd.acmi.net.au:88"
# ReplaceRules regex "/\x5c/\x2f/gi"
ReplaceRules replace "\\\\" "/"


Following is a test version of the xml I'm indexing
========================================================================
=======================

<?xml version="1.0" encoding="iso-8859-1" ?>
<page id="590EC12043754DDAB68479A614179A7E" section="experience"
htmlLocation="/590EC12043754DDAB68479A614179A7E.jsp">
 <htmlLocation>/590EC12043754DDAB68479A614179A7E.jsp</htmlLocation>
<content>
 <title><![CDATA[melbourne cin&#233;math&#232;que]]></title>
    <event version="1.0" 
        id="590EC12043754DDAB68479A614179A7E" eventID="vbvbv"
ticketingEventID=""
        htmlLocation="" compoundType="activity stream">
            <eventID>vbvbv</eventID>
            <ticketingEventID></ticketingEventID>
            <eventTitle>melbourne cin&#233;math&#232;que</eventTitle>
            <eventThumbnail><![CDATA[<IMG
src="/experience/images/event_thumbs/thumb_cinematheque.gif"
alt="melbourne cinémathèque" border="0" width="70"
height="70">]]></eventThumbnail>
            <title>melbourne cin&#233;math&#232;que</title>
            <subtitle></subtitle>
            <oneLiner>Cinémathèque offers a diverse program of classic,
cult, animation, experimental, documentary, silent and short films
throughout the year.</oneLiner>
            <paragraph>Cinémathèque offers a diverse program of classic,
cult, animation, experimental, documentary, silent and short films.
&lt;SPAN&gt;Screenings are themed around filmmakers, genres and styles,
historical/literary movements, moments or figures, and national
cinemas.&lt;/SPAN&gt;</paragraph>
            <eventTypes><eventType>Cinema
Screening/Event</eventType></eventTypes>
            <dates>&lt;P&gt;Every Wednesday night from
7pm&lt;/P&gt;</dates>
            <times></times>
            <fullText>
                <subpage title="melbourne
cin&#233;math&#232;que">&lt;P&gt;Cinémathèque offers a diverse program
of classic, cult, animation, experimental, documentary, silent and short
films. &lt;/P&gt;
&lt;P&gt;The year-long program consists of weekly screenings of classic,
cult, experimental, animation, documentary, silent and short
films.&amp;nbsp; Screenings are themed around filmmakers, genres and
styles, historical/literary movements, moments or figures, and national
cinemas.&lt;/P&gt;</subpage>
            </fullText>
            <soldOutText></soldOutText>
            <sponsorsPartners>Presented by the Australian Centre for the
Moving Image &amp;amp; the Melbourne Cinémathèque. Curated by the
Melbourne Cinémathèque. Supported by the Australian Film
Commission.</sponsorsPartners>
            <datesList>
                <interval>
                    <startDate>2003-02-19</startDate>
                    <endDate>2004-01-29</endDate>   
                </interval>
            </datesList>
    </event>
 </content>
</page>

========================================================================
=====


------------------------------------------------------
Tref Gare
Development Consultant
Areeba
Level 19/114 William St, Melbourne VIC 3000
email: trefg@areeba.com.au
phone: +61 3 9642 5553
fax: +61 3 9642 1335
website: http://www.areeba.com.au
------------------------------------------------------
"This email is intended only for the use of the individual or entity
named above and contains information that is confidential. No
confidentiality is waived or lost by any mis-transmission. If you
received this correspondence in error, please notify the sender and
immediately delete it from your system. You must not disclose, copy or
rely on any part of this correspondence if you are not the intended
recipient. Any communication directed to clients via this message is
subject to our Agreement and relevant Project Schedule. Any information
that is transmitted via email which may offend may have been sent
without knowledge or the consent of Areeba."
------------------------------------------------------
Received on Fri Feb 21 01:13:43 2003