Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] XML parsing not returning Title

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Thu Nov 22 2007 - 17:16:16 GMT
Robinson Craig wrote on 11/20/07 9:40 PM:
> Hi Folks,
> 
> As background - what I am trying to do is to return the contents of a particular tag (div id="leftcolumn") as the swishdescription. In a separate thread, Peter Karman suggested that I parse the HTML files as XML, which I have successfully (but partially) implemented (big Thanks to Peter!).
> 
> For those whom are interested, here is the relevent parts of the config that I used:
> 
> IndexContents XML2 .xml .html
> IndexContents HTML* .htm .pdf
> IndexOnly .xml .htm .html .pdf
> XMLClassAttributes id
> DefaultContents XML2
> MetaNames swishdocpath swishtitle keywords function expires contributor datecreated datemodified
> MetaNameAlias keywords    "DC.Subject"
> MetaNameAlias function    "AGLS.Function"
> MetaNameAlias expires     "DC.Date.valid"
> MetaNameAlias contributor "NRM.ContentContributor"
> MetaNameAlias swishdocpath path
> MetaNameAlias datecreated  "DC.Date.created"
> MetaNameAlias datemodified "DC.Date.modified"
> HTMLLinksMetaName link
> ImageLinksMetaName image
> AbsoluteLinks yes
> IndexAltTagMetaName alt
> #UndefinedMetaTags ignore - commented out as otherwise XML won't return anything
> ExtractPath section regex |^/web/nrm/htdocs/([^/]+)/.*$|$1|
> PropertyNames swishdescription function expires contributor datemodified datecreated
> PropertyNameAlias swishdescription div.leftcolumn description
> PropertyNameAlias function    "AGLS.Function"
> PropertyNameAlias expires     "DC.Date.valid"
> PropertyNameAlias contributor "NRM.ContentContributor"
> PropertyNameAlias datemodified "DC.Date.modified"
> PropertyNameAlias datecreated "DC.Date.created"
> PropertyNamesMaxLength 256 swishdescription
> #StoreDescription XML2 <body> 256 - commented out so as to return contents of div only
> StoreDescription HTML* <body> 256
> 
> It works a treat. The only problem I have is when I search using the following command:
> 
> swish-e -m 10 -d ~##~ -p swishdescription -p swishlastmodified -p description -f swish.index -w "water"
> 
> I get something like the following:
> 
> 1000~##~http://nrmdev.dnr.qld.gov.au/water/management/application_forms.html~##~application_forms.html~##~39292~##~Application forms are provided below for most application types under the Water Act 2000. Before applying, applicants should carefully read the relevant guidelines where provided. If the activity includes the construction or installation of new works or an~##~2007-09-24 08:14:08~##~Application forms are provided below for most application types under the Water Act 2000. Before applying, applicants should carefully read the relevant guidelines where provided. If the activity includes the construction or installation of new works or an
> 
> Notice that the 3rd field (which I presume is swishtitle??) contains the filename: application_forms.html
> 
> Whereas, if I index html using HTML2 (ie. IndexContents HTML2 .html), I return some results like:
> 
> 1000~##~http://nrmdev.dnr.qld.gov.au/water/management/application_forms.html~##~Water management application forms~##~39292~##~  Water management application forms Application forms are provided below for most application types under the Water Act 2000. Before applying, applicants should carefully read the relevant guidelines where provided. If the activity includes the constructi~##~2007-09-24 08:14:08~##~water licence application forms water licence application forms
> 
> Where the 3rd field contains the TITLE of the document: Water management application forms
> 
> Looks like I can't have my cake and eat it too :-) If I want to get the contents of the <div>, then I must use XML2, but if I want to return the Title, then I must use HTML2. If anyone has any ideas how I might approach this problem, it would be appreciated.
>  

Without an example document it's hard to say for certain.

Both XML2 and HTML2 use the libxml2 parser, and why one grabs the 'title' the
way you expect and the other doesn't might be a configurable behaviour and might
not.

Post a small, reproduceable example and we've try and help.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Nov 22 12:16:16 2007