Skip to main content.
home | support | download

Back to List Archive

[swish-e] XML parsing not returning Title

From: Robinson Craig <Craig.Robinson(at)not-real.nrw.qld.gov.au>
Date: Wed Nov 21 2007 - 03:40:40 GMT
Hi Folks,

As background - what I am trying to do is to return the contents of a particular tag (div id="leftcolumn") as the swishdescription. In a separate thread, Peter Karman suggested that I parse the HTML files as XML, which I have successfully (but partially) implemented (big Thanks to Peter!).

For those whom are interested, here is the relevent parts of the config that I used:

IndexContents XML2 .xml .html
IndexContents HTML* .htm .pdf
IndexOnly .xml .htm .html .pdf
XMLClassAttributes id
DefaultContents XML2
MetaNames swishdocpath swishtitle keywords function expires contributor datecreated datemodified
MetaNameAlias keywords    "DC.Subject"
MetaNameAlias function    "AGLS.Function"
MetaNameAlias expires     "DC.Date.valid"
MetaNameAlias contributor "NRM.ContentContributor"
MetaNameAlias swishdocpath path
MetaNameAlias datecreated  "DC.Date.created"
MetaNameAlias datemodified "DC.Date.modified"
HTMLLinksMetaName link
ImageLinksMetaName image
AbsoluteLinks yes
IndexAltTagMetaName alt
#UndefinedMetaTags ignore - commented out as otherwise XML won't return anything
ExtractPath section regex |^/web/nrm/htdocs/([^/]+)/.*$|$1|
PropertyNames swishdescription function expires contributor datemodified datecreated
PropertyNameAlias swishdescription div.leftcolumn description
PropertyNameAlias function    "AGLS.Function"
PropertyNameAlias expires     "DC.Date.valid"
PropertyNameAlias contributor "NRM.ContentContributor"
PropertyNameAlias datemodified "DC.Date.modified"
PropertyNameAlias datecreated "DC.Date.created"
PropertyNamesMaxLength 256 swishdescription
#StoreDescription XML2 <body> 256 - commented out so as to return contents of div only
StoreDescription HTML* <body> 256

It works a treat. The only problem I have is when I search using the following command:

swish-e -m 10 -d ~##~ -p swishdescription -p swishlastmodified -p description -f swish.index -w "water"

I get something like the following:

1000~##~http://nrmdev.dnr.qld.gov.au/water/management/application_forms.html~##~application_forms.html~##~39292~##~Application forms are provided below for most application types under the Water Act 2000. Before applying, applicants should carefully read the relevant guidelines where provided. If the activity includes the construction or installation of new works or an~##~2007-09-24 08:14:08~##~Application forms are provided below for most application types under the Water Act 2000. Before applying, applicants should carefully read the relevant guidelines where provided. If the activity includes the construction or installation of new works or an

Notice that the 3rd field (which I presume is swishtitle??) contains the filename: application_forms.html

Whereas, if I index html using HTML2 (ie. IndexContents HTML2 .html), I return some results like:

1000~##~http://nrmdev.dnr.qld.gov.au/water/management/application_forms.html~##~Water management application forms~##~39292~##~  Water management application forms Application forms are provided below for most application types under the Water Act 2000. Before applying, applicants should carefully read the relevant guidelines where provided. If the activity includes the constructi~##~2007-09-24 08:14:08~##~water licence application forms water licence application forms

Where the 3rd field contains the TITLE of the document: Water management application forms

Looks like I can't have my cake and eat it too :-) If I want to get the contents of the <div>, then I must use XML2, but if I want to return the Title, then I must use HTML2. If anyone has any ideas how I might approach this problem, it would be appreciated.
 
Cheers, Craig


************************************************************************
The information in this email together with any attachments is
intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material.
Any form of review, disclosure, modification, distribution
and/or publication of this email message is prohibited, unless
as a necessary part of Departmental business.
If you have received this message in error, you are asked to
inform the sender as quickly as possible and delete this message
and any copies of this message from your computer and/or your
computer system network.
************************************************************************

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Nov 20 22:41:54 2007