Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] XML parsing not returning Title

From: Robinson Craig <Craig.Robinson(at)not-real.nrw.qld.gov.au>
Date: Mon Nov 26 2007 - 01:12:40 GMT
Karman Peter wrote on 11/22/07 11:16 AM:

<..... preceeding conversation snipped out .....>

>Without an example document it's hard to say for certain.
>
>Both XML2 and HTML2 use the libxml2 parser, and why one grabs the
'title' the way you expect and the other doesn't might be a configurable
behaviour and might not.
>
>Post a small, reproduceable example and we've try and help.

Thanks Peter. I have attached three files. These reproduce the effect
when using a configuration similar to what I originally posted.

index.html -> a simple html file that is indexed via XML2, and returns
filename for title
showerheads.pdf -> an example pdf that is parsed via HTML2, and returns
title

And

showerheads.html -> which is the result of #pdftotext -htmlmeta
showerheads.pdf command as per config
[I included this file so you could see what output my pdftotext gives
me].

After indexing, searching via:

/web/nrm/search-bin > swish-e -m 10 -d ~##~ -p swishdescription -p
swishlastmodified -p description -f swish.index -w "showerheads"

Gives me:

# SWISH format: 2.2.3
# Search words: showerheads
# Number of hits: 2
# Search time: 0.001 seconds
# Run time: 0.067 seconds
1000~##~/tmp/web/showerheads.pdf~##~Showerheads - Home WaterWise Rebate
Scheme~##~67021~##~Showerheads The bathroom is one of the biggest water
using areas in the home, accounting for about 20 per cent of household
water. A standard showerhead uses about 20 litres of water a minute,
while a 3-star rated showerhead limits water flow to just
under~##~2007-11-26 10:46:55~##~Showerheads The bathroom is one of the
biggest water using areas in the home, accounting for about 20 per cent
of household water. A standard showerhead uses about 20 litres of water
a minute, while a 3-star rated showerhead limits water flow to just
under
161~##~/tmp/web/index.html~##~index.html~##~1263~##~The bathroom is one
of the biggest water using areas in the home, accounting for about 20
per cent of household water. A standard showerhead uses about 20 litres
of water a minute, while a 3-star rated showerhead limits water flow to
just under half that a~##~2007-11-26 10:55:36~##~The bathroom is one of
the biggest water using areas in the home, accounting for about 20 per
cent of household water. A standard showerhead uses about 20 litres of
water a minute, while a 3-star rated showerhead limits water flow to
just under half that a


For convenience, the config is:

> IndexContents XML2 .xml .html
> IndexContents HTML* .htm .pdf
> IndexOnly .xml .htm .html .pdf
> XMLClassAttributes id
> DefaultContents XML2
> MetaNames swishdocpath swishtitle keywords function expires
contributor datecreated datemodified
> MetaNameAlias keywords    "DC.Subject"
> MetaNameAlias function    "AGLS.Function"
> MetaNameAlias expires     "DC.Date.valid"
> MetaNameAlias contributor "NRM.ContentContributor"
> MetaNameAlias swishdocpath path
> MetaNameAlias datecreated  "DC.Date.created"
> MetaNameAlias datemodified "DC.Date.modified"
> HTMLLinksMetaName link
> ImageLinksMetaName image
> AbsoluteLinks yes
> IndexAltTagMetaName alt
> #UndefinedMetaTags ignore - commented out as otherwise XML won't 
> return anything ExtractPath section regex 
> |^/web/nrm/htdocs/([^/]+)/.*$|$1| PropertyNames swishdescription 
> function expires contributor datemodified datecreated
PropertyNameAlias swishdescription div.leftcolumn description
> PropertyNameAlias function    "AGLS.Function"
> PropertyNameAlias expires     "DC.Date.valid"
> PropertyNameAlias contributor "NRM.ContentContributor"
> PropertyNameAlias datemodified "DC.Date.modified"
> PropertyNameAlias datecreated "DC.Date.created"
> PropertyNamesMaxLength 256 swishdescription #StoreDescription XML2 
> <body> 256 - commented out so as to return contents of div only 
> StoreDescription HTML* <body> 256

Thanks for your help.
 
Cheers, Craig


************************************************************************
The information in this email together with any attachments is
intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material.
Any form of review, disclosure, modification, distribution
and/or publication of this email message is prohibited, unless
as a necessary part of Departmental business.
If you have received this message in error, you are asked to
inform the sender as quickly as possible and delete this message
and any copies of this message from your computer and/or your
computer system network.
************************************************************************


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Nov 25 20:12:48 2007