Skip to main content.
home | support | download

Back to List Archive

Re: Some Questions about 2.2RC1 and XML

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Sep 06 2002 - 01:19:41 GMT
At 05:46 PM 09/05/02 -0700, Bill Humphries wrote:
>I've built 2.2RC1 with LIBXML 2.23.4 on Mac OS 10.1.5, and have been 
>experimenting with it the past two days.

Thanks very much!

>1) I plan to return search results as XML, however, in the Configuration 
>File Directives (http://swish-e.org/2.2/docs/SWISH-
>CONFIG.html#Document_Contents_Directives) it appears that entities in XML 
>documents are evaluated regardless of the value of ConvertHTMLEntities:
>
>	"NOTE: Entities within XML files and files parsed with libxml2 are 
>converted regardless of this setting."

Right, that option only works for HTML docs (docs parsed by html.c).  XML,
XML2 both convert entities.

>My current workaround for this is to build an XML result string, then pass 
>it through Tidy (http://tidy.sourceforge.net/) to re-escape entities.

What needs to be escaped besides < and >?  Seems like it would be slow to
span an external program to do this.  If you are using perl the things like
CGI.pm and HTML::Entities can do this work.

>2) I'm indexing XML source documents in the file system. I can use the 
>configuration to use the first 100 characters of the document's root 
>element, 'page', as the description:
>
>PropertyNamesMaxLength 100 swishdescription
>PropertyNameAlias swishdescription page
>
>However, when swish-e constructs the index, it's taking the attribute 
>values, as well as the text nodes of 'page'.

Can you put together a small example?


>I'd also like to specify a location in the document to use as the 
>description, ie /page/section[1]/para[1].
>
>The workaround here would be to use the prog method to load pages and use 
>some xpath tool to extract that location and use as the page description.

I'm not 100% clear what you want, but using -S prog with the available
tools will probably give you the most control.

I added a bit to the configuration of the spider.pl program on the
perl.apache.org site and now search results take you to the specific
section of the page instead of just to the document.  It was a lot easier
to do with perl than with C in swish.



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Sep 6 01:23:12 2002