On Thursday, September 5, 2002, at 06:19 PM, Bill Moseley wrote:
>> "NOTE: Entities within XML files and files parsed with libxml2 are
>> converted regardless of this setting."
> Right, that option only works for HTML docs (docs parsed by html.c).
> XML2 both convert entities.
>> My current workaround for this is to build an XML result string, then
>> it through Tidy (http://tidy.sourceforge.net/) to re-escape entities.
> What needs to be escaped besides < and >? Seems like it would be slow
> span an external program to do this. If you are using perl the things
> CGI.pm and HTML::Entities can do this work.
We plan to process the results with XSLT (which is used for the rest of
the site's presentation layer), so we need all the entities converted
back to unicode equivalents. This is all being done in PHP. So there
may be a string function handy, I need to glance back at the man pages.
>> 2) I'm indexing XML source documents in the file system. I can use the
>> configuration to use the first 100 characters of the document's root
>> element, 'page', as the description:
>> PropertyNamesMaxLength 100 swishdescription
>> PropertyNameAlias swishdescription page
>> However, when swish-e constructs the index, it's taking the attribute
>> values, as well as the text nodes of 'page'.
> Can you put together a small example?
Here's the relevant portion of the config file:
MetaNames page.title container.title container.access
XMLClassAttributes page.title container.access
PropertyNameAlias swishtitle page.title
PropertyNamesMaxLength 100 swishdescription
PropertyNameAlias swishdescription page
# But only index the .xml files
IndexContents XML2 .xml
Then an example file to index:
<page changed="false" name="dsAuthTest" new="false" title="dsAuth Test">
<container access="employee" changed="false" new="false"
shorttitle="Employee" title="Section One">
<para changed="false" new="false">Content viewable by all
>> dsAuth Test</breadcrumb>
Then when I search
% /usr/local/bin/swish-e -f index.swish-e -w dsAuth -x '%p\n%d'
# SWISH format: 2.2rc1
# Search words: dsAuth
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.104 seconds
false dsAuthTest false dsAuth Test employee false false Employee
Section One false false Content vie.
As you can see, the description is pulling in the attribute values
instead of the text nodes.
>> I'd also like to specify a location in the document to use as the
>> description, ie /page/section/para.
>> The workaround here would be to use the prog method to load pages and
>> some xpath tool to extract that location and use as the page
> I'm not 100% clear what you want, but using -S prog with the available
> tools will probably give you the most control.
That's probably the way to go, just some expense on the indexing side,
but then I'm indexing on the order of 3,000 pages, so that's no great
Received on Sat Sep 7 03:05:23 2002