Skip to main content.
home | support | download

Back to List Archive

Indexing XML documents

From: Richard Lewis <richardlewis(at)>
Date: Wed Sep 06 2006 - 13:06:58 GMT
Hi there,

I've used Swish-e in the past for indexing XML content. I used the -S prog 
along with a Python script to extract dynamic HTML documents from a large XML 
collection and it all worked well.

I'm now trying to index a collection of XML documents where the hit 
granularity is the same as the file dispertion, i.e. one XML document per 
hit. Therefore, I am attempting to use -S fs rather then -S prog. However, 
I'm having problems with describing the content of my XML documents to 

My XML documents are quite simple and follow this format:

<section id="facilities" name="Facilities">
<!-- content -->


<subsection id="studios" name="Studios" section-id="facilities" 
<!-- content -->

So far I have the following Swish-e configuration file:

IndexFile site.index
IndexDir .
IndexOnly .xml
IndexContents XML* .xml

# exclude the "index.xml" file
FileRules filename is index\.xml

# attempting to index the attribute values
UndefinedXMLAttributes index

# alter the path names to remove the leading "." and remove
# the trailing ".xml"
ReplaceRules remove \\.xml
ReplaceRules remove \\.

What I want to be able to do is use the @name attribute as the "swishtitle" 
property, but I can't work out how to do this.

(I know I could do it using the -S prog method and transforming the XML 
documents into HTML on-the-fly.)

There are also some other things I can't work out (maybe they could be added 
to the FAQ?)

*) How do I query an index to find its available properties?
*) What are the names of Swish-e's default properties? (I know these are in 
the documentation somwhere but they're difficult to find.)
*) How do I assign an XML attribute to a property? And what if I want it to 
have a different name?

Any help with this would be great!

Richard Lewis
Sonic Arts Research Archive
Received on Wed Sep 6 06:06:58 2006