Skip to main content.
home | support | download

Back to List Archive

Re: Problem indexing OpenOffice files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue May 20 2003 - 20:37:02 GMT
On Tue, May 20, 2003 at 01:23:16PM -0700, Douglas Smith wrote:
> 
> Yes, I would like to know more about this.  I got this to work, but not
> in a nice way.  I used the same filter line, with the "unzip content.xml"
> and there was lots of xml to parse.  But XML2 would return nothing for
> some reason, and no content would get indexed.  I switched to HTML2,
> which indexed a bunch of junk along with the content, but got all the
> connect that users wanted,  so I left it as a cludge until I could
> fix the XML2.

Perhaps two different problems:

I had Ivo Mans send me the OO file, and I uncompressed it and then indexed with:

  swish-e -i content.xml -T indexed_words parsed_tags

That showed me that the text was in <text:p> tag, not <text>.

I then used a config file of:

DefaultContents XML2
StoreDescription XML* <text:p> 20000

and it then stored the description.

I did not spend too much time looking at the xml, so there might need to be other tags to 
setup as an alias (perhaps <text:s>).  -T parsed_tags isn't as helpful as I expected -- I 
thought it used to indent and show ending tags.  Oh well.

The other problem is that Ivo was seeing an error from the parser -- that might be due to 
all the xml on a single line.  I did not have that problem, but I'm using a newer version of 
libxml2.

$ xml2-config --version
2.5.6



-- 
Bill Moseley
moseley@hank.org
Received on Tue May 20 20:37:07 2003