Yes, I would like to know more about this. I got this to work, but not
in a nice way. I used the same filter line, with the "unzip content.xml"
and there was lots of xml to parse. But XML2 would return nothing for
some reason, and no content would get indexed. I switched to HTML2,
which indexed a bunch of junk along with the content, but got all the
connect that users wanted, so I left it as a cludge until I could
fix the XML2.
The xml which gets produced is fairly complicated, it would probably be
better to try an OpedOffice doc as an example. I could put an example
doc, and the content.xml on the web if people want to see one for testing.
On Tue, 20 May 2003, Bill Moseley wrote:
> On Tue, May 20, 2003 at 07:50:15AM -0700, Ivo Mans wrote:
> > I'm trying to index OpenOffice files (on a furthermore perfect working swish-e installation).
> > I've added following lines in my config:
> > FileFilterMatch "/usr/bin/unzip" "-p \"%p\" content.xml" /\.(sxw|sxc|sxg)$/i
> > IndexContents XML* .sxw .sxc .sxg
> > StoreDescription XML <text> 20000
> That's confusing.
> XML is one parser based on expat
> XML2 is another parser based on libxml2
> XML* says use the libxml2 parser if available, but fallback to expat otherwise.
> So IndexContents XML* is really XML2 if you have libxml2 installed, but you are
> using StoreDescription XML. Try StoreDescription XML* so it matches up.
> It's confusing, yes.
> > Resulting in error message:
> > Warning: XML parse error in file './QU030423im01.sxw' line 2. Error: not well-formed
> > (93 words)
> > This goes for many or all of the OO-files on our network, created with recent OO-versions
> > (mostly the latest v.184.108.40.206). Looking manually to the unzipped result looks like a fine
> > XML-file to me, although too complex to be 100% sure.
> > The unzipped content:
> > line 1: <?xml version="1.0" encoding="UTF-8"?>
> > line 2: All other data, including style definitions: can be extreme long line
> Where's the opening tag?
> <?xml version="1.0" encoding="UTF-8"?>
> Bill Moseley
Received on Tue May 20 20:29:54 2003