On Thu, Nov 10, 2005 at 05:18:55AM -0800, Lars D. Noodén wrote:
> On Thu, 10 Nov 2005, Bill Moseley wrote:
> > You can't have two XML declarations in the file. Libxml2 will just
> > stop parsing.
>
> My mistake. Couldn't files have multiple declarations in SGML?
Even if you could just cat them together they probably are not that
useful in the general case.
I would think you would be happier with results if you created a new
empty xml document, walked meta.xml fetching the meta and dc tags
and rewrite them using simple tags (i.e. <printed> instead of
<meta:print-date>) and then grabbed all the content nodes from the
contents.xml file and placed them in <content>.
You could also get fancy and generate html -- the advantage there is
you can get some tags <title>, <h1>, <em> to rank a bit higher in
search results. I suppose there's a way to use OO itself to open the
document and generate HTML. That would be slow.
> I've been looking at the other filters, particularly Pdf2HTML.pm and
> XLtoHTML.pm, but if XML can't handle more than one declaration per file,
> then my intended approach won't work.
I just don't think blindly cat'ing the files together is the way to
go.
> Instead, could SWISH::Filter pass the file to multiple filters, with each
> one getting passed to 'prog'[1] separately ? One pass
> could get the content, the second the metadata, etc.
> http://swish-e.org/docs/filter.html#writing_filters
Then you end up with duplicate files in search results.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Thu Nov 10 05:58:57 2005