So there are 2 different subjects at hand here.
The first is: what is a document? Swish indexes documents and returns
references to documents as a search result. If your documents contain sets
of bibliographical descriptions, then swish will return references to these
sets(!) of bibliographical descriptions, and that is not what you expect
from a search in bibliographical data.
There are several solutions. You could preprocess your files and split them
up: make directories called A, B, C and generate an XML-file for each
record. Then index these directories. Other solutions require the -S prog
way of indexing. See Bill's remarks below.
The other subject is the Marc format. There is almost no difference between
the Marc format and Marc's XML. Marc's XML is Marc with "strange brackets",
and "strange = signs and double quotes within these brackets". So I would
advise you to add a tranformation of Marc's XML to your preprocessing:
<datafield tag="245" ind1="" ind2="4">
<subfield code="a">The American chiropractor</subfield>
<datafield tag="210" ind1="" ind2="">
<subfield code="a">AMERICAN CHIROPRACTOR</subfield>
<title>The American chiropractor</title>
or, a little more sophisticated, into:
<title>The American chiropractor</title>
In this way you move the essentials of the format from: values of attributes
in XML elements, to: elementnames themselves. After that you can use all the
features of XML within Swish like searching on specific elementnames etc.
You could use XSLT to do this.
[mailto:firstname.lastname@example.org]Namens Bill Moseley
Verzonden: vr 13 februari 2004 1:05
Aan: Multiple recipients of list
Onderwerp: [SWISH-E] Re: Probs with xml-marc format
On Thu, Feb 12, 2004 at 02:58:12PM -0800, Thoreau Lovell wrote:
> We get a list of Journals for which we have online access to fulltext
> articles from a vendor in either html or xml. We're talking, say 20 -
> 40,000 journals. The list is exported as separate docs for each letter of
> the alphabet, where A--.html has all the journals that start with the
> letter "A".
Ok, so is there ONE journal entry per *.html file, or does a given html
file contain more than one entry?
> The problem is how the found set is returned. Searching for American
> Chiropractor, for instance, tells me that the journal is found in
> But I can't get Swish-e to return any of the more useful data elements:
> Journal title, ISSN, Coverage, Source, which are all present in the
I hope I'm understanding your problem.
Swish-e indexes single documents. It sounds like you are trying to feed
it some xml that contains more than one "document". Swish-e (currently)
does not have the feature to split up a multi-record document into the
What you likely want to do is use swish-e's "-S prog" feature where an
external program feeds "documents" to swish. So your external program
would parse the xml using either a SAX or DOM parser and then formats
each record into a document and feeds it to swish.
Swish-e doesn't do that now -- it could, I suppose, but since there's so
many good tools to do the parsing externally that it make more sense to
An example of this setup is with the swish-e docs:
In this case it's breaking up the source HTML docs into sections and
indexing them separately. Search for something like "installation" and
you can see that you might get more than one result for a given page.
> files. This seems like a situation where the structured nature of XML
> should be useful, so I've focused on working with XML Docs.
Seems reasonable. You just need to parse it into chunks. Do you have a
> One problem may be that the format the vendor uses is xml-marc, which
> seems to give Swish-e some trouble. Here's a snippet of what the data
> looks like:
What does "trouble" mean?
> -<datafield tag="022" ind1="" ind2="">
> <subfield code="a">0194-6536</subfield>
> -<datafield tag="245" ind1="" ind2="4">
> <subfield code="a">The American chiropractor</subfield>
> -<datafield tag="210" ind1="" ind2="">
> <subfield code="a">AMERICAN CHIROPRACTOR</subfield>
> -<datafield tag="090" ind1="" ind2="">
> <subfield code="a">110978978735405</subfield>
> -<datafield tag="866" ind1="" ind2="">
> <subfield code="x">Alt-HealthWatch:Full Text</subfield>
> <subfield code="a"> Availability: from 1998</subfield>
I don't see any problem with that. If you format as HTML you can effect
the ranking a bit (i.e. words inside <title> would get ranked higher
than words in <body>).
> I've experimented with XMLClassAttributes and UndefinedXMLAttributes,
> without much luck.
No, those are more for pulling text out of attributes (and what to do
> What I'd like is to see is a search result like this:
> AMERICAN CHIROPRACTOR (0194-6536)
> Alt-HealthWatch:Full Text
> Availability: from 1998
There's a few ways to do this, but you could format as:
<meta name="022" content="0194-6536">
<meta name="866.a" content="Availability: from 1998">
<meta name="866.x" content="Alt-HealthWatch:Full Text">
Then use MetaNames to define what fields to search for. Use
UndefinedMetaNames to define what to do with meta content that is not
listed in MetaNames.
PropertyNames 022 866.a 866.x
to store the text for display on search results.
Sure hope I'm answering the right question. ;)
Deze e-mail is door E-mail VirusScanner van Planet Internet gecontroleerd op
Op http://www.planet.nl/evs staat een verwijzing naar de actuele lijst waar
op wordt gecontroleerd.
Received on Fri Feb 13 04:00:14 2004