Skip to main content.
home | support | download

Back to List Archive

Re: Probs with xml-marc format

From: Frits van Latum <fritsvanlatum(at)>
Date: Fri Feb 13 2004 - 12:00:10 GMT
Hi Thoreau,

So there are 2 different subjects at hand here.

The first is: what is a document? Swish indexes documents and returns
references to documents as a search result. If your documents contain sets
of bibliographical descriptions, then swish will return references to these
sets(!) of bibliographical descriptions, and that is not what you expect
from a search in bibliographical data.

There are several solutions. You could preprocess your files and split them
up: make directories called A, B, C and generate an XML-file for each
record. Then index these directories. Other solutions require the -S prog
way of indexing. See Bill's remarks below.

The other subject is the Marc format. There is almost no difference between
the Marc format and Marc's XML. Marc's XML is Marc with "strange brackets",
and "strange = signs and double quotes within these brackets". So I would
advise you to add a tranformation of Marc's XML to your preprocessing:


<datafield tag="245" ind1="" ind2="4">
          <subfield code="a">The American chiropractor</subfield>
<datafield tag="210" ind1="" ind2="">
         <subfield code="a">AMERICAN CHIROPRACTOR</subfield>

<title>The American chiropractor</title>
<subject>AMERICAN CHIROPRACTOR</subject>

or, a little more sophisticated, into:
<title>The American chiropractor</title>
<sorttitle>American chiropractor</sorttitle>
<subject>AMERICAN CHIROPRACTOR</subject>

In this way you move the essentials of the format from: values of attributes
in XML elements, to: elementnames themselves. After that you can use all the
features of XML within Swish like searching on specific elementnames etc.
You could use XSLT to do this.


-----Oorspronkelijk bericht-----
[]Namens Bill Moseley
Verzonden: vr 13 februari 2004 1:05
Aan: Multiple recipients of list
Onderwerp: [SWISH-E] Re: Probs with xml-marc format

On Thu, Feb 12, 2004 at 02:58:12PM -0800, Thoreau Lovell wrote:

> We get a list of Journals for which we have online access to fulltext
> articles from a vendor in either html or xml. We're talking, say 20 -
> 40,000 journals. The list is exported as separate docs for each letter of
> the alphabet, where A--.html has all the journals that start with the
> letter "A".

Ok, so is there ONE journal entry per *.html file, or does a given html
file contain more than one entry?

> The problem is how the found set is returned. Searching for American
> Chiropractor, for instance, tells me that the journal is found in
> But I can't get Swish-e to return any of the more useful data elements:
> Journal title, ISSN, Coverage, Source, which are all present in the

I hope I'm understanding your problem.

Swish-e indexes single documents.  It sounds like you are trying to feed
it some xml that contains more than one "document".  Swish-e (currently)
does not have the feature to split up a multi-record document into the
individual parts.

What you likely want to do is use swish-e's "-S prog" feature where an
external program feeds "documents" to swish.  So your external program
would parse the xml using either a SAX or DOM parser and then formats
each record into a document and feeds it to swish.

Swish-e doesn't do that now -- it could, I suppose, but since there's so
many good tools to do the parsing externally that it make more sense to
use those.

An example of this setup is with the swish-e docs:

In this case it's breaking up the source HTML docs into sections and
indexing them separately.  Search for something like "installation" and
you can see that you might get more than one result for a given page.

> files. This seems like a situation where the structured nature of XML
> should be useful, so I've focused on working with XML Docs.

Seems reasonable.  You just need to parse it into chunks.  Do you have a
favorite language?

> One problem may be that the format the vendor uses is xml-marc, which
> seems to give Swish-e some trouble. Here's a snippet of what the data
> looks like:

What does "trouble" mean?

>   <record>
> <leader>-----nas-a22-----z--4500</leader>
> -<datafield tag="022" ind1="" ind2="">
>          <subfield code="a">0194-6536</subfield>
> </datafield>
> -<datafield tag="245" ind1="" ind2="4">
>          <subfield code="a">The American chiropractor</subfield>
> </datafield>
> -<datafield tag="210" ind1="" ind2="">
>          <subfield code="a">AMERICAN CHIROPRACTOR</subfield>
> </datafield>
> -<datafield tag="090" ind1="" ind2="">
>          <subfield code="a">110978978735405</subfield>
> </datafield>
> -<datafield tag="866" ind1="" ind2="">
>          <subfield code="x">Alt-HealthWatch:Full Text</subfield>
>          <subfield code="a"> Availability: from 1998</subfield>
> </datafield>
> </record>

I don't see any problem with that.  If you format as HTML you can effect
the ranking a bit (i.e. words inside <title> would get ranked higher
than words in <body>).

> I've experimented with XMLClassAttributes and UndefinedXMLAttributes,
> without much luck.

No, those are more for pulling text out of attributes (and what to do
with them).

> What I'd like is to see is a search result like this:
>          Alt-HealthWatch:Full Text
>          Availability: from 1998

There's a few ways to do this, but you could format as:

<meta name="022" content="0194-6536">
<meta name="866.a" content="Availability: from 1998">
<meta name="866.x" content="Alt-HealthWatch:Full Text">

Then use MetaNames to define what fields to search for.  Use
UndefinedMetaNames to define what to do with meta content that is not
listed in MetaNames.

And use

   PropertyNames 022 866.a 866.x

to store the text for display on search results.

Sure hope I'm answering the right question. ;)

Bill Moseley

Deze e-mail is door E-mail VirusScanner van Planet Internet gecontroleerd op
Op staat een verwijzing naar de actuele lijst waar
op wordt gecontroleerd.
Received on Fri Feb 13 04:00:14 2004