Skip to main content.
home | support | download

Back to List Archive

Re: Some Questions about 2.2RC1 and XML

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Sep 07 2002 - 21:02:09 GMT
At 08:00 PM 09/06/02 -0700, Bill Humphries wrote:
>We plan to process the results with XSLT (which is used for the rest of 
>the site's presentation layer), so we need all the entities converted 
>back to unicode equivalents. This is all being done in PHP. So there 
>may be a string function handy, I need to glance back at the man pages.

Yes, I agree that escaping should be done in the presentation layer.

>>> 2) I'm indexing XML source documents in the file system. I can use the
>>> configuration to use the first 100 characters of the document's root
>>> element, 'page', as the description:
>>>
>>> PropertyNamesMaxLength 100 swishdescription
>>> PropertyNameAlias swishdescription page
>>>
>>> However, when swish-e constructs the index, it's taking the attribute
>>> values, as well as the text nodes of 'page'.

That's because of:

   MetaNames page.title container.title container.access
   UndefinedXMLAttributes ignore

>From the docs:

  ignore

    The contents of the meta tag are ignored and not indexed unless 
    a metaname has been defined with the MetaNames directive

And you are defining a metaname.

"UndefinedXMLAttributes ignore" is converting

   <page title="foo">
      ...
   </page>

into

   <page>
       <page.title>
            foo
       </page.title>
      ...
   </page>

And then since you have a metaname "page.title" it's indexed.

Now, you then have:

  PropertyNamesMaxLength 100 swishdescription
  PropertyNameAlias swishdescription page

which says to store a property named "swishdescription" (and an alias is
"page").  So everything inside <page> is part of that description, and
that's "foo".

Here you can see what is happening by using the trace option -T:

~/swish-e/src: $cat b
DefaultContents XML2
MetaNames page.title container.title container.access
UndefinedXMLAttributes ignore
PropertyNamesMaxLength 100 swishdescription
PropertyNameAlias swishdescription page

~/swish-e/src: $cat b.xml
<?xml version="1.0"?>
<page title="titlevalue">
      insidepagetag
</page>

~/swish-e/src: $./swish-e -c b -i b.xml -v0  \
      -T parsed_tags indexed_words properties

<page> (undefined meta name - no action)
<page> (property [swishdescription])
<page.title> (meta [page.title])
    Adding:[1:page.title(10)]   'titlevalue'   Pos:3  Stuct:0x1 ( FILE )
</page.title> (meta)
    Adding:[1:swishdefault(1)]   'insidepagetag'   Pos:5  Stuct:0x1 ( FILE )
</page> (property)
          swishdocpath: 6 (  5) S: "b.xml"
          swishdocsize: 8 (  4) N: "0000000000077"
     swishlastmodified: 9 (  4) D: "2002-09-07 11:51:25"
      swishdescription:13 ( 24) S: "titlevalue insidepagetag"

You can see from above that:

   <?xml version="1.0"?>
   <page title="titlevalue">
         insidepagetag
   </page>

was turned into:

   <?xml version="1.0"?>
   <page title="titlevalue">
      <page.title>
            titlevalue
      </page.title>
         insidepagetag
   </page>

And you are saving as a property the text contents of <page> and so you see

      swishdescription:13 ( 24) S: "titlevalue insidepagetag"

>Here's the relevant portion of the config file:
>
>MetaNames page.title container.title container.access
>XMLClassAttributes page.title container.access

That's actually wrong in this case.  XMLClassAttributes is saying *which*
attribute(s) should be used to create new metatags based on the <tag> + the
*value* of the listed at attribute.  

XMLClassAttributes class

    <person class="first">
        John
    </person>
    <person class="last">
        Doe
    </person>

becomes:
 
    <person>
        <person.first>
        John
        </person.first>
    </person>
    <person>
        <person.last>
        Doe
        </person.last>
    </person>


>PropertyNameAlias swishtitle page.title
>PropertyNamesMaxLength 100 swishdescription
>PropertyNameAlias swishdescription page

Now, you don't really need to make the Aliases there if you are only
indexing one type of file.  All you really need is

   PropertyNamesMaxLength 100 page

That will create the property named "page" if it doesn't already exist, and
limit its length.

Then when you want it in your results either use -p page or -x '<page>...\n'

The reason to use 

  PropertyNameAlias swishdescription page
or
  PropertyNameAlias swishtitle page.title

would be if you were indexing HTML files and XML files at the same time and
you want to be able to map <page> or <page.title> from an XML file to the
same metaname used property name.  Or if you are indexing various XML files
and in some files the title is <title> and in others the title is
<doctitle> then you might use

  PropertyNameAlias title doctitle

Then when you print out -p title or -x '<title>\n' you will get both.

There's nothing really special about "swishtitle" other than when indexing
HTML files <title> contents is automatically placed in "swishttitle", and
when you run swish without using -x it displays "swishtitle".

Tip 1: And I'd recommend writing any program that parses the output from
swish to use -x and to not consider the built in swish* properties
differently than any other properties.  That way you can write generic code
where you pass in a query and a list of proerties and get back that list of
properties.

>>> I'd also like to specify a location in the document to use as the
>>> description, ie /page/section[1]/para[1].
>>>
>>> The workaround here would be to use the prog method to load pages and 
>>> use
>>> some xpath tool to extract that location and use as the page 
>>> description.
>>
>> I'm not 100% clear what you want, but using -S prog with the available
>> tools will probably give you the most control.
>
>That's probably the way to go, just some expense on the indexing side, 
>but then I'm indexing on the order of 3,000 pages, so that's no great 
>burden.

Expense speed wise?  Very little.  Some say perl is slow.  C is slow if
every time you run a C program you have to compile it first.  That's what
happens with perl in CGI scripts or when using a perl-based filter in swish
( or -S http which runs swishspider for every request ), the script must
first be compiled.  But with -S prog the script is only compiled once.  All
those fast indexing jobs I post are processed using perl.

   ~/swish-e/src$ perl prog.pl count=100000 | ./swish-e -S prog -i stdin
  Indexing Data Source: "External-Program"
  Indexing "stdin"
  100000 files indexed.  196473862 total bytes.  20312182 total words.
  Elapsed time: 00:03:27 CPU time: 00:02:22

But writing a -S prog program gives you a lot of power (and you don't have
to wonder how all those swish-e XML config options work.)  For example, if
you parsed the XML with something like Perl's XML::LibXML you could pass to
swish exactly the data you want to search, and arrange it how you want it
searched.  You can even send it to swish as HTML and make use of <h1> tags
to force some words to have more weight in ranking the docs that others.
i.e.:

  <page>
      <keywords>
            very important words
      </keywords>
      <other_important>
            kindof important words
      </other_important>
      regular words
  </page>

Then format at
  <html>
    <head>
       <title>very important words</title>
    </head>
    <body>
       <h1>kindof important words</h1>
       regular words
    </body>
   </html>

Then searching for "very" will rank the above doc higher than a doc where
"very" is just in the <body>.  That make sense?  It might seem like a lot
of bother, but it's really not that hard with all the tools available.

Sorry for being so brief in my answer...


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Sat Sep 7 21:05:44 2002