Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] swish.conf questions

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Aug 09 2007 - 18:22:43 GMT
On Thu, Aug 09, 2007 at 10:52:14AM -0700, Kerry Kobashi wrote:
> Maybe I should back up and explain what I'm trying to accomplish.
> 
> I have a file hierarchy that is deeply nested. Here's a snippet:
> 
> /foobar
> ---index.xml
> ---1.xml
> ---2.xml
> ---otherfile.htm
> ---otherfile2.htm
> ---/foobarsubcategory1
> ------index.xml
> ------1.xml
> ------2.xml
> ------otherfile.htm
> ---/foobarsubcategory2
> ------index.xml
> ------otherfile.htm
> 
> I want swish-e to index only index.xml files as it contains 
> metainformation for me to search on those XML documents.
> Inside each index.xml is the following:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <index>
> <metaheader>
>     <title>The title</title>
>     <description>The description</description>
>     <keywords>
>        <keyword>kw1</keyword>
>        <keyword>kw2</keyword>
>     </keywords>
> </metaheader>
> <section>
>     <title>The title</title>
>     <description>The description</description>
>     <body>
>        Lorem ipsum blah blah <keyword>the keyword</keyword> more stuff 
> follows.
>     </body>
> </section>
> .
> .
> </index>
> 
> I want swish-e to index only the metaheader tag elements - the title, 
> description, and the keyword. I do not want it to index the title, 
> description, keyword tags, and anything else including the other title, 
> description, and keyword tags located in other elements like section.

I would hack on DirTree.pl.  Finding only index.xml files would be
simple.  Then use one of the XML parsers on CPAN to extract out the
data you want, then format for swish-e's -S prog format, and write to
stdout.

If you don't want to write in Perl then pick your favorite scripting
language.

> # Store and index only the metaheader information
> MetaNames title, description, keyword

I don't think you want the commas.


> 1) I am developing this with PHP 5, XSL, DOM. Can swish-e accomplish the 
> job? Or is a RDBMS + PHP solution more suitable?
> 2) If swish-e can do the job
>     a) Why is it indexing not only index.xml, but other XML files as well?

FileMatch is an odd one.  Read the docs again on it, but basically it
is useful to allow files that would otherwise be excluded.  You are
not excluding the other files.

    FileRules filename regex /./        <<-- exclude every file
    FileMatch filename is index\.xml    <<-- except this one


>     b) How do I avoid having swish-e index the section's title, 
> description, and keyword tags in the section, if not everywhere else?

I think the fastest way is to write a script to filter the files and
pipe that into swish.


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Aug 9 14:22:43 2007