Skip to main content.
home | support | download

Back to List Archive

Re: 'Missing Title' with swishtitle property

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu May 26 2005 - 14:52:58 GMT
On Thu, May 26, 2005 at 04:28:32AM -0700, Philip Young wrote:
> > Probably because you don't have a <title>.
> Well, the documents (word, OOo) I indexed did have titles, and I have
> originally tried displaying the results via the 'swish.cgi' template
> and it did display the 'swishtitle' property on the results so I think
> the index does have this value.

Maybe yes, maybe no.  The swish-e binary displays the path name when
there isn't a title.

To test you just have to run swish-e -T indexed_words when indexing
to see if you have a "swishtitle" added to the index, and then use
the command line to search the index and see if the title is reported.

> > You might have better luck using SWISH::Filter instead of FileFilter
> > -- gives you more control of the input to swish from your various file
> > formats.  More likely to get a title (for example, pdf conversion uses
> > pdfinfo to extract out a title, if possible).
> Interesting..  Using the SWISH::Filter would there be a need to
> declare this filter
> within the swish.conf, search.cgi files?  I am assuming SWISH::Filter
> is built into the swish-e installation?

Kind of.

SWISH::Filter is a front end to a bunch of modules that do filtering.
The individual filter modules do use things like catdoc and xpdf, so
those also need to be installed.

SWISH::Filter just provides a way to pass in a file (or content), and
a content-type (or let SWISH::Filter determine it based on file name)
and SWISH::Filter returns the filtered document.

There's a few ways to hook that into swish-e.  One is to use the
FileFilter command to run the swish_filter.pl script for each
document.  That loads the SWISH::Filter module and passes the
document off for processing.  (swish_filter.pl is in
/usr/local/share/doc/swish-e/example on my machine).

    FileFilter .pdf /path/to/swish_filter.pl
    FileFilter .doc /path/to/swish_filter.pl
    FileFilter .mp3 /path/to/swish_filter.pl


The problem with doing it that way is it's slow -- SWISH::Filter is a
collection of modules that would need to be loaded and compiled for
each document.  I suppose you could run swish_filter.pl under
SpeedyCGI which keeps the program running in memory.

The more common SWISH::Filter is used is by a program that fetches
the documents and then passes them to swish-e on stdin.  DirTree.pl
and spider.pl are two examples included in the distribution that do
this.

   /path/to/DirTree.pl <dir> | swish-e -S prog -i stdin


> > Interesting.  That's doesn't really work because then you have two xml
> > files, and I don't think the parser is going to like that.
> As I need to extract metadata into the index and also the content is
> there any specific way I could grab the 'metadata' and the 'content'
> to placed into the same index?  And reading from other discussions I
> found that the filters (catdoc, pdf2text) do not actually extract the
> metadata from the documents.

Right, that's why we use pdf2info to grab the title and there's a
filter that uses wvWare to process Word docs.

For the OO files I'd think you would need to process both xml files
into a new xml file with the content you want.  There's quite a few
OpenOffice:: modules on CPAN, so I'll bet there's a reasonably easy
way to add this ability.

> Yes, exactly: front end = templates, sorry about that :)  Anyone point
> me in the right direction to get a working template with search.cgi.

perldoc Template or http://tt2.org are the docs.  Do you have a
specific thing you are trying to do or just want general
understanding of how the templates work?

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu May 26 07:53:06 2005