On Thu, May 26, 2005 at 04:28:32AM -0700, Philip Young wrote:
> > Probably because you don't have a <title>.
> Well, the documents (word, OOo) I indexed did have titles, and I have
> originally tried displaying the results via the 'swish.cgi' template
> and it did display the 'swishtitle' property on the results so I think
> the index does have this value.
Maybe yes, maybe no. The swish-e binary displays the path name when
there isn't a title.
To test you just have to run swish-e -T indexed_words when indexing
to see if you have a "swishtitle" added to the index, and then use
the command line to search the index and see if the title is reported.
> > You might have better luck using SWISH::Filter instead of FileFilter
> > -- gives you more control of the input to swish from your various file
> > formats. More likely to get a title (for example, pdf conversion uses
> > pdfinfo to extract out a title, if possible).
> Interesting.. Using the SWISH::Filter would there be a need to
> declare this filter
> within the swish.conf, search.cgi files? I am assuming SWISH::Filter
> is built into the swish-e installation?
SWISH::Filter is a front end to a bunch of modules that do filtering.
The individual filter modules do use things like catdoc and xpdf, so
those also need to be installed.
SWISH::Filter just provides a way to pass in a file (or content), and
a content-type (or let SWISH::Filter determine it based on file name)
and SWISH::Filter returns the filtered document.
There's a few ways to hook that into swish-e. One is to use the
FileFilter command to run the swish_filter.pl script for each
document. That loads the SWISH::Filter module and passes the
document off for processing. (swish_filter.pl is in
/usr/local/share/doc/swish-e/example on my machine).
FileFilter .pdf /path/to/swish_filter.pl
FileFilter .doc /path/to/swish_filter.pl
FileFilter .mp3 /path/to/swish_filter.pl
The problem with doing it that way is it's slow -- SWISH::Filter is a
collection of modules that would need to be loaded and compiled for
each document. I suppose you could run swish_filter.pl under
SpeedyCGI which keeps the program running in memory.
The more common SWISH::Filter is used is by a program that fetches
the documents and then passes them to swish-e on stdin. DirTree.pl
and spider.pl are two examples included in the distribution that do
/path/to/DirTree.pl <dir> | swish-e -S prog -i stdin
> > Interesting. That's doesn't really work because then you have two xml
> > files, and I don't think the parser is going to like that.
> As I need to extract metadata into the index and also the content is
> there any specific way I could grab the 'metadata' and the 'content'
> to placed into the same index? And reading from other discussions I
> found that the filters (catdoc, pdf2text) do not actually extract the
> metadata from the documents.
Right, that's why we use pdf2info to grab the title and there's a
filter that uses wvWare to process Word docs.
For the OO files I'd think you would need to process both xml files
into a new xml file with the content you want. There's quite a few
OpenOffice:: modules on CPAN, so I'll bet there's a reasonably easy
way to add this ability.
> Yes, exactly: front end = templates, sorry about that :) Anyone point
> me in the right direction to get a working template with search.cgi.
perldoc Template or http://tt2.org are the docs. Do you have a
specific thing you are trying to do or just want general
understanding of how the templates work?
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Thu May 26 07:53:06 2005