Skip to main content.
home | support | download

Back to List Archive

Re: swish-e 2.4.3 windows 2003 iis success!

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jun 22 2005 - 21:46:37 GMT
On Wed, Jun 22, 2005 at 05:05:53PM -0400, Revillini, James wrote:
> RTF's are killing it now.  As soon as it runs into one, the output file
> from dirtree.pl goes like this:

By the way, this is all in the docs, but here's a quick executive
summary:

DirTree.pl finds files and then passes the file name to SWISH::Filter
module.

SWISH::Filter uses MIME::Types to lookup the mime type of the file.
Then all the available SWISH::Filter modules are scanned for a regular
expression that matches the file's mime type.  When found that filter
is used and the filter changes the content type to something else
(like text/plain or text/html).

The individual filters normally need helper programs, like catdoc, to
be installed before they will work.  The swish distribution on windows
includes catdoc, IIRC.

When SWISH::Filter is done DirTree.pl then skips any files that are
"binary", which only means they are not of some kind of text/* type.
Really, it should only not skip if text/xml, text/plain, or text/html
as that's all swish can index.  After all there's a lot of other text
types:

    $ fgrep 'text/' /etc/mime.types | wc -l
    62

You might want to add that test into DirTree.pl -- check for only
those three mime types:

    unless ( $doc->content_type =~ m!^text/(?:plain|xml|html)$/ ) {
        warn "Can't index $path because it's " . $doc->content_type .  "\n";
        return;
    }

Anyway, that's how it all works.




-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Jun 22 14:46:38 2005