Skip to main content.
home | support | download

Back to List Archive

Re: New to Swish-e

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Mar 12 2004 - 00:08:17 GMT
On Thu, Mar 11, 2004 at 02:15:49PM -0800, Sean McGilloway wrote:

>  - We have one share (M:\) that contains all our files
>  - Our files are PDF, Word, Excel, and Powerpoint files
>  - Does anyone have a standard config file that will index based on the =
> above criteria?

I'm repeating what you likely already know, but swish-e is a C program
that creates an inverted index.  Most people use if for small
collections of files -- maybe < 100,000.  Swish-e doesn't have a fancy
interface -- the idea is you would integrate it into your site in some
way.  There are example programs, but they are also Perl based.  I know
some are using IIS so maybe they can offer more help.

Swish-e knows how to read files from the files system, or it can use an
external helper to do the work of fetching and filtering (converting)
documents into a format that swish can parse.

When using the file system (or any method, really) swish-e can pass each
document to an external program for filtering, but this tends to be
slow.

When reading documents via an external program the external program can
do the filtering.  For things like Perl this tends to be much faster
because there's not the overhead of loading the external modules.

The swish-e distribution comes with a few Perl programs.  One is
spider.pl that fetches documents from a web server, and another is
called DirTree.pl.  DirTree.pl is similar to swish-e reading directly
from the file system.  Both of those are able to use a Perl module
called SWISH::Filter for converting things like Word into text.
Actually, the SWISH::Filter module doesn't really do the conversion,
that's left to other external programs.  E.g. "catdoc" is used to
convert the Word document into text.  The SWISH::Filter module just
passes the file to catdoc and formats catdoc's output for swish-e to
use.

On unix it's reasonably easy to use these filters because most machines
come with Perl.  OS X does also.  IIRC, for some odd reason Windows doesn't
offer Perl, so you would have to install that from Active State's web
site.  I believe that swish-e will detect Perl when installing and the
also install the Perl modules that come as part of the Swish-e package.

You may also need to install extra Perl modules to parse the Excel.  I'm
not sure about Powerpoint -- I think someone on this list once posted
about parsing Powerpoint.  Active State has the Perl Package Manager --
so to install module "foo" you type something like C:\> ppm install foo.
But not all Perl modules are available from Active State.

You should consider all that work when considering the cost of other
canned solutions.  

> The documentation goes into a lot of detail and the examples try very =
> hard to explain in simple terms how the system works, but it's just not =
> sinking in (probably because I'm not a perl programmer). I had to tweak =
> around for an hour to figure out what files needed to be in which =
> directory and whether or not to use / or \ when referencing the =
> directories. So, ideally, I'd love to get my hands on a config file that =
> looks something like this:

I think you want forward slashes.

> Can one config file index the entire site (for PDF, Word, Excel, =
> Powerpoint, HTML, text, etc)?

Well, yes.  If you use an external program like DirTree.pl to fetch
files and filter them they you will have a config file for swish-e
(likely very small) and then may also need to tweak DirTree.pl -- namely
listing what files to skip.

I hope someone that uses Windows can offer you some additional advice.

-- 
Bill Moseley
moseley@hank.org
Received on Thu Mar 11 16:08:17 2004