Skip to main content.
home | support | download

Back to List Archive

Re: Indexing of word documents, stored on a UNIX

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Aug 20 2001 - 20:21:16 GMT
On Mon, 20 Aug 2001, FISHER,JOSEPH (Non-HP-Roseville,ex1) wrote:

> Now, to address some of my concerns and issues while installing SWISH-E...
> 
> You mentioned filter files, configuration files, etc...
> 
> I feel that swish-e should have standard configuration files in place, and
> each of those configuration files should be specifically named... (If each
> different installer chooses their own naming convention, that causes half of
> the confusion... Especially when a different installer, like myself, has to
> come in and make heads or tails of what someone else has done previously...)

Swish has one config file called config.h for compile time settings that
most people will never change.  The configuration file specified with -c
is a user defined configuration file, and there is no standard config file
for swish.

> PLEASE... Standardize the naming convention of configuration files... Don't
> let the user / installer create their own naming conventions... It's too
> difficult to maintain...

Since we have no idea *what* people will index, then there's no way to
define a "standard" config file name.  I tend to call it swish.conf or
swish.cfg, but that's not a very good description.  But I imagine people
call the config files names that make sense to them -- something that
describes the data that's being indexed.

> In our case, the person who originally installed swish 1.3 created a
> configuration file called user.config.fs...
> 
> Until I had already spent several hours digging, I did NOT know that this
> was the configuration file I needed to place the catdoc lines in...

You look at the command used to initiate indexing and see what is used
for the config file.  I don't see any other way to do it.  Sounds more
like a problem in your organization than with swish, if I understand you
correctly.  You could make some naming policies at your organization, but
still, I'd want to look at the cron job or whatever that does the actual
indexing to find out exactly what's happening.

> Example:
> 
> 	You have the following "possible" configuration files: configure,
> config.h, swish.h, filter.h

I think you are confused about "configuration" files.  Those are the C
header files, other than config.h, they are not "configuration" files and
not intended to be modified by "users".  The swish-e configuration file
that is often mentioned is what is passed by the -c command during
indexing.

> I found the .../filter-bin/_doc2text.sh script by doing a "find . -exec grep
> -l catdoc {} \;" from the command line...

Sounds like the hard way.  You could have read the README file:

What's included in the SWISH-E distribution?
    Here's an overview of the directories included in the swish-e
    distribution:

    ...
    filter-bin/
       Sample programs to use with swish-e's "filters". Examples include
       PDF, MS Word, and binary strings filters.


> I found various other configuration related entries, using similar find or
> grep commands...
> 
> When I saw the catdoc entry in this script, I was confused as to where the
> entries should go...

There's one document that describes the swish-e configuration settings
that can be placed in the config file specified with the -c switch.  It's
called SWISH-CONFIG and it has a section on using filters with example
directives.

I don't think that section mentions the filter-bin directory.  It should.
Frankly, I haven't looked much at that section, as I either use -S prog,
or call a filter program directly (such at catdoc) and just cut-n-paste
that FileFilter directive directly from the docs.

> Then, when I attempted to put the FileFilter entry in the configuration
> file, I wasn't sure whether I needed to change anything in the syntax of the
> entry or not... You could say something like: "Place the following line in
> your configuration file: FileFilter .doc /usr/local/bin/catdoc "-s8859-1
> -d8859-1 '%p'""

There's a section titled "Examples for filters:" (probably should be
"Examples of filters:") that shows specific examples.  I agree that the
*concept* of the filter can be tricky at first, and that the docs could be
written better, but it's not lacking the information you say is missing.

I also agree that if you just end up in the filter-bin directory instead
of getting there from the documentation (as examples of filters) that it
may be hard to understand.  

Also, for performance reasons, the best use of filters is calling the
filter program (e.g. catdoc, pdftotext) directly from swish instead of
calling a shell or perl program that then calls the fiter program.  If you
must use a shell or perl script you will be better off using -S prog, but
-S prog didn't exist when the filter script examples were written.

Filter scripts are probably easier to write than -S prog programs, so
what is used all depends on your individual situation.  This complicates
things, but that is what flexiblity brings.


> Or... Better yet... You should probably place actual entries for the filter
> files, inside of your "future", "standardized" configuration file...
> Commented out, of course, with a note, saying that the actual executables
> need to be installed before uncommenting out the lines...

The swish-e README file also says:

    conf/
       Example swish-e configuration setups to help you get started. In the
       `stopwords' sub-directory are a number of stopword files for
       different languages.

And in the conf/README file it says:

    example8.conf   - using "filters" to convert PDF files.

And that config example shows how to use filters (ok, the readme files
calls them .conf and they are .config in the directory).

Perhas we need another example that shows a simpler use of a filter, such
as your need to filter Word docs with catdoc.

Anyway, the specific config directive you needed was in the documentation,
and the conf/ directory has a specific example of how to use a filter.
The README in the filter-bin directory should be more descriptive, too.

If there are any technical writers (or just plain good editors) that want
to contribute, that would be a big help.  I don't have much time to spend
editing the documentation (or even my email messges).


The Swish distribution used to have a single config file included with
every config option listed.  Perhaps just my opinion, but what a mess.  
People would use that config file and not know what all the directives
did.  Then when someone posts help they post this huge config file so it
was hard to know what exactly was the problem.  So I replaced that huge
config files with the examples in the conf/ directory, with the idea that
they start out simple and get more complicated or show different features
a few at a time -- kind of a swish-e tutorial.  And also with the hope
that people would only use the directives needed for their own situation
making it easier to read when posted on the list.  Most config files only
need to be about five lines long, and some times a config file is not even
needed.

The idea for the documentation is that you start with the README file and
that describes swish, the included documentation and what's in the various
directories and where to go next (the INSTALL doc).

The INSTALL doc is suppose to get swish installed, and shows a simple
indexing setup and searching.  It also describes where to learn more about
running (SWISH-RUN) and configuration (SWISH-CONFIG), and where to get
help (the swish-e list).

Finally, there's also the SWISH-FAQ.  It's suppose to answer common
questions, and more specific questions like:

   Can I index my PDF, Word, and compressed documents?

Someone first starting out should probably read README, INSTALL, and the
SWISH-FAQ to get a good overview of swish.

I think that's a resonable approach.


-- 
Bill Moseley moseley@hank.org
Received on Mon Aug 20 20:21:58 2001