On Sat, Jun 10, 2006 at 08:07:12PM -0700, Linda W. (that's swishey, not squishey!) wrote:
> I thought NoContents meant, don't look at the contents of files with
> these extensions, but do index the filenames. Guess not.
>From the docs:
If the file's type is HTML or HTML2 (as set by IndexContents or
DefaultContents) then the file will be parsed for a HTML title and
that title will be indexed.
Swish has to look at the contents if it's going to find the title.
Swish-e's original start in life about ten years ago was to index a
few web pages on unix machines. So by default it would index all
files in a directory assuming it was all html. Ten years ago that
was probably a reasonable method.
Now, most people spider their sites, and the spider can look at the
Content-Type header to determine what to index. The spider that comes
with swish uses the set of perl modules called SWISH::Filter that can
take a file and try and figure out it's mime type (if not already
known) and then determine if it can be filtered to text for indexing.
The individual filters for SWISH::Filter are separate perl modules
(e.g. SWISH::Filter::Pdf2HTML.pm) that sometimes use external
programs (e.g. pdftotext and pdfinfo) to convert the file into
What gets filtered depends on what you might have installed. IIRC,
xpdf and catdoc are included in the windows build, where building from
source you have to install those separately. So, if you use the
spider you will likely not have all these problems.
The DirTree.pl program that's included with the distribution makes
use of SWISH::Filter. It's simple scans the file system (like the
default mode of swish), but it will filter based on mime type just
like spidering. So, that may be much easier if you want to scan the
file system instead of spider a web site.
perldoc DirTree.pl for some details, but it's not a very complex
If you want the details of SWISH::Filter see:
The INSTALL doc has examples of indexing, and one is spidering.
Might save yourself a lot of time if you follow those instructions.
My only comment is *I* probably would not use the swish.cgi script.
It's a bit bloated with features. I think it's easier to just write
a simple search script -- maybe use the search.cgi script for ideas.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Sun Jun 11 07:31:35 2006