Skip to main content.
home | support | download

Back to List Archive

Re: Indexing takes forever

From: Nick <newsgroups(at)not-real.2thebatcave.com>
Date: Fri May 06 2005 - 20:50:00 GMT
swish-e -c /etc/swish.conf -S prog -i DirTree.pl
I tried that but I got this:

Indexing Data Source: "External-Program"
Indexing "DirTree.pl"
External Program found: /usr/lib/swish-e/DirTree.pl
Must supply at least one directory
Usage:
    DirTree.pl [options] directory <directory...> | swish-e -S prog -i stdin

      Options:
        -verbose        Display processing info
        -debug          Enable debugging (including SWISH::Filter debugging)
        -man            Display documentation
        -path           Display location lib path set at installation
        -no_skip        Process documents even if filtering fails
        -symlinks       Follow symbolic links.  Default is to NOT follow
symlinks

Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!


Is there any reason to use SWISH::Filter for performance, or is it just
supposed to be easier?  To me doing something like this in the config file
makes more sense, as I understand what it is doing when I tell it about
each type of file:

IndexContents TXT* .txt
IndexContents HTML* .htm
IndexContents HTML* .html

FileFilter .pdf pdftotext "'%p' -"
IndexContents TXT* .pdf

FileFilter .doc catdoc
IndexContents TXT* .doc

FileFilter .ppt ppthtml
IndexContents TXT* .ppt


But of course I have something wrong in there since I am getting lots of
errors from catdoc, and also I don't know how to put the excel one in
there since I think it is a perl script.

>
>
> Nick scribbled on 5/6/05 3:28 PM:
>> As far as docs go, I would like to see a few different sample swish.conf
>> files (and possibly related command line options like you showed below)
>> for different applications.  Generally when I am setting up something I
>> like to see example setups/configs and play around with it before trying
>> to fine-tune it.  If there were more example configs then a user could
>> just pick one that is close to what they are looking for to get it
>> going,
>> then work from that.
>>
>
>
> there should be example config docs installed by default in
> swish_prefix/share/doc/swish-e/examples/conf/
>
> check /usr/local/share/doc/swish-e/examples/conf/ if you installed in
> default
> location.
>
>
>
>> On the same note what should I put in the config file if I use the:
>>
>> swish-e -c /etc/swish.conf -S prog -i DirTree.pl
>>
>
>
> that command should work with your existing config file (I think).
> DirTree.pl
> will try and load SWISH::Filter for file formats it recognizes.
>
>>
>> I am guessing that it was just using the default html filter to find
>> text
>> in the doc and ppt files that I searched then?  I know that it could
>> find
>> text in these binary files using my existing config, that is why I
>> thought
>> it was somehow finding the extra progs I had installed to filter the
>> file
>> types.
>
> yes, I have been misled that way too. swish-e does its best to get
> whatever text
> it finds, and since word .doc (especially) files have real text mixed in
> with
> all the proprietary formatting instructions, swish-e probably got lots of
> chunks
> of text. but a proper filter will ensure you get all of it, as the author
> intended it.
>
> --
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>
Received on Fri May 6 13:50:01 2005