Skip to main content.
home | support | download

Back to List Archive

Re: FW: Re: More Trouble with Filters

From: <moseley(at)not-real.hank.org>
Date: Tue Jul 29 2003 - 14:48:15 GMT
On Tue, Jul 29, 2003 at 06:18:34AM -0700, Klingensmith, Rick wrote:

> I'm probably beginning to sound like a flake, but I've got myself very
> confused at this point. I've used the following config file and added a bare
> use lib line to the swishspider file:

> SpiderDirectory C:/Swish-E
> 
> # Use the file filter to index pdf files
> #FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl '"%p" -'
> #FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe '"%p" -'
> 
> # Filter Directory
> FilterDir C:/SWISH-E/filter-bin

Hi Rick,

I haven't looked at the filter code in a while.  FilterDir is prepended
to the program specified with FileFilter when the program doesn't start
with a "/".  That doesn't work very well on Windows (or anywhere
really).  I'd suspect in your case it's trying to run a program called:

   C:/SWISH-E/filter-bin/c:/SWISH-E/filter-bin/pdftotext.exe

(although you have those filters commented out)

> Swishspider is in my SWISH-e directory. With this configuration the pdf
> files indexed correctly, but I'm still getting the same output on the meta
> tags as below in my previous post.

You mean malformed meta tags like you posted?  I guess I need to see if
I can get my windows machine to boot and try a few things.  Can you make
available online a test PDF file?  You can send it directly to me if you
don't want it public -- although it would be helpful for me to fetch if
from the same URL you are using.

So, provide me with

 - a URL to fetch a test PDF file
 - the version of swish-e you are using (perhaps a link to the specific
   version you installed)
 - answer if your swishspider has the lines shown below

And I'll try it on my Windows machine (if it will boot).


> I thought I was using the SWISH::Filter by default, but now I'm not sure.
> When I use the FileFilter directive in the config file I get the errors that
> pdf is invalid. Once I commented both lines out at least it indexed the pdf
> without error. The FilterDir directive doesn't seem to matter I get the same
> output with or without it. I did confirm that the document is being indexed
> with a search for words that only appear in the pdf with the correct
> results. 
> 
> My perl/site/lib/swish subdirectory contains filter.pm and
> perl/site/lib/swish/filters contain the other filter modules. I'm convinced
> this is a simple configuration issue, but my perl knowledge is limited so
> debugging has been a problem. 

I'm not exactly clear what version you are running.  Does your
swishspider have this at the top?:

# Should SWISH::Filter be use for filtering?  This can be left 1 all the time, but
# will add a little time to processing since.

use constant USE_FILTERS  => 1;  # 1 = yes use SWISH::Filter for filtering, 0 = no. (faster processing if not set)
use constant FILTER_TEXT  => 0;  # set to one to filter text/* content, 0 will save processing time
use constant DEBUG_FILTER => 0;  # set to one to report errors on loading SWISH::Filter module.

Many things regarding installation have changed for the 2.4.0 version -- 
namely most things get installed in places so that you don't need to 
specify paths and set Perl libraries locations.  That will make things 
much easier in the future.

If your swishspider has those lines above then it's designed to work 
with the SWISH::Filter modules.  By default (and this is something to 
possibly change), swishspider doesn't know where SWISH::Filter is 
installed.  That's on purpose because I didn't want swishspider using 
them those filters by default.  Why?  Because the way swish-e works with 
-S http is that it calls swishspider for every URL fetched and that's 
slow (due the the compiling of the swishspider Perl script).  Making it 
load all the SWISH::Filter modules would be a lot more work for every 
request.  Using -S prog (and spider.pl) avoids all that.

So, to have swishspider use SWISH::Filter you either have to set a 
PERL5LIB environment variable or add a "use lib" line to the top of 
swishspider.  Both do the same by adding paths to Perl's @INC array so 
Perl can find the modules.

So if you have the above lines in swishspider then you can set 
DEBUG_FILTER => 1 and it will tell you if swishspider was able to load 
the SWISH::Filter module (and SWISH::Filters::* filter modules).

Then, what you want to do is run swishspider without running swish-e:

   perl swishspider prefix http://localhost/test.pdf

Then you should have in the current directory a file called 
prefix.contents and prefix.response (contains the HTTP response code), 
and maybe a prefix.links (if the file is HTML and has links to follow).

That will tell you if the SWISH::Filter module is being used (well, 
really it will tell you if it's not being used if DEBUG_FILTER is set).

Then you can look at prefix.contents and see the output that is being 
created.  If you then see the messed up meta tags then it's a problem 
with the way the SWISH::Filter is working under Windows.

prefix.response should have text/html in it if the file was filtered, 
otherwise it will have application/pdf.


-- 
Bill Moseley
moseley@hank.org
Received on Tue Jul 29 14:48:36 2003