Skip to main content.
home | support | download

Back to List Archive

FW: Re: FW: Re: More Trouble with Filters

From: Klingensmith, Rick <klingensmith(at)not-real.hr.msu.edu>
Date: Tue Jul 29 2003 - 16:22:42 GMT
See below for answers to your questions. I'm beginning to wonder at this
point if I should abandon this approach and try -prog. The only problem is I
was having fun trying to figure out a configuration file to crawl my site. I
thought it was using filters too and would have the same problem.
 
Rick
klingensmith@hr.msu.edu


> -----Original Message-----
> From: moseley@hank.org [mailto:moseley@hank.org]
> Sent: Tuesday, July 29, 2003 10:48 AM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: FW: Re: More Trouble with Filters
> 
> On Tue, Jul 29, 2003 at 06:18:34AM -0700, Klingensmith, Rick wrote:
> 
> > I'm probably beginning to sound like a flake, but I've got myself very
> > confused at this point. I've used the following config file and added a
> bare
> > use lib line to the swishspider file:
> 
> > SpiderDirectory C:/Swish-E
> >
> > # Use the file filter to index pdf files
> > #FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl '"%p" -'
> > #FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe '"%p" -'
> >
> > # Filter Directory
> > FilterDir C:/SWISH-E/filter-bin
> 
> Hi Rick,
> 
> I haven't looked at the filter code in a while.  FilterDir is prepended
> to the program specified with FileFilter when the program doesn't start
> with a "/".  That doesn't work very well on Windows (or anywhere
> really).  I'd suspect in your case it's trying to run a program called:
> 
>    C:/SWISH-E/filter-bin/c:/SWISH-E/filter-bin/pdftotext.exe
> 
> (although you have those filters commented out)
> 
> > Swishspider is in my SWISH-e directory. With this configuration the pdf
> > files indexed correctly, but I'm still getting the same output on the
> meta
> > tags as below in my previous post.
> 
> You mean malformed meta tags like you posted?  I guess I need to see if
> I can get my windows machine to boot and try a few things.  Can you make
> available online a test PDF file?  You can send it directly to me if you
> don't want it public -- although it would be helpful for me to fetch if
> from the same URL you are using.
> 
> So, provide me with
> 
>  - a URL to fetch a test PDF file

35.8.31.67/affidavit.pdf should get you to the pdf in question. There is
another pdf named terminationchecklist.pdf in the directory too which causes
the same problems. I do have a firewall running on my pc so if you have
problems let me know your ip and I'll allow it in.

>  - the version of swish-e you are using (perhaps a link to the specific
>    version you installed)

When I run swish-e -h I'm getting version 2.2.3 and have been running with
the DEBUG_FILTER set to 1. I just downloaded the latest version about 3
weeks ago and assumed I had the latest. I used the swish-e-2.2.3-win32exe
link to download the version I have.

>  - answer if your swishspider has the lines shown below
> 

My Swishspider does contain the lines below. Here are the top lines from my
swishspicer:

print STDERR "spider $$ [@ARGV]\n";

#
# SWISH-E http method Spider
# $Id: swishspider,v 1.9 2002/09/09 07:15:19 whmoseley Exp $ 
#

# Should SWISH::Filter be use for filtering?  This can be left 1 all the
time, but
# will add a little time to processing since.

use constant USE_FILTERS  => 1;  # 1 = yes use SWISH::Filter for filtering,
0 = no. (faster processing if not set)
use constant FILTER_TEXT  => 0;  # set to one to filter text/* content, 0
will save processing time
use constant DEBUG_FILTER => 1;  # set to one to report errors on loading
SWISH::Filter module.

use LWP::UserAgent;
use HTTP::Status;
use HTML::Parser 3.00;
use HTML::LinkExtor;


> And I'll try it on my Windows machine (if it will boot).
> 
> 
> > I thought I was using the SWISH::Filter by default, but now I'm not
> sure.
> > When I use the FileFilter directive in the config file I get the errors
> that
> > pdf is invalid. Once I commented both lines out at least it indexed the
> pdf
> > without error. The FilterDir directive doesn't seem to matter I get the
> same
> > output with or without it. I did confirm that the document is being
> indexed
> > with a search for words that only appear in the pdf with the correct
> > results.
> >
> > My perl/site/lib/swish subdirectory contains filter.pm and
> > perl/site/lib/swish/filters contain the other filter modules. I'm
> convinced
> > this is a simple configuration issue, but my perl knowledge is limited
> so
> > debugging has been a problem.
> 
> I'm not exactly clear what version you are running.  Does your
> swishspider have this at the top?:
> 
> # Should SWISH::Filter be use for filtering?  This can be left 1 all the
> time, but
> # will add a little time to processing since.
> 
> use constant USE_FILTERS  => 1;  # 1 = yes use SWISH::Filter for
> filtering, 0 = no. (faster processing if not set)
> use constant FILTER_TEXT  => 0;  # set to one to filter text/* content, 0
> will save processing time
> use constant DEBUG_FILTER => 0;  # set to one to report errors on loading
> SWISH::Filter module.
> 
> Many things regarding installation have changed for the 2.4.0 version --
> namely most things get installed in places so that you don't need to
> specify paths and set Perl libraries locations.  That will make things
> much easier in the future.
> 
> If your swishspider has those lines above then it's designed to work
> with the SWISH::Filter modules.  By default (and this is something to
> possibly change), swishspider doesn't know where SWISH::Filter is
> installed.  That's on purpose because I didn't want swishspider using
> them those filters by default.  Why?  Because the way swish-e works with
> -S http is that it calls swishspider for every URL fetched and that's
> slow (due the the compiling of the swishspider Perl script).  Making it
> load all the SWISH::Filter modules would be a lot more work for every
> request.  Using -S prog (and spider.pl) avoids all that.
> 
> So, to have swishspider use SWISH::Filter you either have to set a
> PERL5LIB environment variable or add a "use lib" line to the top of
> swishspider.  Both do the same by adding paths to Perl's @INC array so
> Perl can find the modules.
> 
> So if you have the above lines in swishspider then you can set
> DEBUG_FILTER => 1 and it will tell you if swishspider was able to load
> the SWISH::Filter module (and SWISH::Filters::* filter modules).
> 
> Then, what you want to do is run swishspider without running swish-e:
> 
>    perl swishspider prefix http://localhost/test.pdf
> 
> Then you should have in the current directory a file called
> prefix.contents and prefix.response (contains the HTTP response code),
> and maybe a prefix.links (if the file is HTML and has links to follow).
> 
> That will tell you if the SWISH::Filter module is being used (well,
> really it will tell you if it's not being used if DEBUG_FILTER is set).
> 
> Then you can look at prefix.contents and see the output that is being
> created.  If you then see the messed up meta tags then it's a problem
> with the way the SWISH::Filter is working under Windows.
> 
> prefix.response should have text/html in it if the file was filtered,
> otherwise it will have application/pdf.
> 

I ran swishspider this way and the contents file contained the page with the
malformed meta tags and the response file contained the text/html line. The
only thing I could see wrong with the files is the meta tags.

> 
> --
> Bill Moseley
> moseley@hank.org
Received on Tue Jul 29 16:22:54 2003