On Tue, Jul 29, 2003 at 06:18:34AM -0700, Klingensmith, Rick wrote:
> I'm probably beginning to sound like a flake, but I've got myself very
> confused at this point. I've used the following config file and added a bare
> use lib line to the swishspider file:
> SpiderDirectory C:/Swish-E
> # Use the file filter to index pdf files
> #FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl '"%p" -'
> #FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe '"%p" -'
> # Filter Directory
> FilterDir C:/SWISH-E/filter-bin
I haven't looked at the filter code in a while. FilterDir is prepended
to the program specified with FileFilter when the program doesn't start
with a "/". That doesn't work very well on Windows (or anywhere
really). I'd suspect in your case it's trying to run a program called:
(although you have those filters commented out)
> Swishspider is in my SWISH-e directory. With this configuration the pdf
> files indexed correctly, but I'm still getting the same output on the meta
> tags as below in my previous post.
You mean malformed meta tags like you posted? I guess I need to see if
I can get my windows machine to boot and try a few things. Can you make
available online a test PDF file? You can send it directly to me if you
don't want it public -- although it would be helpful for me to fetch if
from the same URL you are using.
So, provide me with
- a URL to fetch a test PDF file
- the version of swish-e you are using (perhaps a link to the specific
version you installed)
- answer if your swishspider has the lines shown below
And I'll try it on my Windows machine (if it will boot).
> I thought I was using the SWISH::Filter by default, but now I'm not sure.
> When I use the FileFilter directive in the config file I get the errors that
> pdf is invalid. Once I commented both lines out at least it indexed the pdf
> without error. The FilterDir directive doesn't seem to matter I get the same
> output with or without it. I did confirm that the document is being indexed
> with a search for words that only appear in the pdf with the correct
> My perl/site/lib/swish subdirectory contains filter.pm and
> perl/site/lib/swish/filters contain the other filter modules. I'm convinced
> this is a simple configuration issue, but my perl knowledge is limited so
> debugging has been a problem.
I'm not exactly clear what version you are running. Does your
swishspider have this at the top?:
# Should SWISH::Filter be use for filtering? This can be left 1 all the time, but
# will add a little time to processing since.
use constant USE_FILTERS => 1; # 1 = yes use SWISH::Filter for filtering, 0 = no. (faster processing if not set)
use constant FILTER_TEXT => 0; # set to one to filter text/* content, 0 will save processing time
use constant DEBUG_FILTER => 0; # set to one to report errors on loading SWISH::Filter module.
Many things regarding installation have changed for the 2.4.0 version --
namely most things get installed in places so that you don't need to
specify paths and set Perl libraries locations. That will make things
much easier in the future.
If your swishspider has those lines above then it's designed to work
with the SWISH::Filter modules. By default (and this is something to
possibly change), swishspider doesn't know where SWISH::Filter is
installed. That's on purpose because I didn't want swishspider using
them those filters by default. Why? Because the way swish-e works with
-S http is that it calls swishspider for every URL fetched and that's
slow (due the the compiling of the swishspider Perl script). Making it
load all the SWISH::Filter modules would be a lot more work for every
request. Using -S prog (and spider.pl) avoids all that.
So, to have swishspider use SWISH::Filter you either have to set a
PERL5LIB environment variable or add a "use lib" line to the top of
swishspider. Both do the same by adding paths to Perl's @INC array so
Perl can find the modules.
So if you have the above lines in swishspider then you can set
DEBUG_FILTER => 1 and it will tell you if swishspider was able to load
the SWISH::Filter module (and SWISH::Filters::* filter modules).
Then, what you want to do is run swishspider without running swish-e:
perl swishspider prefix http://localhost/test.pdf
Then you should have in the current directory a file called
prefix.contents and prefix.response (contains the HTTP response code),
and maybe a prefix.links (if the file is HTML and has links to follow).
That will tell you if the SWISH::Filter module is being used (well,
really it will tell you if it's not being used if DEBUG_FILTER is set).
Then you can look at prefix.contents and see the output that is being
created. If you then see the messed up meta tags then it's a problem
with the way the SWISH::Filter is working under Windows.
prefix.response should have text/html in it if the file was filtered,
otherwise it will have application/pdf.
Received on Tue Jul 29 14:48:36 2003