Skip to main content.
home | support | download

Back to List Archive

Re: FW: Re: Filtering problems

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Sep 20 2003 - 01:01:59 GMT
On Fri, Sep 19, 2003 at 12:05:00PM -0700, Klingensmith, Rick wrote:
> OK, I've seen the light and am switching over to use spider.pl. So far I've
> gotten it to use SwishSpiderConfig.pl and point to my local host to find 4
> URLs (which is correct). However, it is not indexing the output and it's
> probably issues with the filter object? Here is the output from swish-e
> using the following command line:

Hi Rick,

Sure you don't want to install Linux?  Considering I just received over
300 "Internet Upgrade" messages "from" Microsoft today right on the
heals of Sobig seems like everyone would be ready to switch...


> Summary for: http://localhost/
> Connection: Keep-Alive: 3  (1.5/sec)
>                Skipped: 4  (2.0/sec)
>            Unique URLs: 4  (2.0/sec)

All docs were skipped, probably because SWISH::Filter was not loaded.

>         debug       => DEBUG_INFO,  # print some debugging info to STDERR

          debug       => DEBUG_SKIPPED|DEBUG_INFO

will say why.  I'm sure it was because they were not filtered correctly.


May I suggest you start out simple without a spider config file?


Dave sent me a Windows PR3 build the other day and we still had problems 
with it working right because of PATH issues.  I was able to get it 
working (and indexing PDF files) with a few tweaks.

See, under unix all the paths get set correctly when running the 
configure script, so swish-e and spider.pl know where to find things.

Under Windows we have to figure out where things are installed at run 
time -- and try and keep the Windows version in sync with the way the 
Unix version works.

Now, to summarize, when using -S prog swish-e uses the popen() system 
call to run the external program.  Under unix you can create a swish-e 
config file "swish.conf" that contains just:

  SwishProgParameters default http://localhost/index.html

and then run swish-e like:

  swish-e -S prog -c swish.conf -i spider.pl

Swish-e will then look in the PATH and in $libexecdir (where spider.pl
was installed) for the program specified with -i (same thing as IndexDir
in a config file).  Once that program is found swish-e calls popen with
the command:

  /path/to/spider.pl default http://localhost/index.html

and reads input from spider.pl's stdout.

That "defaut" says to use spider.pl's default settings, which is a good 
way to start.  It will filter by default -- *if* it can find all the 
filter parts that are needed.

Now, I only have Windows 98, so I can't run a program "spider.pl" 
directly.  I guess you can with Win2K and WinNT, though.  So what I did 
is created a spider.bat batch file to run spider.pl for me.

   spider.bat:
   perl /path/to/spider.pl %1 %2 %3 %4 %5 %6 %7

I put that in the same location as spider.pl was installed 
(lib/swish-e/spider.pl).

Note:
The other way to do that is to run perl as the program:


  SwishProgParameters /path/to/spider.pl default http://localhost/index.html

and then run swish-e like:

  swish-e -S prog -c swish.conf -i perl.exe

But I think I like the batch file method better.



Once spider.pl is running it has to find the SWISH::Filter module, and 
it does that by using the @INC array and that can be set using a "use 
lib" line (see the top of spider.pl) or by setting the PERL5LIB 
environment variable.  When swish-e is installed in Windows that path is 
suppose to be set correctly at the top of spider.pl.  But if it isn't 
you can set PERL5LIB.  I think under Windows you can do:

   set PERL5LIB=c:\<where you installed swish>\lib\swish-e

BTW -- That will change soon to match up with unix install to be 

           <install dir>\lib\swish-e\perl

so you will have to look and see where the SWISH directory is located 
and then point to the directory above that (because Perl appends the 
path \SWISH\Filter.pm when looking for the module.


Then to complicate things more, SWISH::Filter then has to locate the 
program pdftotext to convert PDF files to text (HTML, really).

Those helper programs are installed in the $PATH under unix so that's
not a problem, but on Windows they are installed in $libexecdir.  In
your version SWISH::Filter may not know to look in $libexecdir.  That
has been fixed in cvs, but until there's a new windows version my
suggestion is to see where the Windows installer put those programs
(pdftotext, catdoc) and add that location to your PATH environment.

So, in summary, to get filtering to work (with SWISH::Filter) you need 
to:

  1) make sure windows can run spider.pl
       use a spider.bat file if needed

  2) make sure spider.pl can locate SWISH::Filter
       set PERL5LIB or edit spider.pl's "use lib" line

  3) make sure SWISH::Filter can locate the conversion program
       add the location of the programs to your PATH


Now, how to debug things:

  Set the environment variable for the spider debugging:

     set SPIDER_DEBUG=url,skipped

  (see spider.pl docs for details)


  Then for debugging SWISH::Filter use:

     set FILTER_DEBUG=1

I have added additional debugging lately that will show how 
SWISH::Filter is searching for filter programs ("pdftotext") for 
example.


So, once you get that working then add extra swish-e config settings as 
needed.  If you want more control over spidering then switch to using a 
config file for spider.pl (following the examples in 
SwishSpiderParameters.pl).  But wait until you get the above working 
correctly.  No point in making things too complicated from the start.

You can look in spider.pl and search for "sub default_urls" to see the 
config spider.pl uses when you specify "default".

Now:

> I have applied the following lines of code to the windows_fork
> subroutine:
> 
> sub windows_fork {
>     my ( $self, @args ) = @_;
> 
> 
>     require IPC::Open2;
>     my ( $rdrfh, $wtrfh );
> 
>     # Added these three lines per instructions from Bill Moseley 7/29/2003
>     my $path = join " ", @args;
>     open FH, "$path|" or die $!;
>     return \*FH;

I can't remember (or see) why that was needed.  Maybe it was the binmode
issue.  The current code in SWISH::Filter looks like this:

sub windows_fork {
    my ( $self, @args ) = @_;


    require IPC::Open2;
    my ( $rdrfh, $wtrfh );

    my @command = map { s/"/\\"/g; qq["$_"] }  @args;


    my $pid = IPC::Open2::open2($rdrfh, $wtrfh, @command );

    # IPC::Open3 uses binmode for some reason (5.6.1)
    # Assume that the output from the program will be in text
    # Maybe an invalid assumption if running through a binary filter

    binmode $rdrfh, ':crlf';  # perhpaps: unless delete $self->{binary_output};

    $self->{pid} = $pid;

    return $rdrfh;
}

That, AFAICT, runs the program without going through the shell (even on 
Windows).  open2() calls system() with a 1 as the first parameter to 
accomplish this.  The painful part is that Windows seems to process the 
double-quotes even when not going through the shell, so that's the 
reason for the line:

    my @command = map { s/"/\\"/g; qq["$_"] }  @args;

which just escapes the double quotes so that phrase searches still work.




-- 
Bill Moseley
moseley@hank.org
Received on Sat Sep 20 01:02:10 2003