Skip to main content.
home | support | download

Back to List Archive

Re: spider and cgi problems

From: David L Norris <dave(at)not-real.webaugur.com>
Date: Sat Feb 22 2003 - 08:49:31 GMT
On Sat, 2003-02-22 at 02:27, Dave-CBC wrote:
> but can't get it to index more than just links found
> on the pages it spiders

A web spider can only index URLs it knows about...  It can't guess what
other URLs might exist.  Well, I suppose it could but it wouldn't get
far.

Do you have filesystem access to the web server?  If so then use the
filesystem method rather than a spider.  If no filesystem access then
you'll have to give the spider a list of URLs.

spider.config:
           @servers = (
               {
                   base_url    => ˙http://localhost/this/%ff,
                   email       => ˙me@myself.com˙,
               },
               {
                   base_url    => ˙http://localhost/that/%ff,
                   email       => ˙me@myself.com˙,
               },
           );

swish.config:
  IndexDir ./spider.pl
  SwishProgParameters spider.config
  DefaultContents HTML2
  StoreDescription HTML2 <BODY> 100000

Create SWISH-E index like this:
  swish-e -c swish.config -S prog -E ./swishError.log

There is an example spider config SwishSpiderConfig.pl in the prog-bin
directory.  It has perldoc documentation and many comments.

>         my $file = "$swish_binary -w $query -d :: -v 3 -H 9 -f D:/PROGRA~1/SWISH-E/index.swish";
>     if ( $pid = open( SWISH, "$file|" ) ) {
>     if ( $pid = open( SWISH, '-|' ) ) {

What version of SWISH-E?  SWISH-E's current search.cgi doesn't use
open() on Windows.  I don't recall when Bill fixed that but I think it's
been quite a while.

Latest SWISH-E builds are here:
  http://www.swish-e.org/Download/
and here:
  http://www.webaugur.com/wares/files/swish-e/

Odd minor numbers are development builds.  2.2.x is a release, 2.3.x is
development.  Current release is 2.2.3 (2002-12-11).

-- 
 David Norris
  http://www.webaugur.com/dave/
  ICQ - 412039
Received on Sat Feb 22 08:50:01 2003