Skip to main content.
home | support | download

Back to List Archive

Re: spider and cgi problems

From: Dave Morton <dave(at)not-real.circuitboardcomputing.com>
Date: Sat Feb 22 2003 - 09:24:38 GMT
Thanks for the quick response. Let's see if I can answer some of this, and make sense of more. 8^}

> Do you have filesystem access to the web server?  If so then use the
> filesystem method rather than a spider.  If no filesystem access then
> you'll have to give the spider a list of URLs.

I do have filesystem access on this box, but I need to gear this toward multiple servers that I
will not have that sort of access to. Thus the spidering.



> spider.config:
>            @servers = (
>                {
>                    base_url    => ˙http://localhost/this/%ff,
>                    email       => ˙me@myself.com˙,
>                },
>                {
>                    base_url    => ˙http://localhost/that/%ff,
>                    email       => ˙me@myself.com˙,
>                },
>            );
> 

I wasn't aware that PERL code can go right into the config file. I'll have to remember this.
That is, if I'm reading the above lines correctly.

> There is an example spider config SwishSpiderConfig.pl in the prog-bin
> directory.  It has perldoc documentation and many comments.

As a matter of fact, I have the modified version of the 'full URL list' of SwishSpiderConfig.pl
being used when I spider the three servers the company I work for has sites on. The problem I'm
having I'll outline in a bit more detail below.

> What version of SWISH-E?  SWISH-E's current search.cgi doesn't use
> open() on Windows.  I don't recall when Bill fixed that but I think it's
> been quite a while.

I'm using version 2.2.3 at present. Like I said, I'm relatively new; however, I
did check to make sure the latest version. 8^}

>  David Norris
>   http://www.webaugur.com/dave/
>   ICQ - 412039

Ok, let's see if I can be a bit more descriptive in my dilemma:

1.) The spider seems to work well as configured, as far as indexing the pages it's told to
  look at. I can find all the expected references to whatever keywords I enter, without any trouble.

2.) The search page (search.cgi) returns links to all of the pages I expect it to, but only displays
  links, rather than, say, the content surrounding the keywor(s) searched, and since there is no
  content displayed other than page links, any keywords searched are not highlighted. This is the
  behavior I'm hoping to get the search page to exibit. For example:

  Say I'm looking for "kumquat", and there's only one page that contains that word, with a page title
of "Western Fruit Growers Almanac". This page is at URL http://foo.bar/egAlmanac.html. The text
surrounding the word kumquat is "The primary export for Wrinkle County is the fall kumquat crop."
What I'm looking for the search engine to return is a link to the page, with the pages title as the
link's text, and the surrounding text of the first match on the page, with the keyword(s) highlighted.
As far as the parsing of the HTML is concerned, I don't think I'll have a problem, as long as I can get
the above items returned. The links I get already work as described, but I have no content to parse.
That's where my problem lies.

  Any notions?


Dave Morton          dave@circuitboardcomputing.com
Received on Sat Feb 22 09:25:17 2003