Skip to main content.
home | support | download

Back to List Archive

Re: Something wrong with the example in the website?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Jun 07 2005 - 03:17:01 GMT
On Mon, Jun 06, 2005 at 08:00:42PM -0700, Lionel lau wrote:
> This is what I have in the .conf file which i copied
> from the site too:
> 
> ***********************************************
> IndexDir spider.pl
> 
> SwishProgParameters default http://localhost/index.html
> 
> Metanames swishtitle swishdocpath
> 
> StoreDescription TXT* 10000
> StoreDescription HTML* <body> 10000
> ************************************************

Here's some debugging tips:

Ok, so when you run with "-S prog", IndexDir is the program to run
and SwishProgParameters are passed to that program.  You can run
spider.pl without using swish.

Swish knows where to look for spider.pl because that directory is
compiled in when you build swish from source.  So you can find it like
this:

    $ swish-e -h | grep Scripts
     Scripts and Modules at: (libexecdir) = /usr/local/lib/swish-e

    $ file /usr/local/lib/swish-e/spider.pl 
    /usr/local/lib/swish-e/spider.pl: perl script text

Now you can run the spider directly -- notice that I write the output
from the spider (the fetched documents) to /dev/null.

    $ /usr/local/lib/swish-e/spider.pl default http://localhost/index.html >/dev/null


    /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

    Summary for: http://localhost/index.html
         Connection: Close:     1  (1.0/sec)
    Connection: Keep-Alive:     1  (1.0/sec)
            Off-site links:    10  (10.0/sec)
               Total Bytes: 4,111  (4111.0/sec)
                Total Docs:     1  (1.0/sec)
               Unique URLs:     2  (2.0/sec)
                 text/html:     1  (1.0/sec)

Now you can enable debugging:

$ SPIDER_DEBUG=url,skipped,links  /usr/local/lib/swish-e/spider.pl default http://localhost/apache/index.html >/dev/null
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

 -- Starting to spider: http://localhost/apache/index.html --
>> +Fetched 0 Cnt: 1 GET  http://localhost/apache/index.html  200 OK text/html 109 parent: depth:0

Extracting links from http://localhost/apache/index.html:

Looking at extracted tag '<a href="doc2.html">'
   href="http://localhost/apache/doc2.html" Added to list of links to follow
>> +Fetched 1 Cnt: 2 GET  http://localhost/apache/doc2.html  200 OK text/html 109 parent:http://localhost/apache/index.html depth:1

Extracting links from http://localhost/apache/doc2.html:

Looking at extracted tag '<a href="doc3.html">'
   href="http://localhost/apache/doc3.html" Added to list of links to follow
>> +Fetched 2 Cnt: 3 GET  http://localhost/apache/doc3.html  200 OK text/html 109 parent:http://localhost/apache/doc2.html depth:2

Extracting links from http://localhost/apache/doc3.html:

Looking at extracted tag '<a href="doc4.html">'
   href="http://localhost/apache/doc4.html" Added to list of links to follow
>> +Fetched 3 Cnt: 4 GET  http://localhost/apache/doc4.html  200 OK text/html 111 parent:http://localhost/apache/doc3.html depth:3

Extracting links from http://localhost/apache/doc4.html:

Looking at extracted tag '<a href="index.html">'
  tag did not include any links to follow or is a duplicate

Summary for: http://localhost/apache/index.html
     Connection: Close:   1  (1.0/sec)
Connection: Keep-Alive:   3  (3.0/sec)
            Duplicates:   1  (1.0/sec)
           Total Bytes: 438  (438.0/sec)
            Total Docs:   4  (4.0/sec)
           Unique URLs:   4  (4.0/sec)
             text/html:   4  (4.0/sec)

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Jun 6 20:17:02 2005