On Mon, Jun 06, 2005 at 08:00:42PM -0700, Lionel lau wrote: > This is what I have in the .conf file which i copied > from the site too: > > *********************************************** > IndexDir spider.pl > > SwishProgParameters default http://localhost/index.html > > Metanames swishtitle swishdocpath > > StoreDescription TXT* 10000 > StoreDescription HTML* <body> 10000 > ************************************************ Here's some debugging tips: Ok, so when you run with "-S prog", IndexDir is the program to run and SwishProgParameters are passed to that program. You can run spider.pl without using swish. Swish knows where to look for spider.pl because that directory is compiled in when you build swish from source. So you can find it like this: $ swish-e -h | grep Scripts Scripts and Modules at: (libexecdir) = /usr/local/lib/swish-e $ file /usr/local/lib/swish-e/spider.pl /usr/local/lib/swish-e/spider.pl: perl script text Now you can run the spider directly -- notice that I write the output from the spider (the fetched documents) to /dev/null. $ /usr/local/lib/swish-e/spider.pl default http://localhost/index.html >/dev/null /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default' Summary for: http://localhost/index.html Connection: Close: 1 (1.0/sec) Connection: Keep-Alive: 1 (1.0/sec) Off-site links: 10 (10.0/sec) Total Bytes: 4,111 (4111.0/sec) Total Docs: 1 (1.0/sec) Unique URLs: 2 (2.0/sec) text/html: 1 (1.0/sec) Now you can enable debugging: $ SPIDER_DEBUG=url,skipped,links /usr/local/lib/swish-e/spider.pl default http://localhost/apache/index.html >/dev/null /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default' -- Starting to spider: http://localhost/apache/index.html -- >> +Fetched 0 Cnt: 1 GET http://localhost/apache/index.html 200 OK text/html 109 parent: depth:0 Extracting links from http://localhost/apache/index.html: Looking at extracted tag '<a href="doc2.html">' href="http://localhost/apache/doc2.html" Added to list of links to follow >> +Fetched 1 Cnt: 2 GET http://localhost/apache/doc2.html 200 OK text/html 109 parent:http://localhost/apache/index.html depth:1 Extracting links from http://localhost/apache/doc2.html: Looking at extracted tag '<a href="doc3.html">' href="http://localhost/apache/doc3.html" Added to list of links to follow >> +Fetched 2 Cnt: 3 GET http://localhost/apache/doc3.html 200 OK text/html 109 parent:http://localhost/apache/doc2.html depth:2 Extracting links from http://localhost/apache/doc3.html: Looking at extracted tag '<a href="doc4.html">' href="http://localhost/apache/doc4.html" Added to list of links to follow >> +Fetched 3 Cnt: 4 GET http://localhost/apache/doc4.html 200 OK text/html 111 parent:http://localhost/apache/doc3.html depth:3 Extracting links from http://localhost/apache/doc4.html: Looking at extracted tag '<a href="index.html">' tag did not include any links to follow or is a duplicate Summary for: http://localhost/apache/index.html Connection: Close: 1 (1.0/sec) Connection: Keep-Alive: 3 (3.0/sec) Duplicates: 1 (1.0/sec) Total Bytes: 438 (438.0/sec) Total Docs: 4 (4.0/sec) Unique URLs: 4 (4.0/sec) text/html: 4 (4.0/sec) -- Bill Moseley moseley@hank.org Unsubscribe from or help with the swish-e list: http://swish-e.org/Discussion/ Help with Swish-e: http://swish-e.org/current/docs swish-e@sunsite.berkeley.eduReceived on Mon Jun 6 20:17:02 2005