At 03:09 PM 05/24/02 +0200, BenBen wrote: >---------------------------- >IndexFile swishnew.index >#IndexDir >http://192.168.111.15/intranet/service/moteur/DOC/spider.html >IndexDir http://gedeon/swish/DOC/actuel/spider.html >#IndexDir DOC/actuel/ >IndexOnly .doc IndexOnly Has no effect for spidering with -S http >#FollowSymLinks no >IgnoreWords file: french.txt >TranslateCharacters :ascii7: >MaxDepth 5 >Delay 0 >TmpDir /usr/local/apache/htdocs/html/swish/ >SpiderDirectory /usr/local/apache/htdocs/html/swish/ >----------------------------------------- Since you are indexing an intranet I can't test from here. In the "conf" directory there are a set of config files that should help. You are using the -S http method. I prefer to use the -S prog with spider.pl method as there's a lot more debugging info available. It's just more work to learn. I'll give an example below. The http method drives me crazy since once it starts it's hard to kill it. ;) But here's some suggestions for -S http method. 1) Make sure the spider can actually read something. Run the spider without using swish. moseley(at)not-real.bumby:~/swish-e/src$ ./swishspider ./ http://swish-e.org/index.html (Note that I used "./" for the current directory. That's just a prefix used on the spider's output files.) moseley@bumby:~/swish-e/src$ ls -la | head total 496120 -rw-r--r-- 1 moseley moseley 5321 May 24 06:47 .contents -rw-r--r-- 1 moseley moseley 638 May 24 06:47 .links -rw-r--r-- 1 moseley moseley 14 May 24 06:47 .response moseley@bumby:~/swish-e/src$ cat .response 200 text/html moseley@bumby:~/swish-e/src$ head -5 .contents <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <title>SWISH-Enhanced</title> moseley@bumby:~/swish-e/src$ head -5 .links http://swish-e.org/download.html http://swish-e.org/Discussion http://swish-e.org/2.2/docs/CHANGES.html http://swish-e.org/2.2/docs/index.html http://swish-e.org/2.2/swish-daily Now we see that it's really fetching something. 2) Try indexing: moseley@bumby:~/swish-e/src$ cat c Delay 0 moseley(at)not-real.bumby:~/swish-e/src$ ./swish-e -c c -S http -i http://swish-e.org/index.html -v3 Indexing Data Source: "HTTP-Crawler" Indexing "http://swish-e.org/index.html" retrieving http://swish-e.org/index.html (0)... - Using DEFAULT (HTML) parser - (324 words) Skipping http://sourceforge.net/cvs/?group_id=15097: Wrong method or server. Skipping http://swish-e.org/download.html: Already indexed. Skipping http://www.fsf.org/copyleft/gpl.html: Wrong method or server. Skipping http://swish-e.org/Discussion/: Already indexed. retrieving http://swish-e.org/download.html (1)... - Using DEFAULT (HTML) parser - (77 words) .. I keep another terminal window open so I can kill swish-e. Pressing ^C doesn't always work. 3) If that's not enough debugging info you can tell swish to print the words it's actually indexing: Again, this will generate a lot of output: moseley(at)not-real.bumby:~/swish-e/src$ ./swish-e -c c -S http -i http://swish-e.org/index.html -v0 -T indexed_words Adding:[1:swishdefault(1)] 'swish' Pos:1 Stuct:0x7 ( HEAD TITLE FILE ) Adding:[1:swishdefault(1)] 'enhanced' Pos:2 Stuct:0x7 ( HEAD TITLE FILE ) Adding:[1:swishdefault(1)] 'swish' Pos:3 Stuct:0x9 ( BODY FILE ) Adding:[1:swishdefault(1)] 'e' Pos:4 Stuct:0x9 ( BODY FILE ) Adding:[1:swishdefault(1)] 'is' Pos:5 Stuct:0x9 ( BODY FILE ) Adding:[1:swishdefault(1)] 'a' Pos:6 Stuct:0x9 ( BODY FILE ) Adding:[1:swishdefault(1)] 'fast' Pos:7 Stuct:0x9 ( BODY FILE ) Adding:[1:swishdefault(1)] 'powerful' Pos:8 Stuct:0x9 ( BODY FILE ) Adding:[1:swishdefault(1)] 'flexible' Pos:9 Stuct:0x9 ( BODY FILE ) Adding:[1:swishdefault(1)] 'free' Pos:10 Stuct:0x9 ( BODY FILE ) Here's an example using -S prog method. It uses the program prog-bin/spider.pl. That's a reasonably complicated program with questionable documentation. You read the docs with: perldoc spider.pl My examples have been from the ~/swish-e/src directory so then you would type: perldoc ../prog-bin/spider.pl 1) Testing just the spider program Here's running just the spdier.pl program. spider.pl outputs the files directly to stdout, so /dev/null is helpful. Setting the environment variable SPIDER_DEBUG before running controls debugging output. There's a number of different options. ee the docs for more info. moseley(at)not-real.bumby:~/swish-e/src$ SPIDER_DEBUG=url ../prog-bin/spider.pl default http://swish-e.org/index.html >/dev/null ./prog-bin/spider.pl: Reading parameters from 'default' -- Starting to spider: http://swish-e.org/index.html -- >> +Fetched 0 Cnt: 1 http://swish-e.org/index.html 200 OK text/html ??? parent: >> +Fetched 1 Cnt: 2 http://swish-e.org/download.html 200 OK text/html ??? parent:http://swish-e.org/index.html >> -Failed 1 Cnt: 3 http://swish-e.org/Discussion 301 Moved Permanently text/html ??? parent:http://swish-e.org/index.html >> +Fetched 1 Cnt: 4 http://swish-e.org/2.2/docs/CHANGES.html 200 OK text/html ??? parent:http://swish-e.org/index.html >> +Fetched 1 Cnt: 5 http://swish-e.org/2.2/docs/index.html 200 OK text/html ??? parent:http://swish-e.org/index.html >> -Failed 1 Cnt: 6 http://swish-e.org/2.2/swish-daily 301 Moved Permanently text/html ??? parent:http://swish-e.org/index.html >> +Fetched 1 Cnt: 7 http://swish-e.org/features.html 200 OK text/html ??? parent:http://swish-e.org/index.html >> +Fetched 1 Cnt: 8 http://swish-e.org/demonstrations.html 200 OK text/html ??? parent:http://swish-e.org/index.html >> +Fetched 1 Cnt: 9 http://swish-e.org/documentation.html 200 OK text/html ??? parent:http://swish-e.org/index.html >> +Fetched 1 Cnt: 10 http://swish-e.org/bugs.html 200 OK text/html ??? parent:http://swish-e.org/index.html 2) Now running with swish: moseley@bumby:~/swish-e/src$ cat c SwishProgParameters default http://swish-e.org/index.html (maybe I'll add a command line option for SwishProgParameters) moseley@bumby:~/swish-e/src$ ./swish-e -c c -S prog -i../prog-bin/spider.pl -v2 Indexing Data Source: "External-Program" Indexing "../prog-bin/spider.pl" ./prog-bin/spider.pl: Reading parameters from 'default' Processing http://swish-e.org/index.html... Processing http://swish-e.org/download.html... Processing http://swish-e.org/2.2/docs/CHANGES.html... Processing http://swish-e.org/2.2/docs/index.html... Processing http://swish-e.org/features.html... Processing http://swish-e.org/demonstrations.html... Processing http://swish-e.org/documentation.html... Processing http://swish-e.org/bugs.html... Processing http://swish-e.org/Discussion/... Processing http://swish-e.org/graphics.html... Processing http://swish-e.org/team.html... Processing http://swish-e.org/Ports/... I didn't want to spider the entire site so here's a little trick helpful for debugging: I hit ^Z to halt the process: [1]+ Stopped ./swish-e -c c -S prog -i../prog-bin/spider.pl -v2 moseley@bumby:~/swish-e/src$ ps PID TTY TIME CMD 2349 pts/1 00:00:00 bash 8714 pts/1 00:00:00 swish-e 8715 pts/1 00:00:00 spider.pl 8716 pts/1 00:00:00 ps moseley@bumby:~/swish-e/src$ kill -HUP 8715 moseley@bumby:~/swish-e/src$ fg ./swish-e -c c -S prog -i../prog-bin/spider.pl -v2 Removing very common words... no words removed. Writing main index... Sorting words ... Sorting 1268 words alphabetically Writing header ... Writing index entries ... Writing word text: Complete 1268 unique words indexed. 4 properties sorted. 12 files indexed. 69436 total bytes. 6281 total words. Elapsed time: 00:02:40 CPU time: 00:00:00 Indexing done! moseley@bumby:~/swish-e/src$ ./swish-e -w '"swish-daily"' # SWISH format: 2.1-dev-25 # Search words: "swish-daily" # Number of hits: 2 # Search time: 0.002 seconds # Run time: 0.003 seconds 1000 http://swish-e.org/index.html "SWISH-Enhanced" 5321 908 http://swish-e.org/documentation.html "SWISH-E Documentation" 3235 . -- Bill Moseley mailto:moseley@hank.orgReceived on Fri May 24 14:24:26 2002