Skip to main content.
home | support | download

Back to List Archive

Re: pb with http method and perl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri May 24 2002 - 14:24:21 GMT
At 03:09 PM 05/24/02 +0200, BenBen wrote:
>----------------------------
>IndexFile swishnew.index
>#IndexDir
>http://192.168.111.15/intranet/service/moteur/DOC/spider.html
>IndexDir http://gedeon/swish/DOC/actuel/spider.html
>#IndexDir DOC/actuel/
>IndexOnly .doc

IndexOnly Has no effect for spidering with -S http


>#FollowSymLinks no
>IgnoreWords file: french.txt
>TranslateCharacters :ascii7:
>MaxDepth 5
>Delay 0
>TmpDir /usr/local/apache/htdocs/html/swish/
>SpiderDirectory /usr/local/apache/htdocs/html/swish/
>-----------------------------------------

Since you are indexing an intranet I can't test from here.

In the "conf" directory there are a set of config files that should help.

You are using the -S http method.  I prefer to use the -S prog with 
spider.pl method as there's a lot more debugging info available.
It's just more work to learn.  I'll give an example below.

The http method drives me crazy since once it starts it's hard 
to kill it. ;)

But here's some suggestions for -S http method.

1) Make sure the spider can actually read something.  Run the 
spider without using swish.

moseley(at)not-real.bumby:~/swish-e/src$ ./swishspider ./ http://swish-e.org/index.html

(Note that I used "./" for the current directory.  That's just a prefix used on the spider's output files.)

moseley@bumby:~/swish-e/src$ ls -la | head
total 496120
-rw-r--r--    1 moseley  moseley      5321 May 24 06:47 .contents
-rw-r--r--    1 moseley  moseley       638 May 24 06:47 .links
-rw-r--r--    1 moseley  moseley        14 May 24 06:47 .response

moseley@bumby:~/swish-e/src$ cat .response
200
text/html

moseley@bumby:~/swish-e/src$ head -5 .contents    
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>

<head>
<title>SWISH-Enhanced</title>

moseley@bumby:~/swish-e/src$ head -5 .links    
http://swish-e.org/download.html
http://swish-e.org/Discussion
http://swish-e.org/2.2/docs/CHANGES.html
http://swish-e.org/2.2/docs/index.html
http://swish-e.org/2.2/swish-daily

Now we see that it's really fetching something.  

2) Try indexing:

moseley@bumby:~/swish-e/src$ cat c    
Delay 0

moseley(at)not-real.bumby:~/swish-e/src$ ./swish-e -c c -S http -i http://swish-e.org/index.html -v3   

Indexing Data Source: "HTTP-Crawler"
Indexing "http://swish-e.org/index.html"
retrieving http://swish-e.org/index.html (0)...
 - Using DEFAULT (HTML) parser -  (324 words)
Skipping http://sourceforge.net/cvs/?group_id=15097:  Wrong method or server.
Skipping http://swish-e.org/download.html:  Already indexed.
Skipping http://www.fsf.org/copyleft/gpl.html:  Wrong method or server.
Skipping http://swish-e.org/Discussion/:  Already indexed.
retrieving http://swish-e.org/download.html (1)...
 - Using DEFAULT (HTML) parser -  (77 words)


..

I keep another terminal window open so I can kill swish-e.  
Pressing ^C doesn't always work.

3) If that's not enough debugging info you can tell swish to print the words 
it's actually indexing:

Again, this will generate a lot of output:

moseley(at)not-real.bumby:~/swish-e/src$ ./swish-e -c c -S http -i http://swish-e.org/index.html -v0 -T indexed_words
    Adding:[1:swishdefault(1)]   'swish'   Pos:1  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'enhanced'   Pos:2  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'swish'   Pos:3  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'e'   Pos:4  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'is'   Pos:5  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'a'   Pos:6  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'fast'   Pos:7  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'powerful'   Pos:8  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'flexible'   Pos:9  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'free'   Pos:10  Stuct:0x9 ( BODY FILE )



Here's an example using -S prog method.  It uses the program 
prog-bin/spider.pl.  That's a reasonably complicated program 
with questionable documentation.  You read the docs with:

      perldoc spider.pl

My examples have been from the ~/swish-e/src directory so then 
you would type:

      perldoc ../prog-bin/spider.pl

1) Testing just the spider program

Here's running just the spdier.pl program.  spider.pl outputs the 
files directly to stdout, so /dev/null is helpful.

Setting the environment variable SPIDER_DEBUG before running 
controls debugging output.  There's a number of different options.  
ee the docs for more info.

moseley(at)not-real.bumby:~/swish-e/src$ SPIDER_DEBUG=url ../prog-bin/spider.pl default http://swish-e.org/index.html >/dev/null

./prog-bin/spider.pl: Reading parameters from 'default'

 -- Starting to spider: http://swish-e.org/index.html --
>> +Fetched 0 Cnt: 1 http://swish-e.org/index.html 200 OK text/html ??? parent:
>> +Fetched 1 Cnt: 2 http://swish-e.org/download.html 200 OK text/html ??? parent:http://swish-e.org/index.html
>> -Failed 1 Cnt: 3 http://swish-e.org/Discussion 301 Moved Permanently text/html ??? parent:http://swish-e.org/index.html
>> +Fetched 1 Cnt: 4 http://swish-e.org/2.2/docs/CHANGES.html 200 OK text/html ??? parent:http://swish-e.org/index.html
>> +Fetched 1 Cnt: 5 http://swish-e.org/2.2/docs/index.html 200 OK text/html ??? parent:http://swish-e.org/index.html
>> -Failed 1 Cnt: 6 http://swish-e.org/2.2/swish-daily 301 Moved Permanently text/html ??? parent:http://swish-e.org/index.html
>> +Fetched 1 Cnt: 7 http://swish-e.org/features.html 200 OK text/html ??? parent:http://swish-e.org/index.html
>> +Fetched 1 Cnt: 8 http://swish-e.org/demonstrations.html 200 OK text/html ??? parent:http://swish-e.org/index.html
>> +Fetched 1 Cnt: 9 http://swish-e.org/documentation.html 200 OK text/html ??? parent:http://swish-e.org/index.html
>> +Fetched 1 Cnt: 10 http://swish-e.org/bugs.html 200 OK text/html ??? parent:http://swish-e.org/index.html

2) Now running with swish:

moseley@bumby:~/swish-e/src$ cat c
SwishProgParameters default http://swish-e.org/index.html

(maybe I'll add a command line option for SwishProgParameters)


moseley@bumby:~/swish-e/src$ ./swish-e -c c -S prog -i../prog-bin/spider.pl -v2
Indexing Data Source: "External-Program"
Indexing "../prog-bin/spider.pl"
./prog-bin/spider.pl: Reading parameters from 'default'
Processing http://swish-e.org/index.html...
Processing http://swish-e.org/download.html...
Processing http://swish-e.org/2.2/docs/CHANGES.html...
Processing http://swish-e.org/2.2/docs/index.html...
Processing http://swish-e.org/features.html...
Processing http://swish-e.org/demonstrations.html...
Processing http://swish-e.org/documentation.html...
Processing http://swish-e.org/bugs.html...
Processing http://swish-e.org/Discussion/...
Processing http://swish-e.org/graphics.html...
Processing http://swish-e.org/team.html...
Processing http://swish-e.org/Ports/...

I didn't want to spider the entire site so here's a little trick helpful for debugging:

I hit ^Z to halt the process:

[1]+  Stopped                 ./swish-e -c c -S prog -i../prog-bin/spider.pl -v2

moseley@bumby:~/swish-e/src$ ps  
  PID TTY          TIME CMD
 2349 pts/1    00:00:00 bash
 8714 pts/1    00:00:00 swish-e
 8715 pts/1    00:00:00 spider.pl
 8716 pts/1    00:00:00 ps
moseley@bumby:~/swish-e/src$ kill -HUP 8715

moseley@bumby:~/swish-e/src$ fg
./swish-e -c c -S prog -i../prog-bin/spider.pl -v2

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1268 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
1268 unique words indexed.
4 properties sorted.                                              
12 files indexed.  69436 total bytes.  6281 total words.
Elapsed time: 00:02:40 CPU time: 00:00:00
Indexing done!

moseley@bumby:~/swish-e/src$ ./swish-e -w '"swish-daily"'
# SWISH format: 2.1-dev-25
# Search words: "swish-daily"
# Number of hits: 2
# Search time: 0.002 seconds
# Run time: 0.003 seconds
1000 http://swish-e.org/index.html "SWISH-Enhanced" 5321
908 http://swish-e.org/documentation.html "SWISH-E Documentation" 3235
.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri May 24 14:24:26 2002