At 10:35 PM 02/12/02 -0800, Adam Edelman wrote:
>I'm having trouble gettting swish to index files once they have been through
>spider.pl. Indexing has worked using a practically identical config file
>and swishspider.pl. I've also tried the spider.pl from the swish version
>i'm working with and with the newest version from 2/12/02. I have perl
>5.6.1. Any assistance would be appreciated. The relevent info follows.
Thanks very much for posting such a helpful post. Make helping much easier.
So easy, in fact, that it works as-is on my machine. I just now downloaded
the Feb 7, 2002 binary version onto Win98.
E:\Program Files\SWISH-E>type SwishSpiderConfig.pl
@servers = (
{
base_url => 'http://arena.internet2.edu/sample.htm',
email => 'swish@tulane.edu',
delay_min => .001,
#max_time => 10, # Max time to spider in minutes
max_files => 2, # Max Unique URLs to spider
max_indexed => 1,
test_url => \&test_url,
test_response => \&test_response,
filter_content => \&filter_content,
debug => DEBUG_URL | DEBUG_SKIPPED | DEBUG_FAILED | DEBUG_INFO
},);
sub test_url {
my ( $uri, $server ) = @_;
return $uri->path =~ /\.html?$/;
}
sub test_response {
my ( $uri, $server, $response ) = @_;
return 1; # ok to index and spider
}
sub filter_content {
my ( $uri, $server, $response, $content_ref ) = @_;
return 1;
}
1;
E:\Program Files\SWISH-E>type test.txt
BumpPositionCounterCharacters |.
MaxWordLimit 80
WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-_
IndexReport 3
IgnoreTotalWordCountWhenRanking yes
#IgnoreWords file: c:\\swish-e\\conf\\stopwords\\english.txt
IndexDir e:\\perl\\bin\\perl.exe
SwishProgParameters prog-bin/spider.pl
E:\Program Files\SWISH-E>swish-e -S prog -c test.txt
Indexing Data Source: "External-Program"
Indexing "e:\perl\bin\perl.exe"
prog-bin/spider.pl: Reading parameters from 'SwishSpiderConfig.pl'
-- Starting to spider: http://arena.internet2.edu/sample.htm --
?Testing 'test_url' user supplied function #1
'http://arena.internet2.edu:80/sample.htm'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_response' user supplied function #1
'http://arena.internet2.edu:8
0/sample.htm'
+Passed all 1 tests for 'test_response' user supplied function
>> +Fetched 0 Cnt: 1 http://arena.internet2.edu:80/sample.htm 200 OK
text/html 2
9 parent:
! Found 0 links in http://arena.internet2.edu:80/sample.htm
prog-bin/spider.pl: Max indexed files Reached
Summary for: http://arena.internet2.edu/sample.htm
Total Bytes: 29 (14.5/sec)
Total Docs: 1 (0.5/sec)
Unique URLs: 1 (0.5/sec)
http://arena.internet2.edu:80/sample.htm - Using DEFAULT (HTML) parser -
(2 wor
ds)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 2 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
2 unique words indexed.
4 properties sorted.
1 file indexed. 29 total bytes. 2 total words.
Elapsed time: 00:00:03 CPU time: 00:00:03
Indexing done!
Could there be some issue with Windows SE?
Try running just the spider:
perl prog-bin/spider.pl > out
then type out:
E:\Program Files\SWISH-E>type out
Path-Name: http://arena.internet2.edu:80/sample.htm
Content-Length: 29
Last-Mtime: 1013569857
<HTML>Sample document</HTML>
That way you can see if spider.pl is working correctly.
You might try adding a blank line in your sample document, just in case
that's causing problems.
--
Bill Moseley
mailto:moseley@hank.org
Received on Wed Feb 13 14:09:09 2002