Skip to main content.
home | support | download

Back to List Archive

Re: I'm getting there!

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jan 25 2002 - 21:28:15 GMT
At 12:37 PM 01/25/02 -0800, Rich Thomas wrote:
>Only....  Now I can't get spidering to work again..grrrrr

Yes, grrrr.

Have I mentioned how much I don't like -S http.  ^C doesn't work!  It just
keeps on spidering and spidering and spidering.

Still, back to your personal directory, Rich:

First: with -S http method, if I have to:

~/rich > cp ~/swish-e/src/swishspider .

~/rich > cat swish.conf
IndexDir http://ublin.lib.buffalo.edu/webcat/bibcat/A/A/E/9/001.html
StoreDescription HTML2 <body> 100000
DefaultContents HTML2
Delay 0
MaxDepth 2


~/rich > ./swish-e -c swish.conf -S http -v9 
Indexing Data Source: "HTTP-Crawler"
Indexing "http://ublin.lib.buffalo.edu/webcat/bibcat/A/A/E/9/001.html"
retrieving http://ublin.lib.buffalo.edu/webcat/bibcat/A/A/E/9/001.html (0)...
 - Using HTML2 parser -  (117 words)
retrieving http://ublin.lib.buffalo.edu/webcat/about/about1.html (1)...
 - Using HTML2 parser -  (100 words)
retrieving http://ublin.lib.buffalo.edu/holcat/A/A/E/9/001.html (1)...
retrieving http://ublin.lib.buffalo.edu/marcat/A/A/E/9/001.mrc (1)...
 - Using HTML2 parser -  (107 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 153 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
153 unique words indexed.
5 properties sorted.                                              
3 files indexed.  3668 total bytes.  324 total words.
Elapsed time: 00:00:03 CPU time: 00:00:00
Indexing done!

~/rich > ./swish-e -w your -p swishdescription
# SWISH format: 2.1-dev-25
# Search words: your
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.038 seconds
1000 http://ublin.lib.buffalo.edu/webcat/about/about1.html "University at
Buffalo Libraries WebCatalog" 1368 "Project 008 Web Based Bibliographics
This is an experimental website that deals with the general conversion of
marc records into html. You have found a bibliographic record for an item
in the collection of the University Libraries at the University at Buffalo.
If you are a student or faculty member of UB, this item is available for
your use, click the catalog button and enter an appropriate search to get
exact location information about the item. University at Buffalo
bibliographic records Binghamton University bibliographic records The
Libraries University at Buffalo State University of New York"
.

Ok, I don't like -S http, let's move on to -S prog.

~/rich > cat swish.conf
SwishProgParameters default
http://ublin.lib.buffalo.edu/webcat/bibcat/A/A/E/9/001.html
StoreDescription HTML2 <body> 100000
DefaultContents HTML2

~/rich > cp ~/swish-e/prog-bin/spider.pl .

~/rich > ./swish-e -S prog -i ./spider.pl -c swish.conf -v 9
Indexing Data Source: "External-Program"
Indexing "./spider.pl"
./spider.pl: Reading parameters from 'default'
http://ublin.lib.buffalo.edu/webcat/bibcat/A/A/E/9/001.html - Using HTML2
parser -  (117 words)
http://ublin.lib.buffalo.edu/webcat/about/about1.html - Using HTML2 parser
-  (100 words)
http://ublin.lib.buffalo.edu/marcat/A/A/E/9/001.mrc - Using HTML2 parser -
(107 words)
http://ublin.lib.buffalo.edu/webcat/bibcat/ - Using HTML2 parser -  (75 words)
http://ublin.lib.buffalo.edu/webcat/bingbcat/ - Using HTML2 parser -  (64
words)
http://ublin.lib.buffalo.edu/webcat/bibcat/?N=D - Using HTML2 parser -  (75
words)
http://ublin.lib.buffalo.edu/webcat/bibcat/?M=A - Using HTML2 parser -  (75
words)
http://ublin.lib.buffalo.edu/webcat/bibcat/?S=A - Using HTML2 parser -  (75
words)
http://ublin.lib.buffalo.edu/webcat/bibcat/?D=A - Using HTML2 parser -  (75
words)
http://ublin.lib.buffalo.edu/webcat/ - Using HTML2 parser -  (23 words)
http://ublin.lib.buffalo.edu/webcat/bibcat/A/ - Using HTML2 parser -  (190
words)
http://ublin.lib.buffalo.edu/webcat/bibcat/B/ - Using HTML2 parser -  (190
words)

[2]+  Stopped                 ./swish-e -S prog -i ./spider.pl -c
swish.conf -v 9

(Send a SIGHUP to spider.pl and that tells it to quit spidering.  Cool!)

~/rich > kill -HUP 14862
lii@mardy:~/rich > fg
./swish-e -S prog -i ./spider.pl -c swish.conf -v 9
Can't connect to ublin.lib.buffalo.edu:80 (Timeout)     ...propagated at
./spider.pl line 238.

Summary for: http://ublin.lib.buffalo.edu/webcat/bibcat/A/A/E/9/001.html
    Duplicates:     49  (0.6/sec)
Off-site links:      1  (0.0/sec)
   Total Bytes: 22,567  (275.2/sec)
    Total Docs:     13  (0.2/sec)
   Unique URLs:     17  (0.2/sec)
http://ublin.lib.buffalo.edu/webcat/bibcat/C/ - Using HTML2 parser -  (190
words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 219 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
219 unique words indexed.
5 properties sorted.                                              
13 files indexed.  22567 total bytes.  1356 total words.
Elapsed time: 00:01:23 CPU time: 00:00:00
Indexing done!

~/rich > ./swish-e -w your -p swishdescription
# SWISH format: 2.1-dev-25
# Search words: your
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.038 seconds
1000 http://ublin.lib.buffalo.edu/webcat/about/about1.html "University at
Buffalo Libraries WebCatalog" 1368 "Project 008 Web Based Bibliographics
This is an experimental website that deals with the general conversion of
marc records into html. You have found a bibliographic record for an item
in the collection of the University Libraries at the University at Buffalo.
If you are a student or faculty member of UB, this item is available for
your use, click the catalog button and enter an appropriate search to get
exact location information about the item. University at Buffalo
bibliographic records Binghamton University bibliographic records The
Libraries University at Buffalo State University of New York"
.

Whew.  Now I need to get some work done.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Jan 25 21:29:04 2002