More questions about swishspider...
This may be related to the thread "HTTP spidering - zero results" that
bounced around on the list in June, but wasn't sure if a resolution was
ever reached.
When attempting to index via HTTP, I seem to only be getting as many unique
words as there are files that I attempt to index. Have pasted a sample of
the results I'm getting below. In this case, I'm only trying to index the
first page of the site, but if I set the MaxDepth variable higher than 1, I
only end up with as many unique words as swishspider attempts to index.
--------------------------------------------------------------------------------------------------------
Indexing Data Source: "HTTP-Crawler"
Indexing http://www.access-board.gov..
retrieving http://www.access-board.gov (0)...
(1 words)
Skipping http://www.access-board.gov/: Too deep.
Skipping http://www.access-board.gov/indexes/Newsfile-contents.htm: Too deep.
..
Removing very common words...
no words removed.
Writing main index...
Computing hash table ...
Writing header ...
Writing index entries ...
Writing stopwords ...
1 unique word indexed.
Writing file index...
Writing file list ...
Writing file offsets ...
Writing MetaNames ...
Writing offsets (2)...
1 file indexed.
Running time: 6 seconds.
Indexing done!
--------------------------------------------------------------------------------------------------------
Any ideas?
Thanks,
-Ben
--
Ben Caldwell - Web/Information Specialist
Trace Research & Development Center
email: caldwell(at)not-real.trace.wisc.edu | http://www.trace.wisc.edu
Tel: 608.265.2064 | Fax: 608.262.8848 | TTY: 608.263.5408
Received on Thu Aug 10 16:50:34 2000