Skip to main content.
home | support | download

Back to List Archive

swishspider only indexing file names

From: Ben Caldwell <caldwell(at)not-real.trace.wisc.edu>
Date: Thu Aug 10 2000 - 20:46:49 GMT
More questions about swishspider...

This may be related to the thread "HTTP spidering - zero results" that 
bounced around on the list in June, but wasn't sure if a resolution was 
ever reached.

When attempting to index via HTTP, I seem to only be getting as many unique 
words as there are files that I attempt to index. Have pasted a sample of 
the results I'm getting below. In this case, I'm only trying to index the 
first page of the site, but if I set the MaxDepth variable higher than 1, I 
only end up with as many unique words as swishspider attempts to index.

--------------------------------------------------------------------------------------------------------
Indexing Data Source: "HTTP-Crawler"
Indexing http://www.access-board.gov..
retrieving http://www.access-board.gov (0)...
  (1 words)
Skipping http://www.access-board.gov/:  Too deep.
Skipping http://www.access-board.gov/indexes/Newsfile-contents.htm:  Too deep.
..

Removing very common words...
no words removed.
Writing main index...
Computing hash table ...
Writing header ...
Writing index entries ...
Writing stopwords ...
1 unique word indexed.
Writing file index...
Writing file list ...
Writing file offsets ...
Writing MetaNames ...
Writing offsets (2)...
1 file indexed.
Running time: 6 seconds.
Indexing done!
--------------------------------------------------------------------------------------------------------

Any ideas?

Thanks,

-Ben
--
Ben Caldwell - Web/Information Specialist
Trace Research & Development Center
email: caldwell(at)not-real.trace.wisc.edu | http://www.trace.wisc.edu
Tel: 608.265.2064 | Fax: 608.262.8848 | TTY: 608.263.5408
Received on Thu Aug 10 16:50:34 2000