Skip to main content.
home | support | download

Back to List Archive

Narrowing http spidering

From: Alexandre Gefen <listes(at)not-real.fabula.org>
Date: Mon Nov 16 1998 - 11:57:13 GMT
With the http option, config file only allows to limit the number of links
to be followed by the spider.
Does someone has an idea of how to narrow spidering to a subdirectory
(http://myserveur/mydirectory/...) and indexing all the eventual
subdirectories (http://myserveur/mydirectory/sub/sub/sub/example.html) but
never going out to the main server (http://myserver) ? if there is any link
in this subdirectory to any file on http://myserveur/, this link will be
followed, which is very ennoying. Is it some way to hack the C source or
Perl libraries used for spidering for this purpose ?
Also : is it some way to customize the files that will be scanned (by
extension, like in the file option) ?

Best regards,

Alexandre Gefen


PS With the help of David Norris, swish-e is working very very well on NT
4.0. I'm indexing  more than 1000 web pages / hour for creating a web search
engine devoted to french literature !
Received on Mon Nov 16 03:58:32 1998