My first real attempt to use the HTTP indexing method was taking too long
so I hacked together something to replace the perl "swishspider.pl" method
of fetching pages and finding links. The program is written in C (MSVC). I
don't expect it compile without modification under any form of unix but I
figure someone might be motivated to port it. Let me know if you want to
give is a shot and I'll email the code.
It is definitely faster... I indexed 500 files in 1.5 minutes, which was a
big improvement over the perl version.
The code is a quick hack... it looks for <A ... href="<url>" ...> and
<FRAME ... src="<url>" ...>
tags and expects to see double quotes around the url. It does try to skip
comments but I would not bet on a clear parse of incorrect or
unconventional HTML. All I can say for sure is that it worked on my pages.
If you are running NT (or maybe even Win98 or Win95) you can just use the
is at ftp://ftp.designinfo.com/GetPage.exe.
To use it change your swish.config file to include the following line:
GetPage is designed to notice that it is being called like swishspider.pl
and does the same thing (and creates a couple of extra temp files thought
ought to be cleaned up too).
Received on Fri Apr 23 11:29:37 1999