At 04:42 PM 11/02/01 -0800, Bruce Pettyjohn wrote:
>I have noticed that there are duplicate entries for the URLs which are
>replicated on
>many pages. There does not seem to be any way to ensure that this does not
>happen.
>Is it a bug or is there a configuration error on my part?
What do you mean by duplicate?
You mean the spider is indexing pages more than once?
># Duplicates: 80,345 (6.2/sec)
That "Duplicates:" means how many links it found that it already spidered
i.e. skipped because it already saw that URL.
># 31856 unique words indexed.
># 5 properties sorted.
># 13339 files indexed. 198504356 total bytes.
># Elapsed time: 03:36:03 CPU time: 00:16:20
Are you indexing a remote web server? Or do you have a delay set? I'm
wondering why it's taking 3 1/2 hours to index.
If you have a current LWP setup you can run with "keep alive", which will
help both you (faster indexing) and the server (fewer requests).
BTW -- there's also a way to index links, so you can say "what pages link
to this url".
Sorry, I guess I'm not clear on the problem.
Bill Moseley
mailto:moseley@hank.org
Received on Sat Nov 3 01:05:34 2001