On Wed, Jun 02, 2004 at 10:32:24AM -0700, Justin Tang wrote:
> Duplicates: 796 (1.5/sec)
Count of links extracted that had already be seen.
> MD5 Duplicates: 1 (0.0/sec)
Count of pages that were skipped because their MD5 signature matched
another page.
> Off-site links: 164 (0.3/sec)
Off-site links that were skipped
> Skipped: 114 (0.2/sec)
Those are links that were skipped for various reasons (may include some
of the ones listed above)
> Unique URLs: 108 (0.2/sec)
Those are unique URLs that were processed.
> robots.txt: 3 (0.0/sec)
And those were skipped because robots.txt told the spider to skip them
Take a look at spider.pl if others pop up.
moseley@bumby:~/swish-e/prog-bin$ fgrep '$server->{counts}{' spider.pl.in | perl -pe 's/^\s+/ /'
my $val = commify( $server->{counts}{$_} );
commify( $server->{counts}{$_} ),
$server->{counts}{$_}/$start;
$server->{counts}{'Connection: Keep-Alive'}++;
$server->{counts}{'Connection: Close'}++;
$server->{counts}{'Unique URLs'}++;
if $server->{max_files} && $server->{counts}{'Unique URLs'} > $server->{max_files};
"Cnt: $server->{counts}{'Unique URLs'}",
$server->{counts}{Skipped}++;
$server->{counts}{'robots.txt'}++;
$server->{counts}{Skipped}++;
$server->{counts}{'MD5 Duplicates'}++;
$server->{counts}{Skipped}++;
$server->{counts}{Skipped}++;
$server->{counts}{Skipped}++;
$server->{counts}{'Off-site links'}++;
#$server->{counts}{Skipped}++;
$server->{counts}{Duplicates}++;
$server->{counts}{'Total Bytes'} += length $$content;
$server->{counts}{'Total Docs'}++;
if $server->{max_indexed} && $server->{counts}{'Total Docs'} >= $server->{max_indexed};
$server->{counts}{'PDF transformed'}++;
$server->{counts}{'Private Files'}++;
--
Bill Moseley
moseley@hank.org
Received on Wed Jun 2 11:49:03 2004