Skip to main content.
home | support | download

Back to List Archive

Re: Summary of indexing report

From: <moseley(at)not-real.hank.org>
Date: Thu May 29 2003 - 13:04:12 GMT
On Thu, May 29, 2003 at 04:31:45AM -0700, Patrick Tinley wrote:
> Hi all,
> 
> I'm just wondering if its possible to have the Summary Report
> (see below) which appears at the end of the indexing process
> detached & e-mailed to myself.
> Indexing is on a weekly cronjob.
> I'd like to keep a record of how the site grows.

Others may be better at answering this.

Cron normally emails any output to the user (see man 5 crontab if you 
can set the email address).

You have output from two things below:

> Summary for: http://www.mysite.com/index.shtml
>      Duplicates:     535,248  (214.4/sec)
>      Off-site links:      12,996  (5.2/sec)
>      Total Bytes: 246,865,706  (98904.5/sec)
>      Total Docs:      16,824  (6.7/sec)
>      Unique URLs:      17,329  (6.9/sec)

That's from spider.pl and it is written to stderr.  That output can be 
disabled by setting SPIDER_QUIET=1 in your environment when indexing.

> Removing very common words...
  [...]
> 66602 unique words indexed.
> 9 properties sorted.
> 16824 files indexed.  246865706 total bytes.  12192895 total words.
> Elapsed time: 00:42:28 CPU time: 00:04:02

And that's from the swish-e binary and, by default, is written to 
stdout.  You can use the -E <file> to append swish-e's output to a file, 
or without <file> send the output to stderr instead of stdout.  Or just 
redirect stdout to a file.

So that gives you a few options since you can pick which output goes 
where.  You might capture the spider output one place and the swish-e 
summary someplace else.

Note that when indexing swish-e shows a progress report and uses \r to 
overwrite its percent complete messages.  Those will be ugly so you will 
probably want to filter swish-e's output.

What I'd do is save output to a file while indexing.  If swish-e exits
with a non-zero exit code then email the entire file (just use cat in
cron and then it should automatically be sent).  If swish-e exits
without an error exit code then use grep to extract out just the data
you want emailed, or write a little script to append interesting data to
a (comma separated values?) file for later processing.


-- 
Bill Moseley
moseley@hank.org
Received on Thu May 29 13:04:21 2003