Thanks for doing this. I am testing your routine at one of our sites:
http://www2.ucsc.edu/cats/sc/tools/search.shtml
and here are my experiences with it:
1. I had to increase the number of characters read to 2048 characters,
otherwise the extract often disappeared altogether after the
eliminations of the various tags at the start of a document (meta tags
and other proprietary tags inserted automatically by some web editors).
2. By the same token, I've decreased the number of words returned to no
more than 50.
3. I've inserted the line:
s/<!--.*-->//gi; # remove comments tags
to remove comments tags. I do this first.
4. You are using the "description" meta tag to extract the description
of the page. Is this use universal? I'm curious to learn whether
there is a well-defined standard (I plead ignorance about this), or
whether there is a variety of meta tags in use (e.g. "abstract",
"subject").
- Jacques Delsemme (jacques@cats.ucsc.edu)
Workstation Support - CATS
University of California
Santa Cruz, CA 95064
(831) 459-2642
Received on Fri Dec 4 13:26:48 1998