Skip to main content.
home | support | download

Back to List Archive

Re: Re: Extracting descriptions

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Fri Dec 04 1998 - 23:01:33 GMT
On Fri, 4 Dec 1998, Jacques Delsemme wrote:

> 1. I had to increase the number of characters read to 2048 characters, 

	OK.

> 2. By the same token, I've decreased the number of words returned to no 
> more than 50.

	I'll make this a parameter to the function.

> 3. I've inserted the line:
> 
> 	s/<!--.*-->//gi;                    # remove comments tags
> 
> to remove comments tags.  I do this first.

	I don't understand why.  The line:

		s!<.*?>!!g;

	in my code will remove comments also.  I don't see why it has to
	be done first.  Please explain.

> 4. You are using the "description" meta tag to extract the description of the
> page.  Is this use universal?

	Probably not universal, but fairly common.  See:

		http://www.w3.org/TR/REC-html40/appendix/notes.html#recs

	under "Provide keywords and descriptions."  Although it says:

		The value of the name attribute sought by a
		search attribute is not defined by this
		specification.

	the example given uses "description."  For what it's worth,
	AltaVista uses "description"; see:

		http://www.altavista.com/av/content/addurl_meta.htm

	Excite doesn't use META tags at all.  Hotbot points one to:

		http://searchenginewatch.internet.com/webmasters/meta.html

	that also uses "description."  (They also point you to a
	"Search Engine Features" page, but that page states that
	AltaVista doesn't use META tags which is wrong.)

	There is also the "Dublin Core" set of names:

		http://purl.oclc.org/dc/

	All of their names start with "DC." so their description would
	look like:

		<META NAME="DC.description" CONTENT="blah blah">

	I've changed the regular expression in the Perl function to
	allow an optional "DC." before "description":

		(?:DC\.)description

	- Paul
Received on Fri Dec 4 15:02:23 1998