Skip to main content.
home | support | download

Back to List Archive

Re: Re: Extracting descriptions

From: Jacques Delsemme <jacques(at)not-real.cats.UCSC.EDU>
Date: Fri Dec 04 1998 - 23:58:47 GMT
> > 3. I've inserted the line:
> > 
> > 	s/<!--.*-->//gi;                    # remove comments tags
> > 
> > to remove comments tags.  I do this first.
> 
> 	I don't understand why.  The line:
> 
> 		s!<.*?>!!g;
> 
> 	in my code will remove comments also.  I don't see why it has to
> 	be done first.  Please explain.

Comment tags can contain other tags, and your regular expression only 
removes characters up to the next > (as it should, for all the other 
tags cannot contain other tags).  For instance, consider the comment:

	<!-- always insert UC Santa Cruz in your <title> tag -->

There was also an error in the regular expression I provided to remove 
comments.  It should have a question mark after the .* to match only up 
to the first -->, that is:

	s/<!--.*?-->//gi;

since comment tags cannot be nested within one another (although it's 
OK for them to contain other tags).  You may also need to provide a way 
to remove comments without a closing comment tag (Ugh!).

Another glitch I ran into concerns binary files.  For instance, on our 
site we index the file names (but not the contents) of GIF files.  The 
function you genereously supplied will return their first 1024 
characters even though they are meaningless as a description.  Is there 
an easy way to recognize if a file is binary?  It'd be tedious to list 
all of them (.gif, .jpg, .exe, ...), and filter them out before getting 
the description.

Thanks for the information on the use of the meta "description" tag.

- Jacques (jacques@cats.ucsc.edu)
Received on Fri Dec 4 16:00:57 1998