> > 3. I've inserted the line:
> >
> > s/<!--.*-->//gi; # remove comments tags
> >
> > to remove comments tags. I do this first.
>
> I don't understand why. The line:
>
> s!<.*?>!!g;
>
> in my code will remove comments also. I don't see why it has to
> be done first. Please explain.
Comment tags can contain other tags, and your regular expression only
removes characters up to the next > (as it should, for all the other
tags cannot contain other tags). For instance, consider the comment:
<!-- always insert UC Santa Cruz in your <title> tag -->
There was also an error in the regular expression I provided to remove
comments. It should have a question mark after the .* to match only up
to the first -->, that is:
s/<!--.*?-->//gi;
since comment tags cannot be nested within one another (although it's
OK for them to contain other tags). You may also need to provide a way
to remove comments without a closing comment tag (Ugh!).
Another glitch I ran into concerns binary files. For instance, on our
site we index the file names (but not the contents) of GIF files. The
function you genereously supplied will return their first 1024
characters even though they are meaningless as a description. Is there
an easy way to recognize if a file is binary? It'd be tedious to list
all of them (.gif, .jpg, .exe, ...), and filter them out before getting
the description.
Thanks for the information on the use of the meta "description" tag.
- Jacques (jacques@cats.ucsc.edu)
Received on Fri Dec 4 16:00:57 1998