Skip to main content.
home | support | download

Back to List Archive

Re: Re: Extracting descriptions

From: Paul J. Lucas <pjl(at)>
Date: Sat Dec 05 1998 - 00:10:24 GMT
On Fri, 4 Dec 1998, Jacques Delsemme wrote:

> Comment tags can contain other tags, and your regular expression only 
> removes characters up to the next > (as it should, for all the other 
> tags cannot contain other tags).

	Oh, thanks for the explanation: makes sense.

> There was also an error in the regular expression I provided to remove 
> comments.  It should have a question mark after the .* to match only up 
> to the first -->, that is:
> 	s/<!--.*?-->//gi;

	OK.  Nit: you don't need the 'i'.

> Another glitch I ran into concerns binary files.  ...  Is there an easy way
> to recognize if a file is binary?

	Yes, the Perl -B file test.  (See p. 85 in the Programming Perl
	book, 2nd. ed.)

> It'd be tedious to list all of them (.gif, .jpg, .exe, ...), and filter them
> out before getting the description.

	Wouldn't simply checking the filename extension for /\.txt$/
	work?  You most likely don't want the first 50 words of a
	PostScript file (say) even though a PostScript file is a text

	- Paul
Received on Fri Dec 4 16:11:15 1998