On Tue, 1 Dec 1998, Brian Rankin wrote:
> The Perl script pulls-out the first 300 non-html characters from each
> document that matches the query, and uses that as the desription. This
> happens interactively for each search result. The software all resides on an
> old Sparc-10 with 48MB RAM (about as fast as a 486-66). Not too shabby for a
> Perl script...
I'd like to include a Perl library function to do just this in
the next release of SWISH++ so people don't have to keep
reinventing the wheel. Here's what I've come up with:
-------------------------------------------------------------------------------
sub extract_description {
my $path_name = shift;
open( FILE, $path_name ) or die "can not open $path_name\n";
read( FILE, $_, 1024 ); # pull in chunk o' text
close FILE;
s/\s+/ /g; # turn into one big long string
if ( $path_name =~ /\.s?html?$/ ) { # file is HTML
return $3 if /<META\s[^>]*?NAME\s*=\s*(['"])?description\1?\s[^>]*?CONTENT\s*=\s*(['"])([^>]+)\2[^>]*?>/i;
return $2 if /<META\s[^>]*?CONTENT\s*=\s*(['"])([^>]+)\1\s[^>]*?NAME\s*=\s*(?:['"])?description[^>]*?>/i;
s!<(SCRIPT|STYLE|TITLE).*?>.*?</\1>!!gi;
s!<(?:SCRIPT|STYLE|TITLE).*$!!i;
s!<.*?>!!g; # zap all other HTML tags
s!<.*$!!g; # zap partial trailing tag
}
s/^\s+//; # zap leading whitespace
s/^((?:[^\s]+\s+){100}).*/$1/; # return at most 100 words
return $_;
}
-------------------------------------------------------------------------------
For HTML files, if it contains a META description, take that
string and be done with it; otherwise, strip out all text
between SCRIPT, STYLE, and TITLE begin/end tags, and then strip
all remaining tags.
At this point for an HTML file (or initially for plan text
files), throw away all but the first 100 words (at most).
Sound reasonable?
- Paul
Received on Thu Dec 3 15:08:10 1998