Dave Thomson wrote:
> open(FILE,"$file") || print "error: $file";
> while ($chars <2500)
> {
> $mychar = getc(FILE);
Hello,
I didn't looked at the rest but if I were you, I would read
directly 2500 chars in one system call instead of doing
2500 getc() and string cat each character....
> $junk =~ s/<(([^>]|\n)*)>//g;
This won't work for comment that may contain an HTML tag.
Ie : <!-- This is a <B>comment</B> -->
I am using a "little" more complex method myself :
$*=1;
$/="\x00";
# Read first 4KB
read(S, $cbuffer, 4096);
close(S);
# We work on the whole buffer. Let remove new lines..
$cbuffer=~s/\n/ /g;
# I don't want the head : Title tag is already in the DB anyway.
$cbuffer=~s/<\s*head\s*>.*<\s*\/head\s*>//gi;
# Discard comments.
$cbuffer=~s/-->/\0/g;
$cbuffer=~s/<!--[^\0]*\0//g;
$cbuffer=~s/\0/ /g;
# Discard other html tags
$cbuffer=~s/<[^>]+>/ /g;
# There is no use to display only spaces... One is enough
$cbuffer=~s/\s+/ /g;
# Keep the first few characters
$context=substr($cbuffer, 0, 520);
# Remove the last "word" in case it was cut
$context=~s/\s+\S+$//;
I don't say that it's foolproof... but it should behave proprely
in most cases... (and even with all those regexp, it should still
be faster than doing several thausand system calls... :)
By the way, I don't like working with the filesystem method... It
would be rather awkward to password protect area of the site just
to display the content of those files in the result of a search :)
(that's just one problem among others)
Cheers,
Yann S.
--
-------------------------------------------------------------------
TheNet - Internet Services AG CohProg SaRL
stettler@thenet.ch stettler@cohprog.com
http://www.thenet.ch/ http://www.cohprog.com/
---**---
Anime and Manga Services http://www.animanga.com/
Received on Sat Dec 12 14:04:14 1998