Skip to main content.
home | support | download

Back to List Archive

Re: Displaying index results with summaries (no meta tags

From: Yann Stettler <stettler(at)not-real.cohprog.com>
Date: Sat Dec 12 1998 - 22:07:36 GMT
Dave Thomson wrote:

>    open(FILE,"$file") || print "error: $file";
>    while ($chars <2500)
>      {
>      $mychar = getc(FILE);

Hello,
I didn't looked at the rest but if I were you, I would read
directly 2500 chars in one system call instead of doing
2500 getc() and string cat each character....

> $junk =~ s/<(([^>]|\n)*)>//g;

This won't work for comment that may contain an HTML tag.
Ie : <!-- This is a <B>comment</B> -->

I am using a "little" more complex method myself :

   $*=1;
   $/="\x00";

   # Read first 4KB
   read(S, $cbuffer, 4096);
   close(S);
   # We work on the whole buffer. Let remove new lines..
   $cbuffer=~s/\n/ /g;
   # I don't want the head : Title tag is already in the DB anyway.
   $cbuffer=~s/<\s*head\s*>.*<\s*\/head\s*>//gi;                 
   # Discard comments.
   $cbuffer=~s/-->/\0/g;
   $cbuffer=~s/<!--[^\0]*\0//g;
   $cbuffer=~s/\0/ /g;
   # Discard other html tags
   $cbuffer=~s/<[^>]+>/ /g;
   # There is no use to display only spaces... One is enough
   $cbuffer=~s/\s+/ /g;
   # Keep the first few characters
   $context=substr($cbuffer, 0, 520);
   # Remove the last "word" in case it was cut
   $context=~s/\s+\S+$//;

I don't say that it's foolproof... but it should behave proprely
in most cases... (and even with all those regexp, it should still
be faster than doing several thausand system calls... :)

By the way, I don't like working with the filesystem method... It
would be rather awkward to password protect area of the site just
to display the content of those files in the result of a search :)
(that's just one problem among others)

Cheers,
Yann S.

-- 
-------------------------------------------------------------------
TheNet - Internet Services AG              CohProg SaRL
stettler@thenet.ch                         stettler@cohprog.com
http://www.thenet.ch/                      http://www.cohprog.com/
                              ---**---
Anime and Manga Services                   http://www.animanga.com/
Received on Sat Dec 12 14:04:14 1998