Skip to main content.
home | support | download

Back to List Archive

RE: Document Summaries/Descriptions

From: Bas Meijer <bas(at)not-real.antraciet.nl>
Date: Wed Nov 15 2000 - 14:45:14 GMT
At 05:58 -0800 15-11-2000, Bill Moseley wrote:
>...
>OTOH, I'm not sure that this feature can't be handled outside of swish if
>Properties won't work in some case.  It's faster to access the document
>summaries if they are in the index, but it might come at the expense of
>speed when searching -- and that is swish's main job.
>
>If using the file system you can always access the documents from your CGI
>front-end to show a summary of the first x characters.  If indexing with
>the httpd method then maybe the spider can extract the first x characters
>and save it to a local file or database depending on your needs.  That
>would be better with HTML as you could use HTML::TreeBuilder to extract out
>correct HTML instead of just chopping it off after x number of characters.
>

Lookup-1.6.0 has this 'outside' approach for summaries (abstracts), 
for an example see:

http://bas.antraciet.nl/cgi-bin/search.cgi?search=stat&results=0&index=apache.swish&numpp=5&abstracts=CHECKED

Abtracts are generated at runtime, it uses abstract.pm (attached), 
based on code by Steve van der Burg, with a routine:

sub abstract{

	my $dhp = new HTML::DocHead;

	my $content;
	open(THISFILE,$_[0]) or return;
	while(<THISFILE>){
		$content .= $_;
	}
	close THISFILE;

	$dhp->parse($content);
	$dhp->eof;
	return $dhp->out;
}


Steve originally used it for swishspider and stored results in a gdbm file,
more portable would be a AnyDBM construction.

Point is that you would want to reduce the runtime load of search.cgi's.
When the resultset is over 25 files, extracting abstracts takes too 
much time IMHO.


regards,




Bas Meijer



-- 

--  /'''     Bas Meijer mailto:bas@antraciet.com
     c-OO     http://antraciet.com Web Services
     \  >     Kerkstraat 19 Postbus 256 1400 AG Bussum
      \&&     t. +31 35 7502100  f. +31 35 7502111
Received on Wed Nov 15 14:46:45 2000