On Thu, 4 Jun 1998, Dan Brickley wrote:
> On Thu, 4 Jun 1998, Roy Tennant wrote:
> > what I'm talking about. I'd much prefer to put abstracts in my files and
> > fetch those.
> Me too. Has anybody done any work along these lines? eg. building
> something like a cache of extracted metadata from the indexed pages, so
> that result-sets could include Title/Description/Keywords/Subject etc.
> based on contents of META tags? Extracting these manually each time a
> query occurs would presumably be a little inefficient.
I understand and support the desire to be efficient as long as it
doesn't prevent useful projects from happening. For example, I've cobbled
together some terribly inefficient projects using SWISH-E and Perl that
nonetheless are effective. For example, in one I've used SWISH-E
to index around 18,000 files that only consist of META tags and their
contents. For example:
<LINK REL="SCHEMA.dc" HREF="http://purl.org/metadata/dublin_core">
<META NAME="DC.publisher" CONTENT="The Library, University of California,
<META NAME="DC.creator" CONTENT="National Information Standards
Organization (NISO) ">
<META NAME="DC.title" CONTENT="Serial Item and Contribution Identifier
<META NAME="DC.identifier" CONTENT="http://sunsite.Berkeley.EDU/SICI/">
<META NAME="DC.description" CONTENT="The SICI standard (ANSI/NISO
Z39.560-1996, Version 2)provides an extensible mechanism for the unique
identification of either an issue of a serial title or a contribution
(e.g., article) contained within a serial, regardless of the distribution
medium (paper, electronic, microform,etc.). ">
<META NAME="DC.type" CONTENT="text">
<META NAME="DC.language" CONTENT="eng">
<META NAME="DC.date" CONTENT="1997">
<META NAME="DC.relation" CONTENT="Online version of the paper document
published by the National Information Standards Organization (NISO) .">
<META NAME="DC.rights" CONTENT="Copyright (c) 1997 ANSI/NISO.">
<META NAME="DC.format" CONTENT="text/html">
<META NAME="DC.subject" CONTENT="standard">
<META NAME="DC.subject" CONTENT="identifiers">
This means when a search is performed, I must *parse every hit* in order
to extract the information for display. And this is using Perl, not a
compiled language. To see the response time for a search that retrieves
364 items go to (or do your own search at
You will find that besides the time it takes for the Web client to
download the images, it is the initial search that takes the longest chunk
of time, since I write out a temporary file that speeds up the response
when another page of results is requested. But even so, I think you will
find the response time to be decent, particularly for a prototype.
So as much as I may work to increase efficiencies, I too often run into
those who would not do something *at all* because it is too inefficient.
In my opinion, good enough is often just that -- good enough. And CPU
cycles don't do you one bit of good until you burn them.
Received on Fri Jun 5 07:24:16 1998