On Sun, 15 Mar 1998, Paul J. Lucas wrote:
>> I don't know what sort of overhead would be involved.
> Copying the first 50-100 words of every file into the index.
There is another way... at least for text and html files. I maintain
the web archives for a couple astronomy-related mailing lists and use
swish-e to index and search them. One archive contains 24,000 messages
and the other has 14,000 messages. Each message is an individual html
file, between 2K and 20K bytes in size, generated by MHonArc (unix).
The perl script first calls swish-e to produce the list of filenames
that contain the text we searched for. Then the script takes those
filenames and pulls out the context information directly from the files
in question. No need to keep the context in an index file when you can
get it directly from the original document.
If a person just wanted to retrieve the line that contains the search
term... it would be very easy to use something like "grep" to do it. I
also wanted the line before and the line after the search term, the term
itself to be bold, and all HTML to be stripped from the result. So, we
did the whole thing in the perl5 script (added about 50 lines to the
If anyone would like to look at the search engine, go to:
Enter the word "nitrate", select "Detailed", then press "Search". Note
that due to the size of what I am indexing, I chose to split the index
files up into pieces. The messages from 1995-1997 never change so I saw
no reason to reindex them every time (plus swish-e gets unhappy on my
machine with 24,000 files to index). So, I have a search index for each
year and the script actually runs swish once for each year. Even
executing swish 4 times and displaying the context... the script still
runs pretty quickly on our very old, very slow server (HP9000 715/75
running HPUX v9.03 and Netscape's Enterprise v2.01 server).
I hope this spurs some ideas on how to do this! I'm sorry that I can't
give away the perl code for this at the present time. The programming
time to create it was donated instead of a monetary contribution to the
Andy Steere - email@example.com
http://www.system.missouri.edu/atm/ <- Amateur Telescope Makers archive
http://www.system.missouri.edu/apml/ <- AstroPhotography MailingList archive
http://www.system.missouri.edu/andy/ <- my homepage
Received on Fri Mar 20 08:35:48 1998