I agree with David, when he says that the index shouldn't hold everything everyone wants.
The index size could grow to larger than the indexed files, but also, the search time in the index could increase in order to send back information not needed by most people...
What I do on my client's site that use Swish is a daily indexing of 18 different indexes, indexing 30000 well known documents (each one is listed in the conf files). It takes four or five hours to index everything, but the files are stored on an NFS filesystem, which is slow and frees the CPU for the Web service.
All that works fine, and I have no performance problem, on a not so fast machine (I don't know exactly, but I think it is a P200 or little more, with 128 MB RAM, which hosts dozens of sites and databases).
I know that I have a great amount of optimization to do, and I think that many people have the same optimizations to do, if the responses times are slow :
* index only file that have changed, and merge the old index and the new one,
* use mod_perl on the server, instead of CGI for Perl scripts,
* even change the programming language, for some parts of the site,
* carefully design the scripts, and optimize the main parts,
* adjust the HTTP server config,
* in the case of timestamps, maybe write this information in a separate index (maybe stored in a DBFile hash table, or something like that).
De: David Norris [SMTP:email@example.com]
Date: mardi 24 août 1999 23:32
À: Multiple recipients of list
Objet: [SWISH-E] RE: timestamps in the database?
>> timestamps will become outdated as well. This may or may not be a
> I could life with this. If you update your index file with cron at
> midnight and move documents before midnight the index will be inconsistent too.
> The problem of inconsistency can only be solved by generating the index at
> runtime... nobody wants this.
I agree, in many cases this is insignificant because old documents don't change often. Where
documents update more often, it would be bad to indicate that this document was updated last
week when in fact it was updated yesterday. How many people update their index as often as
every day? It takes a long time to index a large site, most folks I know do it on Sundays or
their normal low traffic times. I'd be surprised if many people do it more than once a week.
Grabbing filemtime at runtime gives an indicator that the file has changed since the index was
updated. New documents wouldn't be shown, but, old documents which have changed will be shown
>> My search results rarely
>> takes more than 30 - 50 milliseconds to generate results while reading
> However it becomes more important if your server hardware is not that
What do you mean by not fast? 386/486? My server machine is a Pentium 100 MHz with 16 MB RAM
running Linux 2.2.5 (SuSE 6.1)... PHP is extremely fast when compared to almost anything,
especially on old hardware. Similar PERL scripts take in seconds to parse search results even
on a fast machine. In comparison, the latency in my PHP script is almost completely caused by
the file I/O routines in Linux and slow hardware.
>> in the index, then why even have them in the file system. Convenience?
> :-) You are right. But it would not need more space than the file size
> actually takes.
I agree, adding one thing or another wouldn't take as much space as the file. But, adding
everything everyone wants to have returned in the results could become a significant percentage.
I was referring to trend of trying to add everything into the index. Paragraph+ descriptions,
file size, time stamps, keywords, etc would become large. And, the only time it saves is
measured in milliseconds.
World Wide Web - http://www.webaugur.com/dave
Page via mail - firstname.lastname@example.org
ICQ Universal Internet Number - 412039
E-Mail - email@example.com
Received on Wed Aug 25 01:48:57 1999