In the moment I use htdig to index the site of one a big university in
germany (in fact, the university of erlangen-nuremberg,
http://www.uni-erlangen.de, that has about 21.000 students). I'm very
happy with htdig's ability to index parts of the site, but when it comes
to the all-over-index that should include all sites hosted by the
universities computing center, all (stable) versions of htdig I know
either just crash (in fact it might be that it doesn't crash, but after
2 days of consuming 100% CPU time without any progress visible to the
rest of the world I find it crashed) or run into 2-GB-filesize-limit
problems (at least on Linux, which is the platform the search engine
must run on) which are not caused by kernel or filesystem limitations
(kernel 2.4.18 and ext3 filesystem, both capable of up to some TB big
files). So what I need is a search engine that will index .doc, .pdf,
.ps and all kinds of html and text, that can also deal with umlauts,
which doesn't crash when the ammount of data to be indexed is a bit
bigger than usual and that will return search results within reasonable
time though the database might be of some GB of size.
The available hardware is an HP server with 2 Pentium III Xeon 1 GHz
CPUs, 1 GB memory and 100 GB SCSI RAID-30 disk space. The server has
nothing else to do than hosting the search engine (and it's webserver).
Maybe you can give me a hint if I should try swish-e and if I can make
use of both CPUs, if swish-e has incremental indexing... and so on. I
have no problem using a bleeding-edge development version as long as
this version is not capable of breaking out of a chroot (so no matter
what the version is doing it won't harm the rest of the installation).
(almost) desperate part-time search-engine administrator looking for
something that works.
Received on Mon Jul 29 10:27:59 2002