On Mon, 23 Feb 1998, Jacques Delsemme wrote:
> 1- When indexing, swish-e goes very fast at first, then slows down more
> and more, until it literally crawls when you have a lot of data to
> index. It would be great to be have it report periodically on its use
> of system resources to be able to learn where the bottleneck is located.
It's eating up memory and your machine is swapping. Also, from
looking at the code, it appears to use unbalanced binary trees
for the words (although it uses hashes for most everything
Because of this and many other limitaions of SWISH-E, I've
written SWISH++. I'm putting the finishing touches on the docs
now, so it should be available next week sometime. Briefly
(from the in-progress README):
1. 8-10 times faster at indexing. It achieves this speed by using:
a) mmap(2) instead of stdio to read files
b) very little explicit dynamic memory allocation
c) more inlining and fewer function calls in inner loops
d) better data structures and algorithms by virtue of
using STL (The C++ Standard Template Library), e.g.,
maps rather than linked lists
2. Better results format of:
rank path_name file_size file_title
By placing the file_title, which may contain spaces, last,
you can easily parse it, e.g.:
($rank,$path,$size,$title) = split( / /, $_, 4 );
---> 3. Automatically splits and remerges large file sets.
4. Parses hexadecimal numeric character entity references of
the form "&xhhh;" in addition to decimal ones.
5. Searches are practically instantaneous because the index
file is mmap(2)'ed and binary-searchable immediately.
For example, on a SPARC Ultra 2, it indexes 5 million words (1
million unique) in just under 8 minutes. Smokin'!
- Paul J. Lucas
NASA Ames Research Center Caelum Research Corporation
Moffett Field, California San Jose, California
<pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Fri Feb 27 09:20:13 1998