Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:150] Two minor suggestions

From: Paul J. Lucas <pjl(at)>
Date: Fri Feb 27 1998 - 17:11:44 GMT
On Mon, 23 Feb 1998, Jacques Delsemme wrote:

> 1- When indexing, swish-e goes very fast at first, then slows down more 
> and more, until it literally crawls when you have a lot of data to 
> index.  It would be great to be have it report periodically on its use 
> of system resources to be able to learn where the bottleneck is located.

	It's eating up memory and your machine is swapping.  Also, from
	looking at the code, it appears to use unbalanced binary trees
	for the words (although it uses hashes for most everything

	Because of this and many other limitaions of SWISH-E, I've
	written SWISH++.  I'm putting the finishing touches on the docs
	now, so it should be available next week sometime.  Briefly
	(from the in-progress README):

	1. 8-10 times faster at indexing.  It achieves this speed by using:
		a) mmap(2) instead of stdio to read files
		b) very little explicit dynamic memory allocation
		c) more inlining and fewer function calls in inner loops
		d) better data structures and algorithms by virtue of
		   using STL (The C++ Standard Template Library), e.g.,
		   maps rather than linked lists

	2. Better results format of:

		rank path_name file_size file_title

	   By placing the file_title, which may contain spaces, last,
	   you can easily parse it, e.g.:

		($rank,$path,$size,$title) = split( / /, $_, 4 );

--->	3. Automatically splits and remerges large file sets.

	4. Parses hexadecimal numeric character entity references of
	   the form "&xhhh;" in addition to decimal ones.

	5. Searches are practically instantaneous because the index
	   file is mmap(2)'ed and binary-searchable immediately.

	For example, on a SPARC Ultra 2, it indexes 5 million words (1
	million unique) in just under 8 minutes.  Smokin'!

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Fri Feb 27 09:20:13 1998