Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:181] Re: More

From: Paul J. Lucas <pjl(at)>
Date: Fri Mar 06 1998 - 17:33:28 GMT
On Thu, 5 Mar 1998 wrote:

> Suppose we select an arbitrary page as our root -- call it "Home". 
> Home has links to A, B, C and D, so we record those links and then
> follow them in sequence.  We find that A has links to i, ii, iii and
> iv, that B has links to v, vi, vii and viii, etc.  Page i then has
> links to a, b, c and ii links to d, e, f and g, etc.  But some of
> those links (assume iv) may be off-site.

	It's easy to check the domain of the URL to see if it's on the
	same site.  There are other problems in that links can be
	circular and aliased, e.g., "" is the same as
	"" (usually) so you'd have to perform
	some sort of fuzzy URL matching in order to eliminate

> In the context of indexing, "visiting" a node (page) would correspond
> to indexing the contents of that page.  Certainly noone with any sense
> whatever would insist that all links from a page must be followed
> before the words of the page can be added to the index.  In all
> probability, the entire page will be read and indexed before any of
> the links are followed.  Any sensible indexing program will index the
> contents of Home and then follow links to A, B, C and D rather than
> waiting to index Home until after all of A, B, C and D (and everything
> they link to) are indexed.  I therefore take issue with your
> characterization of depth-first as never completing a page because of
> following links first.

	You can do as you please; however, it's very easy to imagine
	that an implementation will follow a link upon encountering it.
	It is the "easy" solution since you don't have to queue it.

> Suppose Altavista (or swish++) is currently indexing at level 'n' of
> its search.  Level 'n' consists of nodes 3,123,456-3,654,321 and while
> indexing node 3,245,678 it finds a link to your page at
> It records that url as node 6,743,822, but, of course, it does not yet
> actually index it, since it is part of level 'n+1' rather than level
> 'n'.

	Then it's not depth-first.  I'll answer questions on SWISH++,
	but I don't see it as my (or anyone's task) to be a substitute
	for an algorithms course or textbook.

> Does my explanation above provide any insight regarding why I (still) have
> trouble understanding why you regard breadth-first as more suitable?

	Not really.

> I also fail to see the relevance of an algorithms text here.  You say below
> that efficiency is not the reason.  Most algorithms texts of my acquaintence
> concern themselves primarily with efficiency.

	To me it seems like you don't understand the basic algorithm.

> My question had to do with *why* you felt that breadth-first was better.

	And I stated so in as clear English as I can.  I can say it
	again if you like.

> Just as there is a simple explanation for *why* any of several
> different sorting algorithms beat the conceptually simpler bubble sort
> (and each other depending on circumstances), I was assuming that there
> was a corresponding explanation for why swish++ is faster than
> swish-e.

	There is: read the README.

> Usually, when you get an order-of-magnitude improvement, it is because of a
> better, faster algorithm, not just from a better implementation of the same
> algorithm.

	Absolutely wrong.  Please, please, consult an algorithms text.
	An order of magnitude improvment (or even thousands of orders of
	magnitude improvement) has *NOTHING* to do with algorithms
	because such are *CONSTANT* *FACTORS* in the polynomial of the
	running time.  The order of 10n is O(n).

	There are several factors that contribute to the speed
	improvement of SWISH++ (and being breadth-first vs. depth-first
	has *NOTHING* to do with it nor did I ever say it did).  SWISH++
	essentially uses exactly the same algorithm for indexing words;
	however, it does use a better data structure for storing the
	word index.

	I could be wrong, but it appears that SWISH-E uses an
	unbalanced binary tree.  The worst case running time for
	inserts and lookups is O(n) because the degenerate case for
	such a tree is a linked list.  SWISH++ uses a balanced binary
	tree.  The GNU STL implementation of the map class uses a
	red-black tree (but an AVL tree would do nicely as well).
	There, the worst case running time for inserts and lookups is

	So while *I* personally am not using a different algorithm
	overall, the implementation of the data structure I'm using "off
	the shelf" is far superior.

	Also, my "order of magnitude" claim is based on a few test
	cases.  I encourage others to compare SWISH-E and SWISH++ and
	share their results.

> I incorrectly assumed that the algorithm mentioned in your message to the
> mailing list was the algorithm that was responsible for the improvement.

	It's only partly responsible, but for large data sets, may turn
	out to be the dominant reason even though I never stated so.

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Fri Mar 6 09:42:00 1998