Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: More

From: <00prneubauer(at)not-real.bsuvc.bsu.edu>
Date: Thu Mar 05 1998 - 20:53:52 GMT
Paul J. Lucas wrote:  
>On Thu, 5 Mar 1998 00prneubauer@bsuvc.bsu.edu wrote:
>> Paul J. Lukas wrote:  
>> >	The breadth-first strategy of SWISH++ rather than the depth-
>> >	first one of SWISH-E is certainly better suited to web
>> >	indexing.
>> 
>> This is an interesting proposition, but I'm afraid it's not obvious to
>> me why this should be so.
>
>	The reason is simple: depth first would take an extremely long
>	time to complete and blow your stack moste likely since, every
>	time you come to a link, you'd follow it to another page and
>	possibly to another site.  You'd never get around to finishing
>	the site you started at until months later.  Breadth-first
>	would completes an entire page (and therefore an entire sitre)
>	at a time before moving on.

I believe I see your point that the relevance of breadth-first rather
than depth-first is related to searching one's own site versus
searching the entire web.  In searching one's own site, you have
access to directory structures and can reliably traverse them in
either order, whereas in searching the web you generally don't have
such access and must follow links.

I can see that you want to complete an entire page at a time, but in
general, links off a page are not restricted to a single site.  The
only way to complete an entire site before moving on is to forgo (or
at least postpone) following links off site.

Suppose we select an arbitrary page as our root -- call it "Home". 
Home has links to A, B, C and D, so we record those links and then
follow them in sequence.  We find that A has links to i, ii, iii and
iv, that B has links to v, vi, vii and viii, etc.  Page i then has
links to a, b, c and ii links to d, e, f and g, etc.  But some of
those links (assume iv) may be off-site.

In the context of indexing, "visiting" a node (page) would correspond
to indexing the contents of that page.  Certainly noone with any sense
whatever would insist that all links from a page must be followed
before the words of the page can be added to the index.  In all
probability, the entire page will be read and indexed before any of
the links are followed.  Any sensible indexing program will index the
contents of Home and then follow links to A, B, C and D rather than
waiting to index Home until after all of A, B, C and D (and everything
they link to) are indexed.  I therefore take issue with your
characterization of depth-first as never completing a page because of
following links first.  A depth-first search would more fairly be
characterized as indexing A and the nodes it links to: i, ii, iii and
iv, along with their linked local pages a, b, etc before indexing B, C
or D.

A depth-first search would *index* in an order resembling Home, A, i,
a, b, c, ii, d, e, f, g, iii, (any on-site links off iii), (NOT iv),
v, ... where a breadth-first search would *index* Home, A, B, C, D, i,
ii, iii, (NOT iv), v, vi, ..., a, b, c, ... . In either case, if we
intend to index an entire site at a time, we will have to record the
existence of iv, but postpone indexing it. In either case, once we run
out of on-site links from a page, we pop the stack and look for more
unvisited (unindexed) links at the next higher level.

>> At first glance, a breadth-first strategy looks like it would spend
>> all its time listing sites and never get around to individual
>> pages/documents.
>
>	You don't understand.

I think it's pretty obvious I don't understand.  That's why I was
asking for clarification, after all.  Let's see if I can make it
clearer *what* I don't understand.

Suppose Altavista (or swish++) is currently indexing at level 'n' of
its search.  Level 'n' consists of nodes 3,123,456-3,654,321 and while
indexing node 3,245,678 it finds a link to your page at
http://www.best.com/~pjl/software.html.  

It records that url as node 6,743,822, but, of course, it does not yet
actually index it, since it is part of level 'n+1' rather than level
'n'.  At some point during the indexing of level 'n+1' it will have
indexed your software page and recorded the links to your other pages,
but they will not themselves be indexed until somewhere in level
'n+2'.  If, on the other hand, the breadth-first search were modified
in a manner such that it visits (indexes) internal links whenever it
indexes a page (in a breadth-first manner, of course) then *that* is
itself a form of depth-first traversal, since it would be interrupting
the next higher level of breadth-first to complete a whole site before
moving on to the other external links from the higher level.

In comparison, a correspondingly modified depth-first search would
have the indexing engine finding a reference to your page somewhere
else and when the time came to visit the link to your page, the
depth-first search engine might then continue following the internal
links within your site before leaving to continue with the external
links from your site or the referring site.

Thus, your site would not be fully indexed in a coherent chunk of
time/process in that unmodified breadth-first search any more than in
the correspondingly unmodified depth-first search.  I can easily see
how there might be different sorts of either breadth-first or
depth-first, but if you want to index a whole site at a time, the
scenarios I described strike me as reasonably plausible.  Is it now
clearer *what* I don't understand about your remark?

>> Still, even for a limited number of sites (e.g., one) it's not obvious
>> to me why breadth-first is going to be faster, more efficient or
>> otherwise better suited than depth-first.
>
>	Well, what can I say.  Consult an algorithms text.

Does my explanation above provide any insight regarding why I (still)
have trouble understanding why you regard breadth-first as more
suitable?  I also fail to see the relevance of an algorithms text
here.  You say below that efficiency is not the reason.  Most algorithms
texts of my acquaintence concern themselves primarily with efficiency. 
Your opinion regarding the suitability of the different search
strategies would appear to be based either on a mischaracterization of
depth-first or a misunderstanding on my part of what you really mean. 
I will grant the plausibility of the latter, in fact, it is what I
have been assuming all along.

>> Depth-first strikes me as more obvious/natural,
>
>	Bubble sort strikes people as natural, but it sucks as a
>	sorting algorithm.

Exactly my point.  My question had to do with *why* you felt that
breadth-first was better.  As I said (reordered for purposes of
exposition):

>> Paul L. says that swish++ is "an order of magnitude faster than SWISH-E."  I
>> am not going to dispute the truth of this statement, but if the factor that
>> makes it true is a breadth-first algorithm rather than a depth-first
>> algorithm, then there is probably a simple explanation for why this is so.

Just as there is a simple explanation for *why* any of several
different sorting algorithms beat the conceptually simpler bubble sort
(and each other depending on circumstances), I was assuming that there
was a corresponding explanation for why swish++ is faster than
swish-e.  Usually, when you get an order-of-magnitude improvement, it
is because of a better, faster algorithm, not just from a better
implementation of the same algorithm.  I incorrectly assumed that the
algorithm mentioned in your message to the mailing list was the
algorithm that was responsible for the improvement.

>	I never cited that as a reason precicely because it isn't a
>	reason.  Read the README for the real reasons why it is faster.

I'll do that.

>> but I have certainly not ever thought about it in any detail.
>
>	So let me understand: you've never though about it in any
>	detail, yet you're coming to conclusions?

Certainly not.  My point was that I had never tried to create an
Altavista, Lycos, etc and that I was wondering *why* you thought that
a particular sort of search strategy was better than a different
search strategy.  I am more than willing to be enlightened.  I never
claimed to have the right answer.  In fact, I was explicitly saying
that I had not come to a conclusion, so I was looking for an
explanation of why my feeling of naturalness was wrong.  As far as I
can see, if we want to make sure of indexing as much of each visited
site as possible (and that strikes me as a reasonable thing to want),
then we will have to modify either sort of strategy to (recursively)
index all internal links before visiting any external links. 
Furthermore, if we are using a breadth-first strategy, we will need to
keep track of whether we arrived at page P (from O) by an internal or
an external link so that we will know whether we should continue
indexing the pages with (internal) links from page P (something of a
modified depth-first strategy embedded in the overall breadth-first
strategy) or whether we should return to R to index other pages linked
from R before we index the links from P.  I hope my point is clear.  

I do see your point about stack overflow, so I believe that I
understand the point that you cannot expect to index the entire web
with a depth-first search, so if I were to take on the task of writing
the next Altavista, I would probably want to use a suitably modified
breadth-first search.  OTOH, I thought we were talking about indexing
one or two sites. (I could be and probably am wrong.)  With a
reasonably finite task like that and external pages not indexed,
either search strategy seems usable.  I will not, of course, be very
surprised to find that one actually turns out to be much better than
the other.

>	The name is "Lucas" with a 'C', the far more common spelling.

Oops!  My bad.  I don't approve of misspelling people's names.  All I
can do is offer my apologies.

	Paul
========
Paul Neubauer         prn@bsu.edu            00prneubauer@bsuvc.bsu.edu
For PGP Public Key send mail with subject="Send PGP Public Key" 
1024 bits -- Key ID: 3FEB993D
Key Fingerprint: 85 AA A5 91 00 49 7A 7B  23 26 F7 B8 DB 72 C9 48
Received on Thu Mar 5 13:02:28 1998