You wrote: "But if you find one of the pages that is indirectly referenced,
you get the page only."
This is very true, and if one were spidering indiscriminately, it would be
a problem because there is probably no way of knowing that the page you had
found *was* indirectly referenced. However, most of my indexing so far has
been just the first page of a Web site, which means that my approach to
reading through the frames is probably safe. Each Web site will already
have been looked at by a human being and its basic structure understood.
If you can think of a case you would like to see handled that isn't handled
by the approach I am using, I would really appreciate it if you could
supply a url for me to try out.
From: Ron Samuel Klatchko [SMTP:email@example.com]
Sent: Saturday, February 26, 2000 1:29 AM
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] Re: Swish-E and HTML documents with frames
Chris Humphries wrote:
> The way my system works, all the "frame src" links are read to create one
> big file, and *any* "a href" links found in any of those files are
> as if they were from that one big file. This means that to get at any <A>
> tags in the HTML pages you describe, one would need to set the spider to
> read to a depth of 2.
That's not what I'm worried about. If you find one of the pages
directly referenced from the frameset, then you get the entire
frameset. But if you find one of the pages that is indirectly
referenced, you get the page only. Is that behavior acceptable? The
first one is nicer but the second one will be more common.
Ron Samuel Klatchko - Software Jester
Brightmail Inc - firstname.lastname@example.org
Received on Sat Feb 26 06:47:13 2000