Skip to main content.
home | support | download

Back to List Archive

AutoSwish - How index non-linked pages

From: PropheZine Owner <bob(at)not-real.prophezine.com>
Date: Sat Feb 26 2000 - 11:57:30 GMT
Hi:

Lets say I want to index a site that has 1,000 pages that there are no links
to.  An archive that is currently indexed with swish file system method.

How would these pages be indexed using the HTTP method?  What would be a
good method?  If I do not have an index.html page then the web server
generated index would be links to all the pages and they would be linked.
But, I want an index.html page there so people can not get the list of files
in the directory.

Is there a good method that someone is already using?

Thanks.

Bob

-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Chris Humphries
Sent: Saturday, February 26, 2000 6:42 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Swish-E and HTML documents with frames


Dear Ron,

You wrote: "But if you find one of the pages that is indirectly referenced,
you get the page only."

This is very true, and if one were spidering indiscriminately, it would be
a problem because there is probably no way of knowing that the page you had
found *was* indirectly referenced. However, most of my indexing so far has
been just the first page of a Web site, which means that my approach to
reading through the frames is probably safe. Each Web site will already
have been looked at by a human being and its basic structure understood.

If you can think of a case you would like to see handled that isn't handled
by the approach I am using, I would really appreciate it if you could
supply a url for me to try out.

Many thanks,

Chris Humphries

-----Original Message-----
From:	Ron Samuel Klatchko [SMTP:rsk@brightmail.com]
Sent:	Saturday, February 26, 2000 1:29 AM
To:	ChrisJMH@vermilion99.freeserve.co.uk
Cc:	Multiple recipients of list
Subject:	Re: [SWISH-E] Re: Swish-E and HTML documents with frames

Chris Humphries wrote:
> The way my system works, all the "frame src" links are read to create one
> big file, and *any* "a href" links found in any of those files are
returned
> as if they were from that one big file. This means that to get at any <A>
> tags in the HTML pages you describe, one would need to set the spider to
> read to a depth of 2.

That's not what I'm worried about.  If you find one of the pages
directly referenced from the frameset, then you get the entire
frameset.  But if you find one of the pages that is indirectly
referenced, you get the page only.  Is that behavior acceptable?  The
first one is nicer but the second one will be more common.

moo
------------------------------------------------------------
           Ron Samuel Klatchko - Software Jester
            Brightmail Inc - rsk@brightmail.com
Received on Sat Feb 26 07:03:54 2000