Skip to main content.
home | support | download

Back to List Archive

Re: HTTP indexing: internal site spidering

From: Ron Samuel Klatchko <rsk(at)not-real.corpmail.brightmail.com>
Date: Tue May 30 2000 - 23:54:46 GMT
> I meant that the web pages I want indexed are part of a company
> INTRANET,in that, there definitely are links on the main page( page to start
> spidering from) only to web sites that are within the company intranet

Okay, in this sentence you talk about pages and sites.  I need to know
whether your are exactly describing your environment or whether you are
being inexact in your phrasing.  What is your definition of "web site"
as used in the above question?  For that matter, what is your definition
of an "intranet"?

moo

arajamani@excite.com wrote:
> 
> Hello everone and Mr.Klatchko,
>   I agree with you, Mr.Klatchko, when you say that in the HTTP method we
> dont look at the file system at all. Also, when I said "NOT visible to the
> WWW" I meant that the web pages I want indexed are part of a company
> INTRANET,in that, there definitely are links on the main page( page to start
> spidering from) only to web sites that are within the company intranet and
> not to any WWW sites.I want these internal links to be spidered. I really
> appreciate your taking time out to answer my questions.
> Sincerely,
> Ashok
> 
> On Fri, 26 May 2000 15:56:32 -0700 (PDT), rsk@corpmail.brightmail.com wrote:
> 
> >  arajamani@excite.com wrote:
> >  >   Thanks for pointing out the errors. I have gone ahead and changed the
> >  > config file and the HTTP indexing works just fine!( I have enclosed the
> >  > modified config file ) However,it is unable to spider down the the
> links and
> >  > index them too. All the links are a part of intra-net and are NOT
> visible to
> >  > the WWW. Is  this what's preventing the spider from spidering down.
> >  > THanks once again for your help.
> >
> >  The spider works by indexing the first page (depth 1).  It then finds
> >  all links on that page that are on the same (or equivalent as defined in
> >  the config file) server.  It then indexes each of those pages (depth 2)
> >  and follows those links.  It does this until it reaches it's max depth
> >  or all file on a server are indexed.
> >
> >  The most important thing is that it can only find pages that you tell it
> >  to index or that it can find a URL on one of the pages it indexes.  If
> >  your comment that they are "NOT visible to the WWW" means there are no
> >  links to the pages, then no, they won't be indexed.  How would the
> >  spider know they exist (and don't suggest that it look at the file
> >  system, the HTTP method was built to index foreign sites where it has no
> >  access to the fs).
> >
> >  moo
> >  ------------------------------------------------------------
> >          Ron Samuel Klatchko - Senior Software Jester
> >              Brightmail Inc - rsk@brightmail.com
> 
> _______________________________________________________
> Get 100% FREE Internet Access powered by Excite
> Visit http://freelane.excite.com/freeisp

-- 
------------------------------------------------------------
        Ron Samuel Klatchko - Senior Software Jester
            Brightmail Inc - rsk@brightmail.com
Received on Tue May 30 19:57:23 2000