Skip to main content.
home | support | download

Back to List Archive

Re: Crawling Sub-domains

From: James <swish.enhanced(at)not-real.gmail.com>
Date: Thu Jan 04 2007 - 18:07:29 GMT
I am also wondering if there is a way to get Swish-e's spider to
automatically follow links to subdomains of the same domain, without having
it follow off-site links to other domains.  Do you know what I mean?  I
would rather not have to individually spider each subdomain separately in
the code and I would like Swish-e and/or it's spider to keep track of the
links related to each sub-domain.

By the way, it appears that my original post on this topic was the first on
the Swish-e discussion group for 2007!  So, Happy New Year, everyone!

On 1/4/07, James <swish.enhanced@gmail.com> wrote:
>
> I have been trying to spider/crawl an off-site sub-domain several times
> and
> it doesn't seem to be working.  I also seem to have a problem trying to
> spider/crawl a certain regular domain.  I can't figure out the problem.  I
> know there is a redirect, from the www to the non-www.  The spider picks
> up
> the robots.txt and nothing more.  Are there things I need to be aware of
> about the spider that are not in the documentation?  Also, when will the
> spider be updated next?  And when will Swish-e be updated for UTF-8?
>
> Also, I am concerned about something I read in the documentation about
> spidering sub-domains, that the index may point the links to the pages
> without the sub-domain.  In other words, sub.domain.com/mypage.html would
> be
> indexed as domain.com/mypage.html, unless some tweaking of the code is
> done.  Is this true?
>
> I know that though the questions are specific, some of the details are
> vague.  I apologize.  I would rather not post the actual URL's I am trying
> to crawl.



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Thu Jan 4 10:07:30 2007