Skip to main content.
home | support | download

Back to List Archive

Re: Crawling Sub-domains

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jan 04 2007 - 18:05:37 GMT
On Thu, Jan 04, 2007 at 09:50:46AM -0800, James wrote:
> I have been trying to spider/crawl an off-site sub-domain several times and
> it doesn't seem to be working.  I also seem to have a problem trying to
> spider/crawl a certain regular domain.  I can't figure out the problem.  I
> know there is a redirect, from the www to the non-www.  The spider picks up
> the robots.txt and nothing more.  Are there things I need to be aware of
> about the spider that are not in the documentation?

Just things that are in the docs. ;)  Did you turn on any of the
debugging features to find out why it's not fetching the pages you
think it should be fetching?


> Also, when will the spider be updated next?

In what way?


> And when will Swish-e be updated for UTF-8?

That's a large task, and it depends on when there's a big block of
developer time available.


> Also, I am concerned about something I read in the documentation about
> spidering sub-domains, that the index may point the links to the pages
> without the sub-domain.  In other words, sub.domain.com/mypage.html would be
> indexed as domain.com/mypage.html, unless some tweaking of the code is
> done.  Is this true?

There's a way to say that two domains are the same domain (used, for
example, where a site's pages can be accessed with or without the
leading "www.".

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Jan 4 10:05:39 2007