Skip to main content.
home | support | download

Back to List Archive

Re: Crawling Sub-domains

From: James <swish.enhanced(at)not-real.gmail.com>
Date: Thu Jan 04 2007 - 18:23:20 GMT
I will try to turn on some more debugging features.

The spider is awesome, so don't get me wrong.  I just wondered if there was
anything in development.  Perhaps there are some things that can be done
with it to make it more robust.  Don't take that as an insult.  As I said,
the Swish-e spider is really awesome and does a good job.

I think the rewrite of the code for UTF-8 is critical.  You may not agree
that it is "critical" but crawling/indexing sites in foreign languages is
greatly hampered because of this.  One of your open source competitors,
mnoGoSearch works well with UTF-8.  I would still put Swish-e ahead of
mnoGoSearch for lots of reasons, but this is probably the #1 thing that
gives Swish-e drawbacks (that may discourage someone from choosing Swish-e
over mnoGoSearch).

Can you point me to more information about:

"There's a way to say that two domains are the same domain (used, for
example, where a site's pages can be accessed with or without the
leading 'www.'."

Thanks for your help, Bill!  I truly appreciate your efforts!

On 1/4/07, Bill Moseley <moseley@hank.org> wrote:
>
> On Thu, Jan 04, 2007 at 09:50:46AM -0800, James wrote:
> > I have been trying to spider/crawl an off-site sub-domain several times
> and
> > it doesn't seem to be working.  I also seem to have a problem trying to
> > spider/crawl a certain regular domain.  I can't figure out the
> problem.  I
> > know there is a redirect, from the www to the non-www.  The spider picks
> up
> > the robots.txt and nothing more.  Are there things I need to be aware of
> > about the spider that are not in the documentation?
>
> Just things that are in the docs. ;)  Did you turn on any of the
> debugging features to find out why it's not fetching the pages you
> think it should be fetching?
>
>
> > Also, when will the spider be updated next?
>
> In what way?
>
>
> > And when will Swish-e be updated for UTF-8?
>
> That's a large task, and it depends on when there's a big block of
> developer time available.
>
>
> > Also, I am concerned about something I read in the documentation about
> > spidering sub-domains, that the index may point the links to the pages
> > without the sub-domain.  In other words, sub.domain.com/mypage.htmlwould be
> > indexed as domain.com/mypage.html, unless some tweaking of the code is
> > done.  Is this true?
>
> There's a way to say that two domains are the same domain (used, for
> example, where a site's pages can be accessed with or without the
> leading "www.".
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
>    http://swish-e.org/Discussion/
>
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
>
>



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Thu Jan 4 10:23:21 2007