Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Passing username and password when spidering restricted websites

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Wed Jun 16 2010 - 13:31:33 GMT
Troy Wical wrote on 06/16/2010 12:07 AM:

>
> When I run "./spider.pl spider.config > output.txt" I get the following:
>
> ###########################
> Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl
line 38.
> Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl
line 38.

those uninit value warnings you can ignore. they are fixed in svn.

> /usr/lib/swish-e/spider.pl: Reading parameters from 'spider.config'
> Warning: document 'http://restricted-website.com' has no content
>
> Summary for: http://restricted-website.com
> Connection: Close: 1  (1.0/sec)
>       Total Bytes: 1  (1.0/sec)
>        Total Docs: 1  (1.0/sec)
>       Unique URLs: 1  (1.0/sec)
> ###########################


> 
> Now, there are two things that I have noticed. When I login to this
> website via browser, the url end in dashboard.action, as opposed to
> something more common like .php etc. Also, the pop up window to login
> is being handled by a second url that takes care of all the
> authentication. I'm wondering if this isn't throwing a curve ball to
> swish-e when it comes to logging in.
> 

I'm sure it is. The spider.pl just uses the HTTP basic authentication
mechanism.

try turning on debugging to confirm:
http://swish-e.org/docs/spider.html#debug

You probably need to hack spider.pl or use the get_password callback to
do the authentication piece before the spider actually does its work. If
that 2nd window sets a cookie, you could do a POST to that login url
with your credentials, get the returned cookie and set it in the
spider.pl user agent for the rest of the site.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Jun 16 09:31:35 2010