Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Passing username and password when spidering restricted websites

From: Troy Wical <troy(at)not-real.wical.com>
Date: Wed Jun 16 2010 - 14:06:31 GMT
> On Jun 16, 2010, at 7:31 AM, Peter Karman wrote:
> 
>> Troy Wical wrote on 06/16/2010 12:07 AM:
>> 
>> Now, there are two things that I have noticed. When I login to this
>> website via browser, the url end in dashboard.action, as opposed to
>> something more common like .php etc. Also, the pop up window to login
>> is being handled by a second url that takes care of all the
>> authentication. I'm wondering if this isn't throwing a curve ball to
>> swish-e when it comes to logging in.
>> 
> 
> I'm sure it is. The spider.pl just uses the HTTP basic authentication
> mechanism.
> 
> try turning on debugging to confirm:
> http://swish-e.org/docs/spider.html#debug

with debugging:

##########################################################
user@host:/home/www/search# /usr/lib/swish-e/spider.pl spider.config > spider-output.txt
Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl line 38.
Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl line 38.
Argument "DEBUG_REDIRECTS" isn't numeric in bitwise or (|) at spider.config line 1.
/usr/lib/swish-e/spider.pl: Reading parameters from 'spider.config'

 -- Starting to spider: http://restricted-website.com/dashboard.action --

vvvvvvvvvvvvvvvv HEADERS for http://restricted-website.com/dashboard.action vvvvvvvvvvvvvvvvvvvvv

---- Request ------
GET http://restricted-website.com/dashboard.action
Accept-Encoding: gzip, x-gzip, deflate
Authorization: Basic ZGVub3BzOk11TmszRmIxMQ==
From: my@email.com
User-Agent: swish-e http://swish-e.org/


---- Response ---
Status: 200 OK
Cache-Control: no-cache, must-revalidate
Connection: Keep-Alive
Date: Wed, 16 Jun 2010 13:58:07 GMT
Server: Apache-Coyote/1.1
Content-Length: 0
Content-Type: text/html;charset=UTF-8
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Client-Date: Wed, 16 Jun 2010 13:58:07 GMT
Client-Peer: 10.254.228.100:80
Client-Response-Num: 1
Keep-Alive: timeout=5, max=99
Set-Cookie: JSESSIONID=5D6893B657037E417589BCABDBD20C74; Path=/
X-Confluence-Request-Time: 1276696687212

^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^

>> +Fetched 0 Cnt: 1 GET  http://restricted-website.com/dashboard.action  200 OK text/html ??? parent: depth:0
Warning: document 'http://restricted-website.com/dashboard.action' has no content

Summary for: http://restricted-website.com/dashboard.action
Connection: Close: 1  (1.0/sec)
      Total Bytes: 1  (1.0/sec)
       Total Docs: 1  (1.0/sec)
      Unique URLs: 1  (1.0/sec)
user@host:/home/www/search#
##########################################################

I will read up more on get_password and using POST. A brief look at them has me a bit confused, and hacking spider.pl sounds a bit fun too :)

Thanks, Troy Wical
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Jun 16 10:06:35 2010