Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Passing username and password when spidering restricted websites

From: William Conlon <bill(at)not-real.tothept.com>
Date: Fri Jun 18 2010 - 06:30:06 GMT
My recollection is that the spider perl script provides for HTTP BASIC  
or DIGEST authentication.  If you want to  spider a web application  
that requires login credentials, then in addition to using cookies,  
you need to provide credentials to the application.  For  me it was  
easier for me to create a back door in my app by which swish-e spider  
could authenticate via GET; human users would enter credentials and  
then POST the form.

On Jun 15, 2010, at 10:07 PM, Troy Wical wrote:

> On Jun 15, 2010, at 7:30 PM, Peter Karman wrote:
>
>> Troy Wical wrote on 6/15/10 9:09 AM:
>>> Had my down time, now getting back into this again. This time it's  
>>> for the workplace. We have several internal documentation sites,  
>>> and search all of them individually can be a pain. So I decided to  
>>> spider all of them and make them all searchable via swish.cgi.  I  
>>> have it working fairly well so far, but am having a hard time  
>>> spidering sites that require authentication.  All the sites are  
>>> being indexed individually, and this is the basic conf that I am  
>>> using:
>>>
>>> ###############################
>>>
>>> IndexDir spider.pl
>>> SwishProgParameters default http://restricted-website.com/dir/index.php
>>> IndexFile /path/to/indexes/restricted-website.index
>>> StoreDescription HTML* <body> 200000
>>>
>>
>> Instead of "default" above you need to create a spider config file  
>> and put
>> "credentials" in it:
>>
>> http://swish-e.org/docs/spider.html#credentials
>
>
> Gave that a shot, but no luck. Below is the config I am working with.
>
> ###########################
> @servers = (
>        {
>            base_url    => 'http://restricted-website.com',
>            email       => 'my@email.com',
>            delay_sec   => '0',
>            credentials => 'username:password',
>        },
>    );
> ###########################
>
> When I run "./spider.pl spider.config > output.txt" I get the  
> following:
>
> ###########################
> Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl  
> line 38.
> Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl  
> line 38.
> /usr/lib/swish-e/spider.pl: Reading parameters from 'spider.config'
> Warning: document 'http://restricted-website.com' has no content
>
> Summary for: http://restricted-website.com
> Connection: Close: 1  (1.0/sec)
>      Total Bytes: 1  (1.0/sec)
>       Total Docs: 1  (1.0/sec)
>      Unique URLs: 1  (1.0/sec)
> ###########################
>
> Now, there are two things that I have noticed. When I login to this  
> website via browser, the url end in dashboard.action, as opposed to  
> something more common like .php etc. Also, the pop up window to  
> login is being handled by a second url that takes care of all the  
> authentication. I'm wondering if this isn't throwing a curve ball to  
> swish-e when it comes to logging in.
>
> Troy
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Jun 18 02:30:16 2010