Skip to main content.
home | support | download

Back to List Archive

Re: crawling protected site

From: intervolved none <intervolved(at)not-real.yahoo.com>
Date: Thu May 12 2005 - 15:22:37 GMT
I have tried it but I it basically works the same (401 error).  
 
In IIS the authentication is set to "Integrated Windows authentication".   Which is NTLM authentication.  I checked the perl document and the spider.pl program but did not get it working, I still got a 401 error.  
 
Document that discusses NTML Authentication : \Swishe242\perl\html\site\lib\LWP\Authen\Ntlm.html

Has anyone added NTLM authentication to spider.pl or know how to get it working?
 
Thanks
 
I basically used this code to run some tests.  I do not do perl that often so I might have made some mistakes when trying to convert it.  I also took the following code and put it in a perl program by itself but I got a 400 error.
 
***********************************************************************
# code from \perl\html\site\lib\LWP\Authen\Ntlm.html
use LWP::UserAgent;
 use HTTP::Request::Common;
 my $url = 'http://www.company.com/protected_page.html';
 # Set up the ntlm client and then the base64 encoded ntlm handshake message my $ua = new LWP::UserAgent(keep_alive=>1); $ua->credentials('www.company.com:80', '', "MyDomain\\MyUserCode", 'MyPassword');

 $request = GET $url; print "--Performing request now...-----------\n"; $response = $ua->request($request); print "--Done with request-------------------\n";

 if ($response->is_success) {print "It worked!->" . $response->code . "\n"} else {print "It didn't work!->" . $response->code . "\n"}

 

*******************************************

 

Peter Karman <peter@peknet.com> wrote:
have you tried spider.pl instead? Much better than the -S http method.

I expect that the -S http method will, in fact, be deprecated in a future version.

intervolved none scribbled on 5/10/05 4:33 PM:
> I need to crawl a website that is protected by windows authentication but when swish-e tries to crawl it it returns a 401 error. I pass in the username and password the same way that I have tried using IE ( http://username:password@www.somedomain.com ) and swish-e does not work. I have attached a condensed config file and the output that is generated when I run the command to index the site. Thanks in advance.
> 
> 
> c:> type mytestsite.config (subset of config file)
> 
> MaxDepth 0
> Delay 0
> IndexContents HTML2 .htm .html .shtml
> IndexContents TXT .pdf 
> IndexFile newprimarycare.index
> StoreDescription HTML2 200
> StoreDescription TXT 200
> DefaultContents HTML2 
> IndexDir http://myusrname:mypassword@mysite.com/main.html
> 
> 
> 
> c:> swish-e.exe -v 3 -S http -c "mytestsite.config"
> 
> ..
> Now fetching ;http://myusrname:mypassword@mysite.com/main.html"... Status: 401.
> ..
> 
> 
> 
> 
> 
> ---------------------------------
> Yahoo! Mail Mobile
> Take Yahoo! Mail with you! Check email on your mobile phone.
> 
> 
> *********************************************************************
> Due to deletion of content types excluded from this list by policy,
> this multipart message was reduced to a single part, and from there
> to a plain text message.
> *********************************************************************

-- 
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Thu May 12 08:22:50 2005