Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] How do I index via HTTP when authentication is

From: William M Conlon <bill(at)not-real.tothept.com>
Date: Wed Feb 20 2008 - 17:30:00 GMT
What does the debug show when swish-e GETs the login page?

Do you need cookies enabled?

Is the username szUsername (as used for the form POST), or is it szID  
as in the swish-e GET below?

Bill


On Feb 20, 2008, at 9:10 AM, Adam Douglas wrote:

> Hi. Alright I have implemented this method to index my side of the web
> site that requires authentication. I have also altered my login to  
> allow
> Swish-e to login via URL query string. I manually verified that this
> method via the URL query string works. However when I index the web  
> site
> it does not login and just acts as a non-authenticated client. I
> obviously am missing something here, any suggestions as to what I'm
> doing wrong?
>
> I index the web site like so, "swish-e -S prog -c
> swishe.venmarces.private.conf".
> Here is what I have in my SwishSpiderConfig.pl. The rest is just
> comments.
>
>  @ servers = ({
>
>         base_url    =>
> 'https://www.venmarces.com/login/?szID=idhere&szPWD=passwordhere',
>         same_hosts  => [ qw/www.venmarces.com/ ],
>         agent       => 'swish-e spider http://swish-e.org/',
>         email       => 'webmaster@domainnamehere.com',
>
>         # limit to only .html files
>         test_url    => sub {
>                   my  $ok =  !($_[0]->path =~ /login/ &&
>                   $_[0]-  >query =~ /logout/);
>                   return 1 if $ok;
>                   return; },
>
>         delay_sec   => 1,         # Delay in seconds between requests
>         max_time    => 10,        # Max time to spider in minutes
>         max_files   => 100,       # Max Unique URLs to spider
>         max_indexed => 20,        # Max number of files to send to  
> swish
> for indexing
>         keep_alive  => 1,         # enable keep alives requests
>         debug       => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS,
>     } );
>     1;
>
> Also how would I get Swish-e when indexing that when its finishes  
> to go
> to the URL "/logout/" to logout of the web site ending the
> session/authentication?
>
> By the way, I have the following configuration files setup.
>
> Search configuration - .venmarces.public.swishcgi.conf
> Search configuration - .venmarces.private.swishcgi.conf
> Swish-e configuration - swishe.venmarces.public.conf
> Swish-e configuration - swishe.venmarces.private.conf
> Spider.pl configuration - SwishSpiderConfig.pl (I have not made one  
> for
> public yet).
>
> Best,
> Adam
>
>>> I'm not sure why it's any more dangerous to require/allow
>> the swish-e
>>> spider to login to an application than any other user agent that
>>> presents credentials.  In fact for a public facing application, far
>>> more checks can be applied
>>> (username/password;IP_address;one-of-a-
>>> kind user agent) to the spider than is feasible with a
>> normal user's
>>> login.
>>>
>>> Merely enabling cookies by itself presents just as much risk of
>>> forgery.
>>>
>>> Anyway, here's a snip from my @servers:
>>>
>>> @servers = (
>>>          {
>>>          base_url    => 'http://my.domain.com/login.app?
>>> _function=checkpw&userid=swishe&password=swishe&remember=no',
>>>          use_cookies => 1,
>>> #        debug => DEBUG_URL | DEBUG_SKIPPED | DEBUG_FAILED |
>>> DEBUG_HEADERS,
>>>          delay_sec => 1,
>>>          test_url    => sub {
>>>                  my  $ok =  !($_[0]->path =~ /login.app/ &&
>>> $_[0]-  >query =~ /_function=logout/);
>>>                  return 1 if $ok;
>>>                  return; },
>>> ...
>>>
>>> Essentially, the spider logs in as the user 'swishe' so it sees the
>>> same content as any similarly privileged user.
>>> remember=no means don't give swish-e a long-term cookie to
>>> re-authenticate with.
>>> use_cookies allows the application to provide, and swish-e
>> to use the
>>> session cookies needed for access test_url keeps the spider from
>>> following a link to log out, to assure we follow all links.
>
> This message (including any attachments) is intended only for the  
> use of the individual or entity to which it is addressed and may  
> contain information that is non-public, proprietary,privileged,  
> confidential, and exempt from disclosure under applicable law or  
> may constitute as attorney work product. If you are not the  
> intended recipient, you are hereby notified that any use,  
> dissemination, distribution, or copying of this communication is  
> strictly prohibited. If you have received this communication in  
> error, notify us immediately by telephone and
> (i) destroy this message if a facsimile or (ii) delete this message
> immediately if this is an electronic communication. Thank you.

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Feb 20 12:30:06 2008