Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] How do I index via HTTP when authentication is

From: Adam Douglas <ADouglas(at)not-real.venmarces.com>
Date: Wed Feb 20 2008 - 17:10:22 GMT
Hi. Alright I have implemented this method to index my side of the web
site that requires authentication. I have also altered my login to allow
Swish-e to login via URL query string. I manually verified that this
method via the URL query string works. However when I index the web site
it does not login and just acts as a non-authenticated client. I
obviously am missing something here, any suggestions as to what I'm
doing wrong? 

I index the web site like so, "swish-e -S prog -c
swishe.venmarces.private.conf".
Here is what I have in my SwishSpiderConfig.pl. The rest is just
comments.

 @ servers = ({

        base_url    =>
'https://www.venmarces.com/login/?szID=idhere&szPWD=passwordhere',
        same_hosts  => [ qw/www.venmarces.com/ ],
        agent       => 'swish-e spider http://swish-e.org/',
        email       => 'webmaster@domainnamehere.com',

        # limit to only .html files
        test_url    => sub {
                  my  $ok =  !($_[0]->path =~ /login/ &&
                  $_[0]-  >query =~ /logout/);
                  return 1 if $ok;
                  return; },

        delay_sec   => 1,         # Delay in seconds between requests
        max_time    => 10,        # Max time to spider in minutes
        max_files   => 100,       # Max Unique URLs to spider
        max_indexed => 20,        # Max number of files to send to swish
for indexing
        keep_alive  => 1,         # enable keep alives requests
        debug       => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS,
    } );
    1;

Also how would I get Swish-e when indexing that when its finishes to go
to the URL "/logout/" to logout of the web site ending the
session/authentication?

By the way, I have the following configuration files setup.

Search configuration - .venmarces.public.swishcgi.conf
Search configuration - .venmarces.private.swishcgi.conf
Swish-e configuration - swishe.venmarces.public.conf
Swish-e configuration - swishe.venmarces.private.conf
Spider.pl configuration - SwishSpiderConfig.pl (I have not made one for
public yet).

Best,
Adam

> > I'm not sure why it's any more dangerous to require/allow 
> the swish-e 
> > spider to login to an application than any other user agent that 
> > presents credentials.  In fact for a public facing application, far 
> > more checks can be applied
> > (username/password;IP_address;one-of-a-
> > kind user agent) to the spider than is feasible with a 
> normal user's 
> > login.
> > 
> > Merely enabling cookies by itself presents just as much risk of 
> > forgery.
> > 
> > Anyway, here's a snip from my @servers:
> > 
> > @servers = (
> >          {
> >          base_url    => 'http://my.domain.com/login.app? 
> > _function=checkpw&userid=swishe&password=swishe&remember=no',
> >          use_cookies => 1,
> > #        debug => DEBUG_URL | DEBUG_SKIPPED | DEBUG_FAILED |  
> > DEBUG_HEADERS,
> >          delay_sec => 1,
> >          test_url    => sub {
> >                  my  $ok =  !($_[0]->path =~ /login.app/ &&
> > $_[0]-  >query =~ /_function=logout/);
> >                  return 1 if $ok;
> >                  return; },
> > ...
> > 
> > Essentially, the spider logs in as the user 'swishe' so it sees the 
> > same content as any similarly privileged user.
> > remember=no means don't give swish-e a long-term cookie to 
> > re-authenticate with.
> > use_cookies allows the application to provide, and swish-e 
> to use the 
> > session cookies needed for access test_url keeps the spider from 
> > following a link to log out, to assure we follow all links.

This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary,privileged, confidential, and exempt from disclosure under applicable law or may constitute as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, notify us immediately by telephone and
(i) destroy this message if a facsimile or (ii) delete this message
immediately if this is an electronic communication. Thank you.
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Feb 20 12:10:23 2008