Skip to main content.
home | support | download

Back to List Archive

Re: Language and Swishspider

From: Ron Klatchko <ron(at)not-real.library.ucsf.edu>
Date: Wed May 26 1999 - 16:53:27 GMT
At 09:41 AM 5/26/99 -0700, Antonio Cisternino wrote:
>It is possible to substitute the swishspider with the following one?

It looks fine.  Did you test it?

moo


>I've added just three lines to support the language request to web servers.
>Our site uses the MultiViews option of Apche and the server serves a
>page depending on the language requested (we support it and en).
>I want to use swish over HTTP but I want build two indexes: one for
>english pages and the other for italian pages.
>Thus I've added the following lines to swishspider helper:
>
>my $language = $ENV{SWISH_LANG};
>$language ||= "en"; # These two lines sets the language (en is the default).
>my $header_lang = new HTTP::Headers(Accept_language => $language);
>
>These lines decide the language to use looking for the SWISH_LANG environment
>variable. If this variable it is not defined "en" is assumed and an
additional
>header for the HTTP::Request is created.
>Finally I've changed the line that build the HTTP request as follows:
>
>my $request = new HTTP::Request( "GET", $url, $header_lang );
>
>Now the spider can be controlled by the environment for the language
>to use.
>
>-- Antonio
>
>Swishspider:
>
>#!/usr/bin/perl
>
>use LWP::UserAgent;
>use HTTP::Headers;
>use LWP::RobotUA;
>use HTTP::Request;
>use HTTP::Status;
>use HTML::LinkExtor;
>
>if (scalar(@ARGV) != 2) {
>    print STDERR "Usage: SwishSpider localpath url\n";
>    exit(1);
>}
>
>my $language = $ENV{SWISH_LANG};
>$language ||= "en";
>
>my $header_lang = new HTTP::Headers(Accept_language => $language);
>
>my $ua = new LWP::UserAgent;
>$ua->agent( "SwishSpider" );
>$ua->from( "ron\@ckm.ucsf.edu" );
>
>my $localpath = shift;
>my $url = shift;
>
>my $request = new HTTP::Request( "GET", $url, $header_lang );
>my $response = $ua->simple_request( $request );
>
>#
># Write out important meta-data.  This includes the HTTP code.  Depending
on the
># code, we write out other data.  Redirects have the location printed,
everything
># else gets the content-type.
>#
>open( RESP, ">$localpath.response" ) || die( "Could not open response file
$localpath.response" );
>print RESP $response->code() . "\n";
>if( $response->code() == RC_OK ) {
>    print RESP $response->header( "content-type" ) . "\n";
>} elsif( $response->is_redirect() ) {
>    print RESP $response->header( "location" ) . "\n";
>}
>close( RESP );
>
>#
># Write out the actual data assuming the retrieval was succesful.  Also, if
># we have actual data and it's of type text/html, write out all the links it
># refers to
>#
>if( $response->code() == RC_OK ) {
>    my $contents = $response->content();
>
>    open( CONTENTS, ">$localpath.contents" ) || die( "Could not open
contents file $localpath.contents\n" );
>    print CONTENTS $contents;
>    close( CONTENTS );
>
>    if( $response->header("content-type") eq "text/html" ) {
>	open( LINKS, ">$localpath.links" ) || die( "Could not open links file
$localpath.links\n" );
>	$p = HTML::LinkExtor->new( \&linkcb, $url );
>	$p->parse( $contents );
>
>	close( LINKS );
>    }
>}
>
>
>sub linkcb {
>    my($tag, %links) = @_;
>    if (($tag eq "a") && ($links{"href"})) {
>	my $link = $links{"href"};
>
>	#
>	# Remove fragments
>	#
>	$link =~ s/(.*)#.*/$1/;
>	
>	#
>	# Remove ../  This is important because the abs() function
>	# can leave these in and cause never ending loops.
>	#
>	$link =~ s/\.\.\///g;
>	
>	print LINKS "$link\n";
>    }
>}
>
>
>
----------------------------------------------------------------------
          Ron Klatchko - Manager, Advanced Technology Group           
           UCSF Library and Center for Knowledge Management           
                        ron@library.ucsf.edu                
Received on Wed May 26 09:48:56 1999