Skip to main content.
home | support | download

Back to List Archive

no_index and querystrings

From: Tim Hartley <tim.hartley(at)not-real.planetpdf.com>
Date: Mon Oct 11 2004 - 02:58:29 GMT
Hi all,

I'm having trouble narrowing my index down in a few key areas... basically I have an ASP page that contains a number of different querystrings, some of which I want indexed and some I don't want indexed.
I'm pretty sure that it's my sucky PERL coding that's causing the problems so I just wanted to run it past the forum..

The querystrings that I wish to run through the 'no_index' callback function all start with "www.planetpdf.com/tools.asp?webpageid=615&SearchType=Product". There are many different endings to this string (heaps of different 'Product' keywords) but any querystring like this I only want spidered and not indexed.

DETAILS:

--- START Conf file ---

IndexFile c:\swish-e\toolsIndex.index
IndexDir spider.pl
IndexReport 3
IgnoreMetaTags script
obeyRobotsNoIndex yes
StoreDescription HTML2 <description> 500
SwishProgParameters toolsSpider.config

--- END Conf file ---

--- START Spider Config file, toolsSpider.config ---
@servers = (
     {
         debug =>DEBUG_SKIPPED | DEBUG_ERRORS,
         base_url    => 'http://cm3.planetpdf.com/tools.asp',
	 email => 'binary@binarything.com',
	 agent => 'pp_tools',
        keep_alive=>1,        
        test_response => sub {
        my $server = $_[1];
	#try and not index tools.asp?webpageid=615&SearchType=Product..	
      $server->{no_index}++ if $_[0]->path =~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->path =~ /\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->query_form =~ /\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->query_form =~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->path_query =~ /\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->path_query =~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->query =~ /\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->query =~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->query_keywords=~ /\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->query_keywords=~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->path_segments=~ /\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->path_segments=~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->path_query=~ /\?webpageid\=615\&SearchType\=Product$/;
	$server->{no_index}++ if $_[0]->path_query=~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
	return 1;
	},
	use_md5 => 1, 
     },
 );

--- END Spider Config file ---

--- Run Command ---
C:\SWISH-E>swish-e -S prog -c PPTools.conf

--- 

As you can see, I've played around with a variety of different URI functions, and none of them seem to be able to stop that URL from being indexed. The query_form, query and query_keywords calls all produce a "Use of uninitialised value in pattern match (m//)at line blah" warning too.

Any pointers?

Tim
Received on Sun Oct 10 19:58:42 2004