Hi all,
I'm having trouble narrowing my index down in a few key areas... basically I have an ASP page that contains a number of different querystrings, some of which I want indexed and some I don't want indexed.
I'm pretty sure that it's my sucky PERL coding that's causing the problems so I just wanted to run it past the forum..
The querystrings that I wish to run through the 'no_index' callback function all start with "www.planetpdf.com/tools.asp?webpageid=615&SearchType=Product". There are many different endings to this string (heaps of different 'Product' keywords) but any querystring like this I only want spidered and not indexed.
DETAILS:
--- START Conf file ---
IndexFile c:\swish-e\toolsIndex.index
IndexDir spider.pl
IndexReport 3
IgnoreMetaTags script
obeyRobotsNoIndex yes
StoreDescription HTML2 <description> 500
SwishProgParameters toolsSpider.config
--- END Conf file ---
--- START Spider Config file, toolsSpider.config ---
@servers = (
{
debug =>DEBUG_SKIPPED | DEBUG_ERRORS,
base_url => 'http://cm3.planetpdf.com/tools.asp',
email => 'binary@binarything.com',
agent => 'pp_tools',
keep_alive=>1,
test_response => sub {
my $server = $_[1];
#try and not index tools.asp?webpageid=615&SearchType=Product..
$server->{no_index}++ if $_[0]->path =~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->path =~ /\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->query_form =~ /\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->query_form =~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->path_query =~ /\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->path_query =~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->query =~ /\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->query =~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->query_keywords=~ /\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->query_keywords=~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->path_segments=~ /\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->path_segments=~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->path_query=~ /\?webpageid\=615\&SearchType\=Product$/;
$server->{no_index}++ if $_[0]->path_query=~ /tools\.asp\?webpageid\=615\&SearchType\=Product$/;
return 1;
},
use_md5 => 1,
},
);
--- END Spider Config file ---
--- Run Command ---
C:\SWISH-E>swish-e -S prog -c PPTools.conf
---
As you can see, I've played around with a variety of different URI functions, and none of them seem to be able to stop that URL from being indexed. The query_form, query and query_keywords calls all produce a "Use of uninitialised value in pattern match (m//)at line blah" warning too.
Any pointers?
Tim
Received on Sun Oct 10 19:58:42 2004