Skip to main content.
home | support | download

Back to List Archive

Re: indexing dynamic sites and robots.txt

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Apr 11 2002 - 22:47:01 GMT
At 02:39 PM 04/11/02 -0700, Michael wrote:
>Is it possible to index dynamic sites that use URL's of the form
>
>http://somewhere.com/page.cfm?object=1234
>a new page is represented by
>http://somewhere.com/page.cfm?object=1235
>
>but I believe that swish-e thinks 
>
>http://somewhere.com/page.cfm
>
>has already been indexed and move on without looking at the new page.
>Is this what happens??

No, it will index each url.[1]

>While 
>
>User-agent *
>Disallowed: \

http://www.robotstxt.org/wc/exclusion-admin.html

Disallow: /


[1] You can try it yourself:

#!/usr/local/bin/perl -w
use strict;
use CGI;

my $cgi = CGI->new;

print $cgi->header, $cgi->start_html;
     
if ( $cgi->param('object') ) {
    print $cgi->param('object');
} else {
    print <<EOF;

Hello main page!
<a href="p.cgi?object=1234">word1234</a>
<a href="p.cgi?object=ABCD">wordABCD</a>
EOF
}
print $cgi->end_html;

> ./swish-e -w not dkdk
# SWISH format: 2.1-dev-25
# Search words: not dkdk
# Number of hits: 3
# Search time: 0.001 seconds
# Run time: 0.040 seconds
1000 /p.cgi?object=ABCD "Untitled Document" 277
1000 /p.cgi?object=1234 "Untitled Document" 277
1000 /p.cgi "Untitled Document" 373

I tried with 2.0.5 and current CVS using both -S prog and -S http.



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Thu Apr 11 22:48:55 2002