Skip to main content.
home | support | download

Back to List Archive

problem with "swishspider" perl helper -- needs patch

From: Michael <michael(at)not-real.bizsystems.com>
Date: Wed Jun 16 1999 - 19:40:21 GMT
I ran into a problem indexing a site that looked perfectly fine. The 
header of the file(s) in question contained the META TAG

<META HTTP-EQUIV="content-type" CONTENT="text/html;">

libwww 'concatenates' multiple headers with the same names and since 
the header already exists from the http server in the form of
Content-type: text/html

the result stored in the $response->header("content-type")
in 'swishspider' ends up containing "text/html, text/html"

This fails the test on line 50
if( $response->header("content-type") eq "text/html" ) {

so links on the page are not followed.
Changing the perl script to read:
50c50
<     if( $response->header("content-type") eq "text/html" ) {
---
>     if( $response->header("content-type") =~ m|^text/html| ) {

solves the problem by matching to the string beginning with 
'text/html'


Michael
michael@bizsystems.com

BTW, I'd still like to know how to enable FileRules while spidering. 
This would be most helpful in eliminating usless index information on 
very large sites that have specific interest archives.
Received on Wed Jun 16 11:36:43 1999