Skip to main content.
home | support | download

Back to List Archive

Re: swishspider fix for content-type check at line 50

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Apr 02 2001 - 01:59:26 GMT
At 06:23 PM 04/01/01 -0700, David Wood wrote:
>Looks like version 2.1-dev-20 might have broken something that was okay in 
>2.0.5.  In swishspider, line 50 should change as follows:
>
>50c50
><     if( $response->header("content-type") eq "text/html" ) {
>---
> >     if( $response->header("content-type") =~ /text\/html/i ) {

Thanks.   I just committed the change to CVS.  You might think about using
CVS to get your sources.
http://sunsite.berkeley.edu/SWISH-E/archive/2526.html

Also, of you are spidering it would be really great if you would try the
new spider included in the distribution.  I'd be curious how it works
(other than in my simple tests).  It uses a new 2.2 feature of swish where
swish can call external program, and the external program feeds documents
to swish.  

In the prog-bin directory is an example (yet full featured) spider.pl
program.  This program demonstrates how to write a program to use the
"prog" feature.  The advantage over the built-in spider is that it doesn't
fork and run (and compile) a perl script for every document, and you have
full control over what gets indexed (and how) in the perl program.  The
example spider.pl program, for example, can easily filter content, so you
should be able in about three lines of code also index pdf files, too.  (A
module is included to convert pdf to xml for indexing.)

Again, it would be great if you can try it out and report back.




Bill Moseley
mailto:moseley@hank.org
Received on Mon Apr 2 02:05:18 2001