Skip to main content.
home | support | download

Back to List Archive

Re: Problem on Parser with TXT/HTML and Spider.pl

From: Robert Keith <Robert(at)not-real.technolords.com>
Date: Wed Apr 30 2003 - 08:18:38 GMT
That fixed it!

Thanks.  I'll see if I can dig something else up.

rgs,
Robert Keith

> -----Original Message-----
> From: swish-e@sunsite.berkeley.edu
> [mailto:swish-e@sunsite.berkeley.edu]On Behalf Of moseley@hank.org
> Sent: Tuesday, April 29, 2003 11:58 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Problem on Parser with TXT/HTML and Spider.pl
> 
> 
> On Tue, Apr 29, 2003 at 03:30:23PM -0700, Robert Keith wrote:
> > 
> > I am having a strange problem indexing a combination of MSWord, 
> .txt and PHP
> > documents using spider.pl and feeding this into swish-e.  If I 
> index the PHP
> > urls first, the documents are parsed and loaded as HTML.  If I 
> select the
> > MSWord and other documents, which are filtered by the spider.pl filter
> > routines, the MSWord and other documents are parsed as TXT 
> (correctly), then
> > when the subsequent PHP and HTML documents are parsed, they are 
> parsed as
> > TXT.  The SwishSpiderConfig.pl file contains two entries, the 
> URL with the
> > MSWord links, and the URL with only PHP links.
> 
> This is a better fix (I actually tried it this time!)
> 
> --- extprog.c.old       2003-04-29 23:51:34.000000000 -0700
> +++ extprog.c   2003-04-29 23:52:04.000000000 -0700
> @@ -272,7 +272,10 @@
>  
>              /* Set the doc type from the header */
>              if ( docType )
> +            {
>                  fprop->doctype   = docType;
> +                docType = 0;
> +            }
>  
>  
>              /* set real_path, doctype, index_no_content, filter, 
> stordesc 
> */
> 
> 
> That error doesn't show up on the dev version because the doctype is
> set on all files instead of just the filtered ones.
> 
> Sorry for the trouble.
> 
Received on Wed Apr 30 08:22:53 2003