Skip to main content.
home | support | download

Back to List Archive

Re: Problem on Parser with TXT/HTML and Spider.pl

From: <moseley(at)not-real.hank.org>
Date: Wed Apr 30 2003 - 04:36:57 GMT
On Tue, Apr 29, 2003 at 03:30:23PM -0700, Robert Keith wrote:
> 
> I am having a strange problem indexing a combination of MSWord, .txt and PHP
> documents using spider.pl and feeding this into swish-e.  If I index the PHP
> urls first, the documents are parsed and loaded as HTML.  If I select the
> MSWord and other documents, which are filtered by the spider.pl filter
> routines, the MSWord and other documents are parsed as TXT (correctly), then
> when the subsequent PHP and HTML documents are parsed, they are parsed as
> TXT.  The SwishSpiderConfig.pl file contains two entries, the URL with the
> MSWord links, and the URL with only PHP links.

Ah yep, I see the problem.  If you look below you notice that 
$server->{parser_type} is only set if the document is filtered.
It needs to be cleared.  Try adding the line below.  

I don't knw why request-specific data is in that global structure.  Put it
on my todo list...


> The prof1.pl spider.pl config file contains:

[...]

> sub filter_content {
>     my ( $uri, $server, $response, $content_ref ) = @_;

      delete $server->{parser_type};

> 
>     my $content_type = $response->content_type;
> 
>     # Ignore text/* content type -- no need to filter
>     return 1 if !$content_type || $content_type =~ m!^text/!;
> 
>     # Load the module - returns FALSE if cannot load module.
>     unless ( $filter ) {
>         eval { require SWISH::Filter };
>         if ( $@ ) {
>             $server->{abort} = $@;
>             return;
>         }
>         $filter = SWISH::Filter->new;
>         unless ( $filter ) {
>             $server->{abort} = "Failed to create filter object";
>             return;
>         }
>     }
> 
>     # If not filtered return false and doc will be ignored (not indexed)
> 
>     return unless $filter->filter(
>         document => $content_ref,
>         name     => $response->base,
>         content_type => $content_type,
>     );
> 
>     # nicer to use **char...
>     $$content_ref = ${$filter->fetch_doc};
> 
>     # let's see if we can set the parser.
>     $server->{parser_type} = $filter->swish_parser_type || '';
> 
>     return 1;
> }
> 
> 
> 
> 
> 
> # Must return true...
> 
> 1;
> 
Received on Wed Apr 30 04:44:00 2003