Skip to main content.
home | support | download

Back to List Archive

Re: PDF to HTML causing swish-e to crash

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Oct 11 2002 - 00:05:28 GMT
At 03:48 PM 10/10/02 -0700, Greg Fenton wrote:
>The temp file from step 2 is different: same size, different sum.
>xpdf blows up when run against this file.  I compared the files via "od
>-c" and it seems that most (all?) of the \0 in the original PDF have
>been converted to \n.

Yep.  I'm not sure if I'd call it a bug or a design flaw.  I'll explain below.

I didn't realize you were using spider.pl.  Your fix is to filter your docs
in spider.pl before sending them to swish-e.

There's two ways to do it, and both are given as examples in the
SwishSpiderConfig.pl file.  One way calls the pdf2html.pm module in the
prog-bin directory.  Here I'm testing from the distribution src directory.

Here's a complete config --

~/swish-e/src > cat SwishSpiderConfig.pl 

@servers = (
    {
        base_url    => 'http://localhost/xfig.pdf',
        email       => 'swish@domain.invalid',
        test_response   => sub {
            return grep { $_ eq $content_type } 
               qw{ text/html text/plain application/pdf };
        },
        filter_content  => [ \&pdf ],
    },
);    

use lib '../prog-bin';
use pdf2html;  # included example pdf converter module
sub pdf {
   my ( $uri, $server, $response, $content_ref ) = @_;

   return 1 unless $response->content_type eq 'application/pdf';

   # for logging counts
   $server->{counts}{'PDF transformed'}++;

   $$content_ref = ${pdf2html( $content_ref, 'title' )};
   $$content_ref =~ tr/ / /s;
   return 1;
}


1;

That's kind of the old way, but for now is probably the fastest.

There's a new set of modules called SWISH::Filter which you may want to
look at when you have more time.

Here's the explanation of the bug.

The old swish (pre HTML2 libxml2 parser) would read entire files into
memory as a single string.  That's a NULL terminated string.  Some people
had problems indexing files claiming that swish was not indexing the entire
file.  The problem was that their text files had embedded null chars, so
only up to the first null was processed.  The fix was to replace the nulls
with a \n and issue a warning.  You must have seen that:

 "Substituted possible embedded null character(s) in file"

The bug, as far as I'm concerned, is that when using a filter with -S prog
I reused the code to slurp in the file into the buffer before writing the
data out to disk, and that code is what replaced the nulls.  That will be
fixed.



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Oct 11 00:09:02 2002