Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Sep 24 2004 - 19:19:03 GMT
On Fri, Sep 24, 2004 at 12:00:17PM -0700, Anthony Baratta wrote:
>   -Skipped http://local.dev.port.com/pdf/real_ccr.pdf due to 
> 'filter_content' user supplied function #1 death 'Skipping
>     http://local.dev.port.com/pdf/real_ccr.pdf
>     due to content type: application/pdf may be binary'

The filter isn't converting it for some reason.

Works like this:

   spider.pl gets a link.
   test_url() has the option to skip based on the URL alone
   spider.pl fetches first chunk of document
   test_response() has the option to skip based on content type
   spider.pl fetches the rest of the document
   filter_content() has the option to filter the content returned
     document and content-type is passed to SWIHS::Filter
     - SWISH::Filter calls the filters one-by-one and each filter
     looks at the content-type to see if it can handle it.
     - SWISH::Filter::Pdf2HTML does:

          return unless $filter->content_type =~ m!application/pdf!;

     and then it filters the content (by using pdfinfo
     and pdftotext) and changes the content type to "text/html" and
     returns the doc saying "It's been filtered!"

So, if you are getting that then maybe your version of the filter
is not looking at the correct content type?

Did you try running

   swish-filter-test -verbose http://local.dev.port.com/pdf/real_ccr.pdf

> I have not been able to capture when this error first occurs but it 
> appears that after it shows up once, it fails to attempt to index every 
> PDFs found there after with the same type of error message.

Interesting.  Filters can get disabled if they abort (by calling die).
In Filter.pm it does this:

        eval {
            local $SIG{__DIE__};
            $filtered_doc = $filter->filter($doc_object);
        };
        
        if ( $@ ) {
            $self->mywarn("Problems with filter '$filter'.  Filter disabled:\n -> $@");
            next;
        }

That traps an exception in the individual filter.  Are you seeing that
warning?  It would give an error message.  And then after that point
the filter would not be used.

If that's what is happening then that error message would be very
helpful.





> 
> Any clues?
> 
> 

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Sep 24 12:19:17 2004