Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Sep 24 2004 - 20:58:44 GMT
On Fri, Sep 24, 2004 at 01:38:34PM -0700, Anthony Baratta wrote:
> http://local.dev.port.com/portnyou/agendas/publ040826.asp
>   - Using HTML2 parser -
>   -Skipped http://local.dev.port.com/pdf/audi_shee_040722.pdf
>    due to 'filter_content' user supplied function #1 death
>    'Skipping http://local.dev.port.com/pdf/audi_shee_040722.pdf
>    due to content type: application/pdf may be binary'
> 
> I've run the swish-filter-test against the first PDF to fail with 
> "death" and the PDF that was filtered just before the first failure, 
> both filtered successfully.

Too bad I already had lunch -- I could of just stopped on by.

Can you setup a small HTML page with a few links to PDFs that I could
spider that shows the problem?

In spider.pl it has this code:

    my $content_type = $response->content_type;
    # Ignore text/* content type -- no need to filter
    return 1 if !$content_type || $content_type =~ m!^text/!;

    my $doc = $filter->convert(
        document     => $content_ref,
        name         => $response->base,
        content_type => $content_type,
    );

    return 1 unless $doc; # so just proceed as if not using filter

    if ( $doc->is_binary ) {  # ignore "binary" files (not text/* mime type)
        die "Skipping " . $response->base . " due to content type: " . $doc->content_type ." may be binary\n";
    }

So that would indicate that $filter->convert is being called but it's
not being filtered.  (Which I guess you know by now.)  You can turn on
filter debugging by setting then environment FILTER_DEBUG to something
true (like 1 or some text).

> I did not find any error messages regarding the filter being disabled.

Yes, looks like that message is displayed when FILTER_DEBUG is
enabled.  That's a drag.

> P.S. I'm still unable to get the Descriptions to work for non-PDF pages. 
> I've spidered the site with PDF filtering off via the test_url option 
> and I can't get the descriptions to appear. There must be something 
> weird about our HTML pages in order to mess up the indexer.

Maybe.  Again, make a tiny simple HTML page and spider it and see if
it works.  If so, then you now it's not your config.  Then try one of
your HTML pages and see what happens.  If nothing then turn on
ParserWarnLevel 9 in the swish config file and/or validate the page's
html.

> You can run this against the test.portofoakland.com after dumbing down 
> the test_url to skip pdfs then run a search against the create index 
> file. I still get no descriptions.

I'll try later, but I need to get some paying work done first.  But,
I'd be trying all the things I just said -- just start small, divide
up the problem and you will get it working.  Of course, I hope it's
not some weird Windows issue.

Are you only a Windows shop?

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Sep 24 13:58:59 2004