Bill Moseley wrote:
>
> Can you setup a small HTML page with a few links to PDFs that I could
> spider that shows the problem?
I'm starting to think this might be a buffer overflow due to the number
of PDFs being skipped due to size or the Error Missing Endstream errors
we were discussing earlier.
Here's a page with 56 links to PDFs. 33 should succeed, 20 should be
skipped (if you drop the max file size down to 1 MB) and 3 should
"fail". The last three on the list are the failures. Something about the
previous 53 cause the filter to blow.
http://test.portofoakland.com/PDF_TestPage.html
I've run the spider twice locally and against the URL above. Same output
every time.
> So that would indicate that $filter->convert is being called but it's
> not being filtered. (Which I guess you know by now.) You can turn on
> filter debugging by setting then environment FILTER_DEBUG to something
> true (like 1 or some text).
Found it. I thought I had that setup but I typo'd it. Sigh.
>> Starting to process new document: application/pdf
++Checking filter [SWISH::Filters::Doc2txt=HASH(0x258f670)] for
application/pdf
++ application/pdf was not filtered by
SWISH::Filters::Doc2txt=HASH(0x258f670)
++Checking filter [SWISH::Filters::Pdf2HTML=HASH(0x257df08)] for
application/pdf
(202 words)
Problems with filter 'SWISH::Filters::Pdf2HTML=HASH(0x257df08)'. Filter
disabled:
-> open2: IO::Pipe: Can't spawn-NOWAIT: Resource temporarily
unavailable at C:\
Progra~1\SWISH-E\lib\swish-e\perl/SWISH/Filter.pm line 1158
Final Content type for
http://local.dev.port.com/pdf/audi_shee_040722.pdf is application/pdf
*No filters were used
>>P.S. I'm still unable to get the Descriptions to work for non-PDF pages.
>>I've spidered the site with PDF filtering off via the test_url option
>>and I can't get the descriptions to appear. There must be something
>>weird about our HTML pages in order to mess up the indexer.
>
>
> Maybe. Again, make a tiny simple HTML page and spider it and see if
> it works. If so, then you now it's not your config. Then try one of
> your HTML pages and see what happens. If nothing then turn on
> ParserWarnLevel 9 in the swish config file and/or validate the page's
> html.
OK - I figured out what is going on. The "Local Time - ..." that is
appearing in the description *is* being harvested from the html. I
grabbed a view source from one of our pages (since 90% of the content is
coming out of a database for many of our hub pages) and ran swish-e
against this now flat html file.
The Local Time is embedded at the first real text in the page via a time
function. But strangely none of the other text on the page shows up.
Just a lot of "space" filling up the rest of the description buffer.
We'll looks like we'll have to wrap the Local Time in those special
comments <!-- noindex --> <!-- index --> to by pass this "text" from
indexing.
I've done a bit if testing with this route and it appears promising,
just need to set them up so they don't interfere with the link spidering.
Received on Fri Sep 24 15:33:23 2004