Bill Moseley wrote:
> Can you setup a small HTML page with a few links to PDFs that I could
> spider that shows the problem?
I'm starting to think this might be a buffer overflow due to the number
of PDFs being skipped due to size or the Error Missing Endstream errors
we were discussing earlier.
Here's a page with 56 links to PDFs. 33 should succeed, 20 should be
skipped (if you drop the max file size down to 1 MB) and 3 should
"fail". The last three on the list are the failures. Something about the
previous 53 cause the filter to blow.
I've run the spider twice locally and against the URL above. Same output
> So that would indicate that $filter->convert is being called but it's
> not being filtered. (Which I guess you know by now.) You can turn on
> filter debugging by setting then environment FILTER_DEBUG to something
> true (like 1 or some text).
Found it. I thought I had that setup but I typo'd it. Sigh.
>> Starting to process new document: application/pdf
++Checking filter [SWISH::Filters::Doc2txt=HASH(0x258f670)] for
++ application/pdf was not filtered by
++Checking filter [SWISH::Filters::Pdf2HTML=HASH(0x257df08)] for
Problems with filter 'SWISH::Filters::Pdf2HTML=HASH(0x257df08)'. Filter
-> open2: IO::Pipe: Can't spawn-NOWAIT: Resource temporarily
unavailable at C:\
Progra~1\SWISH-E\lib\swish-e\perl/SWISH/Filter.pm line 1158
Final Content type for
http://local.dev.port.com/pdf/audi_shee_040722.pdf is application/pdf
*No filters were used
>>P.S. I'm still unable to get the Descriptions to work for non-PDF pages.
>>I've spidered the site with PDF filtering off via the test_url option
>>and I can't get the descriptions to appear. There must be something
>>weird about our HTML pages in order to mess up the indexer.
> Maybe. Again, make a tiny simple HTML page and spider it and see if
> it works. If so, then you now it's not your config. Then try one of
> your HTML pages and see what happens. If nothing then turn on
> ParserWarnLevel 9 in the swish config file and/or validate the page's
OK - I figured out what is going on. The "Local Time - ..." that is
appearing in the description *is* being harvested from the html. I
grabbed a view source from one of our pages (since 90% of the content is
coming out of a database for many of our hub pages) and ran swish-e
against this now flat html file.
The Local Time is embedded at the first real text in the page via a time
function. But strangely none of the other text on the page shows up.
Just a lot of "space" filling up the rest of the description buffer.
We'll looks like we'll have to wrap the Local Time in those special
comments <!-- noindex --> <!-- index --> to by pass this "text" from
I've done a bit if testing with this route and it appears promising,
just need to set them up so they don't interfere with the link spidering.
Received on Fri Sep 24 15:33:23 2004