On Fri, Sep 24, 2004 at 03:30:53PM -0700, Anthony Baratta wrote:
Seems to work ok for me. So far I've grabbed 28
$ fgrep Path-Name spider.out | wc -l
Some of those PDFs are killer, though. pdftotext sucks up my CPU.
That checking for too big is a bit dumb -- it doesn't use the
content-length heaader but instead downloads it. A new spider.pl will
be created soon...
> ++Checking filter [SWISH::Filters::Pdf2HTML=HASH(0x257df08)] for
> (202 words)
> Problems with filter 'SWISH::Filters::Pdf2HTML=HASH(0x257df08)'. Filter
> -> open2: IO::Pipe: Can't spawn-NOWAIT: Resource temporarily
> unavailable at C:\
> Progra~1\SWISH-E\lib\swish-e\perl/SWISH/Filter.pm line 1158
Yuck. I wonder what resource they are talking about.
Well, maybe you can help.
Under non-Windows to run an external program I use fork/exec. Can't
do that on Windows.
So the issue is how to run an external program *safely* under Windows.
What I'm currently using is IPC::Open3, which under windows is suppose
to avoid the shell (although it seems like the data is still messed
with by Windows).
I'm sure there's a better way to run a program from Perl under
Windows, but I've never found anyone that could help.
> The Local Time is embedded at the first real text in the page via a time
> function. But strangely none of the other text on the page shows up.
Are you asking it to capture enough bytes?
StoreDescription HTML* <body> 1000000
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Fri Sep 24 15:57:49 2004