Skip to main content.
home | support | download

Back to List Archive

[swish-e] Swish-e and HarvestMan

From: Anand Pillai <abpillai(at)>
Date: Tue May 08 2007 - 07:04:16 GMT
Hello list,

  I am not sure if this is the right forum to post this question. I searched
the swish-e website for a developer list, but could not find any. If I am
posting in the wrong forum, please excuse!

 I am the developer and maintainer of an open source web crawler
program in Python named HarvestMan
( As part of the
from a user, I have integrated HarvestMan with swish-e, enabling HarvestMan
to work as an external program for webcrawling, using the "-S prog" option.
This work is complete.

The crawling and indexing works well for small crawls of say upto
a maximum of 50-100 files. However, when crawling and indexing sites with
a lot of HTML files, swish-e keeps failing with a "Broken Pipe" error. I am
assuming that the way swish-e does the indexing of the external
program's output is to open a pipe to read the programs STDOUT and
index it.

The following is a snippet of the error when indexing  current module
documentation of Python at

anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -c
examples/swish-config.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./"
External Program found: ./

Warning: Unknown header line: 'me:' from program
err: External program failed to return required headers Path-Name:
anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ Exception in
thread fetcher0:
Traceback (most recent call last):
  File "", line 442, in __bootstrap
  File "/home/anand/projects/HarvestMan-2.0/HarvestMan/",
line 201, in run
  File "/home/anand/projects/HarvestMan-2.0/HarvestMan/",
line 696, in action
  File "/home/anand/projects/HarvestMan-2.0/HarvestMan/common/",
line 80, in method
    post(self, x, *args, **kwargs)
  File "/home/anand/projects/HarvestMan-2.0/HarvestMan/plugins/",
line 40, in process_url_further
IOError: [Errno 32] Broken pipe

Python is clearly showing that this is a case of a broken pipe. HarvestMan is
a multithreaded program which means that multiple threads are crawling and
downloading files at the same time and writing to STDOUT. I have tried
increasing the time period between two threads writing to STDOUT and also
tried to run the program with 1-2 threads. Still not much success with large

The swish-e support is added as a HarvestMan plugin. The current source
code can be seen and downloaded from berlios CVS.

I have been able to run HarvestMan with the swish-e plugin when it
just prints the required information to STDOUT without actually calling swish-e.
This works fine without any issues.

Can someone let me know where in swish-e source code should I look
to try and fix this issue ? Is there any configuration parameter that
controls the
input buffer and piping when reading external program output ?

On a happier note, I have been able to crawl and index smaller sites
as mentioned.
For example the swish-e docs URL {} indexes without
any issue. The swish-e integration is one of the better features for
this release
of HarvestMan (2.0), so it would be nice if this annoying bug is fixed.

Thanks for your help.

Users mailing list
Received on Tue May 8 03:04:20 2007