Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e and HarvestMan

From: Anand Pillai <abpillai(at)>
Date: Tue May 08 2007 - 07:06:22 GMT
I am using the latest version of swish-e. Here is the version information.

anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -V
SWISH-E 2.4.4

Running on Ubuntu 6.10 on an Intel Dual core 1.83 GHZ with 1 GB RAM


On 5/8/07, Anand Pillai <> wrote:
> Hello list,
>   I am not sure if this is the right forum to post this question. I searched
> the swish-e website for a developer list, but could not find any. If I am
> posting in the wrong forum, please excuse!
>  I am the developer and maintainer of an open source web crawler
> program in Python named HarvestMan
> ( As part of the
> request
> from a user, I have integrated HarvestMan with swish-e, enabling HarvestMan
> to work as an external program for webcrawling, using the "-S prog" option.
> This work is complete.
> The crawling and indexing works well for small crawls of say upto
> a maximum of 50-100 files. However, when crawling and indexing sites with
> a lot of HTML files, swish-e keeps failing with a "Broken Pipe" error. I am
> assuming that the way swish-e does the indexing of the external
> program's output is to open a pipe to read the programs STDOUT and
> index it.
> The following is a snippet of the error when indexing  current module
> documentation of Python at
> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -c
> examples/swish-config.conf -S prog
> Indexing Data Source: "External-Program"
> Indexing "./"
> External Program found: ./
> Warning: Unknown header line: 'me:
>' from program
> ./
> err: External program failed to return required headers Path-Name:
> .
> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ Exception in
> thread fetcher0:
> Traceback (most recent call last):
>   File "", line 442, in __bootstrap
>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/",
> line 201, in run
>     self.action()
>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/",
> line 696, in action
>     self.process_url()
>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/common/",
> line 80, in method
>     post(self, x, *args, **kwargs)
>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/plugins/",
> line 40, in process_url_further
>     sys.stdout.flush()
> IOError: [Errno 32] Broken pipe
> </QUOTE>
> Python is clearly showing that this is a case of a broken pipe. HarvestMan is
> a multithreaded program which means that multiple threads are crawling and
> downloading files at the same time and writing to STDOUT. I have tried
> increasing the time period between two threads writing to STDOUT and also
> tried to run the program with 1-2 threads. Still not much success with large
> crawls.
> The swish-e support is added as a HarvestMan plugin. The current source
> code can be seen and downloaded from berlios CVS.
> I have been able to run HarvestMan with the swish-e plugin when it
> just prints the required information to STDOUT without actually calling swish-e.
> This works fine without any issues.
> Can someone let me know where in swish-e source code should I look
> to try and fix this issue ? Is there any configuration parameter that
> controls the
> input buffer and piping when reading external program output ?
> On a happier note, I have been able to crawl and index smaller sites
> as mentioned.
> For example the swish-e docs URL {} indexes without
> any issue. The swish-e integration is one of the better features for
> this release
> of HarvestMan (2.0), so it would be nice if this annoying bug is fixed.
> Thanks for your help.
> Regards
> --
> -Anand

Users mailing list
Received on Tue May 8 03:06:25 2007