Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e and HarvestMan

From: Anand Pillai <abpillai(at)not-real.gmail.com>
Date: Tue May 08 2007 - 07:06:22 GMT
I am using the latest version of swish-e. Here is the version information.

anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -V
SWISH-E 2.4.4

Running on Ubuntu 6.10 on an Intel Dual core 1.83 GHZ with 1 GB RAM

-Anand

On 5/8/07, Anand Pillai <abpillai@gmail.com> wrote:
> Hello list,
>
>   I am not sure if this is the right forum to post this question. I searched
> the swish-e website for a developer list, but could not find any. If I am
> posting in the wrong forum, please excuse!
>
>  I am the developer and maintainer of an open source web crawler
> program in Python named HarvestMan
> (http://developer.berlios.de/projects/harvestman). As part of the
> request
> from a user, I have integrated HarvestMan with swish-e, enabling HarvestMan
> to work as an external program for webcrawling, using the "-S prog" option.
> This work is complete.
>
> The crawling and indexing works well for small crawls of say upto
> a maximum of 50-100 files. However, when crawling and indexing sites with
> a lot of HTML files, swish-e keeps failing with a "Broken Pipe" error. I am
> assuming that the way swish-e does the indexing of the external
> program's output is to open a pipe to read the programs STDOUT and
> index it.
>
> The following is a snippet of the error when indexing  current module
> documentation of Python at
> http://www.python.org/doc/current/modindex.html.
>
> <QUOTE>
> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -c
> examples/swish-config.conf -S prog
> Indexing Data Source: "External-Program"
> Indexing "./harvestman.py"
> External Program found: ./harvestman.py
>
> Warning: Unknown header line: 'me:
> http://www.python.org/doc/current/lib/module-main.html' from program
> ./harvestman.py
> err: External program failed to return required headers Path-Name:
> .
> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ Exception in
> thread fetcher0:
> Traceback (most recent call last):
>   File "threading.py", line 442, in __bootstrap
>     self.run()
>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
> line 201, in run
>     self.action()
>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
> line 696, in action
>     self.process_url()
>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/common/methodwrapper.py",
> line 80, in method
>     post(self, x, *args, **kwargs)
>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/plugins/swish-e.py",
> line 40, in process_url_further
>     sys.stdout.flush()
> IOError: [Errno 32] Broken pipe
> </QUOTE>
>
> Python is clearly showing that this is a case of a broken pipe. HarvestMan is
> a multithreaded program which means that multiple threads are crawling and
> downloading files at the same time and writing to STDOUT. I have tried
> increasing the time period between two threads writing to STDOUT and also
> tried to run the program with 1-2 threads. Still not much success with large
> crawls.
>
> The swish-e support is added as a HarvestMan plugin. The current source
> code can be seen and downloaded from berlios CVS.
>
> http://cvs.berlios.de/cgi-bin/viewcvs.cgi/harvestman/HarvestMan-2.0/
>
> I have been able to run HarvestMan with the swish-e plugin when it
> just prints the required information to STDOUT without actually calling swish-e.
> This works fine without any issues.
>
> Can someone let me know where in swish-e source code should I look
> to try and fix this issue ? Is there any configuration parameter that
> controls the
> input buffer and piping when reading external program output ?
>
> On a happier note, I have been able to crawl and index smaller sites
> as mentioned.
> For example the swish-e docs URL {http://swish-e.org/docs} indexes without
> any issue. The swish-e integration is one of the better features for
> this release
> of HarvestMan (2.0), so it would be nice if this annoying bug is fixed.
>
> Thanks for your help.
>
> Regards
> --
> -Anand
>


-- 
-Anand
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue May 8 03:06:25 2007