Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e and HarvestMan

From: William M Conlon <bill(at)not-real.tothept.com>
Date: Tue May 08 2007 - 07:24:39 GMT
Sometimes this is an indication that you have a multi-byte character  
in your content.

Bill



On May 8, 2007, at 12:06 AM, Anand Pillai wrote:

> I am using the latest version of swish-e. Here is the version  
> information.
>
> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -V
> SWISH-E 2.4.4
>
> Running on Ubuntu 6.10 on an Intel Dual core 1.83 GHZ with 1 GB RAM
>
> -Anand
>
> On 5/8/07, Anand Pillai <abpillai@gmail.com> wrote:
>> Hello list,
>>
>>   I am not sure if this is the right forum to post this question.  
>> I searched
>> the swish-e website for a developer list, but could not find any.  
>> If I am
>> posting in the wrong forum, please excuse!
>>
>>  I am the developer and maintainer of an open source web crawler
>> program in Python named HarvestMan
>> (http://developer.berlios.de/projects/harvestman). As part of the
>> request
>> from a user, I have integrated HarvestMan with swish-e, enabling  
>> HarvestMan
>> to work as an external program for webcrawling, using the "-S  
>> prog" option.
>> This work is complete.
>>
>> The crawling and indexing works well for small crawls of say upto
>> a maximum of 50-100 files. However, when crawling and indexing  
>> sites with
>> a lot of HTML files, swish-e keeps failing with a "Broken Pipe"  
>> error. I am
>> assuming that the way swish-e does the indexing of the external
>> program's output is to open a pipe to read the programs STDOUT and
>> index it.
>>
>> The following is a snippet of the error when indexing  current module
>> documentation of Python at
>> http://www.python.org/doc/current/modindex.html.
>>
>> <QUOTE>
>> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ swish-e -c
>> examples/swish-config.conf -S prog
>> Indexing Data Source: "External-Program"
>> Indexing "./harvestman.py"
>> External Program found: ./harvestman.py
>>
>> Warning: Unknown header line: 'me:
>> http://www.python.org/doc/current/lib/module-main.html' from program
>> ./harvestman.py
>> err: External program failed to return required headers Path-Name:
>> .
>> anand@anand-laptop:~/projects/HarvestMan-2.0/HarvestMan$ Exception in
>> thread fetcher0:
>> Traceback (most recent call last):
>>   File "threading.py", line 442, in __bootstrap
>>     self.run()
>>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
>> line 201, in run
>>     self.action()
>>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/crawler.py",
>> line 696, in action
>>     self.process_url()
>>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/common/ 
>> methodwrapper.py",
>> line 80, in method
>>     post(self, x, *args, **kwargs)
>>   File "/home/anand/projects/HarvestMan-2.0/HarvestMan/plugins/ 
>> swish-e.py",
>> line 40, in process_url_further
>>     sys.stdout.flush()
>> IOError: [Errno 32] Broken pipe
>> </QUOTE>
>>
>> Python is clearly showing that this is a case of a broken pipe.  
>> HarvestMan is
>> a multithreaded program which means that multiple threads are  
>> crawling and
>> downloading files at the same time and writing to STDOUT. I have  
>> tried
>> increasing the time period between two threads writing to STDOUT  
>> and also
>> tried to run the program with 1-2 threads. Still not much success  
>> with large
>> crawls.
>>
>> The swish-e support is added as a HarvestMan plugin. The current  
>> source
>> code can be seen and downloaded from berlios CVS.
>>
>> http://cvs.berlios.de/cgi-bin/viewcvs.cgi/harvestman/HarvestMan-2.0/
>>
>> I have been able to run HarvestMan with the swish-e plugin when it
>> just prints the required information to STDOUT without actually  
>> calling swish-e.
>> This works fine without any issues.
>>
>> Can someone let me know where in swish-e source code should I look
>> to try and fix this issue ? Is there any configuration parameter that
>> controls the
>> input buffer and piping when reading external program output ?
>>
>> On a happier note, I have been able to crawl and index smaller sites
>> as mentioned.
>> For example the swish-e docs URL {http://swish-e.org/docs} indexes  
>> without
>> any issue. The swish-e integration is one of the better features for
>> this release
>> of HarvestMan (2.0), so it would be nice if this annoying bug is  
>> fixed.
>>
>> Thanks for your help.
>>
>> Regards
>> --
>> -Anand
>>
>
>
> -- 
> -Anand
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue May 8 03:24:45 2007