Skip to main content.
home | support | download

Back to List Archive

Re: wvWare

From: <Allan_Watts(at)>
Date: Tue Jan 27 2004 - 05:52:25 GMT

Thanks.  I do appreciate the suggestions.  I was just getting round to
figuring out how the wvHtml script worked...  Just reporting back on how it

wvWare does seem to be better behaved than catdoc..   It still has problems
with my most difficult Word documents (but handles long file names, of
course, and doesn't return the ?????????? which your version of catdoc.exe
did for some of my files.)   And swish-e seems to be able to recover when
wvWare.exe fails (despite a nasty looking  message box something like:
wvWare.exe application error .... reference memory at 0x00000000).

Unfortunately waitpid causes problems with wvWare - it seems to wait for
something that is not going to finish, and the middle of
processing the first file.

Without waitpid, I get to 64 files processed and then the "open2: IO::Pipe:
Can't spawn-NOWAIT" error.

I guess we have to solve the windows_fork problem.  Output from swish-e
(with some interesting messages from wvWare coming through for some
particular files...) and perl script are below.


C:\Program Files\SWISH-E>swish-e -e -S prog -c c:

Indexing Data Source: "External-Program"
Indexing "perl.exe"
External Program found: C:\Perl\bin\/perl.exe
AW - File is: c:/cat/0373.doc
I won't mmap that file, using a slower method
Panic: broken stream, truncating to block 2282996
Invalid seekInvalid seekInvalid seekAW - File is: c:/cat/0374.doc
c:/cat/0373.doc - Using HTML2 parser - **Adding automatic MetaName
'generator' f
ound in file 'c:/cat/0373.doc'
 (10 words)
I won't mmap that file, using a slower method
Invalid seekInvalid seekInvalid seekInvalid seekInvalid seekInvalid
seekInvalid seekInvalid seekInvalid seekc:/cat/0374.doc - Using HTML2
parser - A
W - File is: c:/cat/0375.doc
 (696 words)
I won't mmap that file, using a slower method
Panic: broken stream, truncating to block 2282996
Invalid seekInvalid seekInvalid seekc:/cat/0375.doc - Using HTML2 parser -

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 302 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
302 unique words indexed.
5 properties sorted.
3 files indexed.  16374 total bytes.  716 total words.
Elapsed time: 00:01:52 CPU time: 00:01:51
Indexing done!

use strict;

use File::Find;  # for recursing a directory tree
use IPC::Open2;

my $rdrfh;
my $wtrfh;

for (my $k = 373; $k <= 375; $k++)      # run through a few of the test
files that were causing problems before...
      my $filename = "c:/cat/".substr("0000$k",-4).".doc";
      print STDERR "AW - File is: $filename\n";
      #my $command = "c:\\data\\swish\\catdoc\\catdoc.exe $filename";
      #my $command = "c:\\progra~1\\swish-e\\lib\\swish-e\\catdoc.exe
      my $command = "c:\\progra~1\\gnuwin32\\bin\\wvWare.exe -x
c:/progra~1/gnuwin32/share/wv/wvHtml.xml $filename" ;
      my $pid = IPC::Open2::open2($rdrfh, $wtrfh, "$command" );
      #waitpid $pid,0;
      binmode $rdrfh, ':crlf';
      $/ = undef;

      my $content =  <$rdrfh>;
      my $mtime  = (stat $filename)[9];
      my $size = length $content;

      print <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Document-Type: HTML*
Path-Name: $filename

      print $content;

David L Norris <> on 23/01/2004
05:14:05 PM

Please respond to

Sent by:

To:    Multiple recipients of list <>
Subject:    [SWISH-E] wvWare

OK, it has come to my attention that wvHtml is a Bourne shell script. =20
So, I did a little testing with WINE (not exactly Windows but its handy).

Here is the wvWare Windows installer:

Install it to c:\gnu, for example.

Then you would run wvWare something like this:
  c:\gnu\bin\wvWare.exe -x C:/gnu/share/wv/wvHtml.xml yourfile.doc -

The last few lines of the wvHtml shell script show how to execute

 David Norris
  ICQ - 412039

Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.

This email message and any accompanying attachments may contain
information that is confidential and is subject to legal privilege. If you are not
the intended recipient, do not read, use, disseminate, distribute or copy this 
message or attachments. If you have received this message in error, please 
notify the sender immediately and delete this message. Any views expressed
in this message are those of the individual sender, except where the sender
expressly, and with authority, states them to be the views of AMP. Before 
opening any attachments, please check them for viruses and defects.
Received on Mon Jan 26 21:52:25 2004