Skip to main content.
home | support | download

Back to List Archive

Re: PowerPoint module for spider.pl

From: Alan Ivey <ai4891(at)not-real.yahoo.com>
Date: Thu Jul 08 2004 - 13:52:31 GMT
Wow, thanks a lot! It all makes sense now! The
lightbulb finally turned on, lol. 

I edited Doc2txt.pm like you showed, and now I'm
trying to write a Ppt2txt.pm. There isn't a binary
that converts ppt to txt, but rather html (ala
ppthtml). The only problem is, the <TITLE/> is the
full filename and path, which, with SWISH-E, makes it
like /tmp/sddwt4g490 or whatever.

I know I can pipe the output through w3m with some
options to strip the HTML tags to make it text, but
I'm having a hard time figuring out how to make it
work in a module. Using the doc2txt.pm as an example,
I tried about 20 different things I was hoping would
work but no luck. 

How would I change the line...
my $content = $filter->run_program( $self->{ppthtml},
$file )

To do the bash equivilent of...
ppthtml [filegoeshere] | w3m -dump -T text/html | perl
-pe 's/\xa0/ /g'
?

Unless someone is hordeing a Ppt2txt.pm, I love some
help :)

I greatly appreciate all the help thus far! If I can
get this working, we'll go live with it!

--- Bill Moseley <moseley@hank.org> wrote:
> On Wed, Jul 07, 2004 at 11:33:36AM -0700, Alan Ivey
> wrote:
> > @servers = (
> >     # Localhost
> >     {
> >         skip        => 0,
> >         
> >         base_url    => 'http://localhost',
> >         same_hosts  => [ qw/127.0.0.1/ ],
> >         agent       => 'swish-e spider
> > http://swish-e.org/',
> >         email       => 'alan@localhost',
> > 
> >         delay_sec   => 2,
> 
> Turn on Keep Alives and don't use a delay.
> 
> 
> >         max_time    => 10,
> >         max_files   => 100,
> >         max_indexed => 20, 
> >         keep_alive  => 1,  
> >         filter_content  => \&filter_content,
> >     },
> > );    
> > 
> > I've read the Docs serveral times, and searched on
> the
> > mailing list, and I'm just not getting it. But
> like
> > I've said before on this list, I'm recently new to
> > Linux, and I don't really know much of anything in
> the
> > way of Perl. So, my question is... do I just have
> to
> > put modules in the
> > /usr/local/lib/swish-e/perl/SWISH/Filters
> directory,
> > and then they'll automatically be processed? Don't
> I
> > have to set the content type somewhere? Wherever
> they
> > go doesn't jump out to me, a newbie in the sample
> > file.
> 
> Well, first read
> http://swish-e.org/current/docs/Filter.html
> that should give some overview.  Then just pick an
> existing filter and
> copy it as your new filter.
> 
> You can put the filters anyplace, they just need to
> be in the
> SWISH::Filters name space.  It's not as complex as
> it sounds --
> SWISH::Filter (SWISH/Filter.pm) takes perl @INC
> array and appends each
> path with "SWISH/Filters" to make a full path to a
> directory.  It
> think looks in that directory for filters.
> 
> So, you can make a file called $HOME/SWISH/Filters
> and add a module
> called PowerPoint.pm to it (the module is
> SWISH::Filters::PowerPoint)
> and then set PERL5LIB=$HOME and SWISH::Filter will
> find the module.
> 
> That make any sense?  SWISH::Filter uses @INC to
> find the filters.
> 
> > I wish I knew more Perl :( Tis frustrating.
> 
> Me too.
> 
> > I ran swish-filter-test and it seems there needs
> to be
> > more than just an existing module. The first time
> I
> > ran it, it said I needed MIME::Type and
> MIME::Types so
> > I added those to a suitable Perl folder. Here's
> the
> > results of my .doc test, even with Doc2txt.pm
> being in
> > the SWISH Filter folder...
> 
> MIME::Type shouldn't be required -- it's just used
> if available to map
> from file extensions to content-types.  There's a
> few built in maps if
> MIME::Types isn't installed.  But PowerPoint is not
> in there by
> default.
> 
> 
> > >> Loading filter: [SWISH/Filters/Doc2txt.pm]
> > Find path of [catdoc] in
> >
> /usr/local/bin:/usr/bin:/bin:/usr/local/lib/swish-e
> >  * Found program at: [/usr/local/bin/catdoc]
> 
> Ok, so that filter found "catdoc" so it's available.
> 
> 
> >  
> > >> Starting to process new document:
> > application/x-msword
> 
> And your document (from MIME::Types, I guess) is
> marked as x-msword.
> 
> >  ++Checking filter
> > [SWISH::Filters::Doc2txt=HASH(0x8ff3b3c)] for
> > application/x-msword
> >  ++ application/x-msword was not filtered by
> > SWISH::Filters::Doc2txt=HASH(0x8ff3b3c)
> 
> For some reason Doc2txt didn't accept the file for
> filtering.
> What SWISH::Filter does is pass the document to all
> filters,
> one-by-one until it's accepted by a filter.  It's up
> to the filter to
> determine if it can filter the document -- normally
> by checking the
> content type.
> 
> It MAY be that Doc2txt doesn't know about that
> content type.  I think
> at one point it only checked for application/msword
> and then
> MIME::Types was updated for x-msword.  But I'm not
> sure.  Just look at
> Doc2txt.pm and see what it does.
> 
> moseley@bumby:~/swish-e/filters/SWISH/Filters$ fgrep
> msword Doc2txt.pm 
>     return unless $filter->content_type =~
> m!application/(x-)?msword!;
> 
> So the filter is just returning if the content type
> doesn't match.
> 
> 
> 
> -- 
> Bill Moseley
> moseley@hank.org
> 
> Unsubscribe from or help with the swish-e list: 
>    http://swish-e.org/Discussion/
> 
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Received on Thu Jul 8 06:52:49 2004