Skip to main content.
home | support | download

Back to List Archive

Re: Made a filter for powerpoint (ppt), included. Have questions.

From: Randy <randyest(at)not-real.gmail.com>
Date: Thu Feb 03 2005 - 00:25:11 GMT
Thank you very much for the fast and helpful reply.  You gave me
plenty to work with, and clarified many issues for me with very few
words.  I appreciate the time you've saved me in experimentation. 
While you're free to do as you will with the filter I attached, I will
post an improved version (at least with the title fix) to this list by
the weekend if all goes well.

I assume that's the preferred way to submit stuff, but I notice you
have a wiki now.  Would you prefer I post it there when done?  Also, I
think your documentation is excellent, especially after studying it a
while, but for some reason it's hard for me to grasp in a short time. 
I think an overview/intro summarizing how the program works (which I
think I get now) would be very helpful and I would like to volunteer a
contribution or two toward this end.  Would you prefer doc submissions
via the mailing list, the wiki, or otherwise?

Best regards,.
Randy


On Wed, 2 Feb 2005 09:23:09 -0800, Bill Moseley <moseley@hank.org> wrote:
> On Wed, Feb 02, 2005 at 08:35:58AM -0800, Randy wrote:
> > Once thing I was missing was a ppt filter; I saw a lot of requests for
> > such a filter in the archive, but no working code.  It wasn't hard to
> > make a basic working one, here it is (just put it in your Filters
> > directory, and make sure the ppthtml executable is in your path)
> 
> Cool.  I'll add that to the distribution.
> 
> > I have not yet figured out how to pass a more useful title back to
> > Filter.pm.  The code above generates doc titles like "/tmp/foo1234"
> > where I'd like to have the actual name of the .ppt file instead.  I'm
> > still reading all the docs, so I'm sure I'll get to the answer
> > eventually, but if anyone wants to give me a hint I won't mind :)
> 
> What about something like (not tried or tested):
> 
>   $$content =~ s/<title>[^<]+</title>/<title>$doc->name</title>/e;
> 
> > Another small item I miss from my htdig setup is automatic indexing
> > inside .zip, .Z, .gz, .tar archives.  I'm not really sure how to chain
> > the filters so that, after unzipping an archive, the ppt, doc, xls,
> > html, txt, etc. files inside will be passed to the appropriate filter.
> >  Does this recursion happen automatically, or do I have to specify it
> > in my config?
> 
> There's two different things happing there.  One is encoding and one
> is the file format (mime type).
> 
> For encoding spider.pl, for example, sets:
> 
>  $request->header('Accept-encoding', 'gzip; deflate') if $can_uncompress;
> 
> telling the server it will accept that encoding and the spider will then
> automatically uncompress the document.  It's content type will still
> be the uncompress content type sent by the server.
> 
> Now, if the file is a compressed mime-type then you could
> use a filter.  You can set a filter to run early in the sort order of
> filters and then after filtering you set a flag saying that filtering
> should continue -- as in a chain of filters.
> 
> .zip or .tar is another matter, as it can also be a collection of files.
> Again, I think that would be better dealt with inside spider.pl (or
> whatever calls the filter code).  You would need to unpack all the
> files and then one-by-one set the content-type and the process.
> 
> > Would it be possible to use FIleFilter directives (even though I'm
> > using prog / spider.pl )?  Something like:
> >
> > FileFIlter .gz gzip "-c '%p'"
> > FileFIlter .zip unzip "-p '%p'"
> > etc. for all compression/archive types?
> 
> No, not really.  Maybe for .gz if just a single file is compressed.
> 
> > Will the files inside each archive be passed along to the next
> > appropriate filter?  How about (unfortunate cases) where there's a .gz
> > or .tar file inside a .zip file?  I'd like to dig as deep as possible.
> 
> in spider.pl there's a filter content callback.  What I'd do is a
> recursive uncompression (decompression?) into temporary directories and for each one
> set the content-type and then call spider's output_content function.
> 
> But what if one of the compressed files is .html.  Would you want to
> search it for links to follow?  ;)
> 
> BTW -- I've been planning on rewriting spider.pl for quite a while.  I
> want to make the spider a class so that instead of having call-back
> functions you would sub-class the spider to override its methods.
> 
> --
> Bill Moseley
> moseley@hank.org
> 
> Unsubscribe from or help with the swish-e list:
>   http://swish-e.org/Discussion/
> 
> Help with Swish-e:
>   http://swish-e.org/current/docs
>   swish-e@sunsite.berkeley.edu
> 
>
Received on Wed Feb 2 16:25:17 2005