Skip to main content.
home | support | download

Back to List Archive

Re: Made a filter for powerpoint (ppt), included. Have questions.

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Feb 02 2005 - 17:24:45 GMT
On Wed, Feb 02, 2005 at 08:35:58AM -0800, Randy wrote:
> Once thing I was missing was a ppt filter; I saw a lot of requests for
> such a filter in the archive, but no working code.  It wasn't hard to
> make a basic working one, here it is (just put it in your Filters
> directory, and make sure the ppthtml executable is in your path)

Cool.  I'll add that to the distribution.

> I have not yet figured out how to pass a more useful title back to
> Filter.pm.  The code above generates doc titles like "/tmp/foo1234"
> where I'd like to have the actual name of the .ppt file instead.  I'm
> still reading all the docs, so I'm sure I'll get to the answer
> eventually, but if anyone wants to give me a hint I won't mind :)

What about something like (not tried or tested):

   $$content =~ s/<title>[^<]+</title>/<title>$doc->name</title>/e;

> Another small item I miss from my htdig setup is automatic indexing
> inside .zip, .Z, .gz, .tar archives.  I'm not really sure how to chain
> the filters so that, after unzipping an archive, the ppt, doc, xls,
> html, txt, etc. files inside will be passed to the appropriate filter.
>  Does this recursion happen automatically, or do I have to specify it
> in my config?

There's two different things happing there.  One is encoding and one
is the file format (mime type).  

For encoding spider.pl, for example, sets:

  $request->header('Accept-encoding', 'gzip; deflate') if $can_uncompress;

telling the server it will accept that encoding and the spider will then
automatically uncompress the document.  It's content type will still
be the uncompress content type sent by the server.

Now, if the file is a compressed mime-type then you could
use a filter.  You can set a filter to run early in the sort order of
filters and then after filtering you set a flag saying that filtering
should continue -- as in a chain of filters.

zip or .tar is another matter, as it can also be a collection of files.
Again, I think that would be better dealt with inside spider.pl (or
whatever calls the filter code).  You would need to unpack all the
files and then one-by-one set the content-type and the process.

> Would it be possible to use FIleFilter directives (even though I'm
> using prog / spider.pl )?  Something like:
> 
> FileFIlter .gz gzip "-c '%p'"
> FileFIlter .zip unzip "-p '%p'"
> etc. for all compression/archive types?

No, not really.  Maybe for .gz if just a single file is compressed.

> Will the files inside each archive be passed along to the next
> appropriate filter?  How about (unfortunate cases) where there's a .gz
> or .tar file inside a .zip file?  I'd like to dig as deep as possible.

in spider.pl there's a filter content callback.  What I'd do is a
recursive uncompression (decompression?) into temporary directories and for each one
set the content-type and then call spider's output_content function.

But what if one of the compressed files is .html.  Would you want to
search it for links to follow?  ;)

BTW -- I've been planning on rewriting spider.pl for quite a while.  I
want to make the spider a class so that instead of having call-back
functions you would sub-class the spider to override its methods.


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Feb 2 09:24:58 2005