Thank you very much for the fast and helpful reply. You gave me
plenty to work with, and clarified many issues for me with very few
words. I appreciate the time you've saved me in experimentation.
While you're free to do as you will with the filter I attached, I will
post an improved version (at least with the title fix) to this list by
the weekend if all goes well.
I assume that's the preferred way to submit stuff, but I notice you
have a wiki now. Would you prefer I post it there when done? Also, I
think your documentation is excellent, especially after studying it a
while, but for some reason it's hard for me to grasp in a short time.
I think an overview/intro summarizing how the program works (which I
think I get now) would be very helpful and I would like to volunteer a
contribution or two toward this end. Would you prefer doc submissions
via the mailing list, the wiki, or otherwise?
Best regards,.
Randy
On Wed, 2 Feb 2005 09:23:09 -0800, Bill Moseley <moseley@hank.org> wrote:
> On Wed, Feb 02, 2005 at 08:35:58AM -0800, Randy wrote:
> > Once thing I was missing was a ppt filter; I saw a lot of requests for
> > such a filter in the archive, but no working code. It wasn't hard to
> > make a basic working one, here it is (just put it in your Filters
> > directory, and make sure the ppthtml executable is in your path)
>
> Cool. I'll add that to the distribution.
>
> > I have not yet figured out how to pass a more useful title back to
> > Filter.pm. The code above generates doc titles like "/tmp/foo1234"
> > where I'd like to have the actual name of the .ppt file instead. I'm
> > still reading all the docs, so I'm sure I'll get to the answer
> > eventually, but if anyone wants to give me a hint I won't mind :)
>
> What about something like (not tried or tested):
>
> $$content =~ s/<title>[^<]+</title>/<title>$doc->name</title>/e;
>
> > Another small item I miss from my htdig setup is automatic indexing
> > inside .zip, .Z, .gz, .tar archives. I'm not really sure how to chain
> > the filters so that, after unzipping an archive, the ppt, doc, xls,
> > html, txt, etc. files inside will be passed to the appropriate filter.
> > Does this recursion happen automatically, or do I have to specify it
> > in my config?
>
> There's two different things happing there. One is encoding and one
> is the file format (mime type).
>
> For encoding spider.pl, for example, sets:
>
> $request->header('Accept-encoding', 'gzip; deflate') if $can_uncompress;
>
> telling the server it will accept that encoding and the spider will then
> automatically uncompress the document. It's content type will still
> be the uncompress content type sent by the server.
>
> Now, if the file is a compressed mime-type then you could
> use a filter. You can set a filter to run early in the sort order of
> filters and then after filtering you set a flag saying that filtering
> should continue -- as in a chain of filters.
>
> .zip or .tar is another matter, as it can also be a collection of files.
> Again, I think that would be better dealt with inside spider.pl (or
> whatever calls the filter code). You would need to unpack all the
> files and then one-by-one set the content-type and the process.
>
> > Would it be possible to use FIleFilter directives (even though I'm
> > using prog / spider.pl )? Something like:
> >
> > FileFIlter .gz gzip "-c '%p'"
> > FileFIlter .zip unzip "-p '%p'"
> > etc. for all compression/archive types?
>
> No, not really. Maybe for .gz if just a single file is compressed.
>
> > Will the files inside each archive be passed along to the next
> > appropriate filter? How about (unfortunate cases) where there's a .gz
> > or .tar file inside a .zip file? I'd like to dig as deep as possible.
>
> in spider.pl there's a filter content callback. What I'd do is a
> recursive uncompression (decompression?) into temporary directories and for each one
> set the content-type and then call spider's output_content function.
>
> But what if one of the compressed files is .html. Would you want to
> search it for links to follow? ;)
>
> BTW -- I've been planning on rewriting spider.pl for quite a while. I
> want to make the spider a class so that instead of having call-back
> functions you would sub-class the spider to override its methods.
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
> http://swish-e.org/Discussion/
>
> Help with Swish-e:
> http://swish-e.org/current/docs
> swish-e@sunsite.berkeley.edu
>
>
Received on Wed Feb 2 16:25:17 2005