Good doc is always appreciated. Especially examples.
I think this is the best place to post contributions that might end up in the
distribution: doc, code, etc. The wiki is a project for conversation, project
ideas, etc., but not a very good place to send code. Besides, some of us aren't
in the daily habit of checking it as often as our email. :)
Randy wrote on 2/2/05 6:24 PM:
> Thank you very much for the fast and helpful reply. You gave me
> plenty to work with, and clarified many issues for me with very few
> words. I appreciate the time you've saved me in experimentation.
> While you're free to do as you will with the filter I attached, I will
> post an improved version (at least with the title fix) to this list by
> the weekend if all goes well.
> I assume that's the preferred way to submit stuff, but I notice you
> have a wiki now. Would you prefer I post it there when done? Also, I
> think your documentation is excellent, especially after studying it a
> while, but for some reason it's hard for me to grasp in a short time.
> I think an overview/intro summarizing how the program works (which I
> think I get now) would be very helpful and I would like to volunteer a
> contribution or two toward this end. Would you prefer doc submissions
> via the mailing list, the wiki, or otherwise?
> Best regards,.
> On Wed, 2 Feb 2005 09:23:09 -0800, Bill Moseley <firstname.lastname@example.org> wrote:
>>On Wed, Feb 02, 2005 at 08:35:58AM -0800, Randy wrote:
>>>Once thing I was missing was a ppt filter; I saw a lot of requests for
>>>such a filter in the archive, but no working code. It wasn't hard to
>>>make a basic working one, here it is (just put it in your Filters
>>>directory, and make sure the ppthtml executable is in your path)
>>Cool. I'll add that to the distribution.
>>>I have not yet figured out how to pass a more useful title back to
>>>Filter.pm. The code above generates doc titles like "/tmp/foo1234"
>>>where I'd like to have the actual name of the .ppt file instead. I'm
>>>still reading all the docs, so I'm sure I'll get to the answer
>>>eventually, but if anyone wants to give me a hint I won't mind :)
>>What about something like (not tried or tested):
>> $$content =~ s/<title>[^<]+</title>/<title>$doc->name</title>/e;
>>>Another small item I miss from my htdig setup is automatic indexing
>>>inside .zip, .Z, .gz, .tar archives. I'm not really sure how to chain
>>>the filters so that, after unzipping an archive, the ppt, doc, xls,
>>>html, txt, etc. files inside will be passed to the appropriate filter.
>>> Does this recursion happen automatically, or do I have to specify it
>>>in my config?
>>There's two different things happing there. One is encoding and one
>>is the file format (mime type).
>>For encoding spider.pl, for example, sets:
>> $request->header('Accept-encoding', 'gzip; deflate') if $can_uncompress;
>>telling the server it will accept that encoding and the spider will then
>>automatically uncompress the document. It's content type will still
>>be the uncompress content type sent by the server.
>>Now, if the file is a compressed mime-type then you could
>>use a filter. You can set a filter to run early in the sort order of
>>filters and then after filtering you set a flag saying that filtering
>>should continue -- as in a chain of filters.
>>.zip or .tar is another matter, as it can also be a collection of files.
>>Again, I think that would be better dealt with inside spider.pl (or
>>whatever calls the filter code). You would need to unpack all the
>>files and then one-by-one set the content-type and the process.
>>>Would it be possible to use FIleFilter directives (even though I'm
>>>using prog / spider.pl )? Something like:
>>>FileFIlter .gz gzip "-c '%p'"
>>>FileFIlter .zip unzip "-p '%p'"
>>>etc. for all compression/archive types?
>>No, not really. Maybe for .gz if just a single file is compressed.
>>>Will the files inside each archive be passed along to the next
>>>appropriate filter? How about (unfortunate cases) where there's a .gz
>>>or .tar file inside a .zip file? I'd like to dig as deep as possible.
>>in spider.pl there's a filter content callback. What I'd do is a
>>recursive uncompression (decompression?) into temporary directories and for each one
>>set the content-type and then call spider's output_content function.
>>But what if one of the compressed files is .html. Would you want to
>>search it for links to follow? ;)
>>BTW -- I've been planning on rewriting spider.pl for quite a while. I
>>want to make the spider a class so that instead of having call-back
>>functions you would sub-class the spider to override its methods.
>>Unsubscribe from or help with the swish-e list:
>>Help with Swish-e:
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
"One of the best things to come out of the home computer revolution
could be the general and widespread understanding of how severely limited logic
- Frank Herbert (1920-1986, American Writer)
Received on Wed Feb 2 17:05:17 2005