Skip to main content.
home | support | download

Back to List Archive

Re: RE: LWP,HTTP and HTML modules

From: Mark Gaulin <gaulin(at)>
Date: Tue Jan 19 1999 - 20:20:02 GMT
Just jumping into the discussion here...

Checking mime types make a lot of sense because an url
call be as simple as "http://myserver" (or "http://myserver/")
which has no extension, and can also be things like
"http://myserver/scripts/doit.dll" which may return a gif or
it may return an HTML page. So relying on just the file extension
in the url will exclude dynamic pages, or, if they are all allowed,
will attempt to index dynamically generated non-HTML pages,
like gifs, etc.

The reason you and I can use file extensions to index files
is because *we* control those extensions... we know what
they mean and they are by definition not dynamically

The down-side to checking mime types is that some web servers
do not support the HEAD method, and so the algorithm must fall
back to a GET (which I assume it does... I have not looked at it).

I think there may be a middle ground where both methods can be
used... mime to include urls and file extension to exclude the
common extensions. 

My 2 cents...

	Mark Gaulin

At 12:04 PM 1/19/99 -0800, Yann Stettler wrote:
>Ron Klatchko wrote:
>> I'm not sure why it is hanging, but it's not due to looking for URLs in
>> binary files.  If you check swishspider, you'll see it only does the check
>> for URLs in file with a mime type of text/html.
>Hmm, yes, possible : It was just an idee why it could hang but
>I didn't realy checked. I am more concerned by the transfer of
>the file in the first place... :)
>> >And if you realy want to optimize things, you should implement
>> >back the "NoContents" directive into SWISH for the HTTP method.
>> >(why was it done only for the file system method anyway ???).
>> >That would avoid forking a process and runing a PERL program
>> >just to realize that the document shouldn't be indexed after
>> >all...
>> I completely disagree with that.  The only way to prevent a request is to
>> use the file extension and I believe that the HTTP method should rely
>> solely on the mime type as this is the definitive statement as to the type
>> of the file. 
>Why ? The mime-type is also based on the file extension so it
>amount to the same. You could actualy implement the mime-type
>format into SWISH but that's an over-kill...
>Hu, I don't realy understand you : You say that you don't agree
>with me, that the only way is to use the file extension... and
>that's exactly what I suggest to do. I just say that SWISH
>don't need to get the mime-type from the server to know that
>a ".jpg" is not of "text/*" type...
>> Even if it works differently from Swish, I think it makes
>> more sense to use positive logic (only index files with certain mime types)
>> as opposed to the file systems negative logic (index everything except
>> files with certain extensions).
>I don't agree at all !
>1) There is _absolutly nothing_ that say that
>   the "3 characters extension" notation has to be used under
>   Unix or other OS ! If people want to name their documents
>   without extension, that's their right. Note that it's also
>   similar to the default of most web server that assume a
>   "text" type when they can't recognize the extension of the
>   document.
>2) Many server use database to generate semi-static documents.
>   Those documents generaly doesn't change (for example, newspaper
>   articles, catalog descriptions) but are the result of a query
>   on the database passed directly in the URL.
>   It would be impossible to list all the "possible" meaning less
>   extension but we definitively want to index those documents...
>> As for avoiding the fork/exec overhead, I'd love to see the perl helper
>> script swishspider rewritten in C and pulled into the actual swish
>> executable.  The only reason I wrote it in Perl in the first place was to
>Oh, that won't solve everything : That still mean that for each
>pictures, video and other, you will do a connection with the
>server to get it's mime-type. That will take _a lot_ of time,
>especialy if the server is far away. Much better not to do
>the connection at all...
>> anyone know of an HTTP library that we could integrate with Swish?  It
>> should meet the following requirements:
>Doing the connection to the server, passing the request and
>getting the data is not a problem. What is more annoying
>is parsing the URLs... (meaning : I don't have a library already
>coded for that :)
>> 2) Works with most Unices and Win32.
>I am dubious... don't have the faintest idee how socket programing
>is done under win32... I dunno if my own libraries would work
>or would need to be rewriten totaly for windows...
>Yann Stettler
>TheNet - Internet Services AG              CohProg SaRL
>                              ---**---
>Anime and Manga Services         
Received on Tue Jan 19 12:13:31 1999