Skip to main content.
home | support | download

Back to List Archive

Re: RE: LWP,HTTP and HTML modules

From: Yann Stettler <stettler(at)>
Date: Tue Jan 19 1999 - 20:11:38 GMT
Ron Klatchko wrote:

> I'm not sure why it is hanging, but it's not due to looking for URLs in
> binary files.  If you check swishspider, you'll see it only does the check
> for URLs in file with a mime type of text/html.

Hmm, yes, possible : It was just an idee why it could hang but
I didn't realy checked. I am more concerned by the transfer of
the file in the first place... :)
> >And if you realy want to optimize things, you should implement
> >back the "NoContents" directive into SWISH for the HTTP method.
> >(why was it done only for the file system method anyway ???).
> >That would avoid forking a process and runing a PERL program
> >just to realize that the document shouldn't be indexed after
> >all...
> I completely disagree with that.  The only way to prevent a request is to
> use the file extension and I believe that the HTTP method should rely
> solely on the mime type as this is the definitive statement as to the type
> of the file. 

Why ? The mime-type is also based on the file extension so it
amount to the same. You could actualy implement the mime-type
format into SWISH but that's an over-kill...

Hu, I don't realy understand you : You say that you don't agree
with me, that the only way is to use the file extension... and
that's exactly what I suggest to do. I just say that SWISH
don't need to get the mime-type from the server to know that
a ".jpg" is not of "text/*" type...

> Even if it works differently from Swish, I think it makes
> more sense to use positive logic (only index files with certain mime types)
> as opposed to the file systems negative logic (index everything except
> files with certain extensions).

I don't agree at all !

1) There is _absolutly nothing_ that say that
   the "3 characters extension" notation has to be used under
   Unix or other OS ! If people want to name their documents
   without extension, that's their right. Note that it's also
   similar to the default of most web server that assume a
   "text" type when they can't recognize the extension of the

2) Many server use database to generate semi-static documents.
   Those documents generaly doesn't change (for example, newspaper
   articles, catalog descriptions) but are the result of a query
   on the database passed directly in the URL.
   It would be impossible to list all the "possible" meaning less
   extension but we definitively want to index those documents...

> As for avoiding the fork/exec overhead, I'd love to see the perl helper
> script swishspider rewritten in C and pulled into the actual swish
> executable.  The only reason I wrote it in Perl in the first place was to

Oh, that won't solve everything : That still mean that for each
pictures, video and other, you will do a connection with the
server to get it's mime-type. That will take _a lot_ of time,
especialy if the server is far away. Much better not to do
the connection at all...

> anyone know of an HTTP library that we could integrate with Swish?  It
> should meet the following requirements:

Doing the connection to the server, passing the request and
getting the data is not a problem. What is more annoying
is parsing the URLs... (meaning : I don't have a library already
coded for that :)
> 2) Works with most Unices and Win32.

I am dubious... don't have the faintest idee how socket programing
is done under win32... I dunno if my own libraries would work
or would need to be rewriten totaly for windows...

Yann Stettler

TheNet - Internet Services AG              CohProg SaRL                           
Anime and Manga Services         
Received on Tue Jan 19 12:03:38 1999