Ron Klatchko wrote:
> I'm not sure why it is hanging, but it's not due to looking for URLs in
> binary files. If you check swishspider, you'll see it only does the check
> for URLs in file with a mime type of text/html.
Hmm, yes, possible : It was just an idee why it could hang but
I didn't realy checked. I am more concerned by the transfer of
the file in the first place... :)
> >And if you realy want to optimize things, you should implement
> >back the "NoContents" directive into SWISH for the HTTP method.
> >(why was it done only for the file system method anyway ???).
> >That would avoid forking a process and runing a PERL program
> >just to realize that the document shouldn't be indexed after
> I completely disagree with that. The only way to prevent a request is to
> use the file extension and I believe that the HTTP method should rely
> solely on the mime type as this is the definitive statement as to the type
> of the file.
Why ? The mime-type is also based on the file extension so it
amount to the same. You could actualy implement the mime-type
format into SWISH but that's an over-kill...
Hu, I don't realy understand you : You say that you don't agree
with me, that the only way is to use the file extension... and
that's exactly what I suggest to do. I just say that SWISH
don't need to get the mime-type from the server to know that
a ".jpg" is not of "text/*" type...
> Even if it works differently from Swish, I think it makes
> more sense to use positive logic (only index files with certain mime types)
> as opposed to the file systems negative logic (index everything except
> files with certain extensions).
I don't agree at all !
1) There is _absolutly nothing_ that say that
the "3 characters extension" notation has to be used under
Unix or other OS ! If people want to name their documents
without extension, that's their right. Note that it's also
similar to the default of most web server that assume a
"text" type when they can't recognize the extension of the
2) Many server use database to generate semi-static documents.
Those documents generaly doesn't change (for example, newspaper
articles, catalog descriptions) but are the result of a query
on the database passed directly in the URL.
It would be impossible to list all the "possible" meaning less
extension but we definitively want to index those documents...
> As for avoiding the fork/exec overhead, I'd love to see the perl helper
> script swishspider rewritten in C and pulled into the actual swish
> executable. The only reason I wrote it in Perl in the first place was to
Oh, that won't solve everything : That still mean that for each
pictures, video and other, you will do a connection with the
server to get it's mime-type. That will take _a lot_ of time,
especialy if the server is far away. Much better not to do
the connection at all...
> anyone know of an HTTP library that we could integrate with Swish? It
> should meet the following requirements:
Doing the connection to the server, passing the request and
getting the data is not a problem. What is more annoying
is parsing the URLs... (meaning : I don't have a library already
coded for that :)
> 2) Works with most Unices and Win32.
I am dubious... don't have the faintest idee how socket programing
is done under win32... I dunno if my own libraries would work
or would need to be rewriten totaly for windows...
TheNet - Internet Services AG CohProg SaRL
Anime and Manga Services http://www.animanga.com/
Received on Tue Jan 19 12:03:38 1999