when filtering, spider.pl extracts the file name from the uri:
my $doc = $filter->convert(
document => $content_ref,
name => $response->base,
content_type => $content_type,
);
This works fine when the file is served from the file system, but not
when served out of a database, where the filename is not present in the
uri, but instead in the Content-Disposition header. Here's an example
header output from swish-e.
----HEADERS for http://oakhill.tothept.com/viewdoc.taf?_uid1=71 ---
Connection: close
Date: Sun, 28 Nov 2004 12:09:49 GMT
Accept-Ranges: bytes
Server: Apache/2.0.48 (Unix)
Content-Length: 123392
Content-Type: application/msword
Last-Modified: 2004-10-27 13:11:13
Client-Date: Sun, 28 Nov 2004 12:09:49 GMT
Client-Peer: 66.201.42.33:80
Client-Response-Num: 1
Content-Disposition: inline; filename=test.doc
-----END HEADERS----
In this case the document name will be stored in the index as
'viewdoc.taf?_uid1=71' instead of test.doc. But if the same url is
viewed in a browser, the file will be downloaded and named test.doc.
Does it make more sense to modify spider.pl to test for the existence
of a filename in the Content-Disposition header or do this as part of
filtering?
Received on Mon Nov 29 17:25:43 2004