I had tried the file:/// ... URL option earlier but it threw the
Use of uninitialized value in -e at
/usr/lib/perl5/vendor_perl/5.8.0/LWP/Protocol/file.pm line 58.
Use of uninitialized value in concatenation (.) or string at
/usr/lib/perl5/vendor_perl/5.8.0/LWP/Protocol/file.pm line 59.
>> -Failed 1 Cnt: 4 file:///tmp/file01/search/PDF%2F 404 File `' does
not exist Unknown content type ??? parent:file:///tmp/file01/search/
Well, listing what files need to be skipped is a difficult task. The
users (bunch of scientists ;-) like to give different filenames to their
publications/documents. And since it's a collaborative environment they
are bound to be copies of the same document floating around. A few
thousand in the collection makes it a little tedious.
I will try hacking the DirTree.pl file. Thanks for the advice.
Bill Moseley wrote:
>[I'm sending this back to the list]
>On Thu, Sep 16, 2004 at 03:37:54PM -0700, Sebastian Jayaraj wrote:
>>Thanks for the quick response. It looks like the spider.pl program does
>>the md5 filtering for only http like URL's. In my case, I have the
>>documents residing on a windows server, samba mounted on a unix machine
>>on which I run the swish-e program using -S fs option to index them.
>>Is there a way to run the spider.pl on the local file system. The other
>>option (roundabout) I was thinking was to expose my source dirs on a
>>webserver and then run the spider to index them.
>Well, if you were really lucky you might be able to "spider" locally
>with file:///path/to/whatever/index.html -- but I've never tried that.
>If you don't mind a tiny bit of Perl programming you could use the
>DirTree.pl program and do your own MD5 checking in that program.
>Can you just list what files need to be skipped with FileRules?
Received on Thu Sep 16 16:37:26 2004