Skip to main content.
home | support | download

Back to List Archive

Re: Identical Documents

From: Sebastian Jayaraj <jayaraj(at)>
Date: Thu Sep 16 2004 - 23:37:10 GMT

I had tried the file:/// ... URL option earlier but it threw the 
following errors.
Use of uninitialized value in -e at 
/usr/lib/perl5/vendor_perl/5.8.0/LWP/Protocol/ line 58.
Use of uninitialized value in concatenation (.) or string at 
/usr/lib/perl5/vendor_perl/5.8.0/LWP/Protocol/ line 59.
 >> -Failed 1 Cnt: 4 file:///tmp/file01/search/PDF%2F 404 File `' does 
not exist Unknown content type ??? parent:file:///tmp/file01/search/

Well, listing what files need to be skipped is a difficult task. The 
users (bunch of scientists ;-) like to give different filenames to their 
publications/documents. And since it's a collaborative environment they 
are bound to be copies of the same document floating around. A few 
thousand in the collection makes it a little tedious.

I will try hacking the file. Thanks for the advice.


Bill Moseley wrote:

>[I'm sending this back to the list]
>On Thu, Sep 16, 2004 at 03:37:54PM -0700, Sebastian Jayaraj wrote:
>>Hi Bill,
>>Thanks for the quick response. It looks like the program does 
>>the md5 filtering for only http like URL's.  In my case, I have the 
>>documents residing on a windows server, samba mounted on a unix machine 
>>on which I run the swish-e program using -S fs option to index them.
>>Is there a way to run the on the local file system. The other 
>>option (roundabout) I was thinking was to expose my source dirs on a 
>>webserver and then run the spider to index them.
>Well, if you were really lucky you might be able to "spider" locally
>with file:///path/to/whatever/index.html -- but I've never tried that.
>If you don't mind a tiny bit of Perl programming you could use the
> program and do your own MD5 checking in that program.
>Can you just list what files need to be skipped with FileRules?
Received on Thu Sep 16 16:37:26 2004