Skip to main content.
home | support | download

Back to List Archive

Re: Identical Documents

From: Sebastian Jayaraj <jayaraj(at)not-real.kosan.com>
Date: Thu Sep 16 2004 - 23:37:10 GMT
Bill,

I had tried the file:/// ... URL option earlier but it threw the 
following errors.
-------------
Use of uninitialized value in -e at 
/usr/lib/perl5/vendor_perl/5.8.0/LWP/Protocol/file.pm line 58.
Use of uninitialized value in concatenation (.) or string at 
/usr/lib/perl5/vendor_perl/5.8.0/LWP/Protocol/file.pm line 59.
 >> -Failed 1 Cnt: 4 file:///tmp/file01/search/PDF%2F 404 File `' does 
not exist Unknown content type ??? parent:file:///tmp/file01/search/
-------------

Well, listing what files need to be skipped is a difficult task. The 
users (bunch of scientists ;-) like to give different filenames to their 
publications/documents. And since it's a collaborative environment they 
are bound to be copies of the same document floating around. A few 
thousand in the collection makes it a little tedious.

I will try hacking the DirTree.pl file. Thanks for the advice.

cheers
Sebastian



Bill Moseley wrote:

>[I'm sending this back to the list]
>
>
>On Thu, Sep 16, 2004 at 03:37:54PM -0700, Sebastian Jayaraj wrote:
>  
>
>>Hi Bill,
>>
>>Thanks for the quick response. It looks like the spider.pl program does 
>>the md5 filtering for only http like URL's.  In my case, I have the 
>>documents residing on a windows server, samba mounted on a unix machine 
>>on which I run the swish-e program using -S fs option to index them.
>>
>>Is there a way to run the spider.pl on the local file system. The other 
>>option (roundabout) I was thinking was to expose my source dirs on a 
>>webserver and then run the spider to index them.
>>    
>>
>
>Well, if you were really lucky you might be able to "spider" locally
>with file:///path/to/whatever/index.html -- but I've never tried that.
>
>If you don't mind a tiny bit of Perl programming you could use the
>DirTree.pl program and do your own MD5 checking in that program.
>
>Can you just list what files need to be skipped with FileRules?
>
>  
>
Received on Thu Sep 16 16:37:26 2004