On Fri, Dec 12, 2003 at 10:58:30AM -0800, John Angel wrote:
> > > > The md5 hashes are stored globally, but the option to check them is
> > > > server.
> > >
> > > Is it possible to check duplicates globally (for all servers)?
> > I'm not sure what you are asking, John. Didn't I answer that question
> > above?
> > The md5 hashes are stored globally -- that means they are not specific
> > to a server.
> I understand that md5 hashes are stored globally, but is it possible to
> check them globally, not only per server?
> E.g. if I have two duplicate documents from different servers, I don't want
> them being indexed two times.
Yes, that is exactly what I mean by "globally".
There's a global hash that stores the md5 keys for each document. It's
global -- it's not specific to a single server, it's used for all
servers, it's the same hash. It's global. That global hash stores the
md5 hashes for all servers in the same global hash. There's only one
hash and it lives for the life of spider.pl execution. And it's global.
So if you enable md5 checking for a given server its pages will have the
md5 check made and those md5 keys will be stored in that global hash I
mentioned above. But it only stores the md5 hash of the documents when
it's configured to do so. And that configuration is on a per-server
basis. So, if you configure each server section to check md5 values for
each document it will calculate the md5 for each document and check it
against the hash that stores globally (I might add) all the md5 hashes
regardless of what server it is checking. That is, when it checks the
md5 hash it's comparing with all other md5s that have been stored
because the hash is used globally.
It might also be helpful for you to look at the spider.pl source code.
The md5 stuff is fewer lines that what I just wrote. You might notice a
comment on the %visited hash where I questioned if it made any sense to
use a global hash or not.
Hope that clears it up for you.
Received on Fri Dec 12 19:35:58 2003