Skip to main content.
home | support | download

Back to List Archive

Re: An idea for calculating ranking

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Jun 10 2005 - 11:29:00 GMT
koszalekopalek scribbled on 6/9/05 5:04 PM:

> This only proves that they do something similar to what I
> proposed. Adobe ranks first because there is a gazillion of
> links out there that say click >>here<< to download Adobe.
> 
> The other thing is how much data you need to make accurate
> guesses. Maybe what works for billions of pages will not
> work for a tiny index...


Maybe only experimentation will show whether the words used in <a> tags to 
describe a doc are more relevant than others. That requires a certain faith in 
the html markup; in 10 billion pages, some consistency emerges just by virtue of 
the sample size; in a small collection (< million), consistency is more shakey.

That said, here's some random ideas about what you're describing.

1. It can all be done with spider.pl (i.e., swish-e does not require any new 
features). You'd need to send the output to a file instead of piping directly to 
swish-e's stdin, because (as you hinted at) you'd need two passes: one to 
collect the data and do the initial fetch/filter, and the second to place the 
'swishlinkedas' words in the target docs in the file. You might send spider.pl 
output to a file, while you save a hash in memory (depending on collection size) 
of url=>[ @swishlinkedas_words ], dump it with Data::Dumper, then post-filter 
the file to add the swishlinkedas_words.

   $ spider.pl > file                   # dumps a_words.dmp when finished
   $ add_a_words.pl a_words.dmp file    # adds a_words.dmp to file

where add_a_words.pl reads the hash from a_words.dmp and adds them as a <meta> 
tag in file.

2. You're talking about saturating a document's metaname space with all the 
words used to refer to that document in <a> links in other docs. That \might\ 
work depending on the collection. The word 'here,' if used in a billion docs to 
refer to Adobe, makes some sense. In a small collection, you might have the word 
'here' used to refer to lots of other docs, thus making it relatively useless, 
since it has a high collection frequency. Example: if every doc was referred to 
at least once by 'here', then the word becomes useless (a stopword, in effect).

Let us know how it goes. :)


-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Fri Jun 10 04:29:02 2005