koszalekopalek scribbled on 6/9/05 5:04 PM:
> This only proves that they do something similar to what I
> proposed. Adobe ranks first because there is a gazillion of
> links out there that say click >>here<< to download Adobe.
> The other thing is how much data you need to make accurate
> guesses. Maybe what works for billions of pages will not
> work for a tiny index...
Maybe only experimentation will show whether the words used in <a> tags to
describe a doc are more relevant than others. That requires a certain faith in
the html markup; in 10 billion pages, some consistency emerges just by virtue of
the sample size; in a small collection (< million), consistency is more shakey.
That said, here's some random ideas about what you're describing.
1. It can all be done with spider.pl (i.e., swish-e does not require any new
features). You'd need to send the output to a file instead of piping directly to
swish-e's stdin, because (as you hinted at) you'd need two passes: one to
collect the data and do the initial fetch/filter, and the second to place the
'swishlinkedas' words in the target docs in the file. You might send spider.pl
output to a file, while you save a hash in memory (depending on collection size)
of url=>[ @swishlinkedas_words ], dump it with Data::Dumper, then post-filter
the file to add the swishlinkedas_words.
$ spider.pl > file # dumps a_words.dmp when finished
$ add_a_words.pl a_words.dmp file # adds a_words.dmp to file
where add_a_words.pl reads the hash from a_words.dmp and adds them as a <meta>
tag in file.
2. You're talking about saturating a document's metaname space with all the
words used to refer to that document in <a> links in other docs. That \might\
work depending on the collection. The word 'here,' if used in a billion docs to
refer to Adobe, makes some sense. In a small collection, you might have the word
'here' used to refer to lots of other docs, thus making it relatively useless,
since it has a high collection frequency. Example: if every doc was referred to
at least once by 'here', then the word becomes useless (a stopword, in effect).
Let us know how it goes. :)
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Fri Jun 10 04:29:02 2005