Skip to main content.
home | support | download

Back to List Archive

Re: Ranking, even with strong bias

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Feb 04 2005 - 11:31:57 GMT
Thomas R. Bruce wrote on 2/4/05 5:22 AM:

> Peter Karman wrote:
> 
>> indexing as html will artificially inflate the number of occurances 
>> whenever a word matches in the <title>.
>>  
>>
> This does help, but not enough for some applications.  A real problem 
> with relevance-ranked searches of collections of judicial opinions is 
> that it's hard to force title weight high enough to overcome large 
> numbers of term-occurrences in the body text -- which is exactly what 
> you get with important legal cases, because really important rulings are 
> heavily cited.  So the cases that repeatedly cite (eg.) Brown v. Board 
> of Education inevitably rank higher, all the more maddening because the 
> more important the case being sought by the user the more likely it is 
> to be swamped by cases citing it.   I guess other literatures manage to 
> avoid this because citations don't give the title of the cited document 
> in full as they do in judicial opinions.
> 
> Anyway, our cheap kludge for dealing with this is to run a title-only 
> search separately and prepend those results to the hit list for 
> full-text search.  We tried jiggering the rankings as described in this 
> thread and it helped, but not enough.
> 

If you're comfortable with C, you might look at rank.c and add another Rank 
Scheme to the two already there. You might want to actually reverse the idea in 
scheme 1, where index frequency is considered.

e.g., if 'Brown vs. Board of Education' appears most often in cases that cite it 
and you don't want those cases at the top of your rankings, you might want to 
invert the frequency, so that the *fewer* the instances of a word in a doc 
relative to the index collection as a whole, the higher it will rank. Right now 
it's just the opposite: greater frequency = greater ranking.

This isn't a foolproof approach of course, since really you would want something 
more like google's page rank feature, which raises the rank for docs referred to 
by other docs. If you had a way of consistently marking citations in the body 
text, you could probably figure out a way of approaching that kind of feature, 
by lowering the rank bias for those tagsets.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Fri Feb 4 03:31:58 2005