Thomas R. Bruce wrote on 2/4/05 5:22 AM:
> Peter Karman wrote:
>> indexing as html will artificially inflate the number of occurances
>> whenever a word matches in the <title>.
> This does help, but not enough for some applications. A real problem
> with relevance-ranked searches of collections of judicial opinions is
> that it's hard to force title weight high enough to overcome large
> numbers of term-occurrences in the body text -- which is exactly what
> you get with important legal cases, because really important rulings are
> heavily cited. So the cases that repeatedly cite (eg.) Brown v. Board
> of Education inevitably rank higher, all the more maddening because the
> more important the case being sought by the user the more likely it is
> to be swamped by cases citing it. I guess other literatures manage to
> avoid this because citations don't give the title of the cited document
> in full as they do in judicial opinions.
> Anyway, our cheap kludge for dealing with this is to run a title-only
> search separately and prepend those results to the hit list for
> full-text search. We tried jiggering the rankings as described in this
> thread and it helped, but not enough.
If you're comfortable with C, you might look at rank.c and add another Rank
Scheme to the two already there. You might want to actually reverse the idea in
scheme 1, where index frequency is considered.
e.g., if 'Brown vs. Board of Education' appears most often in cases that cite it
and you don't want those cases at the top of your rankings, you might want to
invert the frequency, so that the *fewer* the instances of a word in a doc
relative to the index collection as a whole, the higher it will rank. Right now
it's just the opposite: greater frequency = greater ranking.
This isn't a foolproof approach of course, since really you would want something
more like google's page rank feature, which raises the rank for docs referred to
by other docs. If you had a way of consistently marking citations in the body
text, you could probably figure out a way of approaching that kind of feature,
by lowering the rank bias for those tagsets.
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Fri Feb 4 03:31:58 2005