Thomas R. Bruce wrote:
> This does help, but not enough for some applications. A real problem
with relevance-ranked searches of collections of judicial opinions is
that it's hard to force title weight high enough to overcome large
numbers of term-occurrences in the body text -- which is exactly what
you get with important legal cases, because really important rulings are
heavily cited. So the cases that repeatedly cite (eg.) Brown v. Board
of Education inevitably rank higher, all the more maddening because the
more important the case being sought by the user the more likely it is
to be swamped by cases citing it. I guess other literatures manage to
avoid this because citations don't give the title of the cited document
in full as they do in judicial opinions.
> Anyway, our cheap kludge for dealing with this is [snip]
Here's an alternate cheap kludge that may (or may not) add value.
I transform XML into pseudo-HTML code before passing to the indexer.
The <title> is HTML seems to be valued more highly in the default
ranking schemes than elements from xml schemas. Although <title> is
probably an exception to that. I leave that to those who have parsed /
written the code
More to the point, because the only time this particular document is
going to be "read" is by the indexer, in the course of the
transformation I can double, triple or 50x (heck, it's only a loop) the
number of times an particular string like title is presented to the
indexer. So feed the indexer <title> fifty times and see if that
doesn't shift its ranking. I do that in the HTML <body> element so that
it doesn't appear extra times in the swish output, and I also control
what goes into description so that it doesn't pop up multiple times
there as well.
I deeply respect those who are going about this the *right* way and look
forward to the results of their work. Until then, this "sort" of gets
the job done.
Received on Fri Feb 4 06:09:40 2005