Sounds like you're going to miss a lot of results that should have been
returned from hits in the 2nd paragraph onwards. I know this has been
mentioned before, but RankScheme(1) really solved the problem of big
chunks of text always winning for me.
As an aside, at first I was freaked out by titles that contained
keywords being ranked lower than body text hits. But after actually
looking at the results, I discovered the documents were generally at
least as relevant as the title hits.
On Fri, 2005-02-04 at 07:40 -0800, Tac Tacelosky wrote:
> Thanks for the many suggestions from this list. The "hack" I got to work
> for my application was to only index the first "paragraph" (loosely
> defined), which shortened the description (and the relevant words were
> generally near the top). Most descriptions were then the same length, which
> evened out the problem of the big ones always winning.
> I like the title repetition idea, too, I may try that next round (though the
> bias adjustments should do that, and maybe they do but it's still not in the
> merged indexes).
> Thanks again, everyone, for the ideas, it's been an interesting discussion!
> -----Original Message-----
> From: email@example.com [mailto:firstname.lastname@example.org]
> On Behalf Of Bill Moseley
> Sent: Friday, February 04, 2005 10:13 AM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Ranking, even with strong bias
> On Fri, Feb 04, 2005 at 03:23:38AM -0800, Thomas R. Bruce wrote:
> > Peter Karman wrote:
> > >indexing as html will artificially inflate the number of occurances
> > >whenever a
> > >word matches in the <title>.
> > >
> > >
> > This does help, but not enough for some applications. A real problem
> > with relevance-ranked searches of collections of judicial opinions is
> > that it's hard to force title weight high enough to overcome large
> > numbers of term-occurrences in the body text
> Yes, that's the problem with our relatively simple ranking system. Once I
> hacked rank.c to just not count word frequency over some reasonably small
> number and that keep the huge docs from always winning.
> Sometimes it's not that helpful to search for a term "foo" and be told that,
> yes, it is in that 100 page document. So another approach is to split your
> docs into smaller chunks. and index them separately. And if you can link
> into sections of your docs (like with URI #fragments)
> then your search results are even more targeted. That can help with ranking
> a bit, but doesn't help much if you are searching for a common term.
> Sounds like you need a better ranking system in general -- something that
> tries to figure out what a document is *about*.
> > Anyway, our cheap kludge for dealing with this is to run a title-only
> > search separately and prepend those results to the hit list for
> > full-text search. We tried jiggering the rankings as described in
> > this thread and it helped, but not enough.
> Does that mean if you have a word hit in the title then it will always be on
> top of results without a word hit in the title? So a very common word in
> the title would still bring it to the top?
> One thing I would suggest (not really related to above) is to use -T to dump
> your index (of maybe a small set of files) and look over the words swish is
> indexing. You might want to filter your queries of common words for your
> corpus when they are not used in an explicit phrase search.
Received on Fri Feb 4 08:15:47 2005