On Fri, Feb 04, 2005 at 03:23:38AM -0800, Thomas R. Bruce wrote:
> Peter Karman wrote:
> >indexing as html will artificially inflate the number of occurances whenever a
> >word matches in the <title>.
> This does help, but not enough for some applications. A real
> problem with relevance-ranked searches of collections of judicial
> opinions is that it's hard to force title weight high enough to
> overcome large numbers of term-occurrences in the body text
Yes, that's the problem with our relatively simple ranking system.
Once I hacked rank.c to just not count word frequency over some
reasonably small number and that keep the huge docs from always
Sometimes it's not that helpful to search for a term "foo" and be told
that, yes, it is in that 100 page document. So another approach is to
split your docs into smaller chunks. and index them separately. And
if you can link into sections of your docs (like with URI #fragments)
then your search results are even more targeted. That can help with
ranking a bit, but doesn't help much if you are searching for a common
Sounds like you need a better ranking system in general -- something
that tries to figure out what a document is *about*.
> Anyway, our cheap kludge for dealing with this is to run a
> title-only search separately and prepend those results to the hit
> list for full-text search. We tried jiggering the rankings as
> described in this thread and it helped, but not enough.
Does that mean if you have a word hit in the title then it will always
be on top of results without a word hit in the title? So a very
common word in the title would still bring it to the top?
One thing I would suggest (not really related to above) is to use -T
to dump your index (of maybe a small set of files) and look over the
words swish is indexing. You might want to filter your queries of
common words for your corpus when they are not used in an explicit
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Fri Feb 4 07:13:58 2005