On Tue, Mar 09, 2004 at 12:38:17AM -0800, email@example.com wrote:
> >You could try tweaking those, but the other problem is that swish
> >considers to some degree the number of hits in a file, so a large file
> >may out-rank a smaller file with the word in the title.
> Does not swish-e convert frequencys into percents?? Would it be a bad idea?
You should look at rank.c. That and the query processing are
long-standing problems that need attention. Ranking is very basic
There's a mode to consider the length of the document in the rank
calculations but when I tested the feature it didn't seem to make much
difference in the ranking -- and in some cases made it worse.
It's subjective, of course. What I did was index a few small (< 10,000
pages) sites and then compare search results with google. I spent a day
playing with small tweaks to rank.c and it was clear that very large
files throw off the rank. One true hack was to limit the number of
word hits per document and that one thing alone made the results match
more like how google ranked. I just limited the frequency count to 100.
How's that for an ugly hack?
I had also tried limiting the counts to the first X word positions but
with less of an effect. I was expecting that to have more of an effect.
If you are looking for a document about something you might think that
it would be discussed early on in the document.
Swish-e has been used for indexing reasonably small sets of documents,
so effective searching is often as helpful as is the ranking. Still, I
hope someone comes along that knows something about ranking and has some
time that can update swish-e's code.
Received on Tue Mar 9 05:47:30 2004