Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] parallelism and Swish-e

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Mar 27 2009 - 02:51:35 GMT
Andrew Smith wrote on 3/26/09 12:43 PM:
> Hi,
> 
> Thanks a lot for doing this. I just checked r2285 out from svn, built it,
> and tested it on some of my indexes (I assumed I could just use previously
> created indexes from the 2.4.5 version and wouldn't need to rebuild them,
> but let me know if otherwise). 

that is correct. 2.4.5, 2.4.6 and 2.4.7 should all be index-compatible.

> And it returns the raw scores now (they start
> out in the 10000s). However, the rank order of the results matches the
> results for IDF (i.e. -R 1) but not the default ranking scheme. This is fine
> for me (I'm using IDF, -R 1) but it might be nice to give the option of
> getting raw scores for the default ranking scheme too. Would it be hard to
> modify the code to give the raw scores for the default ranking scheme?
> Alternatively, maybe you should instead of calling these new ranking
> schemes, just add a new flag, say '--raw' which if passed gives the raw
> scores for whatever ranking scheme is in effect.

I went for the easiest change, since adding a new common line option requires a
little more invasive work.

In the five years since I wrote the RankScheme feature I've not heard anyone say
they prefer the default (0) over the IDF (1). I'd be tempted to make 1 the
default except that it requires re-indexing if you don't have
IgnoreTotalWordCountWhenRanking set to 0 so it's a back-compat issue.

That said, adding another RankScheme for raw ranks for the default is not hard.
Just a copy/paste in rank.c and a check in docprop.c for the new RankScheme
number. Here's the changeset I did yesterday:

http://dev.swish-e.org/changeset/2285

> 
> Finally, you would still need to merge all independent partial result sets
> (which presumably would have been created by separate parallel processes and
> each have raw scores), then normalize the scores in the merged set, and
> finally sort them all for the final ranked result set. I could just write
> code to do this myself, but did you make changes to support this as well?

No.

> Again, clearly there is some subroutine or code in the Swish-e source that
> does this merging, normalizing, and sorting and could a hook be provided to
> this? For example, add a new flag to Swish-e, say '--raw-files', which takes
> a list of files each of which contains an independently generated partial
> result set with raw scores --- Swish-e then concatenates all these files and
> passes the result into the normalize/sort subroutine and then returns the
> final result.

The chief problem is that when you are comparing IDF/TF scores between indexes,
your numbers are going to be off because the term and document frequencies are
not the same in each index, esp if the indexes are radically different sizes.

IDF/TF is a good start, but compared to the ranking algorithms in most high
scale systems these days, IDF/TF is very naive. And for purists, broken in the
current Swish-e implementation when dealing with multiple indexes (for the
reason I state above).

This is actually one of the main reasons I started Swish3, because I wanted to
play with alternate ranking schemes and I saw that the 2.x architecture wasn't
really suited to it. That, and UTF-8.

> 
> Anyway, If I need to do this merging/normalizing/sorting procedure myself, I
> assume that to normalize I would just take the highest raw score in the
> merged set (call it H) and then transform each raw score S to a normalized
> score N by: N = (S/H)*1000. Is this correct? 

looks ok to me. Look at how it's done (differently) in docprop.c for comparison.

> Although I noticed for IDF the
> top score is not always 1000 --- how would I normalize in this case?
> 

see how docprop.c does it.

> Again, thanks so much for this. This modification could be really valuable
> and make Swish-e scale to much larger corpus sizes and give better
> performance over multiprocessors. And if you combined this with some kind of
> interprocess communication (e.g. PVM or just communicate over TCP sockets)
> you could harness a cluster of workstations in parallel for large-scale
> Swish-e searching. Maybe we can have a David and Goliath moment and Swish-e
> can slay the Google beast! :)

more power to you. :)

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 26 22:51:40 2009