Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] parallelism and Swish-e

From: Andrew Smith <andrewksmith(at)not-real.gmail.com>
Date: Thu Mar 26 2009 - 17:43:51 GMT
Hi,

Thanks a lot for doing this. I just checked r2285 out from svn, built it,
and tested it on some of my indexes (I assumed I could just use previously
created indexes from the 2.4.5 version and wouldn't need to rebuild them,
but let me know if otherwise). And it returns the raw scores now (they start
out in the 10000s). However, the rank order of the results matches the
results for IDF (i.e. -R 1) but not the default ranking scheme. This is fine
for me (I'm using IDF, -R 1) but it might be nice to give the option of
getting raw scores for the default ranking scheme too. Would it be hard to
modify the code to give the raw scores for the default ranking scheme?
Alternatively, maybe you should instead of calling these new ranking
schemes, just add a new flag, say '--raw' which if passed gives the raw
scores for whatever ranking scheme is in effect.

Finally, you would still need to merge all independent partial result sets
(which presumably would have been created by separate parallel processes and
each have raw scores), then normalize the scores in the merged set, and
finally sort them all for the final ranked result set. I could just write
code to do this myself, but did you make changes to support this as well?
Again, clearly there is some subroutine or code in the Swish-e source that
does this merging, normalizing, and sorting and could a hook be provided to
this? For example, add a new flag to Swish-e, say '--raw-files', which takes
a list of files each of which contains an independently generated partial
result set with raw scores --- Swish-e then concatenates all these files and
passes the result into the normalize/sort subroutine and then returns the
final result.

Of course there might be some difficulties with this around supporting the
-b and -m flags (but maybe that just needs to be left for later). Although
there are some easy things you could do such as only getting 20 results in
each partial result set if you only want the top 20. And for paging you
could just keep state of how far you got in each partial result set in
generating the current page, and then just fetch e.g. the next 20 from these
saved positions for the next page.

Anyway, If I need to do this merging/normalizing/sorting procedure myself, I
assume that to normalize I would just take the highest raw score in the
merged set (call it H) and then transform each raw score S to a normalized
score N by: N = (S/H)*1000. Is this correct? Although I noticed for IDF the
top score is not always 1000 --- how would I normalize in this case?

Again, thanks so much for this. This modification could be really valuable
and make Swish-e scale to much larger corpus sizes and give better
performance over multiprocessors. And if you combined this with some kind of
interprocess communication (e.g. PVM or just communicate over TCP sockets)
you could harness a cluster of workstations in parallel for large-scale
Swish-e searching. Maybe we can have a David and Goliath moment and Swish-e
can slay the Google beast! :)

cheers,
Andrew

On Wed, Mar 25, 2009 at 10:24 PM, Peter Karman <peter@peknet.com> wrote:

> Andrew Smith wrote on 3/25/09 4:01 PM:
>
> >> If I were going to implement this feature, I would probably add a new
> >> RankScheme
> >> that doesn't normalize the rank scores but instead returns the raw rank
> >> scores.
> >> Then the search manager could merge and sort using the raw rank scores.
> >
> >
> > So is this all you would need to do --- just get all the raw rank scores,
> > merge and sort based on those, and finally normalize all the scores to
> 1000
> > at the end? There is no way currently to get Swish-e to return you raw
> rank
> > scores? This doesn't seem like it would a very hard change --- presumably
> > Swish-e calls some subroutine that normalizes the scores, so the change
> > would be to just get rid of this subroutine call (which would presumably
> > cause raw rank scores to be returned). Finally, if you wanted normalized
> > scores you presumably could just call the normalize subroutine after the
> > search manager has merged and sorted the raw rank scores. Does this sound
> > about right? If this is it, maybe I could take a stab at it.
>
> you're right. it wasn't a very hard change, provided you knew where to make
> it,
> which I didn't at first.
>
> After a little hunting, r2285 in svn trunk implements raw rank as
> RankScheme 2.
> Check out svn trunk and post back here to let us know if it works for you
> or
> not. I still haven't made the 2.4.7 release, so it can be in there if I
> hear
> positive feedback about it in the next few days.
>
> pek
>
>
> --
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 26 13:43:51 2009