Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] parallelism and Swish-e

From: Andrew Smith <andrewksmith(at)not-real.gmail.com>
Date: Fri Mar 27 2009 - 21:12:51 GMT
>
> > Hi,
> >
> > Thanks a lot for doing this. I just checked r2285 out from svn, built it,
> > and tested it on some of my indexes (I assumed I could just use
> previously
> > created indexes from the 2.4.5 version and wouldn't need to rebuild them,
> > but let me know if otherwise).
>
> that is correct. 2.4.5, 2.4.6 and 2.4.7 should all be index-compatible.
>
> > And it returns the raw scores now (they start
> > out in the 10000s). However, the rank order of the results matches the
> > results for IDF (i.e. -R 1) but not the default ranking scheme. This is
> fine
> > for me (I'm using IDF, -R 1) but it might be nice to give the option of
> > getting raw scores for the default ranking scheme too. Would it be hard
> to
> > modify the code to give the raw scores for the default ranking scheme?
> > Alternatively, maybe you should instead of calling these new ranking
> > schemes, just add a new flag, say '--raw' which if passed gives the raw
> > scores for whatever ranking scheme is in effect.
>
> I went for the easiest change, since adding a new common line option
> requires a
> little more invasive work.
>
> In the five years since I wrote the RankScheme feature I've not heard
> anyone say
> they prefer the default (0) over the IDF (1). I'd be tempted to make 1 the
> default except that it requires re-indexing if you don't have
> IgnoreTotalWordCountWhenRanking set to 0 so it's a back-compat issue.
>
> That said, adding another RankScheme for raw ranks for the default is not
> hard.
> Just a copy/paste in rank.c and a check in docprop.c for the new RankScheme
> number. Here's the changeset I did yesterday:
>
> http://dev.swish-e.org/changeset/2285


On a separate but related note, I'm actually considering trying to develop
my own ranking scheme. I've been looking over the source code and it seems
what I need to do is add a call to the new rank scheme function to getrank
in rank.c, and then define the new ranking function (similar to getrankDEF
and getrankIDF). Am I correct in this or missing any other key steps? Are
there any examples of other contributed ranking functions? Any high level
overview of the code (or just read the comments)? Any other place (wiki,
development list, etc.) where development related questions would be more
appropriate than this list?


>
>
> >
> > Finally, you would still need to merge all independent partial result
> sets
> > (which presumably would have been created by separate parallel processes
> and
> > each have raw scores), then normalize the scores in the merged set, and
> > finally sort them all for the final ranked result set. I could just write
> > code to do this myself, but did you make changes to support this as well?
>
> No.
>
> > Again, clearly there is some subroutine or code in the Swish-e source
> that
> > does this merging, normalizing, and sorting and could a hook be provided
> to
> > this? For example, add a new flag to Swish-e, say '--raw-files', which
> takes
> > a list of files each of which contains an independently generated partial
> > result set with raw scores --- Swish-e then concatenates all these files
> and
> > passes the result into the normalize/sort subroutine and then returns the
> > final result.
>
> The chief problem is that when you are comparing IDF/TF scores between
> indexes,
> your numbers are going to be off because the term and document frequencies
> are
> not the same in each index, esp if the indexes are radically different
> sizes.
>
> IDF/TF is a good start, but compared to the ranking algorithms in most high
> scale systems these days, IDF/TF is very naive. And for purists, broken in
> the
> current Swish-e implementation when dealing with multiple indexes (for the
> reason I state above).


So you are saying that technically the current Swish-e is buggy when doing
IDF for multiple index files (i.e. '-f indexfile1 indexfile2 ...')? Also,
for parallelism you would just divide up all the files to be indexed
randomly and evenly among all the parallel processes, so each independent
index file would be about the same size (and each index would have almost
the same IDF statistics since you divided up files randomly). So in practice
it shouldn't be a problem.


>
>
> This is actually one of the main reasons I started Swish3, because I wanted
> to
> play with alternate ranking schemes and I saw that the 2.x architecture
> wasn't
> really suited to it. That, and UTF-8.


Sounds nice, looking forward to seeing it. Any ETA on it?


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Mar 27 17:12:53 2009