Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] parallelism and Swish-e

From: Andrew Smith <andrewksmith(at)not-real.gmail.com>
Date: Wed Mar 25 2009 - 21:01:37 GMT
>
> >>> For indexing, I could just handle it myself via the prog input method
> >>> (i.e. just fork parallel processes which each independently index part
> >>> of a directory tree, e.g. each process is given a number N and indexes
> >>> 1/Nth of the documents). Then I could merge the indexes at the end (or
> >>> just pass them all to Swish-e using the -f option when searching) But
> >>> it would be easier if I could just do this via the simple file system
> >>> index method; is there any configuration option where you can specify
> >>> that Swish-e only indexes every Nth file it encounters?
> >> no
> >
> > This would seem to be a very useful feature to make it easy to index
> > in parallel. And it would seem to be pretty easy to add such support
> > to Swish-e. May I suggest this as a feature to add in future versions
> > of Swish-e? I.e., pass a new flag to swish-e that tells it which proc
> > num it is; also pass a new flag telling how many total procs are
> > running. Then swish-e will only index document N that it encounters
> > if:
> > (N % total_num_procs) == proc_num
> >
>
> I guess the only way to know how easy it is to add such a feature is to try
> writing it. I'd suggest patching against svn trunk at:
>
>  http://svn.swish-e.org/swish-e/trunk
>
> If you need any more pointers, feel free to ask here on this list.
>
> I look forward to seeing your code.


Well, that might be a bit adventuresome for me at the moment, but maybe I'll
take a peek at the code if I get a chance. I actually just wrote Perl code
to handle it myself via the prog method. If there were interest, maybe I
could contribute that code.


>
>
> >
> >>> Next, for searching can Swish-e take advantage of parallelism? For
> >>> example, does it know it is running on a multiprocessor and internally
> >>> execute the search in parallel? If not, again, I could conceivably
> >>> handle this myself as follows. If I want to search in parallel on,
> >>> say, 8 processors I would create 8 separate indexes as above, each
> >>> covering 1/8th of the files in the corpus of documents to be searched.
> >>> Then when searching I fork 8 processes where each one independently
> >>> searches one of the 8 separate indexes. Finally, I collate the results
> >>> of each of these 8 parallel searches into one final result set. Would
> >>> this work? Or would it somehow screw up relevance ranking since the
> >>> indexes are being searched independently?
> >> the latter. The ranking is scaled to a 1000 baseline, not a raw rank
> score, so
> >> you wouldn't be able to reliably interweave the results.
> >
> > Is there any way we could achieve the correct result manually
> > ourselves? Again, this would seem to be a very useful feature and
> > seemingly not too difficult to implement (basically, Swish-e already
> > does exactly this internally when it searches against multiple
> > separate index files and merges the results --- why couldn't this
> > functionality be made available for external use to enable independent
> > parallel processes, each searching a separate index file, to correctly
> > collate swish-e results?) I would suggest this also as a useful
> > feature to add to future Swish-e versions.
>
> I believe the rank normalization happens in src/search.c.
>
> If I were going to implement this feature, I would probably add a new
> RankScheme
> that doesn't normalize the rank scores but instead returns the raw rank
> scores.
> Then the search manager could merge and sort using the raw rank scores.


So is this all you would need to do --- just get all the raw rank scores,
merge and sort based on those, and finally normalize all the scores to 1000
at the end? There is no way currently to get Swish-e to return you raw rank
scores? This doesn't seem like it would a very hard change --- presumably
Swish-e calls some subroutine that normalizes the scores, so the change
would be to just get rid of this subroutine call (which would presumably
cause raw rank scores to be returned). Finally, if you wanted normalized
scores you presumably could just call the normalize subroutine after the
search manager has merged and sorted the raw rank scores. Does this sound
about right? If this is it, maybe I could take a stab at it.


>
> --
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Mar 25 17:01:38 2009