Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] parallelism and Swish-e

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Wed Mar 18 2009 - 02:59:52 GMT
Andrew Smith wrote on 3/17/09 4:47 PM:

>>> For indexing, I could just handle it myself via the prog input method
>>> (i.e. just fork parallel processes which each independently index part
>>> of a directory tree, e.g. each process is given a number N and indexes
>>> 1/Nth of the documents). Then I could merge the indexes at the end (or
>>> just pass them all to Swish-e using the -f option when searching) But
>>> it would be easier if I could just do this via the simple file system
>>> index method; is there any configuration option where you can specify
>>> that Swish-e only indexes every Nth file it encounters?
>> no
> 
> This would seem to be a very useful feature to make it easy to index
> in parallel. And it would seem to be pretty easy to add such support
> to Swish-e. May I suggest this as a feature to add in future versions
> of Swish-e? I.e., pass a new flag to swish-e that tells it which proc
> num it is; also pass a new flag telling how many total procs are
> running. Then swish-e will only index document N that it encounters
> if:
> (N % total_num_procs) == proc_num
> 

I guess the only way to know how easy it is to add such a feature is to try
writing it. I'd suggest patching against svn trunk at:

 http://svn.swish-e.org/swish-e/trunk

If you need any more pointers, feel free to ask here on this list.

I look forward to seeing your code.


> 
>>> Next, for searching can Swish-e take advantage of parallelism? For
>>> example, does it know it is running on a multiprocessor and internally
>>> execute the search in parallel? If not, again, I could conceivably
>>> handle this myself as follows. If I want to search in parallel on,
>>> say, 8 processors I would create 8 separate indexes as above, each
>>> covering 1/8th of the files in the corpus of documents to be searched.
>>> Then when searching I fork 8 processes where each one independently
>>> searches one of the 8 separate indexes. Finally, I collate the results
>>> of each of these 8 parallel searches into one final result set. Would
>>> this work? Or would it somehow screw up relevance ranking since the
>>> indexes are being searched independently?
>> the latter. The ranking is scaled to a 1000 baseline, not a raw rank score, so
>> you wouldn't be able to reliably interweave the results.
> 
> Is there any way we could achieve the correct result manually
> ourselves? Again, this would seem to be a very useful feature and
> seemingly not too difficult to implement (basically, Swish-e already
> does exactly this internally when it searches against multiple
> separate index files and merges the results --- why couldn't this
> functionality be made available for external use to enable independent
> parallel processes, each searching a separate index file, to correctly
> collate swish-e results?) I would suggest this also as a useful
> feature to add to future Swish-e versions.

I believe the rank normalization happens in src/search.c.

If I were going to implement this feature, I would probably add a new RankScheme
that doesn't normalize the rank scores but instead returns the raw rank scores.
Then the search manager could merge and sort using the raw rank scores.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 17 22:59:43 2009