Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] parallelism and Swish-e

From: Andrew Smith <andrewksmith(at)>
Date: Tue Mar 17 2009 - 21:47:13 GMT
Thanks for your reply (see below for further questions).

>> Hi,
>> I'm using the latest version of Swish-e and I have it working fine,
>> but I am wondering if and how Swish-e has any support for parallelism
>> and multiprocessors, in particular both for indexing and searching.
> In short, there is no built-in support for either.
> A few years ago someone worked up a search cluster manager:
> I've not used it myself. It appears to have been abandoned.
>> For indexing, I could just handle it myself via the prog input method
>> (i.e. just fork parallel processes which each independently index part
>> of a directory tree, e.g. each process is given a number N and indexes
>> 1/Nth of the documents). Then I could merge the indexes at the end (or
>> just pass them all to Swish-e using the -f option when searching) But
>> it would be easier if I could just do this via the simple file system
>> index method; is there any configuration option where you can specify
>> that Swish-e only indexes every Nth file it encounters?
> no

This would seem to be a very useful feature to make it easy to index
in parallel. And it would seem to be pretty easy to add such support
to Swish-e. May I suggest this as a feature to add in future versions
of Swish-e? I.e., pass a new flag to swish-e that tells it which proc
num it is; also pass a new flag telling how many total procs are
running. Then swish-e will only index document N that it encounters
(N % total_num_procs) == proc_num

>> Next, for searching can Swish-e take advantage of parallelism? For
>> example, does it know it is running on a multiprocessor and internally
>> execute the search in parallel? If not, again, I could conceivably
>> handle this myself as follows. If I want to search in parallel on,
>> say, 8 processors I would create 8 separate indexes as above, each
>> covering 1/8th of the files in the corpus of documents to be searched.
>> Then when searching I fork 8 processes where each one independently
>> searches one of the 8 separate indexes. Finally, I collate the results
>> of each of these 8 parallel searches into one final result set. Would
>> this work? Or would it somehow screw up relevance ranking since the
>> indexes are being searched independently?
> the latter. The ranking is scaled to a 1000 baseline, not a raw rank score, so
> you wouldn't be able to reliably interweave the results.

Is there any way we could achieve the correct result manually
ourselves? Again, this would seem to be a very useful feature and
seemingly not too difficult to implement (basically, Swish-e already
does exactly this internally when it searches against multiple
separate index files and merges the results --- why couldn't this
functionality be made available for external use to enable independent
parallel processes, each searching a separate index file, to correctly
collate swish-e results?) I would suggest this also as a useful
feature to add to future Swish-e versions.

> Swish-e's architecture was never designed to scale the way you are describing.
> You might be able to take the approach you describe and use multiple indexes. I
> know that some folks have used that simply to allow for multi-million document
> collections.
> OTOH, you might look at Swish3, since the Xapian backend can scale for
> distributed searching[1].
> [1]
Users mailing list
Received on Tue Mar 17 17:47:12 2009