Skip to main content.
home | support | download

Back to List Archive

Re: Greetings Swish-e developers

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Dec 01 2004 - 04:17:06 GMT
On Tue, Nov 30, 2004 at 06:36:35PM -0800, Dave Seff wrote:
> 1. Numeric range searching - We need a way to take numbers and figures
> and search based on a numeric range. For example: I have an XML tag in
> my documents that store the number of employees for 1.5 million
> companies. Each company has XML documents with these figures. I would
> like to search for those documents for the companies that have between
> 500 and 1000 employees. So We would need to add a <BETWEEN> tag in the
> swish-e syntax. 

Swish's index is a hash lookup which prevents that kind of search in
the main index.  But, swish does have a way to limit searches between
two numbers (unsigned integers), and it's a by-product of the way
swish can sort results by properties.  It's not very scalable, though.

One way to limit the results would be to store the number you want as
a property and then when generating results consult each result's
(file's) property and see if it's within range.  That's way too slow,
though, so swish uses the pre-sort tables to seed things up.

To speed up sorting by a given property at search time, swish creates
an integer sort table for each property at indexing time.  This allows
faster sorting at run time because the results need only be sorted by
this number instead of by, say, a long string comparison.

This table is also be used to limit results based on a range.  For
example, say you have a property that is a date (as a unix timestamp).
Swish takes the sort table for that property, sorts it in order and
then flags all the ones between the two ranges as "keep" and all the
other as "reject", then that table is then sorted once again in file
number order.  Then when generating results swish can look at that
table by file number and determine if each given result should be kept
or rejected.  

The table is only integers, not the actual time stamps, of course, so
swish has to read the property table to find the ranges.  Swish uses
two binary searches to limit the number of accesses of the property
tables.

   swish-e -w foo -L date 1100000000 1101874327


> 2. Search Daemon and scalability - We would like to be able to scale out
> using multiple machines searching multiple collections (100+ Million
> Docs). This would require making swish-e cluster-aware.

Looks like a bit of work.  Swish was designed for indexing a few
hundred pages originally.  Some people are indexing a few million now.

> As we start implimenting new features, Is this list the appropriate
> forum to submit patches to or is there a designated person dealing with
> that?

Discuss your work here.  If it gets involved and detailed we can take
it to the developer's list.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Nov 30 20:17:07 2004