Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Aid with Swish3 Unicode feature

From: Itamar Syn-Hershko <itamar(at)>
Date: Mon Jan 14 2008 - 20:46:09 GMT
> It sounds like Xapian might be more up your alley:
> Swish3 will use it as one possible backend.

Why not Lucene? (CLucene)

I'm learning its internals the last days and frankly this is THE thing.
Since it sounds as in Swish3 you will no longer use index files of your own
(since if Xapian is going to be a backend, it will do the whole work by
itself), why not CLucene which is better and has far more users and support
(I would think)?


-----Original Message-----
[] On Behalf Of Peter Karman
Sent: Monday, January 07, 2008 5:51 PM
To: Swish-e Users Discussion List
Subject: Re: [swish-e] Aid with Swish3 Unicode feature

On 01/06/2008 04:37 PM, Itamar Syn-Hershko wrote:
> Hi all,
> I'm a C++ developer, and found Swish-e not too long ago while 
> researching the net for an indexing service (or algorithm) I could use 
> for a private project. With this project, I'm aiming on providing a 
> good tool for indexing content and make the index files portable and 
> searchable by accompanying software. This application of mine should 
> take into account it is going to be run under possibly weeker systems 
> and from a CD-rom drive (occasionally).

Hi Itamar,

It sounds like Xapian might be more up your alley:

Swish3 will use it as one possible backend.

> I was wondering whether someone could explain in simple what is the 
> index file look like in detail - whats the data structure the words 
> and their related info are being stored in, and the reading process in 
> short. I have it half-figured by now, but the whole thing of COMPRESS 
> and DECOMPRESS got me lost... (which I would also appreciate if 
> someone would explain in short). After I will see how Swish-e does it, 
> I will either claim mine is better and share, or use that approach 
> myself and perhaps tweak it...

The Swish 2.4.x index version comes in 2 flavors: the 'native' (default)
format and the 'btree' format. Neither of them are well documented (as you
have discovered), and the btree format is still labeled experiemental.

The 2.6 branch ( uses Berkeley
DB as a backend. I think that code would be easier to grok, and it supports
the incremental features that 2.4 native does not.

I would guess that the compresss/decompress stuff you are seeing is for the
properties file, which functions semi-independently of the index proper. The
properties file just stores parsed textual content (often compressed) from
the original document collection for later retrieval. It is not used in
searches at all; just for reporting results.

> Last but not least, how does the Hashing function used in Swish-e (at 
> least with 2.x) work, and would it work properly for both English and 
> Hebrew words with no hash collision?

2.4 only supports single-byte encodings, so that version seems like a
non-starter for you, but in any case, I don't know enough about the hashing
functions in 2.4 to answer.

> BTW, have you tried ICU yet? (it has C libraries afaik, and also a 
> regex
> library):

I did look at ICU for Swish3 but rejected it because it seemed too large.
But I may re-visit that decision eventually, and there is currently support
in libswish3 for alternate tokenizers.

> Also, as far as HTML/XML tokenizers for the indexing process, you 
> should have a look at this one:

That's cool.

Swish3 is using libxml2 for more than just parsing; it has buffer, hashing,
iconv and i/o features that are helpful too.

Peter Karman  .  peter(at)  .

Users mailing list

Users mailing list
Received on Mon Jan 14 15:46:29 2008