Skip to main content.
home | support | download

Back to List Archive

[swish-e] Aid with Swish3 Unicode feature

From: Itamar Syn-Hershko <itamar(at)not-real.divrei-tora.com>
Date: Sun Jan 06 2008 - 22:37:34 GMT
Hi all,
 
I'm a C++ developer, and found Swish-e not too long ago while researching
the net for an indexing service (or algorithm) I could use for a private
project. With this project, I'm aiming on providing a good tool for indexing
content and make the index files portable and searchable by accompanying
software. This application of mine should take into account it is going to
be run under possibly weeker systems and from a CD-rom drive (occasionally).
 
So, I've been digging in Swish-e 2.x and found it not suitable for my
purpose, mainly because of its lack of Unicode support, but also for its
server architecture. As far as I understood the code (I'm not too familiar
with C libraries - I'm more into OOP and C++...) it is aiming at indexing at
a server level, hence the need for features like Incremental Indexing and
the misc APIs. To define what I'm looking for in short, is a light-weight
search engine, which looks words up in the index files, which I really don't
care how long will take to generate since this is done once before releasing
the archiving application (updates may be supported on a binary level - diff
- later on). The search should be quick yet not take too much space in the
memory. And of course support non-English languages.
 
My mother tongue is Hebrew. My application is aimed at both Hebrew and
English documents. I lack advanced knowledge of indexing technologies, and I
came up with some algorithm of my own, but when I noticed Swish3 I thought I
would ask for some guidance from you. I, on the other hand, can provide some
help and guidance on the Unicode area - I'm now developing algorithms for
wildcards and query analyzing suitable for Hebrew as well (where ? wildcard
is actually valid...).
 
I was wondering whether someone could explain in simple what is the index
file look like in detail - whats the data structure the words and their
related info are being stored in, and the reading process in short. I have
it half-figured by now, but the whole thing of COMPRESS and DECOMPRESS got
me lost... (which I would also appreciate if someone would explain in
short). After I will see how Swish-e does it, I will either claim mine is
better and share, or use that approach myself and perhaps tweak it...
Last but not least, how does the Hashing function used in Swish-e (at least
with 2.x) work, and would it work properly for both English and Hebrew words
with no hash collision?
 
Feel free to hit me with any Unicode related questions you have, I will try
to do my best to answer. For starts, Hebrew doesn't have upper/lower case
versions, not such that matter anyway ("Ktav", which is parallel to
lowercase, is what we call handwritten texts, unlike printed which is called
"Dfus"). I already commented about Wildcards (in Hebrew wildcards are widely
used also as the first character of a word -- not always explicitly, but the
searching application might need to do this. This requires some intro
perhaps). Hebrew also has Nikkud (http://en.wikipedia.org/wiki/Nikkud),
which I think should be omitted if exists in searches. This means some
Hebrew soundex libraries should be created to come up with more related
results upon searches, but I really see no reason in indexing those
phonetical signs.
BTW, have you tried ICU yet? (it has C libraries afaik, and also a regex
library): http://www.icu-project.org/.
Also, as far as HTML/XML tokenizers for the indexing process, you should
have a look at this one:
http://www.codeproject.com/cpp/HTML_XML_Scanner.asp.
 
Itamar.



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Jan 6 17:37:52 2008