On Fri, Oct 21, 2005 at 01:54:31AM -0700, Nikolay A. Panov wrote:
> Hi Johan,
> My cyrillic docs was perfectly indexed by swish-e on Linux system.
> I do not use libxml2 (my docs was indexed as TEXT only), since libxml
> unfortunately cannot work with cyrillic charset (koi8-r, cp1251, etc) now.
> Furthermore, I use stemming_ru for morphology-independent searching...
It's not a problem with libxml2, it's a problem with swish. libxml2
uses utf-8 internally, and swish-e uses only 8-bit encoding so as an
ugly hack (until swish can be rewritten) swish blindly converts utf-8
Not using libxml2 swish just assumes the input data is an 8-bit
encoding and takes whatever data it is given.
Be aware that the non-libxml2 parsers have some problems. Mostly
minor but may not index as "correctly" as the libxml2 parser. I don't
remember any more exactly what the difference is. It's a fun exercise
to index a few docs using both parsers and then compare the words
Another approach would be to hack parser.c and replace the 8850-1
conversion with another that converts to whatever 8-bit encoding you
need to work with.
You might also want to check that sorting works like you expect.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Fri Oct 21 07:21:13 2005