Re: TranslateCharacters - clarification required

From: Bill Moseley <moseley(at)>
Date: Tue Feb 25 2003 - 18:59:26 GMT
On Tue, 25 Feb 2003, David L Norris wrote:

> On Tue, 2003-02-25 at 03:19, Tref Gare wrote:
> > I'm sure swish-e is fully capable of indexing accented characters
> > (latin-1) but for some reason my swish-e setup seems to be unable to
> > manage it - 

Internally, swish only works with 8 bit chars, and really has no idea
about encodings.  The old HTML, XML, and TXT parsers just work with

But the libxml2 parser does care about encodings.

Libxml2 converts to UTF-8 internally.  For xml docs it looks for 

  <?xml version="1.0" encoding="ISO-8859-1"?>

and for HTML docs it looks for 

   <html lang="en">

Now I'm not clear what happens if an encoding or language is not
specified.  I assume that the current locale setting is used.

I suppose a useful -T option would be to dump as bytes the UTF-8 strings
that libxml2 is passing to swish.  That would be helpful in debugging.

Swish-e then blindly tries to convert the UTF-8 to 8859-1.  A warning
is displayed when a UTF-8 character cannot be mapped to 8859-1.

> I don't think translated characters are stored.  I think it simply
> translates those characters during indexing and searching.  But don't
> hold me to that.  ;-)  Bill would know for sure.

TranslateChars are stored in the index.  That's needed so the same
translation can be applied while searching.

TranslateChars is done early in the indexing and searching process --
before WordCharacters is tested.

moseley@bumby:~/swish-e/src$ cat c doc
TranslateCharacters b x


moseley@bumby:~/swish-e/src$ ./swish-e -c c -i doc -v0 -T indexed_words
    Adding:[1:swishdefault(1)]   'xill'   Pos:2  Stuct:0x9 ( BODY FILE )
                              translated on indexing

moseley@bumby:~/swish-e/src$ ./swish-e -w bill -H9
# SWISH format: 2.3.4
# Search words: bill
# Parsed Words: xill <<< translated on search
1000 doc "doc" 6

Bill Moseley
Received on Tue Feb 25 19:00:16 2003