> Yes, that's been on my todo list for a long time. Just adding iconv
> support to parser.c would not be too hard. It's all the other stuff
> that goes along with that that's the issue.
What other stuff should be modified also?
> > Beside all 8-bit charsets supported that way, there should be one more
> > possible value (e.g. TargetCharset "as-is"), suggesting that documents
> > should be indexed exactly in the same encoding as they were originally.
> As I said yesterday, that doesn't make sense. I tried to explain why I
> don't think it can work. Maybe you can explain in detail how it can
It is the same implementation as for target charset, I don't see why it
shouldn't be done? It makes a lot of sense when you try to index documents
in different languages and encodings.
E.g. try to index a website which is translated in different languages using
several encodings. The results may not be perfect, but "as-is" conversion it
is the best (and the only thing) we can do.
All other open source engines have similar full 8-bit support.
ht://dig has "translate_latin1" attribute for conversion to latin1. If set
to false, it will act as I described - "as-is" conversion.
Received on Sun Dec 14 08:03:35 2003