Re: [SWISH-E:417] Swish-e: Problems with Non-ASCII-Chars (e.g. German Umlaut)

From: Dirk-Willem van Gulik <dirk.vangulik(at)>
Date: Mon Aug 10 1998 - 15:45:25 GMT
On Mon, 10 Aug 1998, Rainer Scherg RTC wrote:

> I've made some enhancements to swish-e 1.1 to index Non-Text or HTML files
> (e.g. to get PDF-files indexed) [I've sent the code changes to Roy].

> But I've got some problems searching words with german umlauts in the swish 
> database. The problems also occurs when searching for words (with umlauts) 
> in simple html pages.
> The non ascii chars are done using ISO coding within the cgi search script / 
> html form.
> Are there any known problems or any solutions?

	We use the C3-API to 'solve' this by first normalyzing any
unicode (utf8, utf7). Then we use it again to convert everything into
7 bit ascii using a look-like conversion alsy part of the C3 api. So
things like the 'u-umloud' become an 'u' (rather than the sound like
conversion which gives you an 'u' and 'eu').

	This the text we index. 

	We do the same magic to the search string. Though not very 
beautifull, it does kind of work :-)

