Re: Indexing UTF-8 IIS Pages

From: Bill Moseley <moseley(at)>
Date: Wed Aug 04 2004 - 14:34:48 GMT
On Wed, Aug 04, 2004 at 04:50:46AM -0700, wrote:
> Hi everybody,
> i try to spider an IIS 6.0 which delivers pages with utf-8 in the
> http-header. As far as i understood the manual, swish-e converts utf-8 to
> iso-8859-1 if i use libxml2 (html2-parser). Unfortunately special chars like
> german umlauts are not recognized if i search through the swish.cgi
> frontend. Also results with umlauts are not displayed correctly. swish-e
> runs on a sun e450 with solaris 5.8. Any ideas?

Basically what Peter said.  One thing you should try is while indexing
and spidering (a few small test files) use the options 

    -T parsed_words indexed_words 

which will show you what white-space separated words are being fed to
swish and how they are converted into words stored in the index (via
WordCharacters setting).

Bill Moseley

Received on Wed Aug 4 07:35:06 2004