Skip to main content.
home | support | download

Back to List Archive

Re: Indexing UTF-8 IIS Pages

From: Bill Moseley <moseley(at)>
Date: Wed Aug 04 2004 - 14:34:48 GMT
On Wed, Aug 04, 2004 at 04:50:46AM -0700, wrote:
> Hi everybody,
> i try to spider an IIS 6.0 which delivers pages with utf-8 in the
> http-header. As far as i understood the manual, swish-e converts utf-8 to
> iso-8859-1 if i use libxml2 (html2-parser). Unfortunately special chars like
> german umlauts are not recognized if i search through the swish.cgi
> frontend. Also results with umlauts are not displayed correctly. swish-e
> runs on a sun e450 with solaris 5.8. Any ideas?

Basically what Peter said.  One thing you should try is while indexing
and spidering (a few small test files) use the options 

    -T parsed_words indexed_words 

which will show you what white-space separated words are being fed to
swish and how they are converted into words stored in the index (via
WordCharacters setting).

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Wed Aug 4 07:35:06 2004