Skip to main content.
home | support | download

Back to List Archive

Re: Indexing UTF-8 IIS Pages

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Aug 04 2004 - 14:34:48 GMT
On Wed, Aug 04, 2004 at 04:50:46AM -0700, Mammitzsch.T@zdf.de wrote:
> Hi everybody,
> 
> i try to spider an IIS 6.0 which delivers pages with utf-8 in the
> http-header. As far as i understood the manual, swish-e converts utf-8 to
> iso-8859-1 if i use libxml2 (html2-parser). Unfortunately special chars like
> german umlauts are not recognized if i search through the swish.cgi
> frontend. Also results with umlauts are not displayed correctly. swish-e
> runs on a sun e450 with solaris 5.8. Any ideas?

Basically what Peter said.  One thing you should try is while indexing
and spidering (a few small test files) use the options 

    -T parsed_words indexed_words 

which will show you what white-space separated words are being fed to
swish and how they are converted into words stored in the index (via
WordCharacters setting).

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Aug 4 07:35:06 2004