Skip to main content.
home | support | download

Back to List Archive

Re: Indexing UTF-8 IIS Pages

From: Peter Karman <karman(at)not-real.cray.com>
Date: Wed Aug 04 2004 - 13:52:46 GMT
I believe a search on the discussion archives will tell you that the 
UTF-8 (and other Unicode sets) would require a significant recoding of 
swish-e. So far, no one has stepped forward to do that.

In your case, if there are UTF chars that have direct iso8859 
equivalents, you might play with the WordCharacters and 
TranslateCharacters config settings. That way things in the 8859 range 
of > 128 might work for you.

Please, someone with better encoding knowhow than me, correct this if it 
is wrong.

Mammitzsch.T@zdf.de wrote on 8/4/04 6:51 AM:

> Hi everybody,
> 
> i try to spider an IIS 6.0 which delivers pages with utf-8 in the
> http-header. As far as i understood the manual, swish-e converts utf-8 to
> iso-8859-1 if i use libxml2 (html2-parser). Unfortunately special chars like
> german umlauts are not recognized if i search through the swish.cgi
> frontend. Also results with umlauts are not displayed correctly. swish-e
> runs on a sun e450 with solaris 5.8. Any ideas?
> 
> best regards,
> 
> _______________________________________ 
> 
> Thomas Mammitzsch

-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Wed Aug 4 06:52:59 2004