Skip to main content.
home | support | download

Back to List Archive

AW: Indexing UTF-8 IIS Pages

From: <Mammitzsch.T(at)>
Date: Wed Aug 04 2004 - 16:00:37 GMT
> On Wed, Aug 04, 2004 at 04:50:46AM -0700, wrote:
> > Hi everybody,
> > 
> > i try to spider an IIS 6.0 which delivers pages with utf-8 in the
> > http-header. As far as i understood the manual, swish-e 
> converts utf-8 to
> > iso-8859-1 if i use libxml2 (html2-parser). Unfortunately 
> special chars like
> > german umlauts are not recognized if i search through the swish.cgi
> > frontend. Also results with umlauts are not displayed 
> correctly. swish-e
> > runs on a sun e450 with solaris 5.8. Any ideas?
> Basically what Peter said.  One thing you should try is while indexing
> and spidering (a few small test files) use the options 
>     -T parsed_words indexed_words 
> which will show you what white-space separated words are being fed to
> swish and how they are converted into words stored in the index (via
> WordCharacters setting).
ok, indexer did e.g. 

White-space found word 'Saarbrucken'
    Adding:[648:swishdefault(1)]   'saarbrucken'   Pos:397  Stuct:0x9 ( BODY

looks good for me, but searching for  saarbrucken returns lots of results
where "saarbrucken" is not included.
other words with umlauts return no results (except 1 pdf which i found).

why isn't it working when searching?

bye, Thomas Mammitzsch
Received on Wed Aug 4 09:01:39 2004