On Wed, Aug 04, 2004 at 05:58:37PM +0200, Mammitzsch.T@zdf.de wrote:
> > On Wed, Aug 04, 2004 at 04:50:46AM -0700, Mammitzsch.T@zdf.de wrote:
> > > Hi everybody,
> > >
> > > i try to spider an IIS 6.0 which delivers pages with utf-8 in the
> > > http-header. As far as i understood the manual, swish-e
> > converts utf-8 to
> > > iso-8859-1 if i use libxml2 (html2-parser). Unfortunately
> > special chars like
> > > german umlauts are not recognized if i search through the swish.cgi
> > > frontend. Also results with umlauts are not displayed
> > correctly. swish-e
> > > runs on a sun e450 with solaris 5.8. Any ideas?
> > Basically what Peter said. One thing you should try is while indexing
> > and spidering (a few small test files) use the options
> > -T parsed_words indexed_words
> > which will show you what white-space separated words are being fed to
> > swish and how they are converted into words stored in the index (via
> > WordCharacters setting).
> ok, indexer did e.g.
> White-space found word 'Saarbrucken'
> Adding:[648:swishdefault(1)] 'saarbrucken' Pos:397 Stuct:0x9 ( BODY
> FILE )
> looks good for me, but searching for saarbrucken returns lots of results
> where "saarbrucken" is not included.
So the umlauts got stripped along the way. But, you can see them when
moseley@bumby:~$ cat txt
moseley@bumby:~$ swish-e -i txt -v0 -T indexed_words
Adding:[1:swishdefault(1)] 'saarbrücken' Pos:5 Stuct:0x9 ( BODY FILE )
moseley@bumby:~$ swish-e -w saarbrücken -H0
1000 txt "txt" 13
moseley@bumby:~$ swish-e -w saarbrücken -H9 | grep Parsed
# Parsed Words: saarbrücke
> other words with umlauts return no results (except 1 pdf which i found).
Well, the -T option shows what text is placed in the index. And
"Parsed Words" shows what words are searched for in the index. Those
two things will help you figure out why you can or cannot search for
I suppose there could be some encoding issue, but even if that
was true then I would expect it to not be an issue unless you are
somehow using different encodings when indexing and when searching.
I've never seen that to be the case.
> why isn't it working when searching?
That's something you need to answer by using the debugging options.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Wed Aug 4 10:38:50 2004