Hi all,
I've just found the following unpredictible/strange/buggy behavior of HTM2
parser when parsing iso-8859-2 documents
Indexed document :
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-2">
<title></title>
</head>
<body>
bałałajka
</body>
</html>
swish-e -T INDEXED_WORDS gives different results in similar situations.
in every of them iso-8859-2 chatecters are ignored
(WordChars are correct in my conf file all iso8859-2 chars are included)
------------------------------------------------------------
1. when body is of the form :
<body>
bałałajka
</body>
swish-e gives
Adding:[1:swishdefault(1)] 'ba' Pos:4 Stuct:0x9 ( BODY FILE )
2. when body is of the form :
<body>
any word bałałajka
</body>
swish-e gives
Adding:[1:swishdefault(1)] 'any' Pos:4 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'word' Pos:5 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'ba' Pos:6 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'a' Pos:7 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'ajka' Pos:8 Stuct:0x9 ( BODY FILE )
3.
<body>
bałałajka
word
</body>
gives
Adding:[1:swishdefault(1)] 'ba' Pos:4 Stuct:0x9 ( BODY FILE )
4.
<body>
bałałajka
any word
</body>
gives
Adding:[1:swishdefault(1)] 'ba' Pos:4 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'a' Pos:5 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'ajka' Pos:6 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'any' Pos:7 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'word' Pos:8 Stuct:0x9 ( BODY FILE )
etc.
The strange thing is that everything is perfect when replace
<meta http-equiv="content-type" content="text/html; charset=iso-8859-2">
by
<meta http-equiv="content-type" content="text/html">
in the head of the document
Adding:[1:swishdefault(1)] 'bałałajka' Pos:4 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'any' Pos:5 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'word' Pos:6 Stuct:0x9 ( BODY FILE )
regards
Received on Fri Sep 13 13:57:10 2002