On Sat, Dec 06, 2003 at 03:19:58AM -0800, John Angel wrote:
> Hi Bill,
>
> Specific example is if you try to index word containing ASCII 154 char.
That's a Windows extension to 8859-1, as far as I know. I would not be
suprised to find that libxml2 didn't support it.
But it seems like I'm indexing with that character without any problem:
moseley@bumby:~$ swish-e -V
SWISH-E 2.4.1
Ok, I have a file "word" that contains that character:
moseley@bumby:~$ hexdump -C word
00000000 61 62 63 9a 64 65 66 0a |abc.def.|
^^
And here's the config file:
moseley@bumby:~$ cat c
WordCharacters abcdef
BeginCharacters a
EndCharacters f
here it is with hexdump:
moseley@bumby:~$ hexdump -C c
00000000 0a 57 6f 72 64 43 68 61 72 61 63 74 65 72 73 20 |.WordCharacters |
00000010 61 62 63 9a 64 65 66 0a 42 65 67 69 6e 43 68 61 |abc.def.BeginCha|
00000020 72 61 63 74 65 72 73 20 61 0a 45 6e 64 43 68 61 |racters a.EndCha|
00000030 72 61 63 74 65 72 73 20 66 0a |racters f.|
0000003a
Now index. You can see that it is indeed indexed (9a is your character)
moseley@bumby:~$ swish-e -i word -T indexed_words -v0 -c c | hexdump -C
00000000 20 20 20 20 41 64 64 69 6e 67 3a 5b 31 3a 73 77 | Adding:[1:sw|
00000010 69 73 68 64 65 66 61 75 6c 74 28 31 29 5d 20 20 |ishdefault(1)] |
00000020 20 27 61 62 63 9a 64 65 66 27 20 20 20 50 6f 73 | 'abc.def' Pos|
00000030 3a 32 20 20 53 74 75 63 74 3a 30 78 39 20 28 20 |:2 Stuct:0x9 ( |
00000040 42 4f 44 59 20 46 49 4c 45 20 29 0a |BODY FILE ).|
0000004c
Now try searching:
moseley@bumby:~$ perl -le '$word = "abc".chr(154)."def"; print `swish-e -w $word -H0`'
1000 word "word" 8
So it found the word.
This doesn't find it (different character):
moseley@bumby:~$ perl -le '$word = "abc".chr(153)."def"; print `swish-e -w $word`'
# SWISH format: 2.4.1
# Search words: abc™def
# Removed stopwords:
err: no results
.
> I assume the problem is conversion to Latin1 - as you said, it is not 100%
> 8-bit clean. Is there some other function we could use to translate UTF-8 to
> 8-bit chars, instead of UTF8Toisolat1()? Or even better - to completely
> avoid conversion to UTF-8, but to leave every char as it is originally.
libxml2 works with utf-8. Nothing I can do about that. For really
8-bit clean you might try indexing with HTML (not HTML2 or HTML*) which
will use the old built-in (and reasonably broken) HTML parser.
--
Bill Moseley
moseley@hank.org
Received on Sat Dec 6 15:37:36 2003