Skip to main content.
home | support | download

Back to List Archive

Re: 8-bit chars

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Dec 06 2003 - 15:37:32 GMT
On Sat, Dec 06, 2003 at 03:19:58AM -0800, John Angel wrote:
> Hi Bill,
> 
> Specific example is if you try to index word containing ASCII 154 char.

That's a Windows extension to 8859-1, as far as I know.  I would not be 
suprised to find that libxml2 didn't support it.  

But it seems like I'm indexing with that character without any problem:

moseley@bumby:~$ swish-e -V
SWISH-E 2.4.1

Ok, I have a file "word" that contains that character:

moseley@bumby:~$ hexdump -C word
00000000  61 62 63 9a 64 65 66 0a                           |abc.def.|
                   ^^
And here's the config file:

moseley@bumby:~$ cat c

WordCharacters abcdef
BeginCharacters a
EndCharacters f

here it is with hexdump:

moseley@bumby:~$ hexdump -C c
00000000  0a 57 6f 72 64 43 68 61  72 61 63 74 65 72 73 20  |.WordCharacters |
00000010  61 62 63 9a 64 65 66 0a  42 65 67 69 6e 43 68 61  |abc.def.BeginCha|
00000020  72 61 63 74 65 72 73 20  61 0a 45 6e 64 43 68 61  |racters a.EndCha|
00000030  72 61 63 74 65 72 73 20  66 0a                    |racters f.|
0000003a

Now index.  You can see that it is indeed indexed (9a is your character)

moseley@bumby:~$ swish-e -i word -T indexed_words -v0 -c c | hexdump -C
00000000  20 20 20 20 41 64 64 69  6e 67 3a 5b 31 3a 73 77  |    Adding:[1:sw|
00000010  69 73 68 64 65 66 61 75  6c 74 28 31 29 5d 20 20  |ishdefault(1)]  |
00000020  20 27 61 62 63 9a 64 65  66 27 20 20 20 50 6f 73  | 'abc.def'   Pos|
00000030  3a 32 20 20 53 74 75 63  74 3a 30 78 39 20 28 20  |:2  Stuct:0x9 ( |
00000040  42 4f 44 59 20 46 49 4c  45 20 29 0a              |BODY FILE ).|
0000004c

Now try searching:

moseley@bumby:~$ perl -le '$word = "abc".chr(154)."def"; print `swish-e -w $word -H0`'
1000 word "word" 8

So it found the word.

This doesn't find it (different character):

moseley@bumby:~$ perl -le '$word = "abc".chr(153)."def"; print `swish-e -w $word`'
# SWISH format: 2.4.1
# Search words: abc™def
# Removed stopwords: 
err: no results
.

> I assume the problem is conversion to Latin1 - as you said, it is not 100% 
> 8-bit clean. Is there some other function we could use to translate UTF-8 to 
> 8-bit chars, instead of UTF8Toisolat1()? Or even better - to completely 
> avoid conversion to UTF-8, but to leave every char as it is originally.

libxml2 works with utf-8.  Nothing I can do about that.  For really 
8-bit clean you might try indexing with HTML (not HTML2 or HTML*) which 
will use the old built-in (and reasonably broken) HTML parser.


-- 
Bill Moseley
moseley@hank.org
Received on Sat Dec 6 15:37:36 2003