Hi
I read the thread on indexing german umlauts and I have a similar
problem.
I made a word document for testing.
The document contains the following two word
Överskottslager
boy
when i run swish-e -c swish_se.conf -i test.doc -T indexed_words -v0
i get the following
Adding:[1:swishdocpath(11)] 'test' Pos:1 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'doc' Pos:2 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'a' Pos:1 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'verskottslager' Pos:2 Stuct:0x1
( FILE )
Adding:[1:swishdefault(1)] 'boy' Pos:3 Stuct:0x1 ( FILE )
I would appear that my umlaut is being treated as a word and being
split from the word it actually belongs to.
I have the following in my config file
TranslateCharacters :ascii7:
WordCharacters &0123456789_abcdefghijklmnopqrstuvwxyzĊċÄäÖö
BeginCharacters &0123456789_abcdefghijklmnopqrstuvwxyzĊċÄäÖö
EndCharacters +0123456789_abcdefghijklmnopqrstuvwxyzĊċÄäÖö
Does anyone have any ideas as to what is doing this?
I could use the answer from the previous thread and make something
like this
for ($query) { # trim the query string
s/Ö/O/;
s/\s+$//;
s/^\s+//;
but since the letter is being split that doesnt really help me
Thanks
Received on Mon Dec 12 12:09:22 2005