Skip to main content.
home | support | download

Back to List Archive

Re: Indexing umlauts

From: Thomas Nyman <thomas(at)>
Date: Tue Dec 13 2005 - 10:07:00 GMT
I have discovered the cause.

I run catdoc on word documents with the following settings

FileFilter .doc /usr/local/bin/catdoc "-b -s8859-1 -dutf-8 '%p'"

which works fine, generating an output that displays umlauts  
correctly, both with respects to the content of word docs and for  
filenames  whihc are utf-8 encoded on my system, in my browser.  
However when the output is utf-8 i get the splitting effect on  
umlauts for indexing.

if i set catdocs output to iso-8859-1 i have no issues with  
splitting, but then umlauts will not display correctly in the browser  
with regards to other words the body of the document  
indexed displays fine when searching but i'm back to my issue with  
filenames. The only way I can see filenames correctly is if the  
outputed stuff to the browser is in utf-8 format.

as things stand it looks like I either have the word splitting effect  
but have correct display of both filenames and body text in the  
swish.cgi output but then ofcourse I am unable to search using  
umlauts..or I can have filenames display incorreclty when using umlauts.

I'm hoping there are some germans out there who have looked at this  
problem. I thought I had it solved but as it turns out there is a  
little "aber".

WHat i need is a way to index utf-8 output correctly, or perhas do a  
substitute in the query...but I will admit I do not know what to  
substitute. My unicode/encoding knowledge is very limited.

I always thought perl worked inernally with utf and it should  
therefore not be a problem.

I guess my question would be .. when the script goes through and  
indexes the output of catdoc what determines the character encoding?  
Is there a variable i can set to use utf-8 or convert utf-8 to  

I can't help but think that this problem i'm having must have a  
solution. I cant imgine there not being international users who have  
encountered the same issues.

Just for the record .. I think swish-e is great !!

13 dec 2005 kl. 02.37 skrev Bill Moseley:

> On Mon, Dec 12, 2005 at 12:03:22PM -0800, Thomas Nyman wrote:
>> I made a word document for testing.
>> The document contains the following two word
>> Överskottslager
>> boy
>> when i run swish-e -c swish_se.conf -i test.doc -T indexed_words -v0
>> i get the following
>> Adding:[1:swishdocpath(11)]   'test'   Pos:1  Stuct:0x1 ( FILE )
>>      Adding:[1:swishdocpath(11)]   'doc'   Pos:2  Stuct:0x1 ( FILE )
>>      Adding:[1:swishdefault(1)]   'a'   Pos:1  Stuct:0x1 ( FILE )
>>      Adding:[1:swishdefault(1)]   'verskottslager'   Pos:2  Stuct:0x1
>> ( FILE )
>>      Adding:[1:swishdefault(1)]   'boy'   Pos:3  Stuct:0x1 ( FILE )
> Odd, works for me.
> moseley@bumby:~$ cat word
> Överskottslager
> boy
> moseley@bumby:~$ swish-e -i word -T indexed_words -v0
>     Adding:[1:swishdefault(1)]   'överskottslager'   Pos:5  Stuct: 
> 0x9 ( BODY FILE )
>     Adding:[1:swishdefault(1)]   'boy'   Pos:6  Stuct:0x9 ( BODY  
> FILE )
> moseley@bumby:~$ cat c
> TranslateCharacters :ascii7:
> moseley@bumby:~$ swish-e -i word -T indexed_words -c c  -v0
>     Adding:[1:swishdefault(1)]   'overskottslager'   Pos:5  Stuct: 
> 0x9 ( BODY FILE )
>     Adding:[1:swishdefault(1)]   'boy'   Pos:6  Stuct:0x9 ( BODY  
> FILE )
> Is it possible your config or source file is in a different encoding?
> Doesn't seem likely, but I can't think of why it wouldn't be working.
> I just cut from your email so seems like it would be the same
> encoding.
> moseley@bumby:~$ od -t x1c  word
> 0000000 d6 76 65 72 73 6b 6f 74 74 73 6c 61 67 65 72 0a
>           Ö   v   e   r   s   k   o   t   t   s   l   a   g   e    
> r  \n
> 0000020 62 6f 79 0a
>           b   o   y  \n
> -- 
> Bill Moseley
> Unsubscribe from or help with the swish-e list:
> Help with Swish-e:
Received on Tue Dec 13 02:07:12 2005