I have discovered the cause.
I run catdoc on word documents with the following settings
FileFilter .doc /usr/local/bin/catdoc "-b -s8859-1 -dutf-8 '%p'"
which works fine, generating an output that displays umlauts
correctly, both with respects to the content of word docs and for
filenames whihc are utf-8 encoded on my system, in my browser.
However when the output is utf-8 i get the splitting effect on
umlauts for indexing.
if i set catdocs output to iso-8859-1 i have no issues with
splitting, but then umlauts will not display correctly in the browser
with regards to filenames..in other words the body of the document
indexed displays fine when searching but i'm back to my issue with
filenames. The only way I can see filenames correctly is if the
outputed stuff to the browser is in utf-8 format.
as things stand it looks like I either have the word splitting effect
but have correct display of both filenames and body text in the
swish.cgi output but then ofcourse I am unable to search using
umlauts..or I can have filenames display incorreclty when using umlauts.
I'm hoping there are some germans out there who have looked at this
problem. I thought I had it solved but as it turns out there is a
little "aber".
WHat i need is a way to index utf-8 output correctly, or perhas do a
substitute in the query...but I will admit I do not know what to
substitute. My unicode/encoding knowledge is very limited.
I always thought perl worked inernally with utf and it should
therefore not be a problem.
I guess my question would be .. when the script goes through and
indexes the output of catdoc what determines the character encoding?
Is there a variable i can set to use utf-8 or convert utf-8 to
iso-8859-1?
I can't help but think that this problem i'm having must have a
solution. I cant imgine there not being international users who have
encountered the same issues.
Just for the record .. I think swish-e is great !!
13 dec 2005 kl. 02.37 skrev Bill Moseley:
> On Mon, Dec 12, 2005 at 12:03:22PM -0800, Thomas Nyman wrote:
>> I made a word document for testing.
>> The document contains the following two word
>>
>> Överskottslager
>>
>> boy
>>
>> when i run swish-e -c swish_se.conf -i test.doc -T indexed_words -v0
>>
>> i get the following
>>
>> Adding:[1:swishdocpath(11)] 'test' Pos:1 Stuct:0x1 ( FILE )
>> Adding:[1:swishdocpath(11)] 'doc' Pos:2 Stuct:0x1 ( FILE )
>> Adding:[1:swishdefault(1)] 'a' Pos:1 Stuct:0x1 ( FILE )
>> Adding:[1:swishdefault(1)] 'verskottslager' Pos:2 Stuct:0x1
>> ( FILE )
>> Adding:[1:swishdefault(1)] 'boy' Pos:3 Stuct:0x1 ( FILE )
>
> Odd, works for me.
>
> moseley@bumby:~$ cat word
> Överskottslager
> boy
> moseley@bumby:~$ swish-e -i word -T indexed_words -v0
> Adding:[1:swishdefault(1)] 'överskottslager' Pos:5 Stuct:
> 0x9 ( BODY FILE )
> Adding:[1:swishdefault(1)] 'boy' Pos:6 Stuct:0x9 ( BODY
> FILE )
>
> moseley@bumby:~$ cat c
> TranslateCharacters :ascii7:
> moseley@bumby:~$ swish-e -i word -T indexed_words -c c -v0
> Adding:[1:swishdefault(1)] 'overskottslager' Pos:5 Stuct:
> 0x9 ( BODY FILE )
> Adding:[1:swishdefault(1)] 'boy' Pos:6 Stuct:0x9 ( BODY
> FILE )
>
>
> Is it possible your config or source file is in a different encoding?
> Doesn't seem likely, but I can't think of why it wouldn't be working.
> I just cut from your email so seems like it would be the same
> encoding.
>
>
> moseley@bumby:~$ od -t x1c word
> 0000000 d6 76 65 72 73 6b 6f 74 74 73 6c 61 67 65 72 0a
> Ö v e r s k o t t s l a g e
> r \n
> 0000020 62 6f 79 0a
> b o y \n
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
> http://swish-e.org/Discussion/
>
> Help with Swish-e:
> http://swish-e.org/current/docs
> swish-e@sunsite.berkeley.edu
>
Received on Tue Dec 13 02:07:12 2005