On Thu, Nov 17, 2005 at 10:30:12AM +0100, Thomas Nyman wrote:
> Anyway, if i change the following setting in TempleteDefault - my
> $output = $q->header . page_header( $results ); - to my $output =
> $q->header(-charset=>'UTF-8') . page_header( $results ); then
> filenames are displayed correctly with regards to umlauts .. however
> the content of swishdescription displays incorrectly then.
Sorry, this kind of thing take a lot of time. And I have not worked
enough with different encodings. When I've looked at encodings in the
past I spent a lot of time dumping Perl SVs, using wget, and using od
to dump bytes of my source files.
Again, swish uses libxml2 for parsing document. libxml2 can parse
utf-8 (or most other encodings) and uses utf-8 internally but then
swish takes that utf-8 and converts it to 8859-1 encoding. So any
characters that don't map to 8859-1 are replaced by a space and the
parser should warn if that's happening (see ParserWarnLevel).
Now, I would think that if just took 8859-1 encoded document and put
it on the web with a content-type of utf-8 then only the first 127
chars would display correctly, since those map to the same chars in
You could take the output from swish and use iconv to convert 8859-1
to utf-8 before sending to the browser and then I would think that
then you would be able to see all the chars.
Now, things are much more complex. There's Perl. For one thing, you
have edited the Perl source files with an editor (on OS X?) and so
when you saved that file it encoded into utf-8 (I assume). And if so,
then you might need to tell Perl that your source files are utf-8.
see perldoc utf8 and perldoc encoding.
The output from swish also goes through perl, of course. What happens
to those characters? I'm not quite sure with current Perl versions.
For a while there was a utf8 flag on scalars (SVs) and sometimes Perl
would have that flag set. But, you might need to tell Perl that it's
8859-1. (And it gets tricky, because SWISH::API uses Perl's xs (C
interface) and dumps swish data right into Perl variables without
consideration of encoding.)
And the for those using template like Template Toolkit, there are
other issues with how the templates are encoded.
> Since the bulk of the documents are word documents being parsed
> through catdoc i changed my swish.conf as follows
> FileFilter .doc /usr/local/bin/catdoc "-b -s8859-1 -dutf-8 '%p' "
Are your Word docs really encoded in 8859-1? (Or do they contain
UTF-16 and then -s8859-1 is ignored?)
> The results now show correct filenames with umlauts however there are
> still some parts displaying incorrectly. The descriptions of the file
> contents and highlighting is pretty much correct with one or two
> faulty representations but now parts of the form are displaying
> incorrectly. I'm enclosing a screendump of what it looks like.
> Oh, and the browsers default encoding is utf-8
My guess there is that Perl is not reading your source files
correctly. If you are saving them as utf-8 you might need to "use
utf8" at the top of the file.
> The issue seems to point towards some part of the html page being
> produced is setting an encoding other than utf-8..question is where
> this is being set?
That is the question. ;) I, quite unfortunately, do things the slow
way so I'd be sitting there with od and looking at the various bits of
the output from the perl script directly. I would then move to using
wget to fetch the output via the web server to see if anything
The web browser is kind of a wild card, so it's best to make sure you
know what the bytes say first. If the bytes are really utf-8 then the
browser needs to be told that.
Or maybe just adding utf8 to the perl files you edit will be enough.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Thu Nov 17 05:25:02 2005