Re: [swish-e] Problem with filenames/character sets

From: Peter Karman <peter(at)>
Date: Wed Apr 04 2007 - 13:13:26 GMT
Rainer Hofmann scribbled on 4/4/07 6:47 AM:
> Hi,
> following situation gives me an headache:
> Windows clients (Lang=de, CP1252) put PDF-Files onto a fileserver (Linux 
> Lang=en_us.utf) via Samba.
> Those files are periodically indexed by swish-e located on the server. 
> Works pretty well so far.
> But if  clients use non ASCII-characters like  in their file or 
> directory names they run into trouble, when searching these files.

sounds like a messy encoding problem. I assume Windows doesn't use UTF-8 for its 
filesystem, and Swish-e converts UTF-8 to Latin1 (ISO-8859-1) where possible. 
And who knows what samba does wrt to converting (or not) filenames from windows 
fs encoding to the destination Linux fs encoding.

might address part of your issue.

The ideal is to do everything in UTF-8, since it has code points for all 
characters and is ASCII compatible. But (as is oft repeated here) Swish-e 
doesn't yet handle UTF-8 well. In the meantime, I'd suggest standardizing on 
Latin1, since that seems like the least evil compromise. Convert your filenames 
with convmv to Latin1, then index with Swish-e, and then your GUI will need to 
map between Latin1 and the windows encoding (CP1252?) if retrieving from the 
Windows fs (instead of from Samba).

Peter Karman  .  .  peter(at)
