Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Problem with filenames/character sets

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Wed Apr 04 2007 - 13:13:26 GMT
Rainer Hofmann scribbled on 4/4/07 6:47 AM:
> Hi,
> 
> following situation gives me an headache:
> 
> Windows clients (Lang=de, CP1252) put PDF-Files onto a fileserver (Linux 
> Lang=en_us.utf) via Samba.
> Those files are periodically indexed by swish-e located on the server. 
> Works pretty well so far.
> But if  clients use non ASCII-characters like  in their file or 
> directory names they run into trouble, when searching these files.
>

sounds like a messy encoding problem. I assume Windows doesn't use UTF-8 for its 
filesystem, and Swish-e converts UTF-8 to Latin1 (ISO-8859-1) where possible. 
And who knows what samba does wrt to converting (or not) filenames from windows 
fs encoding to the destination Linux fs encoding.

http://j3e.de/linux/convmv/man/#how_to_repair_samba_files

might address part of your issue.

The ideal is to do everything in UTF-8, since it has code points for all 
characters and is ASCII compatible. But (as is oft repeated here) Swish-e 
doesn't yet handle UTF-8 well. In the meantime, I'd suggest standardizing on 
Latin1, since that seems like the least evil compromise. Convert your filenames 
with convmv to Latin1, then index with Swish-e, and then your GUI will need to 
map between Latin1 and the windows encoding (CP1252?) if retrieving from the 
Windows fs (instead of from Samba).

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Apr 4 09:13:28 2007