Skip to main content.
home | support | download

Back to List Archive

Indexing accent characters

From: Greg Ford <greg(at)>
Date: Tue Aug 12 2003 - 14:04:48 GMT

Some of the data I am (considering) indexing includes Unicode 
(Latin Extended-A) characters. 

I've looked at the FAQ and done some tests - it seems that 
if those files are in xml or html, libxml2 will convert them to 8859-1 
But in my tests, latin character with accents e.g AMACRON  (&#257;) 
are not indexed. I was hoping they would be converted 
to the plain letter (a) - stripping the accents off would make my 
data conveniently searchable.  

I note that the FAQ suggests full unicode support is a way off, but 
would stripping of Unicode accents be achieved with a
reasonable effort?

By the way, I used swish-e.exe (Windows) compiled from the current 
CVS to test this behaviour.

Greg Ford
Received on Tue Aug 12 14:10:27 2003