Re: indexing utf-8 under windows.

From: Bill Moseley <moseley(at)>
Date: Wed Mar 23 2005 - 18:41:54 GMT
On Wed, Mar 23, 2005 at 10:28:38AM -0800, Carmelo Carchedi wrote:
> I have a tipical xml file like this in utf-8:
> maybe the problem is "accented characters".
> If I have accented characters in <testomassima> tag, i cannot find
> any word (with or without accent) in the xml file.
> Why? 
> is correct to index utf8 files?

It's fine.  In fact all documents parsed by libxml2 are in utf8
internally and then converted to 8-bit encoding (namely 8859-1) at
indexing time.

The trick to debugging is index a single file:

   swish-e -i test.xml -c swish.config -T indexed_words

That -T indexed_words option will have swish display all the words
that are indexed.  Those are the words that you can search for.  Make
sure that the entire document is being indexed -- there are cases
where bad XML will make libxml2 abort processing in the middle of a

Then when searching do:

   swish-e -w foo -H9 | grep Parsed

and that will show you the word(s) swish is searching for in the

The other thing is set ParserWarnLevel 9 in your config file so that
libxml2 will report any errors in processing.

> it's better to convert utf-8 file in other charset?

Doesn't matter.

Bill Moseley

Received on Wed Mar 23 10:41:54 2005