Skip to main content.
home | support | download

Back to List Archive

Re:

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri May 10 2002 - 16:44:26 GMT
At 04:14 AM 05/10/02 -0700, Cristiano Corsani wrote:
>For many records swish says:
>CFI0515272.xml:1: warning: Failed to convert internal UTF-8 to Latin-1.
>Replacing non ISO-8859-1 char with char ' '
>697;d</subproperty><subproperty type="700.b">, Vsevolod
&#278;mil&#697;evi&#269
>it does not like the &#269; I think that the problem is that my html is
>encoded with utf8 and swish latin-1 does not recognize it. For my goals I
>need that swish put in property the string as it is (with &#269;) ... How
>can I manage it?

The problem is that the parser is used for both indexing text, and for
storing properties.  For indexing you do want the entities decoded, but
since swish works with 8 bit characters it gives a warning when it can't
convert a UTF-8 char.  

For properties, it's debatable what is the correct thing to do.  Since
properties (at this time) only store the text (not the markup) then it also
makes sense to decode the entities.  But, again, there's the 8 bit limitation.

It would take a lot of rewriting to separate the extraction of text for
indexing vs. extraction of text for properties.  Currently, the code
extracts the text, and then the same text is processed for indexing and
properties.  There's only one buffer that holds the parsed text.

I'll look at the code to see if there's a way to process properties
separately, so that you could then enable or disable entity decoding for
just properties.

-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri May 10 16:45:49 2002