> Probably, it is working this way for historical reasons.
I think you are exactly right.
> Should it be changed?
Eventually, I think the ideal solution is to internally use Unicode/UTF
with a real HTML/XML parser. I think that would completely solve
character issues. That's very easy for me to say, though ;-) Complex
solution, I'm afraid.
It occured to me that using a recode filter may be possible as a short
term hack. Recode can translate a document to/from HTML entities. That
would give you consistent entries in the index. Then just make sure
that searching works as expected. Numerical entities may cause
Example (using latin-1):
recode -d latin-1:html
This treats the input as latin-1 text but -d limits the output
conversion to "diacritic" characters. This would prevent the HTML
markup from being converted.
recode -d html:latin-1
This would do the reverse. Convert HTML entities into latin-1. This
may be better (correct?).
Dave's Web - http://www.webaugur.com/dave/
Dave's Weather - http://www.webaugur.com/dave/wx
ICQ Universal Internet Number - 412039
E-Mail - email@example.com
"I would never belong to a club that would have me as a member!"
- Groucho Marx
Received on Tue Nov 28 02:50:29 2000