Michael Peters wrote on 3/26/09 3:08 PM:
>
> But regardless, I need to be able to show UTF-8 characters safely in the
> descriptions of my search results, so escaping those into HTML entities is
> needed for that.
That might not be doing what you think it does, depending on the specific UTF-8
characters you are escaping and how you are returning the results.
When you escape your UTF-8 characters to entities, libxml2 resolves them back to
the UTF-8 octets they represent, and then swish-e converts them to Latin1. So
your UTF-8 chars are converted to Latin1 if they can be, and are ignored
otherwise. What you get back when you retrieve the PropertyName is Latin1, not
UTF-8. It's a lossy deal.
If you really have UTF-8 characters that you want preserved in PropertyNames for
HTML display, you need to double escape them so that the entity is preserved.
Σ -> Σ
etc. Caveat: they can't be searched on, but then, they wouldn't be anyway if you
left them as UTF-8 characters.
Here's an example (depending on how my mail gets converted (or not) you might
see these as Latin1 chars or not, but if you run this is a terminal with the
display set to Latin1, you'll see the Latin1 swish-e returns):
[karpet@pekmac:~/tmp]$ cat entmaker.pl
use Search::Tools::XML;
print "<html><body>\n";
for my $name ( sort keys %Search::Tools::XML::HTML_ents ) {
my $num = $Search::Tools::XML::HTML_ents{$name};
print "$name = &#$num;\n";
print "$name = &#$num;\n";
}
print "</body></html>\n";
[karpet@pekmac:~/tmp]$ perl entmaker.pl >ents.html
[karpet@pekmac:~/tmp]$ cat conf
IndexContents HTML2 .html
StoreDescription HTML2 <body>
[karpet@pekmac:~/tmp]$ swish-e -i ents.html -c conf
Indexing Data Source: "File-System"
Indexing "ents.html"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 477 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
477 unique words indexed.
5 properties sorted.
1 file indexed. 8,610 total bytes. 824 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
[karpet@pekmac:~/tmp]$ swish-e -w amp -p swishdescription
# SWISH format: 2.5.6
# Search words: amp
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000 ents.html "ents.html" 8610 "AElig = Æ AElig = Æ Aacute = Á Aacute =
Á Acirc = Â Acirc = Â Agrave = À Agrave = À Alpha = Alpha =
Α Aring = Å Aring = Å Atilde = Ã Atilde = Ã Auml = Ä Auml =
Ä Beta = Beta = Β Ccedil = Ç Ccedil = Ç Chi = Chi = Χ
Dagger = Dagger = ‡ Delta = Delta = Δ ETH = Ð ETH = Ð Eacute =
É Eacute = É Ecirc = Ê Ecirc = Ê Egrave = È Egrave = È Epsilon =
Epsilon = Ε Eta = Eta = Η Euml = Ë Euml = Ë Gamma = Gamma =
Γ Iacute = Í Iacute = Í Icirc = Î Icirc = Î Igrave = Ì Igrave =
Ì Iota = Iota = Ι Iuml = Ï Iuml = Ï Kappa = Kappa = Κ
Lambda = Lambda = Λ Mu = Mu = Μ Ntilde = Ñ Ntilde = Ñ Nu = Nu
= Ν OElig = OElig = Œ Oacute = Ó Oacute = Ó Ocirc = Ô Ocirc =
Ô Ograve = Ò Ograve = Ò Omega = Omega = Ω Omicron = Omicron =
Ο Oslash = Ø Oslash = Ø Otilde = Õ Otilde = Õ Ouml = Ö Ouml =
Ö Phi = Phi = Φ Pi = Pi = Π Prime = Prime = ″ Psi = Psi
= Ψ Rho = Rho = Ρ Scaron = Scaron = Š Sigma = Sigma = Σ
THORN = Þ THORN = Þ Tau = Tau = Τ Theta = Theta = Θ Uacute = Ú
Uacute = Ú Ucirc = Û Ucirc = Û Ugrave = Ù Ugrave = Ù Upsilon =
Upsilon = Υ Uuml = Ü Uuml = Ü Xi = Xi = Ξ Yacute = Ý Yacute =
Ý Yuml = Yuml = Ÿ Zeta = Zeta = Ζ aacute = á aacute = á
acirc = â acirc = â acute = ´ acute = ´ aelig = æ aelig = æ
agrave = à agrave = à alefsym = alefsym = ℵ alpha = alpha = α
amp = & amp = & and = and = ∧ ang = ang = ∠ apos = ' apos =
' aring = å aring = å asymp = asymp = ≈ atilde = ã atilde =
ã auml = ä auml = ä bdquo = bdquo = „ beta = beta = β
brvbar = ¦ brvbar = ¦ bull = bull = • cap = cap = ∩ ccedil =
ç ccedil = ç cedil = ¸ cedil = ¸ cent = ¢ cent = ¢ chi = chi =
χ circ = circ = ˆ clubs = clubs = ♣ cong = cong = ≅
copy = © copy = © crarr = crarr = ↵ cup = cup = ∪ curren = ¤
curren = ¤ dArr = dArr = ⇓ dagger = dagger = † darr = darr =
↓ deg = ° deg = ° delta = delta = δ diams = diams = ♦
divide = ÷ divide = ÷ eacute = é eacute = é ecirc = ê ecirc = ê
egrave = è egrave = è empty = empty = ∅ emsp = emsp =   ensp
= ensp =   epsilon = epsilon = ε equiv = equiv = ≡ eta =
eta = η eth = ð eth = ð euml = ë euml = ë euro = euro = €
exist = exist = ∃ fnof = fnof = ƒ forall = forall = ∀ frac12
= ½ frac12 = ½ frac14 = ¼ frac14 = ¼ frac34 = ¾ frac34 = ¾ frasl
= frasl = ⁄ gamma = gamma = γ ge = ge = ≥ gt = > gt = >
hArr = hArr = ⇔ harr = harr = ↔ hearts = hearts = ♥ hellip
= hellip = … iacute = í iacute = í icirc = î icirc = î iexcl =
¡ iexcl = ¡ igrave = ì igrave = ì image = image = ℑ infin =
infin = ∞ int = int = ∫ iota = iota = ι iquest = ¿ iquest =
¿ isin = isin = ∈ iuml = ï iuml = ï kappa = kappa = κ
lArr = lArr = ⇐ lambda = lambda = λ lang = lang = 〈 laquo =
« laquo = « larr = larr = ← lceil = lceil = ⌈ ldquo = ldquo
= “ le = le = ≤ lfloor = lfloor = ⌊ lowast = lowast =
∗ loz = loz = ◊ lrm = lrm = ‎ lsaquo = lsaquo = ‹
lsquo = lsquo = ‘ lt = < lt = < macr = ¯ macr = ¯ mdash = mdash
= — micro = µ micro = µ middot = · middot = · minus = minus =
− mu = mu = μ nabla = nabla = ∇ nbsp = nbsp =   ndash
= ndash = – ne = ne = ≠ ni = ni = ∋ not = ¬ not = ¬
notin = notin = ∉ nsub = nsub = ⊄ ntilde = ñ ntilde = ñ nu =
nu = ν oacute = ó oacute = ó ocirc = ô ocirc = ô oelig = oelig
= œ ograve = ò ograve = ò oline = oline = ‾ omega = omega =
ω omicron = omicron = ο oplus = oplus = ⊕ or = or = ∨
ordf = ª ordf = ª ordm = º ordm = º oslash = ø oslash = ø otilde
= õ otilde = õ otimes = otimes = ⊗ ouml = ö ouml = ö para = ¶
para = ¶ part = part = ∂ permil = permil = ‰ perp = perp =
⊥ phi = phi = φ pi = pi = π piv = piv = ϖ plusmn = ±
plusmn = ± pound = £ pound = £ prime = prime = ′ prod = prod =
∏ prop = prop = ∝ psi = psi = ψ quot = " quot = " rArr =
rArr = ⇒ radic = radic = √ rang = rang = 〉 raquo = » raquo
= » rarr = rarr = → rceil = rceil = ⌉ rdquo = rdquo =
” real = real = ℜ reg = ® reg = ® rfloor = rfloor = ⌋
rho = rho = ρ rlm = rlm = ‏ rsaquo = rsaquo = › rsquo =
rsquo = ’ sbquo = sbquo = ‚ scaron = scaron = š sdot = sdot
= ⋅ sect = § sect = § shy = shy = ­ sigma = sigma = σ
sigmaf = sigmaf = ς sim = sim = ∼ spades = spades = ♠ sub =
sub = ⊂ sube = sube = ⊆ sum = sum = ∑ sup = sup = ⊃
sup1 = ¹ sup1 = ¹ sup2 = ² sup2 = ² sup3 = ³ sup3 = ³ supe =
supe = ⊇ szlig = ß szlig = ß tau = tau = τ there4 = there4 =
∴ theta = theta = θ thetasym = thetasym = ϑ thinsp = thinsp =
  thorn = þ thorn = þ tilde = tilde = ˜ times = × times =
× trade = trade = ™ uArr = uArr = ⇑ uacute = ú uacute =
ú uarr = uarr = ↑ ucirc = û ucirc = û ugrave = ù ugrave =
ù uml = ¨ uml = ¨ upsih = upsih = ϒ upsilon = upsilon = υ
uuml = ü uuml = ü weierp = weierp = ℘ xi = xi = ξ yacute = ý
yacute = ý yen = ¥ yen = ¥ yuml = ÿ yuml = ÿ zeta = zeta =
ζ zwj = zwj = ‍ zwnj = zwnj = ‌"
.
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 26 22:07:38 2009