Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] problems with spidering UTF8

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Mar 27 2009 - 02:07:32 GMT
Michael Peters wrote on 3/26/09 3:08 PM:

> 
> But regardless, I need to be able to show UTF-8 characters safely in the 
> descriptions of my search results, so escaping those into HTML entities is 
> needed for that. 

That might not be doing what you think it does, depending on the specific UTF-8
characters you are escaping and how you are returning the results.

When you escape your UTF-8 characters to entities, libxml2 resolves them back to
the UTF-8 octets they represent, and then swish-e converts them to Latin1. So
your UTF-8 chars are converted to Latin1 if they can be, and are ignored
otherwise. What you get back when you retrieve the PropertyName is Latin1, not
UTF-8. It's a lossy deal.

If you really have UTF-8 characters that you want preserved in PropertyNames for
HTML display, you need to double escape them so that the entity is preserved.

 &#931;  -> &amp;#931;

etc. Caveat: they can't be searched on, but then, they wouldn't be anyway if you
left them as UTF-8 characters.

Here's an example (depending on how my mail gets converted (or not) you might
see these as Latin1 chars or not, but if you run this is a terminal with the
display set to Latin1, you'll see the Latin1 swish-e returns):

[karpet@pekmac:~/tmp]$ cat entmaker.pl
use Search::Tools::XML;

print "<html><body>\n";
for my $name ( sort keys %Search::Tools::XML::HTML_ents ) {
    my $num = $Search::Tools::XML::HTML_ents{$name};
    print "$name = &#$num;\n";
    print "$name = &amp;#$num;\n";
}
print "</body></html>\n";
[karpet@pekmac:~/tmp]$ perl entmaker.pl >ents.html
[karpet@pekmac:~/tmp]$ cat conf
IndexContents HTML2 .html
StoreDescription HTML2 <body>
[karpet@pekmac:~/tmp]$ swish-e -i ents.html -c conf
Indexing Data Source: "File-System"
Indexing "ents.html"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 477 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
477 unique words indexed.
5 properties sorted.
1 file indexed.  8,610 total bytes.  824 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
[karpet@pekmac:~/tmp]$ swish-e -w amp -p swishdescription
# SWISH format: 2.5.6
# Search words: amp
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000 ents.html "ents.html" 8610 "AElig =  AElig = &#198; Aacute =  Aacute =
&#193; Acirc =  Acirc = &#194; Agrave =  Agrave = &#192; Alpha =  Alpha =
&#913; Aring =  Aring = &#197; Atilde =  Atilde = &#195; Auml =  Auml =
&#196; Beta =  Beta = &#914; Ccedil =  Ccedil = &#199; Chi =  Chi = &#935;
Dagger =  Dagger = &#8225; Delta =  Delta = &#916; ETH =  ETH = &#208; Eacute =
 Eacute = &#201; Ecirc =  Ecirc = &#202; Egrave =  Egrave = &#200; Epsilon =
 Epsilon = &#917; Eta =  Eta = &#919; Euml =  Euml = &#203; Gamma =  Gamma =
&#915; Iacute =  Iacute = &#205; Icirc =  Icirc = &#206; Igrave =  Igrave =
&#204; Iota =  Iota = &#921; Iuml =  Iuml = &#207; Kappa =  Kappa = &#922;
Lambda =  Lambda = &#923; Mu =  Mu = &#924; Ntilde =  Ntilde = &#209; Nu =  Nu
= &#925; OElig =  OElig = &#338; Oacute =  Oacute = &#211; Ocirc =  Ocirc =
&#212; Ograve =  Ograve = &#210; Omega =  Omega = &#937; Omicron =  Omicron =
&#927; Oslash =  Oslash = &#216; Otilde =  Otilde = &#213; Ouml =  Ouml =
&#214; Phi =  Phi = &#934; Pi =  Pi = &#928; Prime =  Prime = &#8243; Psi =  Psi
= &#936; Rho =  Rho = &#929; Scaron =  Scaron = &#352; Sigma =  Sigma = &#931;
THORN =  THORN = &#222; Tau =  Tau = &#932; Theta =  Theta = &#920; Uacute = 
Uacute = &#218; Ucirc =  Ucirc = &#219; Ugrave =  Ugrave = &#217; Upsilon =
Upsilon = &#933; Uuml =  Uuml = &#220; Xi =  Xi = &#926; Yacute =  Yacute =
&#221; Yuml =  Yuml = &#376; Zeta =  Zeta = &#918; aacute =  aacute = &#225;
acirc =  acirc = &#226; acute =  acute = &#180; aelig =  aelig = &#230;
agrave =  agrave = &#224; alefsym =  alefsym = &#8501; alpha =  alpha = &#945;
amp = & amp = &#38; and =  and = &#8743; ang =  ang = &#8736; apos = ' apos =
&#39; aring =  aring = &#229; asymp =  asymp = &#8776; atilde =  atilde =
&#227; auml =  auml = &#228; bdquo =  bdquo = &#8222; beta =  beta = &#946;
brvbar =  brvbar = &#166; bull =  bull = &#8226; cap =  cap = &#8745; ccedil =
 ccedil = &#231; cedil =  cedil = &#184; cent =  cent = &#162; chi =  chi =
&#967; circ =  circ = &#710; clubs =  clubs = &#9827; cong =  cong = &#8773;
copy =  copy = &#169; crarr =  crarr = &#8629; cup =  cup = &#8746; curren = 
curren = &#164; dArr =  dArr = &#8659; dagger =  dagger = &#8224; darr =  darr =
&#8595; deg =  deg = &#176; delta =  delta = &#948; diams =  diams = &#9830;
divide =  divide = &#247; eacute =  eacute = &#233; ecirc =  ecirc = &#234;
egrave =  egrave = &#232; empty =  empty = &#8709; emsp =  emsp = &#8195; ensp
=  ensp = &#8194; epsilon =  epsilon = &#949; equiv =  equiv = &#8801; eta =
eta = &#951; eth =  eth = &#240; euml =  euml = &#235; euro =  euro = &#8364;
exist =  exist = &#8707; fnof =  fnof = &#402; forall =  forall = &#8704; frac12
=  frac12 = &#189; frac14 =  frac14 = &#188; frac34 =  frac34 = &#190; frasl
=  frasl = &#8260; gamma =  gamma = &#947; ge =  ge = &#8805; gt = > gt = &#62;
hArr =  hArr = &#8660; harr =  harr = &#8596; hearts =  hearts = &#9829; hellip
=  hellip = &#8230; iacute =  iacute = &#237; icirc =  icirc = &#238; iexcl =
 iexcl = &#161; igrave =  igrave = &#236; image =  image = &#8465; infin =
infin = &#8734; int =  int = &#8747; iota =  iota = &#953; iquest =  iquest =
&#191; isin =  isin = &#8712; iuml =  iuml = &#239; kappa =  kappa = &#954;
lArr =  lArr = &#8656; lambda =  lambda = &#955; lang =  lang = &#9001; laquo =
 laquo = &#171; larr =  larr = &#8592; lceil =  lceil = &#8968; ldquo =  ldquo
= &#8220; le =  le = &#8804; lfloor =  lfloor = &#8970; lowast =  lowast =
&#8727; loz =  loz = &#9674; lrm =  lrm = &#8206; lsaquo =  lsaquo = &#8249;
lsquo =  lsquo = &#8216; lt = < lt = &#60; macr =  macr = &#175; mdash =  mdash
= &#8212; micro =  micro = &#181; middot =  middot = &#183; minus =  minus =
&#8722; mu =  mu = &#956; nabla =  nabla = &#8711; nbsp =   nbsp = &#160; ndash
=  ndash = &#8211; ne =  ne = &#8800; ni =  ni = &#8715; not =  not = &#172;
notin =  notin = &#8713; nsub =  nsub = &#8836; ntilde =  ntilde = &#241; nu =
 nu = &#957; oacute =  oacute = &#243; ocirc =  ocirc = &#244; oelig =  oelig
= &#339; ograve =  ograve = &#242; oline =  oline = &#8254; omega =  omega =
&#969; omicron =  omicron = &#959; oplus =  oplus = &#8853; or =  or = &#8744;
ordf =  ordf = &#170; ordm =  ordm = &#186; oslash =  oslash = &#248; otilde
=  otilde = &#245; otimes =  otimes = &#8855; ouml =  ouml = &#246; para = 
para = &#182; part =  part = &#8706; permil =  permil = &#8240; perp =  perp =
&#8869; phi =  phi = &#966; pi =  pi = &#960; piv =  piv = &#982; plusmn = 
plusmn = &#177; pound =  pound = &#163; prime =  prime = &#8242; prod =  prod =
&#8719; prop =  prop = &#8733; psi =  psi = &#968; quot = " quot = &#34; rArr =
 rArr = &#8658; radic =  radic = &#8730; rang =  rang = &#9002; raquo =  raquo
= &#187; rarr =  rarr = &#8594; rceil =  rceil = &#8969; rdquo =  rdquo =
&#8221; real =  real = &#8476; reg =  reg = &#174; rfloor =  rfloor = &#8971;
rho =  rho = &#961; rlm =  rlm = &#8207; rsaquo =  rsaquo = &#8250; rsquo =
rsquo = &#8217; sbquo =  sbquo = &#8218; scaron =  scaron = &#353; sdot =  sdot
= &#8901; sect =  sect = &#167; shy =  shy = &#173; sigma =  sigma = &#963;
sigmaf =  sigmaf = &#962; sim =  sim = &#8764; spades =  spades = &#9824; sub =
 sub = &#8834; sube =  sube = &#8838; sum =  sum = &#8721; sup =  sup = &#8835;
sup1 =  sup1 = &#185; sup2 =  sup2 = &#178; sup3 =  sup3 = &#179; supe =
supe = &#8839; szlig =  szlig = &#223; tau =  tau = &#964; there4 =  there4 =
&#8756; theta =  theta = &#952; thetasym =  thetasym = &#977; thinsp =  thinsp =
&#8201; thorn =  thorn = &#254; tilde =  tilde = &#732; times =  times =
&#215; trade =  trade = &#8482; uArr =  uArr = &#8657; uacute =  uacute =
&#250; uarr =  uarr = &#8593; ucirc =  ucirc = &#251; ugrave =  ugrave =
&#249; uml =  uml = &#168; upsih =  upsih = &#978; upsilon =  upsilon = &#965;
uuml =  uuml = &#252; weierp =  weierp = &#8472; xi =  xi = &#958; yacute = 
yacute = &#253; yen =  yen = &#165; yuml =  yuml = &#255; zeta =  zeta =
&#950; zwj =  zwj = &#8205; zwnj =  zwnj = &#8204;"
.
-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 26 22:07:38 2009