Skip to main content.
home | support | download

Back to List Archive

words, entities and accents

From: <liste(at)not-real.artware.qc.ca>
Date: Thu Jun 08 2000 - 22:05:33 GMT
Swish-e 1.3.2 doesn't index documents containing HTML entities properly. 
Because WORDCHARS doesn't contain ';', a word like "montr&eacute;al" is
indexed as two words "montré" and "al".  I cured this temporarily by adding
';' to WORDCHARS.


While digging in index.c to discouver this, I see the following comment :

       /* Ok, can now go to lowercase, the whole problem
          was with entities &Aacute; would become &aacute;
        */

I find this strange, because it's EXACTLY what I want.  Otherwise,
"R&Eacute;SEAU" becomes "rÉseau", and won't be found if you search for
"réseau".  While I realise that I should be using locales so that
tolower() does the right thing, I'd rather not go there.

Is there ever a case where it is undesirable for an HTML entity to be
converted to lower case as-is?  Seeing as how we going to convert it to
lower case after converting to an ISO-latin-1 char anyway.


Also, it seems to me a better idea to convert all entities *before*
looking for word boundaries.  This means that "&;" can be removed from
WORDCHARS.  Is there any particular reason this isn't done now?  


I note that the code to split words up is duplicated in 4functions
(countwords(), countwordstr(), parsecomment() and
parseMetaData()).  This makes things like changing the entities handling a
tad error-prone, to say the least.  Wouldn't it be better for each
function to look for strings that are to be counverted to words, then call
addstring() (say), which does word spliting, entity handling and calls
addentry()?  I'll write the code, but would like opinions first.  And, if
anyone has a torture test or coverage test to make sure I don't break
something, I'll be needing that....

-Philip
Received on Thu Jun 8 18:08:12 2000