Skip to main content.
home | support | download

Back to List Archive

Re: words, entities and accents

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Fri Jun 09 2000 - 06:50:18 GMT
Hi, Philip

liste@artware.qc.ca wrote:
> 
> Swish-e 1.3.2 doesn't index documents containing HTML entities properly.
> Because WORDCHARS doesn't contain ';', a word like "montr&eacute;al" is
> indexed as two words "montré" and "al".  I cured this temporarily by adding
> ';' to WORDCHARS.
> 

You are right, but you can also add ';' to WordCharacters in your config
file.

> While digging in index.c to discouver this, I see the following comment :
> 
>        /* Ok, can now go to lowercase, the whole problem
>           was with entities &Aacute; would become &aacute;
>         */
> 
> I find this strange, because it's EXACTLY what I want.  Otherwise,
> "R&Eacute;SEAU" becomes "rÉseau", and won't be found if you search for
> "réseau".  While I realise that I should be using locales so that
> tolower() does the right thing, I'd rather not go there.
> 
> Is there ever a case where it is undesirable for an HTML entity to be
> converted to lower case as-is?  Seeing as how we going to convert it to
> lower case after converting to an ISO-latin-1 char anyway.
> 
> Also, it seems to me a better idea to convert all entities *before*
> looking for word boundaries.  This means that "&;" can be removed from
> WORDCHARS.  Is there any particular reason this isn't done now?
> 

Entities are converted in convertentities function (wich also calls to
cnverttonamed and concerttoascii).
Convertentities is executed before going to lowercase and striping 
last an firts characters. So montr&eacute; becomes montré prior to go
to lowercase.
In my language (spanish), I also need that montré becomes montre to
avoide
errors in searching when people mispelled the words. This is why I have
added
the TranslateCharacters directive to the code (see my previous message
in the
list).

> I note that the code to split words up is duplicated in 4functions
> (countwords(), countwordstr(), parsecomment() and
> parseMetaData()).  This makes things like changing the entities handling a
> tad error-prone, to say the least.  Wouldn't it be better for each
> function to look for strings that are to be counverted to words, then call
> addstring() (say), which does word spliting, entity handling and calls
> addentry()?  I'll write the code, but would like opinions first.  And, if
> anyone has a torture test or coverage test to make sure I don't break
> something, I'll be needing that....
> 

You are right. swish-e-1.3.2 has several lacks in the coding. Take a
look at
memory, you will see many more malloc,realloc and strdup than free. If
we talk
about bufferoverrun and performance, swish-e has severe lacks in its
design.
Fortunately, many people have worked on it. I have tried to add all
these patches
and many new features including better performance to the package. Take
a look
at: 
http://www.boe.es/swish-e

Have a nice day 

Jose Ruiz

jmruiz@boe.es
Received on Fri Jun 9 02:56:31 2000