I think this harkens back to the Unicode vs ISO-8859 debacle that seems
to recur (someone correct me here).
Your Unicode character entities are being converted to 8859 and stored
that way, as (I think) whitespace, in the index. So searching on
entities won't work.
The ConvertHTMLEntities config option only applies when *not* using
libxml2 as your parser, so you're out of luck that way too.
Not good news, if I am understanding this correctly.
Search the email archives for more on the Unicode thread. As I
understand it, it's a major re-write to support and no one has stepped
forward to say "me! me! I'll do it!".
Pieter Claerhout supposedly wrote on 03/25/2004 09:17 AM:
> Hi all,
> I recently started using Swish-E for indexing some HTML content. The
> indexing works just fine, but I'm still struggling with the search part
> using the command line.
> In the HTML I index, there are a lot of HTML entities embedded. So far, no
> problem as everything indexes just fine.
> However, if I want to do a search, the command line doesn't accept html
> entities in the search string, but requires the original unicode characters.
> Is there a way to have it accept HTML entities for searching?
> An example:
> The document that get's indexed looks as follows:
> <title>beInformed 1.0</title>
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> The search command I tried is as follows:
> C:\>swish-e -w "ばんごはん"
> # SWISH format: 2.4.1
> # Search words: ばんごはん
> # Removed stopwords:
> err: no results
> Is there a way to make this work? I don't want to use the native characters
> in the command line (they are Japanese)...
> Thanks in advance,
Peter Karman - Software Publications Programmer - Cray Inc
phone: 651-605-9009 - mailto:firstname.lastname@example.org
Received on Thu Mar 25 08:10:39 2004