Skip to main content.
home | support | download

Back to List Archive

Problem with foreign characters

From: Zambra - Michael <michael(at)>
Date: Fri Dec 21 2001 - 12:31:33 GMT

I'd appreciate any help regarding my problem (see below). I have posted several messages to the list, but haven't been able to solve it, yet.
swish-e seems to index correctly, but the cgi-script provided doesn't display foreign characters (...). You can see it here:

The index contains a page with the word "Camarn".
If I search for "Camarn" the search engine shows the hit, but without the accented character. Bill pointed out that the indexer might still working wrong because it was indexing "camar" and "n" and interpreting "" as a blank. I don't think so, because the engine is unable to find "camar" or "n".

Any assistance is greatly appreciated.


Previous message history==========================


I have installed Swish-e (latest dev) on a Unix system (Sun OS 5.7). I think I have done it successfully. I have linked to the xml-parsing and the zlib libraries. 

I can index without problems and have used the swish.cgi script in order to to searches. It yields correct results, but always ommiting the foreign characters as [].

I find this VERY strange, because if it finds words with foreign characters (I did a search for "Camarn" and it yielded results), why does it show the results without these characters in the page title lines? 

Thanks in advance for any help.



Hello Bill,

it's great to hear from one of the Swish-developers!
Thanks for your detailed reply.

> The WordCharacters setting is missing the .  I think that's a mistake.  I
> have to look at ISO-8859-1 chars for another project, so I'll try to
> Swish-e's default wordcharacter settings.

Great. Before compiling I had modified config.h in order to suit it to my
needs, and included "". But it didn't work (I don't know why).

# WordCharacters:

I added the following directives to the config file:

WordCharacters  .abcdefghijklmnopqrstuvwxyz
BeginCharacters abcdefghijklmnopqrstuvwxyz
EndCharacters   abcdefghijklmnopqrstuvwxyz

but that didn't work either. Strangely enough, version 2.0.5 had worked

The explanation about the search results ("Camarn" being actually indexed
as "Camar" "n" makes much sense. Thanks.

Adding the WordCharacters directive should solve the problem, shouldn't it?
The problem



Dear Bill,

I have installed the Windows port, generated an index file from my local
file system, uploaded the index to my Unix host, and it work perfectly!

Unfortunately, I have no root permissions, so I have always additional work
to do when installing software on my host (I have to use -PREFIX and
environment variables all the time...). Could this behaviour (ignoring
foreign characters as WordCharacters) of my Swish-e be caused by a faulty
installation? Is there a way to check this?



Dear Bill,

thanks for all your efforts. I have followed your instructions, bar one
thing. It is funny, but I'm no longer able to type in extended characters,
although I had been able until yesterday. I've been playing with RC_LANG and
RC_LC_ALL, but I have not been able to type in accented vowels in my Telnet

I think the page is being indexed correctly. Instead of the command line
search I have done a search through your script:

The problem is there.

===================log of session
bash$ cat swish-e.conf

IndexFile /opt2/zambra/httpd/cgi-bin/sw/idx/index.swish-e
IndexDir ../../../htdocs/camaron.html
ReplaceRules replace "../../../htdocs" ""
FollowSymLinks yes

WordCharacters  .abcdefghijklmnopqrstuvwxyz
BeginCharacters abcdefghijklmnopqrstuvwxyz
EndCharacters   abcdefghijklmnopqrstuvwxyz
IgnoreFirstChar .
IgnoreLastChar  .

bash$ cat ../../../htdocs/camaron.html

Test page for search term <b>Camarn</b>


bash$ ./swish-e -c swish-e.conf -T parsed_words indexed_words -v 0
Indexing Data Source: "File-System"
White-space found word 'Camarn'
    Adding:[swishdefault:1]   'camarn'   Pos:1  Stuct:0x7 ( HEAD TITLE
White-space found word 'Test'
    Adding:[swishdefault:1]   'test'   Pos:2  Stuct:0x9 ( BODY FILE )
White-space found word 'page'
    Adding:[swishdefault:1]   'page'   Pos:3  Stuct:0x9 ( BODY FILE )
White-space found word 'for'
    Adding:[swishdefault:1]   'for'   Pos:4  Stuct:0x9 ( BODY FILE )
White-space found word 'search'
    Adding:[swishdefault:1]   'search'   Pos:5  Stuct:0x9 ( BODY FILE )
White-space found word 'term'
    Adding:[swishdefault:1]   'term'   Pos:6  Stuct:0x9 ( BODY FILE )
White-space found word 'Camarn'
    Adding:[swishdefault:1]   'camarn'   Pos:7  Stuct:0x49 ( EM BODY FILE )
Indexing done!

Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
Received on Fri Dec 21 12:31:43 2001