Skip to main content.
home | support | download

Back to List Archive

Re: xml format errors

From: <brad(at)not-real.auroraquanta.com>
Date: Tue Aug 05 2003 - 02:00:43 GMT
interesting. changing my xml start tag from:

<?xml version="1.0"?>

to

<?xml version="1.0" encoding="ISO-8859-1"?>

has eliminated all of the errors, I am building an index now to see if it
searches ok.

thanks for all the help everyone.

Brad
------------------------------------------------------------
 Brad Miele
 Chief Technology Officer
 Aurora & Quanta Productions
 bmiele@auroraquanta.com
 (207)828-8787 x110

Seeing emptiness, have compassion. --Milarepa

On Mon, 4 Aug 2003, Bill Moseley wrote:

> On Mon, Aug 04, 2003 at 03:28:27PM -0700, Brad Miele wrote:
> > was there a way to tel libxml to accept the characters? I am concerned as
> > we are starting to index a lot of german and spanish stuff, and it seems
> > that these records are the cuplrit.
>
> Libxml2 should detect the input encoding of the document.  I actually
> have not tested that before, though.  Internally, libxml2 uses UTF-8
> encoding.  libxml2 has built in a function to convert UTF-8 into
> Latin1/8859-1 and swish-e uses that function to convert the output from
> libxml2 (UTF-8) into 8859-1 for indexing.  That's done because swish-e
> only works with 8-bit characters at this time.
>
> Libxml2 should also know how to convert most entities for example:
> &copy;
>
> Spanish and German should be no problem.
>
> moseley@bumby:~$ echo $LANG
> en_US
>
> moseley@bumby:~$ cat 1.html
> <html>
> <head><title>Title</title>
> </head>
> <body>
> El Ni&ntilde;o
> Das ist Gro&szlig;
> </body>
> </html>
>
> (I assume this will make it through my mailer)
>
> moseley@bumby:~$ swish-e -i 1.html -T parsed_words indexed_words -v0
>
> White-space found word 'Title'
>     Adding:[1:swishdefault(1)]   'title'   Pos:2  Stuct:0x7 ( HEAD TITLE
> FILE )
> White-space found word 'El'
>     Adding:[1:swishdefault(1)]   'el'   Pos:5  Stuct:0x9 ( BODY FILE )
> White-space found word 'Ni˝o'
>     Adding:[1:swishdefault(1)]   'ni˝o'   Pos:6  Stuct:0x9 ( BODY FILE )
> White-space found word 'Das'
>     Adding:[1:swishdefault(1)]   'das'   Pos:7  Stuct:0x9 ( BODY FILE )
> White-space found word 'ist'
>     Adding:[1:swishdefault(1)]   'ist'   Pos:8  Stuct:0x9 ( BODY FILE )
> White-space found word 'Gro▀'
>     Adding:[1:swishdefault(1)]   'gro▀'   Pos:9  Stuct:0x9 ( BODY FILE )
>
>
> moseley@bumby:~$ swish-e -w gro▀ -H0
> 1000 1.html "Title" 98
>
>
> moseley@bumby:~$ xml2-config --version
> 2.5.7
>
>
>
> > I am afraid that I am a newbie to the worlds of both XML and
> > Charactersets.
>
> Yes, it's bad living in the world of ASCII and 8859-1 because I have not
> needed to learn to work with other character sets.
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
>
Received on Tue Aug 5 02:00:53 2003