Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: John Angel <angel_john(at)not-real.hotmail.com>
Date: Sun Dec 14 2003 - 20:05:20 GMT
You are right, but your examples are theory.

In practice, there are no cats and dogs both represented with 123.

What is the alternative for proposed solution, since we don't have utf-8
yet?


----- Original Message ----- 
From: "Bill Conlon" <bill@tothept.com>
To: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
Sent: Sunday, December 14, 2003 20:34
Subject: [SWISH-E] Re: Fw: Re: 8-bit chars


> Let me try:
>
> Two documents:
>
> Document 1 Encoding A:  has the word 'cat' represented as numbers 123.
> Document 2 Encoding B:  has the word 'dog' represented as numbers 123.
>
> Both documents are spidered.  So the index has some pointers, and the
> pointer for the word represented as '123" points to both Document 1 and
> Document 2.
>
> The index does not know the encoding.  So when I search for 'cat' I get
> two documents, even though one only contains the word with the meaning
> 'cat'.
>
> What I think you are asking for is to add the encoding to the index, so
> instead of just a representation:
> 123 --> Document1, Document 2
>
> you want
> A123 --> Document 1
> B123 --> Document 2
>
> Now what do you do about Encoding C, where 'cat' is also represented as
> 123?
>
> C123 --> Document 3
>
> Now I search for cat = A123, and only obtain Document 1, even though
> semantically, I want both Document 1 and Document 3.
>
> The index is useful because it captures 'meaning'.  How do you propose to
> build in a semantic parser so that the index can know the word 'cat' is
> what is meant by different encodings.  That is how do we know that
>
> A123 is equivalent to C123, but is different from B123?
>
> >> There is NO WAY to store more than one encoding in the index as it is
> >> currently designed.
> >>
> >> And that's exactly what you are asking to do.  You want to have libxml2
> >> convert the document back to it's original encoding when storing the
> >> words in the index -- "as-is" -- and that's trying to store more than
> >> one encoding in the index at the same time.
> >
> >
> >Yes, that is exactly what I am asking to do.
> >
> >Forget about encodings, you won't see the wider picture.
> >
> >Think how can we index documents presented in 3 different languages
(without
> >utf-8 support)? This is the only solution, and it works.
> >
>
>
> Bill Conlon
>
> To the Point
> 345 California Avenue Suite 2
> Palo Alto, CA 94306
>
> office: 650.327.2175
> fax:    650.329.8335
> mobile: 650.906.9929
> e-mail: mailto:bill@tothept.com
> web:    http://www.tothept.com
>
>
>
Received on Sun Dec 14 20:05:28 2003