On Sun, Dec 14, 2003 at 12:05:10PM -0800, John Angel wrote:
> You are right, but your examples are theory.
> In practice, there are no cats and dogs both represented with 123.
No, Bill Conlon's example is right on. I suggest you read it again. I
don't think he was being literal about "cat" and "dog" represented with
123. Did you?
> What is the alternative for proposed solution, since we don't have utf-8
For what you are asking? There is none. The solution IS utf-8. That's
why there is utf-8 (and Unicode) after all, to solve just this problem.
Maybe read this too:
or one of the other many, many sites that explain this.
> ----- Original Message -----
> From: "Bill Conlon" <firstname.lastname@example.org>
> To: "Multiple recipients of list" <email@example.com>
> Sent: Sunday, December 14, 2003 20:34
> Subject: [SWISH-E] Re: Fw: Re: 8-bit chars
> > Let me try:
> > Two documents:
> > Document 1 Encoding A: has the word 'cat' represented as numbers 123.
> > Document 2 Encoding B: has the word 'dog' represented as numbers 123.
> > Both documents are spidered. So the index has some pointers, and the
> > pointer for the word represented as '123" points to both Document 1 and
> > Document 2.
> > The index does not know the encoding. So when I search for 'cat' I get
> > two documents, even though one only contains the word with the meaning
> > 'cat'.
> > What I think you are asking for is to add the encoding to the index, so
> > instead of just a representation:
> > 123 --> Document1, Document 2
> > you want
> > A123 --> Document 1
> > B123 --> Document 2
> > Now what do you do about Encoding C, where 'cat' is also represented as
> > 123?
> > C123 --> Document 3
> > Now I search for cat = A123, and only obtain Document 1, even though
> > semantically, I want both Document 1 and Document 3.
> > The index is useful because it captures 'meaning'. How do you propose to
> > build in a semantic parser so that the index can know the word 'cat' is
> > what is meant by different encodings. That is how do we know that
> > A123 is equivalent to C123, but is different from B123?
> > >> There is NO WAY to store more than one encoding in the index as it is
> > >> currently designed.
> > >>
> > >> And that's exactly what you are asking to do. You want to have libxml2
> > >> convert the document back to it's original encoding when storing the
> > >> words in the index -- "as-is" -- and that's trying to store more than
> > >> one encoding in the index at the same time.
> > >
> > >
> > >Yes, that is exactly what I am asking to do.
> > >
> > >Forget about encodings, you won't see the wider picture.
> > >
> > >Think how can we index documents presented in 3 different languages
> > >utf-8 support)? This is the only solution, and it works.
> > >
> > Bill Conlon
> > To the Point
> > 345 California Avenue Suite 2
> > Palo Alto, CA 94306
> > office: 650.327.2175
> > fax: 650.329.8335
> > mobile: 650.906.9929
> > e-mail: mailto:firstname.lastname@example.org
> > web: http://www.tothept.com
Received on Sun Dec 14 21:05:47 2003