Let me try:
Two documents:
Document 1 Encoding A: has the word 'cat' represented as numbers 123.
Document 2 Encoding B: has the word 'dog' represented as numbers 123.
Both documents are spidered. So the index has some pointers, and the
pointer for the word represented as '123" points to both Document 1 and
Document 2.
The index does not know the encoding. So when I search for 'cat' I get
two documents, even though one only contains the word with the meaning
'cat'.
What I think you are asking for is to add the encoding to the index, so
instead of just a representation:
123 --> Document1, Document 2
you want
A123 --> Document 1
B123 --> Document 2
Now what do you do about Encoding C, where 'cat' is also represented as
123?
C123 --> Document 3
Now I search for cat = A123, and only obtain Document 1, even though
semantically, I want both Document 1 and Document 3.
The index is useful because it captures 'meaning'. How do you propose to
build in a semantic parser so that the index can know the word 'cat' is
what is meant by different encodings. That is how do we know that
A123 is equivalent to C123, but is different from B123?
>> There is NO WAY to store more than one encoding in the index as it is
>> currently designed.
>>
>> And that's exactly what you are asking to do. You want to have libxml2
>> convert the document back to it's original encoding when storing the
>> words in the index -- "as-is" -- and that's trying to store more than
>> one encoding in the index at the same time.
>
>
>Yes, that is exactly what I am asking to do.
>
>Forget about encodings, you won't see the wider picture.
>
>Think how can we index documents presented in 3 different languages (without
>utf-8 support)? This is the only solution, and it works.
>
Bill Conlon
To the Point
345 California Avenue Suite 2
Palo Alto, CA 94306
office: 650.327.2175
fax: 650.329.8335
mobile: 650.906.9929
e-mail: mailto:bill@tothept.com
web: http://www.tothept.com
Received on Sun Dec 14 19:34:59 2003