RE: Character Encoding Question

At 5:18 PM -0800 11/29/00, John Boyer wrote:
>Actually, in the sentence directly after the one from which you cited, I
>quote:
>
>"The Unicode 16-bit encoding form is identical to the ISO/IEC 10646
>transformation format UTF-16."

Correct. Again, that is talking about encodings, not about character 
sets or repertoires.

>As to 'badly' misreading the UTF-8 spec, perhaps you could define how this
>differs in your mind from simply misreading.

I apologize for my harsh wording. I was reacting to you saying 
"clearly" in many places where it was not only not clear, it was the 
opposite of what you said.

>   Your characterization seems a
>bit harsh considering I've already said that I don't have any access to
>UCS-2 documentation, so I am having to guess from all of the shrouded half
>statements in the documents that I do have.

UCS-2 is an encoding the covers only the BMP. The encoding is "spit 
out the code point represented as two octets".

>   The examples in Section 4 do in
>fact have triplets of UCS-2 characters that represent 'something', and I
>have no way of knowing really whether this is considered to be a single
>defined sequence as far as UCS-2 is concerned or whether it represents
>characters in a three character word, or whether two of the three 16-bit
>values represent a single thing.

Your previous message said "These examples clearly show triplets of 
UCS-2 values being used to form a single character, which does not 
appear to be permissible under UTF-16." The examples in the UTF-8 
document show how to take a string of characters represented as code 
points and encode them as a string of UTF-8. The result is not "a 
single character".

>It would be more helpful, since you seem to know, to tell us whether or not
>UCS-2 == Unicode, which is the single most important bit of information we
>need.

I disagree that this is what you need. Given how confused even people 
in this group have been over the sentence, it appears that you need 
to change the sentence that refers to "non-Unicode encodings" because 
that is ambiguous. It is not clear even to me if a "Unicode encoding" 
means just UTF-8 and UTF-16 (which are defined in the Unicode 
Standard), or any encoding of the character set that is often called 
Unicode. Us stabbing at various definitions will not help.

UCS-2 is not the same as Unicode because there is no formal 
definition of "Unicode" that matches the phrase you used.

>   If UCS-2 != Unicode, does UCS-2 have the same representation power as
>UCS-4?

It depends on what you mean by "representation power". UCS-2 can 
fully encode all the characters in the BMP, but *only* in the BMP. 
UCS-4 can fully encode all characters in ISO 10646. Both of them are 
clear an unambiguous, but they cover different-sized sets of 
characters.

Maybe it would be best for you to wait for Martin to clear this up.

--Paul Hoffman, Director
--Internet Mail Consortium

Received on Wednesday, 29 November 2000 20:49:56 UTC