- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 18 Dec 2002 07:56:09 +0900
- To: Dan Connolly <connolly@w3.org>
- Cc: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org
At 16:06 02/12/17 -0600, Dan Connolly wrote: >On Tue, 2002-12-17 at 15:40, Martin Duerst wrote: > > Hello Dan, > > > > Many thanks for your comments on the Character Model. > >Hi... thanks for getting back to me quickly. > >I hope you have time to explain a bit more; I'm >still a little confused... ok, I'll try. > > We don not see any particular conflict here. There is a clear difference > > between "character encoding" and "character encoding scheme". The later > > is taken from > > http://www.unicode.org/unicode/reports/tr17/#Character%20Encoding%20Model > > the former probably goes back to RFC 2070 or so. The later refers to > > a very particular part of the overall process of character encoding, > > the former to the overall thing. > >Which term names the class that utf-8 and us-ascii fall into? You most probably should use 'character encoding', which denotes the overall mapping between char* and byte*. The other terms such as 'character encoding scheme' and 'character encoding form' are really only relevant when looking at the details of that process. In particular, 'character encoding form' was specifically created for UTF-16, which can neither be explained as a sequence of bytes (in particular in memory where endianness issues are invisible) nor as a sequence of naturally (binary) represented integers. > > >It's technically arbitrary whether it goes from byte* > > >to character* or the other way around, but the > > >use of 'encoding' in the name strongly suggests > > >encoding characters, i.e. going from characters > > >to bytes. > > > > In actual fact, it's technically not arbitrary. > > There are quite a few character encodings where the > > char*->byte* mapping allows choices. > >Ah; OK. Please note that in the spec. Ok, we'll try to do so. > > This in particular > > applies for the iso-2022-based encodings. This is the > > reason why the formal definition is 'the wrong way > > round'. Nevertheless, it would be extremely weird for > > anybody to change to calling these things 'decodings'. > > > > > > > > >The charmod spec gets the specification of IANA charsets > > >right, indirectly... > > > > > >"A CES, together with the CCSes it is used with, is > > >identified by an IANA charset identifier." > > > > > >but it's not nearly so mathematically precise as just > > >saying that IANA charsets identify character encoding > > >schemes, and character encoding schemes are > > >(invertible) functions from character sequences > > >to byte sequences. > > > > As explained above, it doesn't say that these are functions > > because they are not functions. > >They are functions when they go the "wrong" way, >from byte sequences to character >sequences, no? On paper, yes. In actual practice, there are small differences between the conversions implemented by each of the major vendors/ technologies. But this is different from the 'right' way, where in some cases, the actual specification allows different choices. > > >Please fix. > > > > We don't think that you have brought any actual issues. > >Well, I need a term for the class that utf-8 and us-ascii >are in; i.e. mappings between character sequences >and byte sequences. Please let me know which term >you've chosen for that concept. I thought it >was "character encoding scheme", from RFC2070 and >such. I just reviewed RFC 2070. There are indeed cases where it uses 'character encoding scheme', in particular in the abstract. But later, it seems to be quite clear, in particular when it says: The term "charset" in MIME is used to designate a character encoding, rather than merely a coded character set as the term may suggest. A character encoding is a mapping (possibly many-to-one) of sequences of octets to sequences of characters taken from one or more character repertoires. Thinking back, it's easily possible that different authors used somewhat different terminology :-). Please also note http://www.ietf.org/rfc/rfc2130.txt, which uses mostly the same distinction between 'coded character set' and 'character encoding scheme' as in http://www.unicode.org/unicode/reports/tr17/, but then confuses things with statements such as This report recommends the use of ISO 10646 as the default Coded Character Set, and UTF-8 as the default Character Encoding Scheme in the creation of new protocols... Here 'utf-8' is talking about a generic method of variable-length encoding from 32-bit integers to byte sequences, which in theory could be applied to any kind of Coded Character Set. However, UTF-8 has fortunately never been used in combination with anything other than ISO 10646, so that it now always stays for the specific variable-length encoding of ISO 10646/Unicode, and it is in this sense (i.e. as a 'character encoding' rather than as a 'character encoding scheme') that it is used in the 'charset' parameter or the 'encoding' pseudo-attribute. Regards, Martin.
Received on Tuesday, 17 December 2002 17:56:27 UTC