- From: Dan Connolly <connolly@w3.org>
- Date: 17 Dec 2002 16:06:07 -0600
- To: Martin Duerst <duerst@w3.org>
- Cc: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org
On Tue, 2002-12-17 at 15:40, Martin Duerst wrote: > Hello Dan, > > Many thanks for your comments on the Character Model. Hi... thanks for getting back to me quickly. I hope you have time to explain a bit more; I'm still a little confused... > > Because our last call period is long closed, we do > not plan to list your comment as an issue in our > last call disposition. But the W3C I18N WG > Core Task Force has looked at your mail shortly at > its teleconf today and has actioned me to write back > to you. > > However, I plan to take up the issue of terminology > streamlining again at the next teleconf, to make sure > that we can increase the clarity and consistency of our document. > > At 08:33 02/12/17 -0600, Dan Connolly wrote: > > >In another working group, I was just going to > >cite the definition of a character encoding scheme, > >but I see it's wrong in the charmod spec: > > > >"A CES is a mapping of the code units of a CEF into well-defined > >sequences of bytes" > > -- http://www.w3.org/TR/charmod/#sec-Digital > > http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Digital > > > >No, a character encoding scheme maps a squence of > >characters to a sequence of bytes. > > > >This goes back at least as far as HTML 4.0: > > > >"The "charset" parameter identifies a character encoding, which is a > >method of converting a sequence of bytes into a sequence of characters." > > -- http://www.w3.org/TR/html401/charset.html > > We don not see any particular conflict here. There is a clear difference > between "character encoding" and "character encoding scheme". The later > is taken from > http://www.unicode.org/unicode/reports/tr17/#Character%20Encoding%20Model > the former probably goes back to RFC 2070 or so. The later refers to > a very particular part of the overall process of character encoding, > the former to the overall thing. Which term names the class that utf-8 and us-ascii fall into? > There may be some places in the Character Model that may be confusing > (but that you don't cite). For example, > http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-UniqueEncoding has: > > [S] Specifications SHOULD avoid using the terms 'character set' and > 'charset' to refer to a character encoding, except when the latter > is used to refer to the MIME charset parameter or its IANA-registered > values. The terms 'character encoding', 'character encoding form' or > 'character encoding scheme' are RECOMMENDED. > > I propose that we either take out 'character encoding form' and > 'character encoding scheme' here, or that we be more specific in > which cases which term should be used. > > There are also a few other cases where 'encoding scheme' is used > colloquially; we should also review these for clarity. > Please do, and please let me know how it turns out. > > >It's technically arbitrary whether it goes from byte* > >to character* or the other way around, but the > >use of 'encoding' in the name strongly suggests > >encoding characters, i.e. going from characters > >to bytes. > > In actual fact, it's technically not arbitrary. > There are quite a few character encodings where the > char*->byte* mapping allows choices. Ah; OK. Please note that in the spec. > This in particular > applies for the iso-2022-based encodings. This is the > reason why the formal definition is 'the wrong way > round'. Nevertheless, it would be extremely weird for > anybody to change to calling these things 'decodings'. > > > > >The charmod spec gets the specification of IANA charsets > >right, indirectly... > > > >"A CES, together with the CCSes it is used with, is > >identified by an IANA charset identifier." > > > >but it's not nearly so mathematically precise as just > >saying that IANA charsets identify character encoding > >schemes, and character encoding schemes are > >(invertible) functions from character sequences > >to byte sequences. > > As explained above, it doesn't say that these are functions > because they are not functions. They are functions when they go the "wrong" way, from byte sequences to character sequences, no? > > >Please fix. > > We don't think that you have brought any actual issues. Well, I need a term for the class that utf-8 and us-ascii are in; i.e. mappings between character sequences and byte sequences. Please let me know which term you've chosen for that concept. I thought it was "character encoding scheme", from RFC2070 and such. > We will look at the issues that we have found 'en passant' > at our next teleconference. > > Regards, Martin. > > > > > > >-- > >Dan Connolly, W3C http://www.w3.org/People/Connolly/ -- Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Tuesday, 17 December 2002 17:06:13 UTC