Re: character encoding schemes go from character*->byte* from Dan Connolly on 2002-12-17 (www-i18n-comments@w3.org from December 2002)

From: Dan Connolly <connolly@w3.org>
Date: 17 Dec 2002 16:06:07 -0600
To: Martin Duerst <duerst@w3.org>
Cc: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org
Message-Id: <1040162767.14858.11.camel@dirk.dm93.org>
On Tue, 2002-12-17 at 15:40, Martin Duerst wrote:
> Hello Dan,
> 
> Many thanks for your comments on the Character Model.

Hi... thanks for getting back to me quickly.

I hope you have time to explain a bit more; I'm
still a little confused...

> 
> Because our last call period is long closed, we do
> not plan to list your comment as an issue in our
> last call disposition. But the W3C I18N WG
> Core Task Force has looked at your mail shortly at
> its teleconf today and has actioned me to write back
> to you.
> 
> However, I plan to take up the issue of terminology
> streamlining again at the next teleconf, to make sure
> that we can increase the clarity and consistency of our document.
> 
> At 08:33 02/12/17 -0600, Dan Connolly wrote:
> 
> >In another working group, I was just going to
> >cite the definition of a character encoding scheme,
> >but I see it's wrong in the charmod spec:
> >
> >"A CES is a mapping of the code units of a CEF into well-defined
> >sequences of bytes"
> >  -- http://www.w3.org/TR/charmod/#sec-Digital
> >  http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Digital
> >
> >No, a character encoding scheme maps a squence of
> >characters to a sequence of bytes.
> >
> >This goes back at least as far as HTML 4.0:
> >
> >"The "charset" parameter identifies a character encoding, which is a
> >method of converting a sequence of bytes into a sequence of characters."
> >  -- http://www.w3.org/TR/html401/charset.html
> 
> We don not see any particular conflict here. There is a clear difference
> between "character encoding" and "character encoding scheme". The later
> is taken from
> http://www.unicode.org/unicode/reports/tr17/#Character%20Encoding%20Model
> the former probably goes back to RFC 2070 or so. The later refers to
> a very particular part of the overall process of character encoding,
> the former to the overall thing.

Which term names the class that utf-8 and us-ascii fall into?


> There may be some places in the Character Model that may be confusing
> (but that you don't cite). For example,
> http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-UniqueEncoding has:
> 
>     [S] Specifications SHOULD avoid using the terms 'character set' and
>     'charset' to refer to a character encoding, except when the latter
>     is used to refer to the MIME charset parameter or its IANA-registered
>     values. The terms 'character encoding', 'character encoding form' or
>     'character encoding scheme' are RECOMMENDED.
> 
> I propose that we either take out 'character encoding form' and
> 'character encoding scheme' here, or that we be more specific in
> which cases which term should be used.
> 
> There are also a few other cases where 'encoding scheme' is used
> colloquially; we should also review these for clarity.
> 

Please do, and please let me know how it turns out.

> 
> >It's technically arbitrary whether it goes from byte*
> >to character* or the other way around, but the
> >use of 'encoding' in the name strongly suggests
> >encoding characters, i.e. going from characters
> >to bytes.
> 
> In actual fact, it's technically not arbitrary.
> There are quite a few character encodings where the
> char*->byte* mapping allows choices.

Ah; OK. Please note that in the spec.

>  This in particular
> applies for the iso-2022-based encodings. This is the
> reason why the formal definition is 'the wrong way
> round'. Nevertheless, it would be extremely weird for
> anybody to change to calling these things 'decodings'.
> 
> 
> 
> >The charmod spec gets the specification of IANA charsets
> >right, indirectly...
> >
> >"A CES, together with the CCSes it is used with, is
> >identified by an IANA charset identifier."
> >
> >but it's not nearly so mathematically precise as just
> >saying that IANA charsets identify character encoding
> >schemes, and character encoding schemes are
> >(invertible) functions from character sequences
> >to byte sequences.
> 
> As explained above, it doesn't say that these are functions
> because they are not functions.

They are functions when they go the "wrong" way,
from byte sequences to character
sequences, no?

> 
> >Please fix.
> 
> We don't think that you have brought any actual issues.

Well, I need a term for the class that utf-8 and us-ascii
are in; i.e. mappings between character sequences
and byte sequences. Please let me know which term
you've chosen for that concept. I thought it
was "character encoding scheme", from RFC2070 and
such.

> We will look at the issues that we have found 'en passant'
> at our next teleconference.
> 
> Regards,    Martin.
> 
> 
> 
> 
> 
> >--
> >Dan Connolly, W3C http://www.w3.org/People/Connolly/
-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Tuesday, 17 December 2002 17:06:13 UTC