W3C home > Mailing lists > Public > www-i18n-comments@w3.org > December 2002

character encoding schemes go from character*->byte*

From: Dan Connolly <connolly@w3.org>
Date: 17 Dec 2002 08:33:13 -0600
To: www-i18n-comments@w3.org
Message-Id: <1040135593.11346.132.camel@dirk.dm93.org>

In another working group, I was just going to
cite the definition of a character encoding scheme,
but I see it's wrong in the charmod spec:

"A CES is a mapping of the code units of a CEF into well-defined
sequences of bytes"
 -- http://www.w3.org/TR/charmod/#sec-Digital
 http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Digital

No, a character encoding scheme maps a squence of
characters to a sequence of bytes.

This goes back at least as far as HTML 4.0:

"The "charset" parameter identifies a character encoding, which is a
method of converting a sequence of bytes into a sequence of characters."
 -- http://www.w3.org/TR/html401/charset.html

It's technically arbitrary whether it goes from byte*
to character* or the other way around, but the
use of 'encoding' in the name strongly suggests
encoding characters, i.e. going from characters
to bytes.

The charmod spec gets the specification of IANA charsets
right, indirectly...

"A CES, together with the CCSes it is used with, is
identified by an IANA charset identifier."

but it's not nearly so mathematically precise as just
saying that IANA charsets identify character encoding
schemes, and character encoding schemes are
(invertible) functions from character sequences
to byte sequences.

Please fix.



-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Tuesday, 17 December 2002 09:33:18 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 October 2009 08:32:32 GMT