character encoding schemes go from character*->byte*

In another working group, I was just going to
cite the definition of a character encoding scheme,
but I see it's wrong in the charmod spec:

"A CES is a mapping of the code units of a CEF into well-defined
sequences of bytes"
 -- http://www.w3.org/TR/charmod/#sec-Digital
 http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Digital

No, a character encoding scheme maps a squence of
characters to a sequence of bytes.

This goes back at least as far as HTML 4.0:

"The "charset" parameter identifies a character encoding, which is a
method of converting a sequence of bytes into a sequence of characters."
 -- http://www.w3.org/TR/html401/charset.html

It's technically arbitrary whether it goes from byte*
to character* or the other way around, but the
use of 'encoding' in the name strongly suggests
encoding characters, i.e. going from characters
to bytes.

The charmod spec gets the specification of IANA charsets
right, indirectly...

"A CES, together with the CCSes it is used with, is
identified by an IANA charset identifier."

but it's not nearly so mathematically precise as just
saying that IANA charsets identify character encoding
schemes, and character encoding schemes are
(invertible) functions from character sequences
to byte sequences.

Please fix.



-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/

Received on Tuesday, 17 December 2002 09:33:18 UTC