Re: Concrete syntax, character sets from Jon Bosak on 1996-09-11 (w3c-sgml-wg@w3.org from September 1996)

From: Jon Bosak <bosak@atlantic-83.Eng.Sun.COM>
Date: Tue, 10 Sep 1996 22:32:04 -0700
To: gtn@ebt.com
CC: tbray@textuality.com, w3c-sgml-wg@w3.org
Message-Id: <199609110532.WAA04380@boethius.eng.sun.com>

[Gavin:]

> some people confuse coded character sets and encodings

Yeah.  Me, for instance.  I apologize for getting this mixed up (once
again) in an earlier comment.

Let's review the basics for us slower members of the class.

* The character set allowed in markup or data (e.g., 10646 BMP) and
the character encoding put on the wire or in a disk file (e.g., UTF-8)
are completely different issues.

* We can specify 10646 as the character set (in the SGML sense) for
all XML documents while still allowing a much more limited encoding
(such as 7-bit ASCII) for the transfer of a particular document
instance, assuming that information about which encoding is being used
is conveyed when the transmission is established.

* Notwithstanding the last point, we can (if we want) go farther than
saying that 10646 is the XML character set (in the SGML sense) and
also say that UTF-8 is the only (or recommended, or expected, or
default) encoding that we shall/should/expect to find when we open a
file containing an XML document or receive a byte stream during an
HTTP session.

Right?

Jon

Received on Wednesday, 11 September 1996 01:34:21 UTC