character sets - a summary and a proposal from Michael Sperberg-McQueen on 1996-09-15 (w3c-sgml-wg@w3.org from September 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
Date: Sun, 15 Sep 96 12:22:37 CDT
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199609151949.PAA24161@www10.w3.org>
Working through the postings about character sets, I have come to
believe we may be passing some things over in silence, because we think
them too obvious to merit discussion.

We all seem to agree that XML documents should be able to contain any
character in ISO 10646.

We all seem to agree that characters other than A-Z, a-z, 0-9, -, and .
should be legal name- and name-start characters.  (This will change a
trivial task in parser development into one most of us have never
performed, but everyone in the discussion seems willing to make a leap
of faith here and believe Gavin Nicol when he says it's not very hard.)

If I understand the postings correctly, the remaining differences of
opinion start here, and there are several positions staked out, which
I attempt to summarize below in ways that make clear how they agree
and differ, or else expose my misunderstanding of people's views.

1.  Hard Minimalism (Tim Bray):
  - all XML data streams must be in UTF-8 form
  - all XML systems must accept UTF-8 data
  - when data on disk is in non-UTF-8 form, responsibility for
    conversion rests outside the XML system
  - it's not said whether XML systems may accept data streams in other
    formats (e.g. Shift-JIS); XML parsers which feed data to applications
    must, however, feed them UTF-8

2.  The Dual-Track Approach (James Clark):
  - all XML data streams must be in UTF-8 or UTF-16
  - all XML systems must accept either UTF-8 or UTF-16 data, telling the
    difference by means of the xFEFF character conventionally used as
    a byte-order label in UTF-16 data streams
  - when data on disk is in non-Unicode form, responsibility for
    conversion rests outside the XML system (? I'm not sure JC was
    explicit about this)
  - it's not said whether XML systems may accept data streams in other
    formats (e.g. Shift-JIS)

3.  Let 100 Flowers Bloom:  Gavin Nicol and Todd Bauman have argued for a
third position which I understand to have the following salient points:
  - XML data streams can be in any known or documentable encoding
  - XML systems may accept data streams in any format(s) they choose
    to support; they are encouraged but not required to accept UTF-8
  - all XML systems must implement and rely on external specification of
    the coded character set / encoding, such as MIME or attributes on
    an FSI
  - each XML system must support content negotiation so clients and
    servers can avoid sending or receiving XML data in unsupported
    encodings

This position seems, in some ways, to be even more minimalist than Tim
Bray's, since there is *no* coded character set or encoding which *all*
XML systems are required to support.  ("XML browsers would not need to
support any encodings other than those deemed important by the companies
producing them" was Gavin Nicol's way of putting it.) A conforming XML
system could legitimately restrict itself to handling ASCII or ISO
8859-1, or 96-character EBCDIC.  For this reason, I propose naming this
the Let-100-Flowers-Bloom position.  Those uncomfortable with the
allusion to Mao might prefer to call it the Laissez-Faire approach.

4.  The Hard Maximalist Position:  this is what I originally understood
Nicol and Bauman to be arguing for; it's not wholly unlike the apparent
intent of ISO 8879, as I understand it, though there are some obvious
differences of detail.
  - XML data streams can be in any known or documentable encoding
  - all XML systems to implement and rely on external specification of
    the coded character set / encoding, such as MIME or attributes on
    an FSI
  - all XML systems to support parse-time specification of arbitrary
    7- or 8-bit coded character sets, or any known Unicode encoding
  - each XML system to support content negotiation so clients and
    servers can know when to send a parse-time character-set
    specification and/or font
  - when data on disk is in a form not built in to the XML system,
    responsibility for declaring it rests with the user, and
    responsibility for using the declaration to convert the data
    into an appropriate internal form rests with the XML system

5.  The Eclectic Compromise (DeRose):  a slight extension of the
Dual-Track approach:
  - XML data streams may be in any known or documentable encoding
  - all XML systems must accept UTF-8 data but may reject other formats
  - XML systems are encouraged to accept UTF-16 data, telling the
    difference by means of the xFEFF character conventionally used as
    a byte-order label in UTF-16 data streams
  - XML systems may at their option accept data in other formats; how
    they recognize the format (autodetection, external labels, internal
    label) is not specified
  - XML system must be able to emit a normalized form of any document
    they can accept; the normalized form is in UTF-8 (and thus can
    be read by any XML system)
  - when data on disk is in a non-supported form, responsibility for
    conversion rests outside the XML system

It seems to me the differences among these proposals pose several
questions, some of them surprising to me since I hadn't expected
any differences of opinion:

Q1 should there be any minimal function required of all conforming XML
systems, any coded character set or character encoding they are all
required to accept as input, whether across the net or from disk?

Q2 should conforming XML systems be prohibited from accepting any
input format they are not required to accept?

Q3 if XML systems may accept different sets of input formats (whether
or not these sets overlap), can we ensure interoperability
in some way, or is that a lost cause?

Q4 if XML systems may *only* accept Unicode (whether just UTF-8 or
also UTF-16), is there anything that can be done to make life
easier for users of current systems which rely on Ascii, ISO 8859-1
or 8859-*, JIS, Shift-JIS, EUC, etc.?

It seems to me that there must be at least one encoding accepted by
all XML systems; a parser that accepts ASCII only may be XML-Like,
but it should not be XML.  Period.

It seems to me that requiring all users to fit filters on the front
and back ends of all XML tools, to accomplish their Local-to-UTF8
and UTF8-to-Local conversions raises an unnecessary barrier to
acceptance; to avoid this, it seems essential to allow an XML
parser, at least one for use on local files, to read the native
coded character set without prostheses.  On the other hand, to
ensure interoperability, we don't want such variations to be
globally visible.

Here's yet another proposal.

6.  Limited Modfied Eclecticism:  compromise between Eclectic
Compromise and 100 Flowers:
  - XML data streams may be in any of a number of supported encodings:
    UTF-8, UTF-16, UCS-4, ISO 8859
  - XML data streams must label themselves as to which supported
    encoding they are using, by means of a PI which must be the first
    data in each XML entity.
  - all XML systems must accept XML data in any supported encoding,
    detecting the encoding in use from the internal label;
    they may reject data in other encodings.
    (See note on autodetection, below.)
  - XML systems may optionally check the internal labeling for
    consistency with external labels (MIME, FSI, ...) and warn about
    inconsistencies or errors.
  - if the encoding of a data stream is not supported, the data stream
    is strictly in error; an XML system may however optionally recover
    from that error, e.g. to support a well known encoding in local use.
    At the user's option, warning messages for this error may be
    suppressed.  Conforming XML systems must however allow a user option
    to have such errors reported (e.g. for the use of users about to
    send data to other sites which may not handle unsupported
    encodings).
  - XML system must be able to emit a normalized form of any document
    they can accept; the normalized form is in UTF-8 (and thus can
    be read by any XML system)
  - when data on disk is in a non-supported form, responsibility for
    conversion rests outside the XML system
  - when data on disk is in a supported form, responsibility for
    conversion to the XML system's internal form rests with the XML
    system

What this boils down to is an attempt to allow XML systems to accept
data in commonly used formats, without impeding interoperability.
Systems are allowed to accept commonly used character encodings, just
not to hide from their users the fact that XML strictly speaking
requires one of the supported encodings.

If we restrict the supported character encodings to UTF-8 and UTF-16, I
think this proposal is only trivially different from the Dual-Track
proposal or the Eclectic Compromise.  If we add 8859 to the list, then
the implementation burdens are only trivially increased (a UTF-8-based
system has to autodetect the character encoding and translate to UTF-8
before actually reading the data), but the users' burdens seem
substantially lighter.  (Very few users want to have to deal with
character-set problems, even if we put the filters on their disk for
them.)  I hesitate to add Shift-JIS etc.  to the list of supported
formats mostly because few programmers outside of Japan understand JIS,
Shift-JIS, and EUC, and only a few more would be willing to learn enough
to understand them if they had the opportunity.

Note on autodetection of character sets.

Before a parser can read the internal label, it has to know what
character set is in use -- which is what the internal label is trying to
tell it.  This is why the SGML declaration doesn't provide fully
automatic handling of foreign data.  But if we limit ourselves to a
finite set of supported formats, and give ourselves some clear text to
begin with, then autodetection is a soluble problem.

If each XML entity begins with a PI looking something like this:

  <?XML charset='...'>

then the first part of the entity *must* be the characters '<?XML' and
any conforming processor can detect, after four octets of input, which
of the following cases apply (it may help to know that in Unicode,
'<' is 0000 003C and '?' is 0000 003F):

  1 x00 00 00 3C - UCS-4, big-endian machine (1234)
    x3C 00 00 00 - UCS-4, little-endian machine (4321)
    x00 00 3C 00 - UCS-4, weird machine (2143)
    x00 3C 00 00 - UCS-4, weird machine (3412)

  2 x00 3C 00 3F - UCS-2, big-endian
    x3C 00 3F 00 - UCS-2, little-endian

  3 x3C 3F 58 4D - ASCII, some part of 8859, or UTF-8
                   or any other ISO-flavor 7- or 8-bit set

  4 x4C 6F E7 D4 - EBCDIC (in some flavor)

  5 other        - the data are corrupt, fragmentary, or enclosed in
                   a wrapper of some kind (e.g. a Mime wrapper)

Knowing that, it ought to be possible to handle things properly --
whether by invoking a separate lexical scanner for each case, or by
calling the proper conversion function on each character of input.  Tim
and Gavin have already shown code fragments for this.

This level of autodetection is enough to read the initial processing
instruction with the pointer to the declarations and the character set
identifier, which is still necessary to distinguish UTF-8 from 8859, and
the parts of 8859 from each other (as well as the varieties of EBCDIC
and so on).

Like any self-labeling system, this can break if software changes the
characte set or encoding and doesn't update the label.  I get lots of
mail with MIME labels saying (in EBCDIC) that the data are in ASCII.  I
don't think ASCII-EBCDIC gateways are the only places where such
translations occur.  So I still think that we need clear rules about
network transfer, and what to do if you don't control the gateways (e.g.
if you are going through someone else's ftp server or client, or via
email).

Perhaps we should say that network transmissions (or http transmission)
should always be in UTF-8, and the other supported formats are only
for local use on disk ...)

Is this compromise workable?

-C. M. Sperberg-McQueen
 ACH / ACL / ALLC Text Encoding Initiative
 University of Illinois at Chicago
 tei@uic.edu
Received on Sunday, 15 September 1996 15:49:51 UTC