XML character sets: a proposal

> From: Gavin Nicol <gtn@ebt.com>
> Date: Tue, 10 Sep 1996 19:55:42 GMT

> >I think there are many good reasons - overwhelmingly good - for the
> >SGML case. For XML, we should think about sweeping all the
> >charset/encoding subtleties under the rug: it's UTF8 and that's all
> >there is to it.
> UTF8 doesn't solve the worlds problems. I think we can fix the
> character repertoire, but fixing the encoding is arbitrary, and
> prescribes certain implementation details. It also complicates usage.

I agree that fixing the character repertoire on ISO 10646 is the right
way to go.  (There appears to be consensus on this.)  How much
flexibility to allow with encodings is a difficult issue.

Going with UTF-8 makes a lot of sense in the Unix world.  Within a few
years I expect to see most Unix platforms supporting UTF-8.  But the
Microsoft Windows world is going with UCS-2 (or UTF-16 which is an
extension of UCS-2); system tools (such as Notepad) on Windows NT
already support UCS-2, and I would expect Windows 9x to follow suit

However, although many platforms do not currently have support for
either UTF-8 or UCS-2/UTF-16, my impression is that most of them are
planning support for one or the other.  This suggests the possibility
of allowing just these two encodings for XML.

If we do this, we then have to face the issue of how an XML system
will know which encoding is being used.  I think there's a easy
solution to this.  The Unicode standard recommends that files start
with a byte order mark (0xFEFF); since the character 0xFFFE is not a
legal character this allows a Unicode system to determine whether
big-endian or little-endian byte order is being used.  Fortunately
neither 0xFF nor 0xFE can start the UTF-8 representation of a
character (actually I don't think they're legal anywhere).  So XML
could require that a byte order mark be used whenever UCS-2/UTF-16 is
being used, and assume UTF-8 when the mark is not present.

Restricting XML to these two encodings would, as Gavin says, be
somewhat arbitrary and would indeed complicate usage at least in the
short term for Japanese and European markets.  I'm very sympathetic to
Gavin's point of view here, but what's the alternative?

- Allow any encoding at all?  That's an implementation nightmare and
users lose the guarantee that any XML system will be able to handle
any XML document.

- If not, what encodings do we allow?  If we allow SJIS amd EUC,
shouldn't we also support the Chinese, Taiwanese, Korean encodings as
well, and all the ISO 8859-[123456789] for the Europeans.  There are
many other encodings that have active user communities.

And if we have lots of possible encodings, we certainly can't
auto-detect which are being used.  The possibilities I can think of

a) rely on some external mechanism to specify the encoding

b) use ISO 2022 escape sequences

c) require each XML file to have a MIME header with a charset

d) use an attribute in a formal system identifier

Whichever way we go, there's an enormous step up in complexity over
the UTF-8/UCS-2 route.


Follow-Ups: References: