[Prev][Next][Index][Thread]

XML character sets: the hard-minimalist manifesto



I think the quality of debate on this has been excellent.  However, to 
my own surprise, I am unconvinced.  I still believe strongly in the
hard-minimalist position: XML is UTF8.  I have a suspicion that I'm probably
going to lose this argument, but here is one final blast of rhetoric on the 
advantages of minimalism.  Parenthetically, assuming we don't take the 
hard-minimalist position, I would support James Clark's proposal, for reasons 
listed at the end of the message.

Here is a restatement of the proposed hard-minimalist standard language:

 All parseable XML entities shall contain ISO 10646 characters encoded
 in UTF8. 

An ancillary design goal:

 The W3C SGML ERB/WG shall encourage the development of a repository of 
 2-way conversion modules between a wide variety of character encodings
 and 10646/UTF8.

Here are some virtues of hard minimalism:

1. Level Playing Field (Ref. design principle #2)

Right now, there is a small (but growing) amount of product support for UTF-8  
and UCS-2.  But the overwhelming majority of real documents in the real world 
are an unholy dog's breakfast of ASCII, EBCDIC, ISO-Latin-XXX, 3 flavors of JIS,
Big5, etc etc etc.  In the short term, it clearly ridiculous for XML to try
to support all these.  In the hard-minimalist world, everybody has to do
exactly the same thing, whether they're operating in ISO-Latin1, EUC, or
11-bit reverse-polarity Ojibway:

 Use whatever tools you want to use.  To store your data in XML, and to use
 your tools on data extracted from XML, you need a 2-way converter from your 
 format to 10646/UTF-8.

I would argue that at the moment, UTF8 is the clear favorite in terms of a
format that you'd want to build universal converters to and from.

2. Programmer Friendliness (Ref design principle #4)

I don't think any decent programmer is going to have difficulty understanding
that display module X needs encoding EsubX, and full-text-indexer Y needs
encoding EsubY and linguistic-analyzer Z needs encoding EsubZ - nor, given a
decent library of converters to/from UTF8, any particular trouble in using them 
to interchange data with X, Y, and Z in the formats they want.

But the gains in simplicity, robustness, and performance that are going
to come from a programmer being able to address an XML byte stream directly,
particularly a byte stream on which we can use the tools we have today, 
are immense.  I suspect such programmers will have enough real input problems 
dealing with disk files and socket feeds and OODBMS blobs and RDBMS tuple
sequences - let's not make it any harder.

3. SGML compliance (Ref design principle #3)

We can write an SGML declaration to support 10646/UTF8.  We can also write
one to support 10646/UCS2.  Can we get away with having 0xfffe at the front
of the file and still be SGML-compliant?

4. Shorter is Better (Ref design principle #8)

The hard-minimalism proposal is more concise.  This might not seem like a big 
issue, but we are going to be facing a couple of hundred design choices as we 
work through this.  In general, I think that in every case we have to take the 
choice with the minimal number of options and maximal conciseness, unless the 
consequences of doing otherwise are really serious.  (This is why James was 
correct to speak up against allowing omission of LIT/LITA on attribute values). 
I am far from convinced thatthe consequences of going with "XML is UTF8" are 
serious at all.  (In practical terms, the consequences of not supporting 
ISO-Latin-X and JIS are much more serious, but (sorry Gavin) we are probably 
heading in that direction.)

5. Network problems of self-labelling (Ref. design principle #1)

Self-labelling, no matter how it's done, has a nonzero risk of failure, and
based on the evidence, has a very high risk of failure.  ISO 2022 has not
exactly burned up the track.  We've all received UUENCODED messages 
claiming to be BinHexed, and ASCII messages claiming to be EBCDIC, and
SJIS pages claiming to be EUC.  (Also it bothers me that the 0xfffe is
not really part of the data, in the SGML sense - thus it's more of a
a small package/wrapper).  Hard minimalism makes all this go away.

Finally, emailing XML is going to present some problems anyhow - but clearly 
will be a desirable thing to do - having one hardwired encoding will simplify 
this.

6. Going with the Flow

VRML2 is UTF8, period.  Java is (supposed to be) UTF8, period.

7. Problems of external entities (Ref. design principles 1 and 3)

Since we'd like XML usable over the net, I'm pretty sure that, de facto, 
we'll see a lot of external entity refs that are, uh, highly non-local.
I posted a question recently to the group: either we assume them all to
be in the same encoding (modulo NDATA), or we apply our encoding-recognition
mechanism every time we interpolate an entity.  The hard-minimalist position
makes this go away.

8. Abstract Philosophy

The answer to our basic question about which encodings to support is 
stated in the notation of a prehistoric number system: One, Two, and Many.  
Thought of in an abstract way, either "One" or "Many" seems easier to 
live with than "Two".  OK, just kidding.

9. Why The UTF8/UCS2 dual encoding proposal is liveable

If I'm going to write a program for processing XML, I'm going to use the
tools I have on the computer that sits in front of me.  They can deal
with UTF8 today.  Thus, I'm going to write a pure UTF8 program, with callouts
to converters for interchange with various other facilities.

And at the front I'm going to have a little kludge along the lines of the
following:

 FirstByte = getc(stream);
 if (FirstByte & 0xfe == 0xfe) 
 {
   SecondByte = getc(stream); 
   temp = fopen(tempfile_name(), "w");
   ConvertUCS2StreamToUTF8(FirstByte, SecondByte, stream, temp);
   fclose(stream);
   stream = fopen(tempfile_name(), "r");
   FirstByte = getc(stream);
 }

But it kind of bothers me that my program will have built-in ready-to-go 
handling for UCS2 but not for JIS, ISO-Latin, EBCDIC, and so on.

I think that is all from me on this thread.
 
Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-488-1167


Follow-Ups: