Re: XML character sets: a proposal

>I agree that fixing the character repertoire on ISO 10646 is the right
>way to go.  (There appears to be consensus on this.)  

Good. One down.

>However, although many platforms do not currently have support for
>either UTF-8 or UCS-2/UTF-16, my impression is that most of them are
>planning support for one or the other.  This suggests the possibility
>of allowing just these two encodings for XML.

I will tell you exactly what will happen if we do this:
   1) People who use ASCII, will pretend they are using UTF-8
   2) People who do not use ASCII (SJIS, EUC, JOHAB etc.) will
      *ignore* this requirement and implement systems that handle the
      encodings they use every day.

As you, and many other must also be aware, UNICODE doesn't solve the
worlds problems either, just 95% of the more common ones ;-) We have
to think of XML being used to deliver things like the classical
buddhist texts, dead languages, etc.

>- Allow any encoding at all?  That's an implementation nightmare and
>users lose the guarantee that any XML system will be able to handle
>any XML document.

I think it would be quite hard to guarantee that all XML systems will
be able to meaningfully intrepret any arbitrary XML document
anyway. We are defining a data syntax, and parsers, but not an entire
set of application semantics. 

In addition, allowing any encoding to be used is not the nightmare you
make it out to be (it's not particularly pleasant either, and in fact, I
argued very strongly for UTF-8 to become the lingua franca of the WWW
a year and a half ago). I have no problem with recommending that
UTF-8 or UTF-16 be used for XML, and I will even applaude that
stance, but I cannot stand by and watch them become the *only* choice.  

The problem of unlimited encodings can be approched in one of two

   1) Limit the encoding an application can accept. For browsers,
      content labelling and negotiation are sufficient to handle this. 
   2) Make the application open-ended by using dynamic link libraries,
      scripting languages, or whatever, so that encoding conversion
      modules can be downloaded.

I am very much in favor of (2), and have a prototype for a langauge
that could be used to accomplish this (and Rick also has an idea for a 
language that would be quite interesting here). 

>And if we have lots of possible encodings, we certainly can't
>auto-detect which are being used.  The possibilities I can think of
>a) rely on some external mechanism to specify the encoding

MIME content labelling. This is why I have been fighting for close to
2 years to get client and servers to actually accept, generate, and
use the charset parameter on the HTML MIME type:

   Content-Type: text/html; charset=shift-jis

I also proposed a data storage format that could be used to combine 
such meta-data and the actual data together in a single file.

>b) use ISO 2022 escape sequences

God forbid....

>c) require each XML file to have a MIME header with a charset

Right. This is pretty much what my *.mim file type did: it just has a
group of MIME headers followed by CRLF, then the data.

>d) use an attribute in a formal system identifier

A nice way to go...

>Whichever way we go, there's an enormous step up in complexity over
>the UTF-8/UCS-2 route.

It depends. Actually *using* UTF-8/UTF-16 in a meaningfull way in an
application would be just as difficult as any other encoding, and in
the case of UTF-16, maybe even more so. Having only these two
encodings also prescribes certain implementation decisions.

Supporting a wide range of encodings is certainly less work that
supporting a wide range of languages in a browser.

Follow-Ups: References: