Re: XML character sets: the hard-minimalist manifesto

[In response to Tim]

I should note again that I actually sympathise *strongly* with the
hard-minimalist side, because I have also argued from that perspective
more than once. However given the history of such things in Asia,
and especially Japan, I will tell you that it will cause more problems
than most are aware.

>In the short term, it clearly ridiculous for XML to try to support
>all these.  In the hard-minimalist world, everybody has to do exactly
>the same thing, whether they're operating in ISO-Latin1, EUC, or
>11-bit reverse-polarity Ojibway: 

I have never argued that all XML parsers needs to support all
encodings. I have only said that we should give XML parsers freedom to
support the encodings they choose to (I could even go as far as to say
that we might perhaps require that all XML parser be able to at least
*accept* data in UTF-8 and UCS-2, though that would also prescribe
certain implementation details, which I feel to be undesirable).

>I would argue that at the moment, UTF8 is the clear favorite in terms of a
>format that you'd want to build universal converters to and from.

Except that it bloats data size for certain languages...

>But the gains in simplicity, robustness, and performance that are going
>to come from a programmer being able to address an XML byte stream directly,
>particularly a byte stream on which we can use the tools we have today, 
>are immense.  I suspect such programmers will have enough real input problems 
>dealing with disk files and socket feeds and OODBMS blobs and RDBMS tuple
>sequences - let's not make it any harder.

It's not any harder, just different.

>We can write an SGML declaration to support 10646/UTF8.  We can also write
>one to support 10646/UCS2.  Can we get away with having 0xfffe at the front
>of the file and still be SGML-compliant?

This is a red herring. I have always said that we should fix the
document character set to ISO 10646.

>I am far from convinced thatthe consequences of going with "XML is UTF8" are 
>serious at all.  (In practical terms, the consequences of not supporting 
>ISO-Latin-X and JIS are much more serious, but (sorry Gavin) we are probably 
>heading in that direction.)

If we follow this route, I will guarantee that people will build
non-conformant systems, and use them (witness the WWW). We should
recognise this up front, and give people the flexibility to build such
systems in a conformant manner, with a warning that they face
incompatability in using anthing other than UTF-8 or UTF-16. 

>5. Network problems of self-labelling (Ref. design principle #1)

This is a red herring. There are standards that are in place. Standards
are meant to be conformed to. 

>Finally, emailing XML is going to present some problems anyhow - but
>clearly will be a desirable thing to do - having one hardwired
>encoding will simplify this.

How? I

>VRML2 is UTF8, period.  Java is (supposed to be) UTF8, period.

VRML2 is UTF-8, yes. I argued against this, and some sites in Japan
already have data that use iso-2022-jp. Java is not UTF-8, nor ever
was. Java is UNICODE based (the char in JAVA is 16 bits, and character
classification takes place via UNICODE). However, JAVA also has the
notion of "localised input streams", which basically are an I/O
streams with a X<->UNICODE converter on them. I should note that there
are a number of efforts to "localise" JAVA in Japan (ie, to make SJIS

>7. Problems of external entities (Ref. design principles 1 and 3)

Given correct protocol-level labelling (as is in HTTP 1.1, and MIME)
this is a non-problem.

>If I'm going to write a program for processing XML, I'm going to use the
>tools I have on the computer that sits in front of me.  They can deal
>with UTF8 today.  Thus, I'm going to write a pure UTF8 program, with callouts
>to converters for interchange with various other facilities.

I'm going to write a *proper* application that does the following:

  StorageObject object = network.fetch(entity.get_system_id());
  InputStream   stream = new InputStream(object.get_stream(), 

  InputStream::InputStream(InputStream stream, String encoding)
     m_decoder = Decoder::resolve(encoding);
     m_stream  = stream;