- From: Uche Ogbuji <uche@ogbuji.net>
- Date: Wed, 12 Sep 2012 07:47:36 -0600
- To: Tony Graham <tgraham@mentea.net>
- Cc: public-microxml@w3.org
- Message-ID: <CAPJCua33PFcg4McDjs5JtXaxNqZ86oURUywArahyOP2wK+4kqA@mail.gmail.com>
On Wed, Sep 12, 2012 at 7:16 AM, Tony Graham <tgraham@mentea.net> wrote: > On Mon, September 10, 2012 2:49 am, James Clark wrote: > ... > > So my question for Tony would be: what is the difference between > > > > - 0xFFFE - 0xFFFF, and > > - the other 64 noncharacters > > > > that justifies forbidding the former but not the latter? > > Nothing. If I were using the other non-characters instead and had stated > that I would find it personally inconvenient if they were eventually > disallowed by the tools that I wanted to use, then you could ask the same > question the other way around just as easily. > > > You could argue that the right approach for noncharacters is to recommend > > against their use for interchange rather than forbid them, but given that > > XML 1.0 has forbidden U+FFFE-U+FFFF, it seems to me that the cleanest > > approach is to forbid all noncharacters. > > Without arguing for or against the inclusion of non-characters, I don't > understand the motivation for forbidding them. If the goal is radical > simplicity, then it would be simpler to allow the whole slew of > characters. If the goal is to "complement rather than replace XML, JSON > and HTML" [1] then if one of the three disallows them (I don't know about > JSON), they should be forbidden. > The goal of backward compatibility is stronger than the goal of simplicity, so allowing the full repertoire isn't an option. And to be fair, XML has tried to ban non-characters for a while, and almost certainly would in an XML 2.0. The reason non-characters are only stated as "discouraged" in the latest ed specs (i.e. much like RFC "SHOULD") is that because it can't by rule make forward-incompatible changes until a major version update. So there is an argument to be made for pragmatic complementarity with XML in banning non-characters. > I don't know whether this has been discussed, but while the current draft > specifies UTF-8 only, but another way to simplify the character processing > (post-parser) would be to also specify Normalization Form C [2][3], which > would mean there would be only one way in MicroXML documents to represent > particular characters. > I don't think it has come up again since this CG was created but early on in the MicroXML discussion on XML-DEV the encoding question was brought up, and there was strong consensus on UTF-8-only. For one thing, there are a fair number of platforms, languages, etc. whose addressing of to Unicode goes no further than UTF-8 support, so this would make it easier to implement MicroXML for a wider variety of environments. Incidentally, on the complementarity point, JSON is UTF-8 only as well. -- Uche Ogbuji http://uche.ogbuji.net Founding Partner, Zepheira http://zepheira.com http://wearekin.org http://www.thenervousbreakdown.com/author/uogbuji/ http://copia.ogbuji.net http://www.linkedin.com/in/ucheogbuji http://twitter.com/uogbuji
Received on Wednesday, 12 September 2012 13:48:11 UTC