Re: 12. Are C1 controls and Unicode non-characters disallowed?

On Wed, Sep 12, 2012 at 7:16 AM, Tony Graham <tgraham@mentea.net> wrote:

> On Mon, September 10, 2012 2:49 am, James Clark wrote:
> ...
> > So my question for Tony would be: what is the difference between
> >
> > - 0xFFFE - 0xFFFF, and
> > - the other 64 noncharacters
> >
> > that justifies forbidding the former but not the latter?
>
> Nothing.  If I were using the other non-characters instead and had stated
> that I would find it personally inconvenient if they were eventually
> disallowed by the tools that I wanted to use, then you could ask the same
> question the other way around just as easily.
>
> > You could argue that the right approach for noncharacters is to recommend
> > against their use for interchange rather than forbid them, but given that
> > XML 1.0 has forbidden U+FFFE-U+FFFF, it seems to me that the cleanest
> > approach is to forbid all noncharacters.
>
> Without arguing for or against the inclusion of non-characters, I don't
> understand the motivation for forbidding them.  If the goal is radical
> simplicity, then it would be simpler to allow the whole slew of
> characters.  If the goal is to "complement rather than replace XML, JSON
> and HTML" [1] then if one of the three disallows them (I don't know about
> JSON), they should be forbidden.
>

The goal of backward compatibility is stronger than the goal of simplicity,
so allowing the full repertoire isn't an option.  And to be fair, XML has
tried to ban non-characters for a while, and almost certainly would in an
XML 2.0.  The reason non-characters are only stated as "discouraged" in the
latest ed specs (i.e. much like RFC "SHOULD") is that because it can't by
rule make forward-incompatible changes until a major version update.

So there is an argument to be made for pragmatic complementarity with XML
in banning non-characters.



> I don't know whether this has been discussed, but while the current draft
> specifies UTF-8 only, but another way to simplify the character processing
> (post-parser) would be to also specify Normalization Form C [2][3], which
> would mean there would be only one way in MicroXML documents to represent
> particular characters.
>

I don't think it has come up again since this CG was created but early on
in the MicroXML discussion on XML-DEV the encoding question was brought up,
and there was strong consensus on UTF-8-only.  For one thing, there are a
fair number of platforms, languages, etc. whose addressing of to Unicode
goes no further than UTF-8 support, so this would make it easier to
implement MicroXML for a wider variety of environments.  Incidentally, on
the complementarity point, JSON is UTF-8 only as well.


-- 
Uche Ogbuji                       http://uche.ogbuji.net
Founding Partner, Zepheira        http://zepheira.com
http://wearekin.org
http://www.thenervousbreakdown.com/author/uogbuji/
http://copia.ogbuji.net
http://www.linkedin.com/in/ucheogbuji
http://twitter.com/uogbuji

Received on Wednesday, 12 September 2012 13:48:11 UTC