Re: 12. Are C1 controls and Unicode non-characters disallowed? from James Clark on 2012-09-08 (public-microxml@w3.org from September 2012)

From: James Clark <jjc@jclark.com>
Date: Sat, 8 Sep 2012 12:14:33 +0700
To: John Cowan <cowan@mercury.ccil.org>
Cc: public-microxml@w3.org
Message-ID: <CANz3_EbJ47rWRcoBSfiUoW1rsR6rqYcboyL2eHvmBxdWGjoJtw@mail.gmail.com>

Excellent question.

I find the case for excluding non-characters pretty compelling. I would
state it like this:

a) Unicode defines a class of non-character code-points:
    [#xFDD0-#xFDDF], [#xFFFE-FFFF],
    [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF],
    [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF],
    [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF],
    [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF],
    [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF],
    [#10FFFE-#x10FFFF].

b) The Unicode stability policy guarantees that no code-points will ever be
added or removed from this class.

c) XML 1.0 already excludes two characters from the non-character class,
specifically #xFFFE-#xFFFF.

d) There's nothing from a Unicode perspective I know of that distinguishes
the two members that XML 1.0 excludes from the other members of the class.

e) Although it's a long list, it's simpler than it looks: (c & 0xFFFE) ==
0xFFFE || (c & 0xFFF0) == 0xFDD0 (did I get that right?)

On the down side, it makes the spec a bit longer (although we could make up
a notation to make it shorter).

The situation with control codes seem a bit murkier to me, particularly as
regards #x85. We don't allow #xC (form-feed), although it's defined in
Unicode, so why should be allow #x85?  It seems to me that the more
consistent policy would be only to allow control codes that we define to be
white-space.

James

On Sat, Sep 8, 2012 at 9:19 AM, John Cowan <cowan@mercury.ccil.org> wrote:

> I've added a new issue: 12. Are C1 controls and Unicode non-characters
> disallowed?
>
> In XML 1.0 3e, the following text was added to 2.2, Characters:
>
>     The characters defined in the following ranges are discouraged. They
>     are either control characters or permanently undefined Unicode
>     characters:
>
>     [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
>     [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF],
>     [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF],
>     [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF],
>     [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF],
>     [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF],
>     [#10FFFE-#x10FFFF].
>
> These codepoints are either not very useful in interchange (the C1
> controls [#x7F-#x84] and [#x86-#x9F], because Unicode doesn't say
> what they mean) or are non-characters, code points permanently reserved
> from being assigned to characters and meant for internal use only (all
> the rest).
>
> They couldn't be banned from XML 1.0 because of backward compatibility,
> but I'd like to consider banning them from MicroXML.
>
> Comments?
>
> --
> As you read this, I don't want you to feel      John Cowan
> sorry for me, because, I believe everyone       cowan@ccil.org
> will die someday.                               http://www.ccil.org/~cowan
>         --From a Nigerian-type scam spam
>
>

Received on Saturday, 8 September 2012 05:15:21 UTC