- From: James Clark <jjc@jclark.com>
- Date: Sat, 8 Sep 2012 12:14:33 +0700
- To: John Cowan <cowan@mercury.ccil.org>
- Cc: public-microxml@w3.org
- Message-ID: <CANz3_EbJ47rWRcoBSfiUoW1rsR6rqYcboyL2eHvmBxdWGjoJtw@mail.gmail.com>
Excellent question.
I find the case for excluding non-characters pretty compelling. I would
state it like this:
a) Unicode defines a class of non-character code-points:
[#xFDD0-#xFDDF], [#xFFFE-FFFF],
[#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF],
[#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF],
[#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF],
[#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF],
[#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF],
[#10FFFE-#x10FFFF].
b) The Unicode stability policy guarantees that no code-points will ever be
added or removed from this class.
c) XML 1.0 already excludes two characters from the non-character class,
specifically #xFFFE-#xFFFF.
d) There's nothing from a Unicode perspective I know of that distinguishes
the two members that XML 1.0 excludes from the other members of the class.
e) Although it's a long list, it's simpler than it looks: (c & 0xFFFE) ==
0xFFFE || (c & 0xFFF0) == 0xFDD0 (did I get that right?)
On the down side, it makes the spec a bit longer (although we could make up
a notation to make it shorter).
The situation with control codes seem a bit murkier to me, particularly as
regards #x85. We don't allow #xC (form-feed), although it's defined in
Unicode, so why should be allow #x85? It seems to me that the more
consistent policy would be only to allow control codes that we define to be
white-space.
James
On Sat, Sep 8, 2012 at 9:19 AM, John Cowan <cowan@mercury.ccil.org> wrote:
> I've added a new issue: 12. Are C1 controls and Unicode non-characters
> disallowed?
>
> In XML 1.0 3e, the following text was added to 2.2, Characters:
>
> The characters defined in the following ranges are discouraged. They
> are either control characters or permanently undefined Unicode
> characters:
>
> [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
> [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF],
> [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF],
> [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF],
> [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF],
> [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF],
> [#10FFFE-#x10FFFF].
>
> These codepoints are either not very useful in interchange (the C1
> controls [#x7F-#x84] and [#x86-#x9F], because Unicode doesn't say
> what they mean) or are non-characters, code points permanently reserved
> from being assigned to characters and meant for internal use only (all
> the rest).
>
> They couldn't be banned from XML 1.0 because of backward compatibility,
> but I'd like to consider banning them from MicroXML.
>
> Comments?
>
> --
> As you read this, I don't want you to feel John Cowan
> sorry for me, because, I believe everyone cowan@ccil.org
> will die someday. http://www.ccil.org/~cowan
> --From a Nigerian-type scam spam
>
>
Received on Saturday, 8 September 2012 05:15:21 UTC