- From: James Clark <jjc@jclark.com>
- Date: Sat, 8 Sep 2012 12:14:33 +0700
- To: John Cowan <cowan@mercury.ccil.org>
- Cc: public-microxml@w3.org
- Message-ID: <CANz3_EbJ47rWRcoBSfiUoW1rsR6rqYcboyL2eHvmBxdWGjoJtw@mail.gmail.com>
Excellent question. I find the case for excluding non-characters pretty compelling. I would state it like this: a) Unicode defines a class of non-character code-points: [#xFDD0-#xFDDF], [#xFFFE-FFFF], [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF], [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF], [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF], [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF], [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF], [#10FFFE-#x10FFFF]. b) The Unicode stability policy guarantees that no code-points will ever be added or removed from this class. c) XML 1.0 already excludes two characters from the non-character class, specifically #xFFFE-#xFFFF. d) There's nothing from a Unicode perspective I know of that distinguishes the two members that XML 1.0 excludes from the other members of the class. e) Although it's a long list, it's simpler than it looks: (c & 0xFFFE) == 0xFFFE || (c & 0xFFF0) == 0xFDD0 (did I get that right?) On the down side, it makes the spec a bit longer (although we could make up a notation to make it shorter). The situation with control codes seem a bit murkier to me, particularly as regards #x85. We don't allow #xC (form-feed), although it's defined in Unicode, so why should be allow #x85? It seems to me that the more consistent policy would be only to allow control codes that we define to be white-space. James On Sat, Sep 8, 2012 at 9:19 AM, John Cowan <cowan@mercury.ccil.org> wrote: > I've added a new issue: 12. Are C1 controls and Unicode non-characters > disallowed? > > In XML 1.0 3e, the following text was added to 2.2, Characters: > > The characters defined in the following ranges are discouraged. They > are either control characters or permanently undefined Unicode > characters: > > [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], > [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF], > [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF], > [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF], > [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF], > [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF], > [#10FFFE-#x10FFFF]. > > These codepoints are either not very useful in interchange (the C1 > controls [#x7F-#x84] and [#x86-#x9F], because Unicode doesn't say > what they mean) or are non-characters, code points permanently reserved > from being assigned to characters and meant for internal use only (all > the rest). > > They couldn't be banned from XML 1.0 because of backward compatibility, > but I'd like to consider banning them from MicroXML. > > Comments? > > -- > As you read this, I don't want you to feel John Cowan > sorry for me, because, I believe everyone cowan@ccil.org > will die someday. http://www.ccil.org/~cowan > --From a Nigerian-type scam spam > >
Received on Saturday, 8 September 2012 05:15:21 UTC