Re: 12. Are C1 controls and Unicode non-characters disallowed? from James Clark on 2012-09-10 (public-microxml@w3.org from September 2012)

From: James Clark <jjc@jclark.com>
Date: Mon, 10 Sep 2012 08:49:42 +0700
To: public-microxml@w3.org
Cc: tgraham@mentea.net, Uche Ogbuji <uche@ogbuji.net>
Message-ID: <CANz3_Ea2K50jYR7UbHhc3=+1SHdC8ccYpQ86WNdXgQcE6R0UTQ@mail.gmail.com>
16.7 of the Unicode standard says:

Noncharacters are code points that are permanently reserved in the Unicode
> Standard for
> internal use. They are forbidden for use in  open interchange of Unicode
> text data. See
> Section 3.4, Characters and Encoding, for the formal definition of
> noncharacters and conformance requirements related to their use.
> The Unicode Standard sets aside 66 noncharacter code points. The last two
> code points of
> each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and
> U+1FFFF
> on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a
> total of 34 code
> points. In addition, there is a contiguous range of another 32
> noncharacter code points in
> the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF
> is contained within the Arabic Presentation Forms-A block, but those
> noncharacters are not
> “Arabic noncharacters” or “right-to-left noncharacters,” and are not
> distinguished in any
> other way from the other noncharacters, except in their code point values.


So my question for Tony would be: what is the difference between

- 0xFFFE - 0xFFFF, and
- the other 64 noncharacters

that justifies forbidding the former but not the latter?

You could argue that the right approach for noncharacters is to recommend
against their use for interchange rather than forbid them, but given that
XML 1.0 has forbidden U+FFFE-U+FFFF, it seems to me that the cleanest
approach is to forbid all noncharacters.

James

On Mon, Sep 10, 2012 at 5:43 AM, Uche Ogbuji <uche@ogbuji.net> wrote:

> On Fri, Sep 7, 2012 at 8:19 PM, John Cowan <cowan@mercury.ccil.org> wrote:
>
>> I've added a new issue: 12. Are C1 controls and Unicode non-characters
>> disallowed?
>>
>> In XML 1.0 3e, the following text was added to 2.2, Characters:
>>
>>     The characters defined in the following ranges are discouraged. They
>>     are either control characters or permanently undefined Unicode
>>     characters:
>>
>>     [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
>>     [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF],
>>     [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF],
>>     [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF],
>>     [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF],
>>     [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF],
>>     [#10FFFE-#x10FFFF].
>>
>> These codepoints are either not very useful in interchange (the C1
>> controls [#x7F-#x84] and [#x86-#x9F], because Unicode doesn't say
>> what they mean) or are non-characters, code points permanently reserved
>> from being assigned to characters and meant for internal use only (all
>> the rest).
>>
>> They couldn't be banned from XML 1.0 because of backward compatibility,
>> but I'd like to consider banning them from MicroXML.
>>
>> Comments?
>>
>
>
> I asked Tony Graham for his thoughts.  His response:
>
> My first thought is that it's only half a list, since if you're going to
>> ban [#xFDD0-#xFDDF], then you might as well also ban #xFFFC, OBJECT
>> REPLACEMENT CHARACTER, since it's meant to be meaningless without the
>> out-of-stream information about the object it's meant to be replacing, or
>> ban #xE0000-#xE007F since they're meant for protocols that don't support
>> markup identification.
>>
>> Has anyone gone through UTR #20, "Unicode in XML and other Markup
>> Languages" (http://www.unicode.org/reports/tr20/) to evaluate its
>> recommendations w.r.t. want you want from MicroXML?  In principle, if you
>> disallowed all the characters that UTR #20 says browsers should discard,
>> then everything would be simpler (apart from the MicroXML parsers that
>> would then have to check that those characters weren't present).
>>
>> The C1 controls are difficult, since they aren't well defined.  What's
>> gained, other than purity of approach, if they are banned?
>>
>> Personally, I wouldn't like to see [#xFDD0-#xFDDF] banned since I often
>> use one of those characters in XSLT stylesheets, e.g., when joining
>> multiple strings together to make a key lookup value, and I'd have to find
>> a different technique if there was ever a MicroXML-only XSLT processor
>> that didn't allow those characters.  If you searched hard enough, you'd
>> probably find somebody, somewhere who's using every one of those
>> characters or the end-of-plane characters for their own internal use, just
>> like it says on the tin.
>>
>> In fact, just last week I was thinking about using characters from
>> #xE0000-#xE007F to spell 'XSpec' for use as the XSpec-specific namespace
>> prefix when XSpec munges a users XSpec tests to make the stylesheet that
>> the framework actually runs (on the grounds that there is unlikely to be a
>> user's stylesheet that used that particular prefix), so maybe I'd want to
>> see them retained, too, despite what I said above.
>>
>> Hope I haven't muddied the waters too much.
>>
>> Regards,
>>
>>
>> Tony Graham                                   tgraham@mentea.net
>> Consultant                                 http://www.mentea.net
>> Mentea       13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
>>  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
>>     XML, XSL-FO and XSLT consulting, training and programming
>
>
>
> --
> Uche Ogbuji                       http://uche.ogbuji.net
> Founding Partner, Zepheira        http://zepheira.com
> http://wearekin.org
> http://www.thenervousbreakdown.com/author/uogbuji/
> http://copia.ogbuji.net
> http://www.linkedin.com/in/ucheogbuji
> http://twitter.com/uogbuji
>
>
Received on Monday, 10 September 2012 01:50:31 UTC