Re: 12. Are C1 controls and Unicode non-characters disallowed?

On Fri, Sep 7, 2012 at 8:19 PM, John Cowan <cowan@mercury.ccil.org> wrote:

> I've added a new issue: 12. Are C1 controls and Unicode non-characters
> disallowed?
>
> In XML 1.0 3e, the following text was added to 2.2, Characters:
>
>     The characters defined in the following ranges are discouraged. They
>     are either control characters or permanently undefined Unicode
>     characters:
>
>     [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
>     [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF],
>     [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF],
>     [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF],
>     [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF],
>     [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF],
>     [#10FFFE-#x10FFFF].
>
> These codepoints are either not very useful in interchange (the C1
> controls [#x7F-#x84] and [#x86-#x9F], because Unicode doesn't say
> what they mean) or are non-characters, code points permanently reserved
> from being assigned to characters and meant for internal use only (all
> the rest).
>
> They couldn't be banned from XML 1.0 because of backward compatibility,
> but I'd like to consider banning them from MicroXML.
>
> Comments?
>


I asked Tony Graham for his thoughts.  His response:

My first thought is that it's only half a list, since if you're going to
> ban [#xFDD0-#xFDDF], then you might as well also ban #xFFFC, OBJECT
> REPLACEMENT CHARACTER, since it's meant to be meaningless without the
> out-of-stream information about the object it's meant to be replacing, or
> ban #xE0000-#xE007F since they're meant for protocols that don't support
> markup identification.
>
> Has anyone gone through UTR #20, "Unicode in XML and other Markup
> Languages" (http://www.unicode.org/reports/tr20/) to evaluate its
> recommendations w.r.t. want you want from MicroXML?  In principle, if you
> disallowed all the characters that UTR #20 says browsers should discard,
> then everything would be simpler (apart from the MicroXML parsers that
> would then have to check that those characters weren't present).
>
> The C1 controls are difficult, since they aren't well defined.  What's
> gained, other than purity of approach, if they are banned?
>
> Personally, I wouldn't like to see [#xFDD0-#xFDDF] banned since I often
> use one of those characters in XSLT stylesheets, e.g., when joining
> multiple strings together to make a key lookup value, and I'd have to find
> a different technique if there was ever a MicroXML-only XSLT processor
> that didn't allow those characters.  If you searched hard enough, you'd
> probably find somebody, somewhere who's using every one of those
> characters or the end-of-plane characters for their own internal use, just
> like it says on the tin.
>
> In fact, just last week I was thinking about using characters from
> #xE0000-#xE007F to spell 'XSpec' for use as the XSpec-specific namespace
> prefix when XSpec munges a users XSpec tests to make the stylesheet that
> the framework actually runs (on the grounds that there is unlikely to be a
> user's stylesheet that used that particular prefix), so maybe I'd want to
> see them retained, too, despite what I said above.
>
> Hope I haven't muddied the waters too much.
>
> Regards,
>
>
> Tony Graham                                   tgraham@mentea.net
> Consultant                                 http://www.mentea.net
> Mentea       13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
>  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
>     XML, XSL-FO and XSLT consulting, training and programming



-- 
Uche Ogbuji                       http://uche.ogbuji.net
Founding Partner, Zepheira        http://zepheira.com
http://wearekin.org
http://www.thenervousbreakdown.com/author/uogbuji/
http://copia.ogbuji.net
http://www.linkedin.com/in/ucheogbuji
http://twitter.com/uogbuji

Received on Sunday, 9 September 2012 22:44:20 UTC