Re: 12. Are C1 controls and Unicode non-characters disallowed?

On Wed, Sep 12, 2012 at 7:25 AM, Mike Sokolov <sokolov@falutin.net> wrote:

> On 09/12/2012 09:16 AM, Tony Graham wrote:
>
>>
>> I don't know whether this has been discussed, but while the current draft
>> specifies UTF-8 only, but another way to simplify the character processing
>> (post-parser) would be to also specify Normalization Form C [2][3], which
>> would mean there would be only one way in MicroXML documents to represent
>> particular characters.
>>
>>
> NFC is called out in the Editor's Draft; I think the idea is you can use
> what you want, but parsers are free to normalize, caveat emptor, you might
> not get what you expect unless you use NFC.  At least that was my breezy
> interpretation :) Read the spec if you want precision...
>

It's rather stronger than that:

4.1 Document conformance
...

[Unicode] says that canonically equivalent sequences of characters ought to
be treated as identical. However, documents that are canonically equivalent
according to Unicode but that use distinct code point sequences are
considered distinct by MicroXML parsers. This gives rise to the possibility
that the user might unintentionally create sequences of characters that are
canonically equivalent but are treated as distinct by MicroXML parsers. To
avoid this possibility, all documents SHOULD be in Normalization Form C as
described by [Unicode].

That's an RFC SHOULD, which means use Normalization Form C unless you have
an absolutely compelling reason not to.

I do wonder why not just strengthen it that little bit to a MUST.


-- 
Uche Ogbuji                       http://uche.ogbuji.net
Founding Partner, Zepheira        http://zepheira.com
http://wearekin.org
http://www.thenervousbreakdown.com/author/uogbuji/
http://copia.ogbuji.net
http://www.linkedin.com/in/ucheogbuji
http://twitter.com/uogbuji

Received on Wednesday, 12 September 2012 13:39:18 UTC