RE: HTML5 and Unicode Normalization Form C

> As for using non-NFC outside attributes, then I don't know if there are issues
> which can justify a warning. But according to Unicode technical report 15, then
> the "W3C Character Model for the World Wide Web [ snip ] and other W3C
> Specifications (such as XML 1.0 5th Edition) recommend using Normalization
> Form C for all content." [4]

There has been some confusion about what Charmod-Norm says (and what the Internationalization WG thought it meant when it said it). I'd like to clarify somewhat. Please note that this is a *personal* email, with my chair hat off.

The normative bits of Charmod-Norm live at [1]. Items C300 and C301 use the RFC 2119 keyword "SHOULD" in requiring that content and specifications be fully-normalized or include-normalized. These requirements used to say "MUST" because the original intent was that "early uniform normalization" (EUN) would be required by the Character Model.

In 2004/2005, the Internationalization Working Group decided that early uniform normalization was dead and that requiring normalization of content (such that applications could assume that content was already normalized) was no longer a reasonable position for Charmod. The debate was whether to relax the "MUST" requirement to "SHOULD", to "MAY", or whether it should be removed altogether. The WG felt, at that time, that normalized content was desirable even if applications and formats could not count on normalization having been applied. Therefore, the recommendation was kept at "SHOULD" rather than the weaker "MAY" (or removed altogether). Further, it was felt that new formats might wish to require normalization even if existing formats did not.

It would be unreasonable, in my opinion, to treat HTML5 as a *new* format, so I think any expectations for adding a normalization requirement to HTML are unrealistic.

Having dropped EUN, other requirements were added or modified to deal with the fact that content would not be ensured to be in a normalized form. The 2119 keyword "SHOULD" has a very strong normative meaning (only a little bit less strong than "MUST"), but the WG's intent was significantly less strong. Once you cannot assume that content is normalized, one must perform normalization sensitive operations carefully or suffer the consequences.

Charmod-Norm was not intended to be advanced in its current form for precisely the reasons we are discussing on this thread. Removing EUN means additional complexity, since specifications and formats must then deal with normalization independently, especially when it comes to things such as identifiers. The I18N Core WG has recently agreed to work on normalization guidelines again. There is (and has ever been) little enthusiasm for working on the Character Model, but having read the normalization document again this weekend, I suspect that Charmod-Norm will probably have to be replaced, rather than just worked around.



Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


Received on Sunday, 29 May 2011 20:54:57 UTC