W3C home > Mailing lists > Public > public-i18n-core@w3.org > April to June 2011

RE: HTML5 and Unicode Normalization Form C

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Mon, 30 May 2011 03:43:00 +0200
To: "Phillips, Addison" <addison@lab126.com>
Cc: www-validator@w3.org, www-international@w3.org, public-i18n-core@w3.org
Message-ID: <20110530034300186618.564a4c65@xn--mlform-iua.no>
Phillips, Addison, Sun, 29 May 2011 13:54:34 -0700:
>> 
>> As for using non-NFC outside attributes, then I don't know if 
>> there are issues which can justify a warning. But according
>> to Unicode technical report 15, then the "W3C Character Model
>> for the World Wide Web [ snip ] and other W3C Specifications
>> (such as XML 1.0 5th Edition) recommend using Normalization
>> Form C for all content." [4]
  [...]
> The normative bits of Charmod-Norm live at [1]. Items C300 and C301 
> use the RFC 2119 keyword "SHOULD" in requiring that content and 
> specifications be fully-normalized or include-normalized.
  [...]
> It would be unreasonable, in my opinion, to treat HTML5 as a *new* 
> format, so I think any expectations for adding a normalization 
> requirement to HTML are unrealistic.

However, HTML5 warns against not using UTF-8 because of "unexpected 
results" in form submissions and links of not doing so. It would seem 
in tune with this spirit to, if possible, let HTML5/validators point to 
how to eliminate the problems that can cause unexpected resulted even 
with UTF-8, no?

Btw, it seems to be unclear, from HTML5, whether two @id attributes 
that only differs with regard to their normalization, are to be 
considered uniqe. All HTML5 says is said is that @id attributes must be 
unique, but it is not said what actually makes them unique. [1]

Related to the uniqueness: 
  * On the Mac, when serving a file on the preinstalled Apache2, then 
normalized link values (provided they are not cool IRIs with decomposed 
letters) do target files with non-normalized file names. How come? Is 
it because Apache performs a normalization of the HTTP request? 
  * Inside a document, however (with the exception of Safari on windows 
[2]), then composed and decomposed identifiers are treated by browsers 
as distinct identifiers, though. 

  [...]
> The I18N Core WG has recently agreed 
> to work on normalization guidelines again. There is (and has ever 
> been) little enthusiasm for working on the Character Model, but 
> having read the normalization document again this weekend, I suspect 
> that Charmod-Norm will probably have to be replaced, rather than just 
> worked around.

Good hear your are looking at it!

> [1] http://www.w3.org/TR/charmod-norm/#sec-NormalizationApplication

[1] http://dev.w3.org/html5/spec/elements.html#the-id-attribute
[2] http://lists.w3.org/Archives/Public/www-validator/2011May/0052
-- 
Leif H Silli
Received on Monday, 30 May 2011 01:43:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 30 May 2011 01:43:31 GMT