Re: HTML5 and Unicode Normalization Form C

Andreas Prilop, Fri, 27 May 2011 17:33:35 +0200 (CEST):
> On Fri, 27 May 2011, Michael[tm] Smith wrote:

>> But if you think it's wrong to even have it emit a warning,
>> then let me know and I talk to Henri and to the internationalization
>> folks about whether it should be or not. But from what I have been
>> told by the internationalization folks so far, I think they would
>> like to for it to be generating a warning here.

> In my opinion, you should not even emit a warning since Unicode
> itself does not require NFC to be used everywhere.
> It is the choice of the author to take any character encoding
> and any valid Unicode representation. This has nothing to do
> with "valid HTML" and should therefore not be reported by
> an HTML validator.

Actually, as discussed on www-international in February, use of non-NFC 
is is likely to be a surprising and hard to debug result of interaction 
with a tool or a file system which do not use/convert to NFC, rather 
than a conscious choice. [1]

Use of non-NFC in file names is a problem in itself: unless the URL 
uses the the same (de)composition, the file name and the link doesn't 
match. And even when e.g. a link and a file name both uses non-NFC, 
there might be interaction problems related to CSS in some user agents. 
(:visited and :link styling).

HTML5 already warns against use of non-UTF8 with the justification that 
it can problems, quote: [2] "form  submission and URL encodings". And 
hence, because non-NFC could cause the same kind of problems, a warning 
for use of non-NFC in links and idrefs does seem in place. This seems 
worthy to mention in HTML5 iself - perhaps a bug should be filed.

I don't know if CSS selectors are affected - if so, then any attribute 
value wiht a non-NFC value should potentially have a warning. CSS 
namespaces is perhpas another problem area - which falls in under CSS, 
though. [3] 

As for using non-NFC outside attributes, then I don't know if there are 
issues which can justify a warning. But according to Unicode technical 
report 15, then the "W3C Character Model for the World Wide Web [ snip 
] and other W3C Specifications (such as XML 1.0 5th Edition) recommend 
using Normalization Form C for all content." [4]

[1] 
http://lists.w3.org/Archives/Public/www-international/2011JanMar/0046
[2] http://www.w3.org/TR/html5/semantics.html#charset
[3] http://lists.w3.org/Archives/Public/www-style/2011May/0076
[4] http://unicode.org/reports/tr15/ 
-- 
Leif Halvard Silli

Received on Sunday, 29 May 2011 17:21:36 UTC