RE: HTML5 and Unicode Normalization Form C from Leif Halvard Silli on 2011-06-01 (www-validator@w3.org from June 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 1 Jun 2011 03:26:09 +0200
To: www-validator@w3.org
Cc: www-international <www-international@w3.org>
Message-ID: <20110601032609398692.0bf8815e@xn--mlform-iua.no>
( Adding www-validator@ again. )

Phillips, Addison, Tue, 31 May 2011 09:34:23 -0700:
>> 
>> No problem. And it is true that my main focus is on linking.
> 
> Linking is a special case. The IRI WG is also discussing 
> normalization. That's the best place to deal with that issue, I 
> think. Other comparisons in HTML (attributes and text values) do not 
> have externally provided requirements and thus HTML (or CSS or...) 
> need to define them.

Thanks for the tip w.r.t IRI WG - I've just subscribed.

Some more words on the HTML5 validator, though: Its current behaviour, 
where non-NFC is stamped as an error, means that the HTML5 validator 
does not perform - or display - IRI syntax warnings whenever decomposed 
characters are used. Instead of giving a IRI relevant warning message, 
the validator stamps the character as an outright error, regardless of 
where it occurs (@href or in "content").

By contrast, if one inserts a U+FF74 (a NFC, halfwith Katakana letter) 
into @href, then the HTML5 validator gives a proper, IRI related 
warning:

]]
Warning: Bad value #ｴ for attribute href on element a: 
Compatibility character in fragment component.  [ snip ]
Syntax of IRI reference: [ snip ] Characters should be 
represented in NFC and spaces should be escaped as %20.
[[

If the HTML5 validator will issue a warning for used of decomposed in 
content, then it should at least make sure to treat IRIs (in @href) 
separete from "content" - they should not be conflated. There could be 
a general warning against use of decomposed characters. But separate 
from that, there should be a IRI warning as well. 

>> HTML5 supports IRIs, which: [1] "Allows native representation of Unicode in
>> resources without % escaping". 
> 
> While this is a general way of defining IRIs, it's also misleading. 

That excellent quote stems from one of the authors behind the IRI spec 
- Michel Suignard. I like very much that it, in such plain and direct 
English, explains the purpose of IRIs.

> While IRIs represent the vast preponderance of Unicode code points 
> without escaping, percent escaping is still required in a number of 
> cases.

I accept this as your view of what needs to be communicated. From my 
perspective, what the quote says, is important to communicate. 

The IRI RFC is much duller than that quote. Coming from HTML4, where 
non-ASCII inside @href and @id is forbidden, but where it is still 
possible to use percent encoding (and the @name attribute in place of 
@id) to represent non-ASCII, I want to see it explicitly stated that 
direclty typed non-ASCII characters are allowed - they are not allowed 
only if you escape them!

Btw, the section "Converting URIs to IRIs" in the IRI RFC, [1] points 
to 3 other sections which defines restrictions, including the section 
'Limitations on UCS Characters Allowed in IRIs'. [2] Despite the 
restricitons, the purpose of IRI nevertheless is to allow non-ASCII 
characters in URLs. (I suppose some of the restrictions, such as the 
restriction on using halfwidth Katakana, is not a technical restriction 
but a "philosophical" restriction, related to the need to avoid visual 
look-alikes. As is the recommendation to use NFC.)

 [ snip ]

>>>> As it has turned out, however, it was an error of the HTML5 validator
>>>> to show an error for use of NFC. But *that* only increases the
>>>> importance of offer helpful recommendations w.r.t. links.
>>> 
>>> Thank you for the explanation of the background I wasn't aware of.
>> 
>> I should have pointed it out when I CC-ed this list. Sorry.
> 
> If you have concerns about links/web addresses, the best place to 
> discuss it is on public-iri@w3.org (the IETF IRI WG's mailing list). 
> The IRI effort needs all the help it can get.
> 
> As I mentioned before, my impression is that IRI is headed down the 
> path of *not* requiring any particular normalization form, although 
> NFC is recommended ("SHOULD") and early uniform normalization is 
> explicitly assumed.

As told above, the HTML5 validator does implement that "SHOULD" with 
regard to non-NFC in IRIs. 

At least, it is my intepretation that, as long as it gets rid of the 
general error message (and also do not introduce a similar, 
indistinguishing, *warning*) for *any* use of decomposed letters, then 
the HTML5 validator would still warn aginst use of non-NFC inside IRIs.

> Comparison of IRIs in the current draft addresses 
> comparison by defining equivalence at the code point level. See: 
> http://tools.ietf.org/html/draft-duerst-iri-bis-07#section-5.3.2 

It seems this is the most recent variant:
http://tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-5.3.2 

That section defines "character normalization" as part of "syntax-based 
normalization".  But none of the user agents of the dominating Web 
browser families do include character/unicode normalization when they 
compare IRI with @id. That they don't can indeed lead to "false 
negatives". So it would be good if they did what the bis draft 
recommmends. 

I think we need to start by stating that two @id attributes in HTML5 
are not to be considered as valid, "unique identifiers" if the only 
difference between them, is the normalization form. Filed as a bug: 
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12839


(Because, unless there is such a requirement that no two @id-s can 
differ only with regard to the normalization, then the recommendation 
of the IRI bis spec would mean that only the first occuring @id would 
be found.)

[1] http://tools.ietf.org/html/rfc3987#section-3.2 
    (BIS variant of [1]: 
http://tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-3.7 )
[2] http://tools.ietf.org/html/rfc3987#section-6.1

-- 
leif halvard silli
Received on Wednesday, 1 June 2011 01:28:12 UTC