Re: Unicode NFC - status, and RDF Concepts from Leif Halvard Silli on 2011-10-12 (public-rdf-wg@w3.org from October 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 12 Oct 2011 03:28:21 +0000
To: John Cowan <cowan@mercury.ccil.org>
Cc: "Phillips, Addison , Martin J. Dürst , Jeremy Carroll , www-international@w3.org , RDF Working Group WG" <addison@lab126.com>
Message-Id: <20111012052735619095.8ef99b5b@xn--mlform-iua.no>

John Cowan, Tue, 11 Oct 2011 10:57:45 -0400:
> Phillips, Addison scripsit:
> 
>> XML is an interesting case because it makes the opposite decision
>> consciously: two canonically-equivalent but unequal identifiers are
>> not equal. 
> 
> And this applies to both XML names and to namespace URIs.

One - probably strong - reason why HTML5 could end up with the same 
solution as XML is that HTML5 has XML 1.0 compatibility as design goal. 
For that reason, it is also probably smart to focus on XML 1.0 if one 
wants to drive HTML5 in a particular direction ...

Btw, I filed bug 12839 on 1st of June to make the HTML5 spec say that 
normalization should be performed on @id attributes before establishing 
whether they are unique or not.[1] If the proposal would go through, 
then <p id='&#xe5;'> and <p id='a&#x30a;'> would be considered having 
he same value and thus would make the document invalid due to identical 
@id-s.

In the discussion inside the bug report, the others, including Henri, 
wanted @id-s that differ only w.r.t. NFC and NFD, to be considered 
unique. Still, Validator.nu would consider @id variant with the 
decomposed character as invalid because it isn't NFC normalized. Still, 
I think HTML5 says nothing yet, about normalization. So I think this at 
best speaks about what Henri think HTML5 should say: That only early 
normalization should occur (read: @id values not in NFC form should be 
illegal). But if two equivalent variants of the same character occur in 
the same document, then parsers should still consider them different.

W.r.t. to the CharmodNormSummary document, then for C005, I'd like to 
suggest two examples when the author might want to avoid NFC: If the 
author wants to style different parts a composed character differently 
- e.g. in different colors. HTML5 just made this legal - see bug 13502.

Another example could be that some tests I made showed that, apart from 
file searching (with a IE as an exception to that again), 'accént'  in 
decomposed form was treated more meaningful than 'accént' in composed 
form. I tested amongst other things the screenreaders Jaws, VoiceOver 
and NVDA to come to that - to myself - surprising conclusion. Simply 
put, the decomposed variant was the only variant that was universally 
meaningfully 'screen-read'.

A third example could be authors that want to take advatage of NFD's 
symmetrical shape: e.g. if you want to sort words based on word length 
in a primitive fashion.

[1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=12839
[2] http://www.w3.org/International/wiki/
-- 
leif halvard silli

Attachments

application/pkcs7-signature attachment: smime.p7s

Received on Thursday, 13 October 2011 05:47:23 UTC