Re: Unicode NFC - status, and RDF Concepts from Martin J. Dürst on 2011-10-14 (public-rdf-wg@w3.org from October 2011)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 14 Oct 2011 06:53:40 +0000
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: "David Wood , Phillips, Addison , Jeremy Carroll , John Cowan , www-international@w3.org , RDF Working Group WG" <david@3roundstones.com>
Message-Id: <4E97DC33.4060209@it.aoyama.ac.jp>

On 2011/10/12 7:58, Eric Prud'hommeaux wrote:
> * David Wood<david@3roundstones.com>  [2011-10-11 17:00-0400]
>> 
>> On Oct 11, 2011, at 16:49, "Phillips, Addison"<addison@lab126.com>  wrote:
>> 
>>>>> B)
>>>>> 2) drop the "SHOULD use NFC" requirement on literals
>>>> 
>>>> I'm good with this one, unless we decide to do something around our ISSUE-63:
>>>>  http://www.w3.org/2011/rdf-wg/track/issues/63
>>>> 
>>> 
>>> For reasons I just outlined, I think this would be a mistake. By avoiding denormalized text, RDF users can help ensure interoperability. In practice, this is a no-op for implementers.
>> 
>> Why do you see it as a noop?
> 
> I guess it depends on which implementors we're talking about, but most of the current stack (OWL, SPARQL, RIF implementers) are invoked after the implied pre-normalization step. They don't have to do any normalization. Exceptions would be those creating RDF from user input or mapping non-RDF data (e.g. RDBs) to RDF. For those folks, the advice to pre-normalize could help them to converge on one of many possible representations of e.g. product names.

That, and also the fact that in many cases, input is already highly normalized, and in some cases can be guaranteed to be in NFC (e.g. when converting from certain legacy encodings to Unicode,...).

> I'm pretty confident that we don't want to rule out having non-normalized forms in the domain of discourse (especially since applying the same codepoint comparison works regardless of normalization), but that we'd like to *advise* folks to converge where it's in their interest to do so and advising NFKC is a good path to that end. Thus, if say "It is recommended to use Unicode Normal Form KC [NFKC] for both literals and IRIs when there is no explicit reason to preserve the non-normalized form.", we probably hit the sweet point (and most present implementors don't have to do anything).

I suddenly see NFKC here, where before the discussion was about NFC, without any motivation. In many cases, going the additional step to NFKC is not a bad idea. But because the compatibility equivalents that get normalized with NFKC (in addition to the canonical equivalents that get normalized with NFC) consist of various and sundry categories, it's often not a good idea to apply NFKC, definitely way, way more often than for NFC (where it's mostly about egde cases of interaction between markup and content and meta-level stuff like e.g. talking about Unicode normalization in RDF).

Regards,    Martin.

Attachments

application/pkcs7-signature attachment: smime.p7s

Received on Friday, 14 October 2011 07:41:16 UTC