Re: Unicode NFC - status, and RDF Concepts from Andy Seaborne on 2011-10-10 (public-rdf-wg@w3.org from October 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Mon, 10 Oct 2011 11:10:56 +0100
To: public-rdf-wg@w3.org
Message-ID: <4E92C4B0.7030604@epimorphics.com>

On 10/10/11 10:34, Eric Prud'hommeaux wrote:
> * "Martin J. Dürst"<duerst@it.aoyama.ac.jp>  [2011-10-10 06:43+0000]
>> Hello Jeremy,
>>
>> Great to hear from you again after a long time!
>
> Ahh, nice to see the crew assembled.
>
>> On 2011/10/10 14:19, Jeremy Carroll wrote:
>>>
>>> Several years ago, I was an editor of RDF Concepts and we included the
>>> following:
>>> http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
>>> [[
>>> The string in both plain and typed literals is recommended to be in
>>> Unicode Normal Form C [NFC]. This is motivated by [CHARMOD] particularly
>>> section 4 Early Uniform Normalization.
>>> ]]
>>> and
>>> [[
>>> All literals have a lexical form being a Unicode [UNICODE] string, which
>>> SHOULD be in Normal Form C [NFC].
>>> ]]
>>>
>>> As we review this document, it has been noted that the CHARMOD reference
>>> is out-of-date, the reference to, section 4 of
>>> http://www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization
>>> has been replaced by the fairly different
>>> http://www.w3.org/TR/charmod-norm/#sec-EarlyUniformNormalization
>>> and that WD seems to have been abandoned, and no consensus reached.
>>>
>>> What advice, if any, do I18N experts offer the RDF WG, updating the
>>> advice of 2002?
>>
>> I'd recommend to keep the text the same, and just tweak or remove the reference. I unfortunately didn't have enough time to follow changes in charmod-norm in detail, but I hope to be able to catch up with more active members of the WG next week at the Internationalization and Unicode Conference in San Jose.
>
> While you're at it, could you get a sense of the implementation burden of normalization?
>
> By imposing early normalization (NFC, in our case) we minimize the matching burden and define a behavior which is adequate for non-normalized data as well. If someone produces data with normalized terms
>    <http://example.com/~bob/resum��>  # U00E9
> and someone queries for it using the same term, life's good, predictable, and in spec. If someone produces data with non-normalized terms
>    <http://example.com/~bob/resumé>  # U0065 U0301
> and queries using the same term, life's still OK, less predictable (it won't match the normalized (correct) term), and explicitly out of spec. As SemWev tools become more industry-hardened, it would be nice to see input tools (e.g. interactive data and query builders) normalize e.g. user paste events. How much does that cost? Is it any cheaper to detect non-normalized input (in a term-validating parser) than it is to C-normalize?

The RDF Concepts text only applies to the lexical form for a literal.

What does do the RFCs say about IRIs?

What happens in XML for qnames?  I don't think they are normalized.

(There is serious issue here for phishing attacks)

 Andy

>
>
>> Regards,    Martin.
>>
>
>
>

Received on Monday, 10 October 2011 10:11:28 UTC