Re: Unicode NFC - status, and RDF Concepts from Eric Prud'hommeaux on 2011-10-10 (public-rdf-wg@w3.org from October 2011)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Mon, 10 Oct 2011 05:34:03 -0400
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Cc: Jeremy Carroll <jeremy@topquadrant.com>, "www-international@w3.org, RDF Working Group WG" <public-rdf-wg@w3.org>
Message-ID: <20111010093402.GB12800@w3.org>

* "Martin J. Dürst" <duerst@it.aoyama.ac.jp> [2011-10-10 06:43+0000]
> Hello Jeremy,
> 
> Great to hear from you again after a long time!

Ahh, nice to see the crew assembled.

> On 2011/10/10 14:19, Jeremy Carroll wrote:
> > 
> > Several years ago, I was an editor of RDF Concepts and we included the
> > following:
> > http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
> > [[
> > The string in both plain and typed literals is recommended to be in
> > Unicode Normal Form C [NFC]. This is motivated by [CHARMOD] particularly
> > section 4 Early Uniform Normalization.
> > ]]
> > and
> > [[
> > All literals have a lexical form being a Unicode [UNICODE] string, which
> > SHOULD be in Normal Form C [NFC].
> > ]]
> > 
> > As we review this document, it has been noted that the CHARMOD reference
> > is out-of-date, the reference to, section 4 of
> > http://www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization
> > has been replaced by the fairly different
> > http://www.w3.org/TR/charmod-norm/#sec-EarlyUniformNormalization
> > and that WD seems to have been abandoned, and no consensus reached.
> > 
> > What advice, if any, do I18N experts offer the RDF WG, updating the
> > advice of 2002?
> 
> I'd recommend to keep the text the same, and just tweak or remove the reference. I unfortunately didn't have enough time to follow changes in charmod-norm in detail, but I hope to be able to catch up with more active members of the WG next week at the Internationalization and Unicode Conference in San Jose.

While you're at it, could you get a sense of the implementation burden of normalization?

By imposing early normalization (NFC, in our case) we minimize the matching burden and define a behavior which is adequate for non-normalized data as well. If someone produces data with normalized terms
  <http://example.com/~bob/resum��> # U00E9
and someone queries for it using the same term, life's good, predictable, and in spec. If someone produces data with non-normalized terms
  <http://example.com/~bob/resumé> # U0065 U0301
and queries using the same term, life's still OK, less predictable (it won't match the normalized (correct) term), and explicitly out of spec. As SemWev tools become more industry-hardened, it would be nice to see input tools (e.g. interactive data and query builders) normalize e.g. user paste events. How much does that cost? Is it any cheaper to detect non-normalized input (in a term-validating parser) than it is to C-normalize?

> Regards,    Martin.
> 

-- 
-ericP

Received on Monday, 10 October 2011 09:34:46 UTC