RE: Unicode NFC - status, and RDF Concepts

Hello Jeremy et al,

The Internationalization working group recently began working in earnest on normalization again and there are several developments in this area. 

The working group's consensus, for some time, has been that while Early Uniform Normalization is desirable and should be recommended for content by specifications (such as RDF), the lack of normative force behind normalization in most of the core specs (such as HTML, XML, etc.) means that some documents will not be normalized.

The WG is preparing to update CharMod-Norm [1] in the near future to this effect. Sometime this week, in fact, you should see the current document replaced with one indicating this as our intention. The new recommendations are being developed on a Wiki page [2]. We are also engaged in a discussion with the TAG about having a finding on normalization. 

The main thrust of the I18N WG's current consensus is that identifiers must be compared as if normalized in one of the Unicode canonical normalization forms (i.e. NFC or NFD, not NFKC or NFKD). In addition, specs should recommend that identifiers use NFC for interoperability. Content (such as text within a document) should use a normalized form whenever possible, but that it should not be automatically normalized by processors (such as parsers, renderers, etc.).

In my opinion, RDF literals fit the definition of "identifiers". The current normative language for the encoding of strings ("SHOULD") is still correct. Comparison of literals should take normalization into account, given that "SHOULD" is not "MUST". So RDF should update references to CharMod/CharMod-Norm (bearing in mind that we intend to publish an extensively different normalization document this year) but existing recommendations needn't change.

Please note that, in addition to the Unicode conference, I18N WG members will be available at TPAC (coming up in a few weeks).

Best regards,

Addison

[1] http://www.w3.org/TR/charmod-norm

[2] http://www.w3.org/International/wiki/CharmodNormSummary

[3] http://www.w3.org/International/wiki/NormalizationProposal 


Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.




> -----Original Message-----
> From: www-international-request@w3.org [mailto:www-international-
> request@w3.org] On Behalf Of "Martin J. Dürst"
> Sent: Sunday, October 09, 2011 11:43 PM
> To: Jeremy Carroll
> Cc: www-international@w3.org; RDF Working Group WG
> Subject: Re: Unicode NFC - status, and RDF Concepts
> 
> Hello Jeremy,
> 
> Great to hear from you again after a long time!
> 
> On 2011/10/10 14:19, Jeremy Carroll wrote:
> >
> > Several years ago, I was an editor of RDF Concepts and we included the
> > following:
> > http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/

> > [[
> > The string in both plain and typed literals is recommended to be in
> > Unicode Normal Form C [NFC]. This is motivated by [CHARMOD]
> > particularly section 4 Early Uniform Normalization.
> > ]]
> > and
> > [[
> > All literals have a lexical form being a Unicode [UNICODE] string,
> > which SHOULD be in Normal Form C [NFC].
> > ]]
> >
> > As we review this document, it has been noted that the CHARMOD
> > reference is out-of-date, the reference to, section 4 of
> > http://www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization

> > has been replaced by the fairly different
> > http://www.w3.org/TR/charmod-norm/#sec-EarlyUniformNormalization

> > and that WD seems to have been abandoned, and no consensus reached.
> >
> > What advice, if any, do I18N experts offer the RDF WG, updating the
> > advice of 2002?
> 
> I'd recommend to keep the text the same, and just tweak or remove the
> reference. I unfortunately didn't have enough time to follow changes in
> charmod-norm in detail, but I hope to be able to catch up with more active
> members of the WG next week at the Internationalization and Unicode
> Conference in San Jose.
> 
> Regards,    Martin.

Received on Monday, 10 October 2011 15:28:52 UTC