W3C home > Mailing lists > Public > public-rdf-wg@w3.org > October 2011

Re: Unicode NFC - status, and RDF Concepts

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Thu, 13 Oct 2011 01:35:25 +0000
Cc: "Jeremy Carroll , www-international@w3.org , RDF Working Group WG" <jeremy@topquadrant.com>
Message-Id: <4E96403A.1020308@it.aoyama.ac.jp>
To: "Phillips, Addison" <addison@lab126.com>


On 2011/10/11 0:28, Phillips, Addison wrote:
> Hello Jeremy et al,
> 
> The Internationalization working group recently began working in earnest on normalization again and there are several developments in this area.
> 
> The working group's consensus, for some time, has been that while Early Uniform Normalization is desirable and should be recommended for content by specifications (such as RDF), the lack of normative force behind normalization in most of the core specs (such as HTML, XML, etc.) means that some documents will not be normalized.

Yes, essentially there are two forms of Early Uniform Normalization; I'd call one 'strict' and the other 'best effort'.

Strict Early Uniform Normalization means that everything on the Web (or for a specific format) has to be in NFC, and everything else is invalid. This led us into a rathole, because:

1) There are cases (in content, not identifiers) where you want/need unnormalized stuff (e.g. isolated combining marks).
2) There are issues with the interaction between escaping and normalization.
3) Operations (again, on content, rather than identifiers, but there's not always a clear boundary) may occasionally lead to non-normalized stuff.
4) Input systems (keyboards,...) don't always produce NFC. In particular, there are some very deep issues on Windows that make this almost impossible at the moment for some cases (Michael Kaplan told me the details once).

So Strict Early Uniform Normalization essentially didn't work out.


Best Effort Early Uniform Normalization, on the other hand, is still an extremely worthwhile effort. First, in this case, the 'Uniform' doesn't stand for "everything is exactly the same", but "when you normalize, do it in the same direction as everybody else".

The creation and use of NFC (yes, NFC was actually created on request and in collaboration between the W3C and the Unicode Consortium) for this purpose was deliberate: It is equivalent or close to existing practice for most Web content.

This has the big advantage that pushing people to adopt it is less of an uphill battle. However, it turns out that it also has a disadvantage: Not very much content/authors/software are affected, and not very often, so that it's easy to neglect this problem (which is what some W3C specs and most implementations have done).

There are two more advantages of (Best Effort) Early Uniform Normalization:
a) When you realize you have normalization problems, and want to normalize, it tells you which way, and you know that this will help you not only for your own data, but for lots of other data, too, even if not for everything.
b) When you find that some things don't match because of normalization problems, even if you expect them to match, you know which side is "to blame", which side should work on fixing the problem. This is way better than having long "no it's your fault" arguments back and forth.


So as a conclusion, please make sure that the message is not "we gave up on Early Uniform Normalization", but "Early Uniform Normalization is important, it's a best effort thing".

Regards,   Martin.


> The WG is preparing to update CharMod-Norm [1] in the near future to this effect. Sometime this week, in fact, you should see the current document replaced with one indicating this as our intention. The new recommendations are being developed on a Wiki page [2]. We are also engaged in a discussion with the TAG about having a finding on normalization.
> 
> The main thrust of the I18N WG's current consensus is that identifiers must be compared as if normalized in one of the Unicode canonical normalization forms (i.e. NFC or NFD, not NFKC or NFKD). In addition, specs should recommend that identifiers use NFC for interoperability. Content (such as text within a document) should use a normalized form whenever possible, but that it should not be automatically normalized by processors (such as parsers, renderers, etc.).
> 
> In my opinion, RDF literals fit the definition of "identifiers". The current normative language for the encoding of strings ("SHOULD") is still correct. Comparison of literals should take normalization into account, given that "SHOULD" is not "MUST". So RDF should update references to CharMod/CharMod-Norm (bearing in mind that we intend to publish an extensively different normalization document this year) but existing recommendations needn't change.
> 
> Please note that, in addition to the Unicode conference, I18N WG members will be available at TPAC (coming up in a few weeks).
> 
> Best regards,
> 
> Addison
> 
> [1] http://www.w3.org/TR/charmod-norm
> [2] http://www.w3.org/International/wiki/CharmodNormSummary
> [3] http://www.w3.org/International/wiki/NormalizationProposal
> 
> 
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N WG)
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
> 
> 
>> -----Original Message-----
>> From: www-international-request@w3.org [mailto:www-international-
>> request@w3.org] On Behalf Of "Martin J. Dürst"
>> Sent: Sunday, October 09, 2011 11:43 PM
>> To: Jeremy Carroll
>> Cc: www-international@w3.org; RDF Working Group WG
>> Subject: Re: Unicode NFC - status, and RDF Concepts
>> 
>> Hello Jeremy,
>> 
>> Great to hear from you again after a long time!
>> 
>> On 2011/10/10 14:19, Jeremy Carroll wrote:
>>> 
>>> Several years ago, I was an editor of RDF Concepts and we included the
>>> following:
>>> http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
>>> [[
>>> The string in both plain and typed literals is recommended to be in
>>> Unicode Normal Form C [NFC]. This is motivated by [CHARMOD]
>>> particularly section 4 Early Uniform Normalization.
>>> ]]
>>> and
>>> [[
>>> All literals have a lexical form being a Unicode [UNICODE] string,
>>> which SHOULD be in Normal Form C [NFC].
>>> ]]
>>> 
>>> As we review this document, it has been noted that the CHARMOD
>>> reference is out-of-date, the reference to, section 4 of
>>> http://www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization
>>> has been replaced by the fairly different
>>> http://www.w3.org/TR/charmod-norm/#sec-EarlyUniformNormalization
>>> and that WD seems to have been abandoned, and no consensus reached.
>>> 
>>> What advice, if any, do I18N experts offer the RDF WG, updating the
>>> advice of 2002?
>> 
>> I'd recommend to keep the text the same, and just tweak or remove the
>> reference. I unfortunately didn't have enough time to follow changes in
>> charmod-norm in detail, but I hope to be able to catch up with more active
>> members of the WG next week at the Internationalization and Unicode
>> Conference in San Jose.
>> 
>> Regards,    Martin.
> 



Received on Thursday, 13 October 2011 05:47:47 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 16:25:45 GMT