RE: Unicode NFC - status, and RDF Concepts from Phillips, Addison on 2011-10-11 (www-international@w3.org from October to December 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Tue, 11 Oct 2011 07:49:03 -0700
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, John Cowan <cowan@mercury.ccil.org>
CC: Jeremy Carroll <jeremy@topquadrant.com>, "www-international@w3.org" <www-international@w3.org>, RDF Working Group WG <public-rdf-wg@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A96D8A71E@EX-SEA31-D.ant.amazon.com>

Martin replied:
> >
> >> The main thrust of the I18N WG's current consensus is that
> >> identifiers must be compared as if normalized in one of the Unicode
> >> canonical normalization forms (i.e. NFC or NFD, not NFKC or NFKD).
> >
> > I don't understand what "as if normalized" means.  Does that mean that
> > an identifier comparison routine can assume its inputs are normalized,
> > or that it must normalize them (non-destructively) before comparing?
> > The implementation implications couldn't be more different.
> 
> The intent is that they should be normalized (again) before comparison, unless
> you're completely sure they already are.

That's the general idea: if you cannot be sure that the input is normalized, then comparisons would need to ensure it either by normalizing on comparison or by interning the identifiers and normalizing them ahead of time.

> 
> But giving this a MUST is tough, because it's not actually done currently (except
> for IDNs, but in that case also only for IDN 2003 and/or TR 46, not for pure IDN
> 2008).

Tough compared to doing nothing, certainly. But perhaps not so much when restricted to identifiers and a lot more probable than early uniform normalization or to late normalization strategies (in which documents are normalized in total). In many cases, introducing normalization is relatively non-disruptive because non-ASCII identifiers have been rare, unimplemented, or undefined until recently. With the exception of XML, in most cases we feel this represents a clarification of intent and an improvement of interoperability.

XML is an interesting case because it makes the opposite decision consciously: two canonically-equivalent but unequal identifiers are not equal. XML provides recommended naming rules that avoid the various problems of normalization (Appendix J in XML 5th Edition) for precisely this reason. This is not an invalid solution, but the I18N WG is trying to highlight this issue one last time before all specifications go down this route. I do not think we're going to change XML, but this should not dissuade other document formats, even those based on XML, from normatively addressing normalization issues.

> 
> >> In my opinion, RDF literals fit the definition of "identifiers".
> >
> > I can't imagine why you think so.  RDF literals are strings (except
> > when they are typed as numbers, dates, etc.)
> 
> Correct. I think Addison meant RDF URIs, i.e. the things that are used to identify
> resources.
> 
That is what I had in mind. Thanks.

Addison

Received on Tuesday, 11 October 2011 14:50:22 UTC