Re: Unicode NFC - status, and RDF Concepts from Alex Hall on 2011-10-10 (public-rdf-wg@w3.org from October 2011)

From: Alex Hall <alexhall@revelytix.com>
Date: Mon, 10 Oct 2011 12:34:01 -0400
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-ID: <CAFq2bix6=uikdrVww30zkBqkpqsWZk0sOBeaKv6_sb2eomF5rA@mail.gmail.com>

On Mon, Oct 10, 2011 at 11:49 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
>
> > > > What does do the RFCs say about IRIs?
> > >
> > > http://tools.ietf.org/html/rfc3987#section-5.3.2.2 says (paraphrased)
> > > "don't futz with it; if you want to compare IRIs, you'd better have
> > > normalized them before you call strcmp". It calls out explicitly NFC
> (U0065
> > > U0301 → U00E9) and NFKC (maps between half-width and full-width
> characters).
> > >
> > >
> > There's also http://tools.ietf.org/html/rfc3987#section-7.5, which
> > essentially says "use NFKC when allocating new IRIs unless you have a
> good
> > reason not to".
>
> If visibly distinct representations of information are in our domain of
> discourse, then we have a good reason not to.
>
>  product:18 css:mobile_label      "ﾀﾑｭﾁ"@ja,    "tamagochi"@en ;
>             css:full-screen_label "ダムゴチ"@ja, "ｔａｍａｇｏｃｈｉ"@en .
>
> We can also pass this advise onto RDF producers, suggesting that they
> SHOULD use NFKC unless they wish to explicitly preserve the distinction
> between the normalized and non-normalized forms.


> Use cases for preserving distinctions in IRIs are a little harder to dream
> up, given that we're probably not trying to map RDF nodes to an existing set
> of IRIs which are already backed by filesystem resources like
> www-data/résumé. These are usually transformed to local filesystem codings
> like iso-latin-1 anyways.


The use case cited in the RFC is an IRI with non-normalized characters in
the query portion which might indicate a search explicitly looking for
non-normalized text.  That doesn't strike me as a particularly common use of
IRIs in RDF, but we certainly shouldn't preclude that.


> That said, I expect we want consistency bewteen literals and IRIs.
>

Agreed.  For IRIs, Unicode character normalization should be mentioned in
the same breath as percent-encoding normalization.  Indeed, the current
editor's draft for Concepts cites IRIs with non-NFC characters as
non-normalized IRI forms which should (but not SHOULD) be avoided.

Here's the current summary:
RFC 3987: IRIs SHOULD be NFC, should be NFKC
RDF Concepts: IRIs should be NFC; literals SHOULD be NFC

-Alex

Received on Monday, 10 October 2011 16:59:03 UTC