- From: Alex Hall <alexhall@revelytix.com>
- Date: Mon, 10 Oct 2011 12:34:01 -0400
- To: "Eric Prud'hommeaux" <eric@w3.org>
- Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
- Message-ID: <CAFq2bix6=uikdrVww30zkBqkpqsWZk0sOBeaKv6_sb2eomF5rA@mail.gmail.com>
On Mon, Oct 10, 2011 at 11:49 AM, Eric Prud'hommeaux <eric@w3.org> wrote: > > > > > What does do the RFCs say about IRIs? > > > > > > http://tools.ietf.org/html/rfc3987#section-5.3.2.2 says (paraphrased) > > > "don't futz with it; if you want to compare IRIs, you'd better have > > > normalized them before you call strcmp". It calls out explicitly NFC > (U0065 > > > U0301 → U00E9) and NFKC (maps between half-width and full-width > characters). > > > > > > > > There's also http://tools.ietf.org/html/rfc3987#section-7.5, which > > essentially says "use NFKC when allocating new IRIs unless you have a > good > > reason not to". > > If visibly distinct representations of information are in our domain of > discourse, then we have a good reason not to. > > product:18 css:mobile_label "タムュチ"@ja, "tamagochi"@en ; > css:full-screen_label "ダムゴチ"@ja, "tamagochi"@en . > > We can also pass this advise onto RDF producers, suggesting that they > SHOULD use NFKC unless they wish to explicitly preserve the distinction > between the normalized and non-normalized forms. > Use cases for preserving distinctions in IRIs are a little harder to dream > up, given that we're probably not trying to map RDF nodes to an existing set > of IRIs which are already backed by filesystem resources like > www-data/résumé. These are usually transformed to local filesystem codings > like iso-latin-1 anyways. The use case cited in the RFC is an IRI with non-normalized characters in the query portion which might indicate a search explicitly looking for non-normalized text. That doesn't strike me as a particularly common use of IRIs in RDF, but we certainly shouldn't preclude that. > That said, I expect we want consistency bewteen literals and IRIs. > Agreed. For IRIs, Unicode character normalization should be mentioned in the same breath as percent-encoding normalization. Indeed, the current editor's draft for Concepts cites IRIs with non-NFC characters as non-normalized IRI forms which should (but not SHOULD) be avoided. Here's the current summary: RFC 3987: IRIs SHOULD be NFC, should be NFKC RDF Concepts: IRIs should be NFC; literals SHOULD be NFC -Alex
Received on Monday, 10 October 2011 16:59:03 UTC