Re: Unicode NFC - status, and RDF Concepts from Eric Prud'hommeaux on 2011-10-10 (public-rdf-wg@w3.org from October 2011)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Mon, 10 Oct 2011 11:49:56 -0400
To: Alex Hall <alexhall@revelytix.com>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-ID: <20111010154954.GD26200@w3.org>
* Alex Hall <alexhall@revelytix.com> [2011-10-10 10:50-0400]
> On Mon, Oct 10, 2011 at 10:01 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
> 
> > * Andy Seaborne <andy.seaborne@epimorphics.com> [2011-10-10 11:10+0100]
> > >
> > >
> > > On 10/10/11 10:34, Eric Prud'hommeaux wrote:
> > > >* "Martin J. Dürst"<duerst@it.aoyama.ac.jp>  [2011-10-10 06:43+0000]
> > > >>Hello Jeremy,
> > > >>
> > > >>Great to hear from you again after a long time!
> > > >
> > > >Ahh, nice to see the crew assembled.
> > > >
> > > >>On 2011/10/10 14:19, Jeremy Carroll wrote:
> > > >>>
> > > >>>Several years ago, I was an editor of RDF Concepts and we included the
> > > >>>following:
> > > >>>http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
> > > >>>[[
> > > >>>The string in both plain and typed literals is recommended to be in
> > > >>>Unicode Normal Form C [NFC]. This is motivated by [CHARMOD]
> > particularly
> > > >>>section 4 Early Uniform Normalization.
> > > >>>]]
> > > >>>and
> > > >>>[[
> > > >>>All literals have a lexical form being a Unicode [UNICODE] string,
> > which
> > > >>>SHOULD be in Normal Form C [NFC].
> > > >>>]]
> > > >>>
> > > >>>As we review this document, it has been noted that the CHARMOD
> > reference
> > > >>>is out-of-date, the reference to, section 4 of
> > > >>>http://www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization
> > > >>>has been replaced by the fairly different
> > > >>>http://www.w3.org/TR/charmod-norm/#sec-EarlyUniformNormalization
> > > >>>and that WD seems to have been abandoned, and no consensus reached.
> > > >>>
> > > >>>What advice, if any, do I18N experts offer the RDF WG, updating the
> > > >>>advice of 2002?
> > > >>
> > > >>I'd recommend to keep the text the same, and just tweak or remove the
> > reference. I unfortunately didn't have enough time to follow changes in
> > charmod-norm in detail, but I hope to be able to catch up with more active
> > members of the WG next week at the Internationalization and Unicode
> > Conference in San Jose.
> > > >
> > > >While you're at it, could you get a sense of the implementation burden
> > of normalization?
> > > >
> > > >By imposing early normalization (NFC, in our case) we minimize the
> > matching burden and define a behavior which is adequate for non-normalized
> > data as well. If someone produces data with normalized terms
> > > >   <http://example.com/~bob/résum��>  # U00E9
> > > >and someone queries for it using the same term, life's good,
> > predictable, and in spec. If someone produces data with non-normalized terms
> > > >   <http://example.com/~bob/résumé>  # U0065 U0301
> > > >and queries using the same term, life's still OK, less predictable (it
> > won't match the normalized (correct) term), and explicitly out of spec. As
> > SemWev tools become more industry-hardened, it would be nice to see input
> > tools (e.g. interactive data and query builders) normalize e.g. user paste
> > events. How much does that cost? Is it any cheaper to detect non-normalized
> > input (in a term-validating parser) than it is to C-normalize?
> > >
> > > The RDF Concepts text only applies to the lexical form for a literal.
> >
> > Ahh, I didn't reallize that. Like with literals, saying nothing about
> > normalizing IRIs means we lose convergence for some graphs, which is why we
> > usually bother with standards.
> >
> >
> > > What does do the RFCs say about IRIs?
> >
> > http://tools.ietf.org/html/rfc3987#section-5.3.2.2 says (paraphrased)
> > "don't futz with it; if you want to compare IRIs, you'd better have
> > normalized them before you call strcmp". It calls out explicitly NFC (U0065
> > U0301 → U00E9) and NFKC (maps between half-width and full-width characters).
> >
> >
> There's also http://tools.ietf.org/html/rfc3987#section-7.5, which
> essentially says "use NFKC when allocating new IRIs unless you have a good
> reason not to".

If visibly distinct representations of information are in our domain of discourse, then we have a good reason not to.

  product:18 css:mobile_label      "ﾀﾑｭﾁ"@ja,    "tamagochi"@en ; 
             css:full-screen_label "ダムゴチ"@ja, "ｔａｍａｇｏｃｈｉ"@en .

We can also pass this advise onto RDF producers, suggesting that they SHOULD use NFKC unless they wish to explicitly preserve the distinction between the normalized and non-normalized forms.

Use cases for preserving distinctions in IRIs are a little harder to dream up, given that we're probably not trying to map RDF nodes to an existing set of IRIs which are already backed by filesystem resources like www-data/résumé. These are usually transformed to local filesystem codings like iso-latin-1 anyways. That said, I expect we want consistency bewteen literals and IRIs.


> -Alex
> 
> 
> 
> >
> > > What happens in XML for qnames?  I don't think they are normalized.
> >
> > My reading is that they default to a behavior consistent with early
> > normalization, i.e. do nothing during XML processing and leave it to the
> > folks generating the XML to generate terms to engineer convergence
> > conventions.
> >
> >
> > > (There is serious issue here for phishing attacks)
> >
> > My guess is that the most serious risk of phishing come from similar domain
> > names; that the proprieters of example.com will have some pressure to
> > mediate between <http://example.com/~Dürst/> and <
> > http://example.com/~Du[U0308]rst/> if those sites are real phishing
> > opportunities.
> >
> >
> > >       Andy
> > >
> > > >
> > > >
> > > >>Regards,    Martin.
> > > >>
> > > >
> > > >
> > > >
> > >
> >
> > --
> > -ericP
> >
> >

-- 
-ericP
Received on Monday, 10 October 2011 15:50:29 UTC