Re: Unicode NFC - status, and RDF Concepts

* Andy Seaborne <andy.seaborne@epimorphics.com> [2011-10-10 11:10+0100]
> 
> 
> On 10/10/11 10:34, Eric Prud'hommeaux wrote:
> >* "Martin J. Dürst"<duerst@it.aoyama.ac.jp>  [2011-10-10 06:43+0000]
> >>Hello Jeremy,
> >>
> >>Great to hear from you again after a long time!
> >
> >Ahh, nice to see the crew assembled.
> >
> >>On 2011/10/10 14:19, Jeremy Carroll wrote:
> >>>
> >>>Several years ago, I was an editor of RDF Concepts and we included the
> >>>following:
> >>>http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
> >>>[[
> >>>The string in both plain and typed literals is recommended to be in
> >>>Unicode Normal Form C [NFC]. This is motivated by [CHARMOD] particularly
> >>>section 4 Early Uniform Normalization.
> >>>]]
> >>>and
> >>>[[
> >>>All literals have a lexical form being a Unicode [UNICODE] string, which
> >>>SHOULD be in Normal Form C [NFC].
> >>>]]
> >>>
> >>>As we review this document, it has been noted that the CHARMOD reference
> >>>is out-of-date, the reference to, section 4 of
> >>>http://www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization
> >>>has been replaced by the fairly different
> >>>http://www.w3.org/TR/charmod-norm/#sec-EarlyUniformNormalization
> >>>and that WD seems to have been abandoned, and no consensus reached.
> >>>
> >>>What advice, if any, do I18N experts offer the RDF WG, updating the
> >>>advice of 2002?
> >>
> >>I'd recommend to keep the text the same, and just tweak or remove the reference. I unfortunately didn't have enough time to follow changes in charmod-norm in detail, but I hope to be able to catch up with more active members of the WG next week at the Internationalization and Unicode Conference in San Jose.
> >
> >While you're at it, could you get a sense of the implementation burden of normalization?
> >
> >By imposing early normalization (NFC, in our case) we minimize the matching burden and define a behavior which is adequate for non-normalized data as well. If someone produces data with normalized terms
> >   <http://example.com/~bob/résum>  # U00E9
> >and someone queries for it using the same term, life's good, predictable, and in spec. If someone produces data with non-normalized terms
> >   <http://example.com/~bob/résumé>  # U0065 U0301
> >and queries using the same term, life's still OK, less predictable (it won't match the normalized (correct) term), and explicitly out of spec. As SemWev tools become more industry-hardened, it would be nice to see input tools (e.g. interactive data and query builders) normalize e.g. user paste events. How much does that cost? Is it any cheaper to detect non-normalized input (in a term-validating parser) than it is to C-normalize?
> 
> The RDF Concepts text only applies to the lexical form for a literal.

Ahh, I didn't reallize that. Like with literals, saying nothing about normalizing IRIs means we lose convergence for some graphs, which is why we usually bother with standards.


> What does do the RFCs say about IRIs?

http://tools.ietf.org/html/rfc3987#section-5.3.2.2 says (paraphrased) "don't futz with it; if you want to compare IRIs, you'd better have normalized them before you call strcmp". It calls out explicitly NFC (U0065 U0301 → U00E9) and NFKC (maps between half-width and full-width characters).


> What happens in XML for qnames?  I don't think they are normalized.

My reading is that they default to a behavior consistent with early normalization, i.e. do nothing during XML processing and leave it to the folks generating the XML to generate terms to engineer convergence conventions.


> (There is serious issue here for phishing attacks)

My guess is that the most serious risk of phishing come from similar domain names; that the proprieters of example.com will have some pressure to mediate between <http://example.com/~Dürst/> and <http://example.com/~Du[U0308]rst/> if those sites are real phishing opportunities.


>  Andy
> 
> >
> >
> >>Regards,    Martin.
> >>
> >
> >
> >
> 

-- 
-ericP

Received on Monday, 10 October 2011 14:01:35 UTC