- From: Alex Hall <alexhall@revelytix.com>
- Date: Mon, 10 Oct 2011 10:50:58 -0400
- To: "Eric Prud'hommeaux" <eric@w3.org>
- Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
- Message-ID: <CAFq2biwZ+y5NjpBAHWf-_7w0_upDWoJ3ryjO=on2d6f7Ytu=jA@mail.gmail.com>
On Mon, Oct 10, 2011 at 10:01 AM, Eric Prud'hommeaux <eric@w3.org> wrote: > * Andy Seaborne <andy.seaborne@epimorphics.com> [2011-10-10 11:10+0100] > > > > > > On 10/10/11 10:34, Eric Prud'hommeaux wrote: > > >* "Martin J. Dürst"<duerst@it.aoyama.ac.jp> [2011-10-10 06:43+0000] > > >>Hello Jeremy, > > >> > > >>Great to hear from you again after a long time! > > > > > >Ahh, nice to see the crew assembled. > > > > > >>On 2011/10/10 14:19, Jeremy Carroll wrote: > > >>> > > >>>Several years ago, I was an editor of RDF Concepts and we included the > > >>>following: > > >>>http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ > > >>>[[ > > >>>The string in both plain and typed literals is recommended to be in > > >>>Unicode Normal Form C [NFC]. This is motivated by [CHARMOD] > particularly > > >>>section 4 Early Uniform Normalization. > > >>>]] > > >>>and > > >>>[[ > > >>>All literals have a lexical form being a Unicode [UNICODE] string, > which > > >>>SHOULD be in Normal Form C [NFC]. > > >>>]] > > >>> > > >>>As we review this document, it has been noted that the CHARMOD > reference > > >>>is out-of-date, the reference to, section 4 of > > >>>http://www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization > > >>>has been replaced by the fairly different > > >>>http://www.w3.org/TR/charmod-norm/#sec-EarlyUniformNormalization > > >>>and that WD seems to have been abandoned, and no consensus reached. > > >>> > > >>>What advice, if any, do I18N experts offer the RDF WG, updating the > > >>>advice of 2002? > > >> > > >>I'd recommend to keep the text the same, and just tweak or remove the > reference. I unfortunately didn't have enough time to follow changes in > charmod-norm in detail, but I hope to be able to catch up with more active > members of the WG next week at the Internationalization and Unicode > Conference in San Jose. > > > > > >While you're at it, could you get a sense of the implementation burden > of normalization? > > > > > >By imposing early normalization (NFC, in our case) we minimize the > matching burden and define a behavior which is adequate for non-normalized > data as well. If someone produces data with normalized terms > > > <http://example.com/~bob/résum> # U00E9 > > >and someone queries for it using the same term, life's good, > predictable, and in spec. If someone produces data with non-normalized terms > > > <http://example.com/~bob/résumé> # U0065 U0301 > > >and queries using the same term, life's still OK, less predictable (it > won't match the normalized (correct) term), and explicitly out of spec. As > SemWev tools become more industry-hardened, it would be nice to see input > tools (e.g. interactive data and query builders) normalize e.g. user paste > events. How much does that cost? Is it any cheaper to detect non-normalized > input (in a term-validating parser) than it is to C-normalize? > > > > The RDF Concepts text only applies to the lexical form for a literal. > > Ahh, I didn't reallize that. Like with literals, saying nothing about > normalizing IRIs means we lose convergence for some graphs, which is why we > usually bother with standards. > > > > What does do the RFCs say about IRIs? > > http://tools.ietf.org/html/rfc3987#section-5.3.2.2 says (paraphrased) > "don't futz with it; if you want to compare IRIs, you'd better have > normalized them before you call strcmp". It calls out explicitly NFC (U0065 > U0301 → U00E9) and NFKC (maps between half-width and full-width characters). > > There's also http://tools.ietf.org/html/rfc3987#section-7.5, which essentially says "use NFKC when allocating new IRIs unless you have a good reason not to". -Alex > > > What happens in XML for qnames? I don't think they are normalized. > > My reading is that they default to a behavior consistent with early > normalization, i.e. do nothing during XML processing and leave it to the > folks generating the XML to generate terms to engineer convergence > conventions. > > > > (There is serious issue here for phishing attacks) > > My guess is that the most serious risk of phishing come from similar domain > names; that the proprieters of example.com will have some pressure to > mediate between <http://example.com/~Dürst/> and < > http://example.com/~Du[U0308]rst/> if those sites are real phishing > opportunities. > > > > Andy > > > > > > > > > > >>Regards, Martin. > > >> > > > > > > > > > > > > > -- > -ericP > >
Received on Monday, 10 October 2011 14:51:28 UTC