- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Thu, 05 Jan 2006 11:39:33 +0900
- To: Jeremy Carroll <jjc@hpl.hp.com>, "www-international@w3.org" <www-international@w3.org>
- Cc: public-iri@w3.org
Hello Jeremy, [Cross-posting the IRI mailing list.] At 20:53 06/01/04, Jeremy Carroll wrote: > > >I am implementing RFC 3987. >(BSD style license, Java, part of Jena semantic web framework, motivated by SPARQL dependency on RFC 3987) > >Two parts I would like advice with are the following: > >http://www.apps.ietf.org/rfc/rfc3987.html#sec-7.5 Please note that Section 7 as a whole is entitled URI/IRI Processing Guidelines (informative), and although it contains words such as 'should' and 'recommended', these are not upper-cased, and therefore not RFC 2119. For the next version of the spec, we may want to think about changing the language so that these words aren't used anymore. Another caveat is that subsection 7.5 is talking about IRI creation. It is important that, e.g. in a Java context, this isn't confused with the creation of an object of an IRI class. Most of these objects will be created from pre-existing IRIs, to which these considerations don't apply directly (there may still be rather similar security considerations that may apply). >A: single script >[[ >To avoid such cases, only IRIs should be created where all the characters in a single component are used together in a given language. This usually means that all of these characters will be from the same script, but there are languages that mix characters from different scripts (such as Japanese). >]] > >On A, could someone articulate that in an automatable fashion please. >e.g. use such and such a table from unicode.org, and for each IRI component map each character to its script code, and then the component is OK if the set of script codes used is either a singleton set, or the set { hiragana, kanji, katakana } or { ... }. John Cowan, in http://lists.w3.org/Archives/Public/www-international/2006JanMar/0003.html, already gave a pointer to http://www.unicode.org/Public/UNIDATA/Scripts.txt as well as some more language examples. Please also have a look through http://www.unicode.org/reports/tr36/ (Unicode Security Considerations) and http://www.unicode.org/reports/tr39/tr39-1.html (Unicode Security Mechanisms, proposed draft). Although creation (you try to create an IRI that survives well) is different from security (others try to avoid IRIs that may make them believe they are something different than they are), the actual data/algorithms needed for detection are very close. >B: NKFC >[[ >Although there may be exceptions, newly created resource names should generally be in NFKC [UTR15] (which means that they are also in NFC). >]] > >For B, my code does an initial pass of the characters in each component, looking for problematic characters e.g. "--" in the host, or "/./" in the path. If it finds such problematic characters it may trigger more expensive processing (e.g. IDNA syntax checking). What are the characters I should be looking for in the component? i.e. please suggest a set of characters is such that if none of these characters is in the IRI then it is necessarily in NKFC? An example would be the set [^\x20-\x7F] which would at least allow me to avoid NKFC checking for URIs. Again I am expecting an answer in terms of some table from unicode.org. e.g. if each character is neither a compatibility character nor a composing character then the component is in NKFC. Bjoern already pointed to http://lists.w3.org/Archives/Public/www-international/2006JanMar/0002.html. Why do you want to avoid NKFC checking? Ideally, you would use an NKFC checking implementation that did a quick first pass internally, wouldn't you? >Given the weak language in both these assertions, violations would by default produce warnings. I think getting your API right in this respect is probably the most challenging. Contrary to end-user software, you have to rely on the API user to do the right thing with the warning. This requires quite a bit of knowledge to not make wrong decisions (e.g. just turning all warnings into errors, or just ignoring all warnings) that will hurt users. >(I suspect I will send further messages about bidi) Please do! Regards, Martin.
Received on Thursday, 5 January 2006 04:53:14 UTC