Re: question about IRI spec from Martin Duerst on 2006-01-05 (www-international@w3.org from January to March 2006)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 05 Jan 2006 11:39:33 +0900
To: Jeremy Carroll <jjc@hpl.hp.com>, "www-international@w3.org" <www-international@w3.org>
Cc: public-iri@w3.org
Message-Id: <6.0.0.20.2.20060105104712.075d1710@localhost>
Hello Jeremy,

[Cross-posting the IRI mailing list.]

At 20:53 06/01/04, Jeremy Carroll wrote:
 >
 >
 >I am implementing RFC 3987.
 >(BSD style license, Java, part of Jena semantic web framework, motivated 
by SPARQL dependency on RFC 3987)
 >
 >Two parts I would like advice with are the following:
 >
 >http://www.apps.ietf.org/rfc/rfc3987.html#sec-7.5

Please note that Section 7 as a whole is entitled
URI/IRI Processing Guidelines (informative), and although it contains words
such as 'should' and 'recommended', these are not upper-cased, and therefore
not RFC 2119. For the next version of the spec, we may want to think about
changing the language so that these words aren't used anymore.

Another caveat is that subsection 7.5 is talking about IRI creation.
It is important that, e.g. in a Java context, this isn't confused with
the creation of an object of an IRI class. Most of these objects will
be created from pre-existing IRIs, to which these considerations don't
apply directly (there may still be rather similar security considerations
that may apply).

 >A: single script
 >[[
 >To avoid such cases, only IRIs should be created where all the characters 
in a single component are used together in a given language. This usually 
means that all of these characters will be from the same script, but there 
are languages that mix characters from different scripts (such as Japanese).
 >]]
 >
 >On A, could someone articulate that in an automatable fashion please.
 >e.g. use such and such a table from unicode.org, and for each IRI 
component map each character to its script code, and then the component is 
OK if the set of script codes used is either a singleton set, or the set { 
hiragana, kanji, katakana } or { ... }.

John Cowan, in 
http://lists.w3.org/Archives/Public/www-international/2006JanMar/0003.html,
already gave a pointer to http://www.unicode.org/Public/UNIDATA/Scripts.txt
as well as some more language examples. Please also have a look through
http://www.unicode.org/reports/tr36/ (Unicode Security Considerations) and
http://www.unicode.org/reports/tr39/tr39-1.html (Unicode Security Mechanisms,
proposed draft). Although creation (you try to create an IRI that survives
well) is different from security (others try to avoid IRIs that may make them
believe they are something different than they are), the actual data/algorithms
needed for detection are very close.


 >B: NKFC
 >[[
 >Although there may be exceptions, newly created resource names should 
generally be in NFKC [UTR15] (which means that they are also in NFC).
 >]]
 >
 >For B, my code does an initial pass of the characters in each component, 
looking for problematic characters e.g. "--" in the host, or "/./" in the 
path. If it finds such problematic characters it may trigger more expensive 
processing (e.g. IDNA syntax checking). What are the characters I should be 
looking for in the component? i.e. please suggest a set of characters is 
such that if none of these characters is in the IRI then it is necessarily 
in NKFC? An example would be the set [^\x20-\x7F] which would at least 
allow me to avoid NKFC checking for URIs. Again I am expecting an answer in 
terms of some table from unicode.org. e.g. if each character is neither a 
compatibility character nor a composing character then the component is in NKFC.

Bjoern already pointed to
http://lists.w3.org/Archives/Public/www-international/2006JanMar/0002.html.
Why do you want to avoid NKFC checking? Ideally, you would use an NKFC
checking implementation that did a quick first pass internally, wouldn't you?

 >Given the weak language in both these assertions, violations would  by 
default produce warnings.

I think getting your API right in this respect is probably the most
challenging. Contrary to end-user software, you have to rely on the
API user to do the right thing with the warning. This requires quite
a bit of knowledge to not make wrong decisions (e.g. just turning
all warnings into errors, or just ignoring all warnings)
that will hurt users.

 >(I suspect I will send further messages about bidi)

Please do!     Regards,    Martin.
Received on Thursday, 5 January 2006 04:53:11 UTC